Adaptive GC Effects Estimation for ChIP-seq Read Counts

Estimates GC-content-dependent effects on ChIP-seq read counts using an expectation-maximization (EM) algorithm with generalized linear models. Regions are modeled as a mixture of background and foreground components (e.g. non-peak and peak-like regions), each following either a Poisson or negative binomial distribution. GC effects are fitted separately for each component using natural spline regression. Optional visualization shows fitted curves and mixture probabilities.

adp_gcEffects(
  gc = gc,
  region = region,
  rc = rc,
  peakwidth = 501,
  plot = TRUE,
  gcrange = c(0.3, 0.8),
  emtrace = TRUE,
  model = c("nbinom", "poisson"),
  mu0 = 1,
  mu1 = 50,
  theta0 = mu0,
  theta1 = mu1,
  p = 0.02,
  converge = 0.001,
  max_pts = 1e+05,
  max_line = 5000
)

Arguments

gc: A numeric vector of GC content values for genomic windows or regions.
region: A genomic region object or identifier (not directly used in modeling but retained for consistency or downstream reference).
rc: A numeric vector of read counts corresponding to the same regions as gc.
peakwidth: A non-negative integer specifying the expected ChIP-seq binding width (default 501). Used for labeling and downstream interpretation.
plot: Logical; if TRUE (default), generates a scatter plot showing GC content vs read counts, colored by posterior mixture probabilities, and overlays fitted foreground (red) and background (blue) curves.
gcrange: A numeric vector of length 2 specifying the GC-content range to include for model fitting. Regions outside this range are ignored. Default is c(0.3, 0.8).
emtrace: Logical; if TRUE (default), prints log-likelihood and convergence progress at each EM iteration.
model: Character string specifying the distribution model for read counts. Supported options are "nbinom" (default) and "poisson".
mu0, mu1: Numeric values giving the initial mean read counts for background and foreground components, respectively. Defaults are mu0 = 1, mu1 = 50.
theta0, theta1: Numeric values specifying initial shape parameters for negative binomial models of background and foreground. Used only when model = "nbinom".
p: Numeric value specifying the initial mixture proportion of foreground regions. Default is 0.02.
converge: Numeric value specifying the EM convergence threshold. Iteration stops when the relative log-likelihood change is below this threshold. Default is 1e-3.
max_pts: Integer specifying the maximum number of points plotted (default 100000).
max_line: Integer specifying the maximum number of points used to draw fitted curves (default 5000).

Value

A list containing:

gc: GC-content values at which GC effects were estimated.
lmns0: Fitted GLM object for the background component.
lmns1: Fitted GLM object for the foreground component.
z: Posterior probabilities of each region belonging to the foreground component.
mu0, mu1: Predicted read counts (fitted means) at each GC content for background and foreground components, respectively.
mu0med0, mu1med1: Medians of fitted background and foreground signals in their respective component regions.
mu0med1, mu1med0: Cross-median values: background prediction in foreground-like regions, and vice versa.
g: A ggplot object (if plot=TRUE) showing fitted GC effects.
dat: Data frame containing the filtered counts and GC content used for fitting.

Details

The algorithm iteratively updates posterior probabilities (E step) and fits GC-dependent regression models for both mixture components (M step). The GC dependence is modeled using a natural spline with 2 degrees of freedom.

The fitted curves can reveal whether sequencing depth or read coverage is biased by GC content, separately for peak-enriched and background regions.

Examples

if (FALSE) { # \dontrun{
set.seed(1)
gc <- runif(10000, 0.2, 0.9)
rc <- rpois(10000, lambda = exp(5 * (gc - 0.5)^2))
res <- adp_gcEffects(gc = gc, rc = rc, model = "poisson", plot = TRUE)

# visualize the estimated GC effects
print(res$g)
} # }