Estimates GC-content-dependent effects on ChIP-seq read counts using an expectation-maximization (EM) algorithm with generalized linear models. Regions are modeled as a mixture of background and foreground components (e.g. non-peak and peak-like regions), each following either a Poisson or negative binomial distribution. GC effects are fitted separately for each component using natural spline regression. Optional visualization shows fitted curves and mixture probabilities.

adp_gcEffects(
  gc = gc,
  region = region,
  rc = rc,
  peakwidth = 501,
  plot = TRUE,
  gcrange = c(0.3, 0.8),
  emtrace = TRUE,
  model = c("nbinom", "poisson"),
  mu0 = 1,
  mu1 = 50,
  theta0 = mu0,
  theta1 = mu1,
  p = 0.02,
  converge = 0.001,
  max_pts = 1e+05,
  max_line = 5000
)

Arguments

gc

A numeric vector of GC content values for genomic windows or regions.

region

A genomic region object or identifier (not directly used in modeling but retained for consistency or downstream reference).

rc

A numeric vector of read counts corresponding to the same regions as gc.

peakwidth

A non-negative integer specifying the expected ChIP-seq binding width (default 501). Used for labeling and downstream interpretation.

plot

Logical; if TRUE (default), generates a scatter plot showing GC content vs read counts, colored by posterior mixture probabilities, and overlays fitted foreground (red) and background (blue) curves.

gcrange

A numeric vector of length 2 specifying the GC-content range to include for model fitting. Regions outside this range are ignored. Default is c(0.3, 0.8).

emtrace

Logical; if TRUE (default), prints log-likelihood and convergence progress at each EM iteration.

model

Character string specifying the distribution model for read counts. Supported options are "nbinom" (default) and "poisson".

mu0, mu1

Numeric values giving the initial mean read counts for background and foreground components, respectively. Defaults are mu0 = 1, mu1 = 50.

theta0, theta1

Numeric values specifying initial shape parameters for negative binomial models of background and foreground. Used only when model = "nbinom".

p

Numeric value specifying the initial mixture proportion of foreground regions. Default is 0.02.

converge

Numeric value specifying the EM convergence threshold. Iteration stops when the relative log-likelihood change is below this threshold. Default is 1e-3.

max_pts

Integer specifying the maximum number of points plotted (default 100000).

max_line

Integer specifying the maximum number of points used to draw fitted curves (default 5000).

Value

A list containing:

gc

GC-content values at which GC effects were estimated.

lmns0

Fitted GLM object for the background component.

lmns1

Fitted GLM object for the foreground component.

z

Posterior probabilities of each region belonging to the foreground component.

mu0, mu1

Predicted read counts (fitted means) at each GC content for background and foreground components, respectively.

mu0med0, mu1med1

Medians of fitted background and foreground signals in their respective component regions.

mu0med1, mu1med0

Cross-median values: background prediction in foreground-like regions, and vice versa.

g

A ggplot object (if plot=TRUE) showing fitted GC effects.

dat

Data frame containing the filtered counts and GC content used for fitting.

Details

The algorithm iteratively updates posterior probabilities (E step) and fits GC-dependent regression models for both mixture components (M step). The GC dependence is modeled using a natural spline with 2 degrees of freedom.

The fitted curves can reveal whether sequencing depth or read coverage is biased by GC content, separately for peak-enriched and background regions.

Examples

if (FALSE) { # \dontrun{
set.seed(1)
gc <- runif(10000, 0.2, 0.9)
rc <- rpois(10000, lambda = exp(5 * (gc - 0.5)^2))
res <- adp_gcEffects(gc = gc, rc = rc, model = "poisson", plot = TRUE)

# visualize the estimated GC effects
print(res$g)
} # }