adp_gcEffects.RdEstimates GC-content-dependent effects on ChIP-seq read counts using an expectation-maximization (EM) algorithm with generalized linear models. Regions are modeled as a mixture of background and foreground components (e.g. non-peak and peak-like regions), each following either a Poisson or negative binomial distribution. GC effects are fitted separately for each component using natural spline regression. Optional visualization shows fitted curves and mixture probabilities.
A numeric vector of GC content values for genomic windows or regions.
A genomic region object or identifier (not directly used in modeling but retained for consistency or downstream reference).
A numeric vector of read counts corresponding to the same
regions as gc.
A non-negative integer specifying the expected ChIP-seq binding width (default 501). Used for labeling and downstream interpretation.
Logical; if TRUE (default), generates a scatter plot
showing GC content vs read counts, colored by posterior mixture
probabilities, and overlays fitted foreground (red) and background (blue)
curves.
A numeric vector of length 2 specifying the GC-content range
to include for model fitting. Regions outside this range are ignored.
Default is c(0.3, 0.8).
Logical; if TRUE (default), prints log-likelihood and
convergence progress at each EM iteration.
Character string specifying the distribution model for read
counts. Supported options are "nbinom" (default) and "poisson".
Numeric values giving the initial mean read counts for
background and foreground components, respectively. Defaults are
mu0 = 1, mu1 = 50.
Numeric values specifying initial shape parameters
for negative binomial models of background and foreground. Used only when
model = "nbinom".
Numeric value specifying the initial mixture proportion of foreground regions. Default is 0.02.
Numeric value specifying the EM convergence threshold. Iteration stops when the relative log-likelihood change is below this threshold. Default is 1e-3.
Integer specifying the maximum number of points plotted (default 100000).
Integer specifying the maximum number of points used to draw fitted curves (default 5000).
A list containing:
GC-content values at which GC effects were estimated.
Fitted GLM object for the background component.
Fitted GLM object for the foreground component.
Posterior probabilities of each region belonging to the foreground component.
Predicted read counts (fitted means) at each GC content for background and foreground components, respectively.
Medians of fitted background and foreground signals in their respective component regions.
Cross-median values: background prediction in foreground-like regions, and vice versa.
A ggplot object (if plot=TRUE) showing fitted GC effects.
Data frame containing the filtered counts and GC content used for fitting.
The algorithm iteratively updates posterior probabilities (E step) and fits GC-dependent regression models for both mixture components (M step). The GC dependence is modeled using a natural spline with 2 degrees of freedom.
The fitted curves can reveal whether sequencing depth or read coverage is biased by GC content, separately for peak-enriched and background regions.