get_gc_baseline.RdThis function is the main function that computes GC-content baseline correction parameters (lambda values) for each genomic region and topic combination. It processes count matrices with topic modeling results, estimates GC effects for each topic separately, and returns normalized baseline parameters that can be used for GC bias correction.
get_gc_baseline(
count_matrix,
Lmat,
Fmat,
outdir,
genome = "hg19",
peakwidth = 501,
emtrace = FALSE,
verbose = FALSE,
plot = FALSE,
gcrange = c(0.3, 0.8),
mu0 = 1,
mu1 = 50,
theta0 = mu0,
theta1 = mu1,
p = 0.02,
converge = 0.001,
max_pts = 1e+05,
max_line = 5000
)A SummarizedExperiment or similar object containing a count
matrix accessible via assay(). Rows represent genomic peaks and columns
represent cells or samples.
A matrix of topic loadings where rows correspond to cells (matching
column names in count_matrix) and columns represent topics.
A character string specifying the genome build. Must be a valid BSgenome package name (e.g., "hg19", "hg38", "mm10"). Default is "hg19".
An integer specifying the width of genomic peaks in base pairs. Default is 501.
Logical; if TRUE, prints iteration details during EM
algorithm convergence for each topic. Default is FALSE.
Logical; if TRUE, prints progress messages during GC
content calculation and data preparation. Default is FALSE.
Logical; if TRUE, generates diagnostic plots showing GC
effects for each topic. Default is FALSE.
A numeric vector of length 2 specifying the GC content range
to include in model fitting. Default is c(0.3, 0.8).
Numeric initial values for background and foreground mean read
counts. Defaults are mu0 = 1 and mu1 = 50.
Numeric initial shape parameters for negative binomial
distribution. Default to mu0 and mu1 respectively.
Numeric initial mixture proportion for foreground regions. Default is 0.02.
Numeric convergence threshold for EM algorithm. Default is 1e-3.
Integer specifying maximum number of points to plot. Default is 100000.
Integer specifying maximum number of points for fitted curve lines. Default is 5000.
A list containing:
A data frame where rows represent genomic regions (peaks) and columns represent topics. Each entry is the normalized baseline parameter (lambda) for that region-topic combination.
A named numeric vector containing the total read count for each topic.
A named list where each element contains the GC content vector for all regions, organized by topic.
The function performs the following steps:
Multiplies the count matrix by the topic loading matrix to obtain topic-specific read counts.
Calculates GC content for each genomic peak region.
Fits adaptive GC effect models for each topic using negative binomial regression with an EM algorithm.
Predicts baseline mean values (mu0) for each region and topic.
Normalizes by total topic counts to obtain lambda parameters.
The lambda values can be used to correct for GC bias in downstream analyses such as differential accessibility or peak calling.
if (FALSE) { # \dontrun{
gc_baseline_res <- get_gc_baseline(count_matrix = count_mat_raw, Lmat = L_mat_topic)
} # }