This function is the main function that computes GC-content baseline correction parameters (lambda values) for each genomic region and topic combination. It processes count matrices with topic modeling results, estimates GC effects for each topic separately, and returns normalized baseline parameters that can be used for GC bias correction.

get_gc_baseline(
  count_matrix,
  Lmat,
  Fmat,
  outdir,
  genome = "hg19",
  peakwidth = 501,
  emtrace = FALSE,
  verbose = FALSE,
  plot = FALSE,
  gcrange = c(0.3, 0.8),
  mu0 = 1,
  mu1 = 50,
  theta0 = mu0,
  theta1 = mu1,
  p = 0.02,
  converge = 0.001,
  max_pts = 1e+05,
  max_line = 5000
)

Arguments

count_matrix

A SummarizedExperiment or similar object containing a count matrix accessible via assay(). Rows represent genomic peaks and columns represent cells or samples.

Lmat

A matrix of topic loadings where rows correspond to cells (matching column names in count_matrix) and columns represent topics.

genome

A character string specifying the genome build. Must be a valid BSgenome package name (e.g., "hg19", "hg38", "mm10"). Default is "hg19".

peakwidth

An integer specifying the width of genomic peaks in base pairs. Default is 501.

emtrace

Logical; if TRUE, prints iteration details during EM algorithm convergence for each topic. Default is FALSE.

verbose

Logical; if TRUE, prints progress messages during GC content calculation and data preparation. Default is FALSE.

plot

Logical; if TRUE, generates diagnostic plots showing GC effects for each topic. Default is FALSE.

gcrange

A numeric vector of length 2 specifying the GC content range to include in model fitting. Default is c(0.3, 0.8).

mu0, mu1

Numeric initial values for background and foreground mean read counts. Defaults are mu0 = 1 and mu1 = 50.

theta0, theta1

Numeric initial shape parameters for negative binomial distribution. Default to mu0 and mu1 respectively.

p

Numeric initial mixture proportion for foreground regions. Default is 0.02.

converge

Numeric convergence threshold for EM algorithm. Default is 1e-3.

max_pts

Integer specifying maximum number of points to plot. Default is 100000.

max_line

Integer specifying maximum number of points for fitted curve lines. Default is 5000.

Value

A list containing:

lambda_jk

A data frame where rows represent genomic regions (peaks) and columns represent topics. Each entry is the normalized baseline parameter (lambda) for that region-topic combination.

N_k

A named numeric vector containing the total read count for each topic.

gc_k

A named list where each element contains the GC content vector for all regions, organized by topic.

Details

The function performs the following steps:

  1. Multiplies the count matrix by the topic loading matrix to obtain topic-specific read counts.

  2. Calculates GC content for each genomic peak region.

  3. Fits adaptive GC effect models for each topic using negative binomial regression with an EM algorithm.

  4. Predicts baseline mean values (mu0) for each region and topic.

  5. Normalizes by total topic counts to obtain lambda parameters.

The lambda values can be used to correct for GC bias in downstream analyses such as differential accessibility or peak calling.

Examples

if (FALSE) { # \dontrun{
  gc_baseline_res <- get_gc_baseline(count_matrix = count_mat_raw, Lmat = L_mat_topic)
} # }