Prepare GC content input for GC-effect correction — prep_gcEffects

This function computes the GC content across genomic peak regions, applying a weighting scheme (either uniform "ladder" or smoothed "tricube") over each region of fixed width. It prepares a list of GC content, genomic regions, and the provided read count information for downstream GC-effect correction or modeling.

prep_gcEffects_input(
  rc,
  peak_names,
  genome = "hg19",
  peakwidth = 501,
  gctype = c("ladder", "tricube"),
  verbose = TRUE
)

Arguments

rc: A numeric vector, matrix, or data frame representing read counts (e.g., from ATAC-seq or ChIP-seq). This can be sparse or dense, and should align with the provided peak_names.
peak_names: A character vector of peak names, typically in the format "chr_start_end", e.g., "chr1_12345_12845".
genome: A character string specifying the genome build to use. Must be a valid BSgenome package name (e.g., "hg19", "hg38", "mm10"). Default is "hg19".
peakwidth: An integer specifying the width of each peak (default: 501). Used to determine the smoothing window for GC content calculation.
gctype: A character string specifying the GC weighting scheme. Options are "ladder" (uniform weights) or "tricube" (smooth kernel weights). Default is "ladder".

Value

A list containing:

gc: A numeric vector of GC content values for each region.
region: A GRanges object representing the genomic regions.
rc: The input read count data, returned for convenience.

Details

The function extracts sequences for the given genomic regions using getSeq() from a BSgenome object and computes the GC content in each region by summing weights at positions corresponding to G or C nucleotides. The result is scaled to sum to 1.

Examples

if (FALSE) { # \dontrun{
library(BSgenome.Hsapiens.UCSC.hg19)
peaks <- c("chr1_100000_100500", "chr2_200000_200500")
rc <- c(50, 75)
res <- prep_gcEffects_input(rc = rc, peak_names = peaks, genome = "hg19", gctype = "tricube")
head(res$gc)
} # }