This function computes the GC content across genomic peak regions, applying a weighting scheme (either uniform "ladder" or smoothed "tricube") over each region of fixed width. It prepares a list of GC content, genomic regions, and the provided read count information for downstream GC-effect correction or modeling.

prep_gcEffects_input(
  rc,
  peak_names,
  genome = "hg19",
  peakwidth = 501,
  gctype = c("ladder", "tricube"),
  verbose = TRUE
)

Arguments

rc

A numeric vector, matrix, or data frame representing read counts (e.g., from ATAC-seq or ChIP-seq). This can be sparse or dense, and should align with the provided peak_names.

peak_names

A character vector of peak names, typically in the format "chr_start_end", e.g., "chr1_12345_12845".

genome

A character string specifying the genome build to use. Must be a valid BSgenome package name (e.g., "hg19", "hg38", "mm10"). Default is "hg19".

peakwidth

An integer specifying the width of each peak (default: 501). Used to determine the smoothing window for GC content calculation.

gctype

A character string specifying the GC weighting scheme. Options are "ladder" (uniform weights) or "tricube" (smooth kernel weights). Default is "ladder".

Value

A list containing:

gc

A numeric vector of GC content values for each region.

region

A GRanges object representing the genomic regions.

rc

The input read count data, returned for convenience.

Details

The function extracts sequences for the given genomic regions using getSeq() from a BSgenome object and computes the GC content in each region by summing weights at positions corresponding to G or C nucleotides. The result is scaled to sum to 1.

Examples

if (FALSE) { # \dontrun{
library(BSgenome.Hsapiens.UCSC.hg19)
peaks <- c("chr1_100000_100500", "chr2_200000_200500")
rc <- c(50, 75)
res <- prep_gcEffects_input(rc = rc, peak_names = peaks, genome = "hg19", gctype = "tricube")
head(res$gc)
} # }