
Downsample edgelist and export to parquet files
downsample_to_parquet.RdFilters an edgelist by probabilistically downsampling read counts and exports the results to parquet files. This simulates the effect of sequencing at lower depth.
Usage
downsample_to_parquet(pxl_file, outdir, components, fracs = seq(0.1, 0.9, 0.1))Arguments
- pxl_file
Path to the PXL file.
- outdir
Directory to write the output parquet files. Created if it does not exist.
- components
A character vector of component names to include. All components must be present in the PXL file.
- fracs
A numeric vector of fractions to downsample by. Values must be strictly between 0 and 1. Default is
seq(0.1, 0.9, 0.1).
Value
A tibble with columns:
- fracs
The downsampling fraction.
- pq_files
Path to the corresponding parquet file.
Details
For each edge with a given read_count, the probability of the edge being
retained at fraction p is:
$$P(\text{edge retained}) = 1 - (1 - p)^{\text{read\_count}}$$
This means edges with higher read counts are more likely to be retained, which accurately models the effect of sequencing at lower depth.
The filtered edgelists are written to parquet files in outdir, one file
per fraction. File names follow the pattern edgelist_001.parquet,
edgelist_002.parquet, etc., corresponding to the order of fractions.