Skip to contents

Filters an edgelist by probabilistically downsampling read counts and exports the results to parquet files. This simulates the effect of sequencing at lower depth.

Usage

downsample_to_parquet(pxl_file, outdir, components, fracs = seq(0.1, 0.9, 0.1))

Arguments

pxl_file

Path to the PXL file.

outdir

Directory to write the output parquet files. Created if it does not exist.

components

A character vector of component names to include. All components must be present in the PXL file.

fracs

A numeric vector of fractions to downsample by. Values must be strictly between 0 and 1. Default is seq(0.1, 0.9, 0.1).

Value

A tibble with columns:

fracs

The downsampling fraction.

pq_files

Path to the corresponding parquet file.

Details

For each edge with a given read_count, the probability of the edge being retained at fraction p is: $$P(\text{edge retained}) = 1 - (1 - p)^{\text{read\_count}}$$

This means edges with higher read counts are more likely to be retained, which accurately models the effect of sequencing at lower depth.

The filtered edgelists are written to parquet files in outdir, one file per fraction. File names follow the pattern edgelist_001.parquet, edgelist_002.parquet, etc., corresponding to the order of fractions.

See also

lcc_sizes for computing LCC sizes from the output parquet files. lcc_curve for a high-level wrapper that combines downsampling and LCC computation.