Skip to contents

Computes an approximate saturation curve for nodes and edges in an edgelist by estimating the expected number of retained elements at various downsampling fractions. This provides insight into how sequencing depth affects saturation without requiring actual downsampling.

Usage

approximate_saturation_curve(
  db,
  fracs = seq(0.04, 0.96, by = 0.04),
  components = NULL,
  node_reads_multiplier = 2,
  verbose = TRUE
)

Arguments

db

A PixelDB object with an active DuckDB connection.

fracs

A numeric vector of fractions to downsample the edgelist. Values must be strictly between 0 and 1. Default is seq(0.04, 0.96, by = 0.04).

components

Optional character vector of component names to compute saturation for. If NULL (default), all components are included.

node_reads_multiplier

Numeric; multiplier for the read count when calculating node saturation. Default is 2, reflecting that each edge contributes to two nodes. Optionally, set to 1 to use the same read count for nodes and edges, which may be more appropriate in certain contexts.

verbose

Logical; if TRUE (default), print progress messages.

Value

A tibble with the following columns:

component

The component (cell) identifier.

read_count

The estimated number of reads after downsampling.

p

The downsampling fraction.

nodes

The estimated number of nodes retained.

edges

The estimated number of edges retained.

degree

The estimated average node degree.

node_saturation

The node saturation at the given fraction.

edge_saturation

The edge saturation at the given fraction.

Details

Downsampling model

The downsampling is performed probabilistically using expected values rather than actual random sampling. For each edge with a given read_count, the probability of the edge being retained at fraction p is: $$P(\text{edge retained}) = 1 - (1 - p)^{\text{read\_count}}$$

For nodes, the probability of retention depends on the node's degree. A node is retained if at least one of its incident edges is retained: $$P(\text{node retained}) = 1 - (1 - p_{edges})^{\text{degree}}$$ where \(p_{edges}\) is the fraction of edges retained.

Saturation calculation

The saturation values are calculated using: $$S = 1 - \frac{N_{downsampled}}{R_{downsampled}}$$

where \(N_{downsampled}\) is the number of nodes or edges in the downsampled graph, and \(R_{downsampled}\) is the number of reads supporting those elements. Since each read supports two nodes (one at each endpoint), \(R_{downsampled, nodes} = 2 \times R_{downsampled, edges}\).

Implementation notes

This function requires that approximate_node_saturation has been called first (either directly or through this function) to create the nodes temporary view in the DuckDB connection.

See also

approximate_node_saturation and approximate_edge_saturation for computing saturation at full depth. lcc_curve for computing largest connected component sizes at different downsampling fractions.

Examples

if (FALSE) { # \dontrun{
library(ggplot2)
library(dplyr)

pxl_file <- minimal_pna_pxl_file()
db <- PixelDB$new(pxl_file)

sat_curve <- approximate_saturation_curve(db, fracs = seq(0.1, 0.9, by = 0.1))

# Plot node saturation curve
sat_curve %>%
  tidyr::pivot_longer(
    cols = c(node_saturation, edge_saturation),
    names_to = "type",
    values_to = "saturation"
  ) %>%
  ggplot(aes(read_count, saturation, color = component)) +
  geom_line() +
  geom_point() +
  labs(x = "Reads", y = "Saturation") +
  theme_minimal() +
  facet_grid(~type) +
  scale_y_continuous(limits = c(0, 1))

db$close()
} # }