Coverage-based FASTQ downsampling tool for amplicon sequencing data
|
CovTrim: Coverage-Based Downsampling of FASTQ Files |
CovTrim is a specialized bioinformatics tool designed for precise coverage-based downsampling of FASTQ files from amplicon sequencing data. It uses pysam for efficient FASTQ processing and implements robust algorithms for coverage calculation and read selection.
# Using conda
conda install -c gosahan covtrim
# Dependencies
- Python ≥ 3.7
- pandas
- numpy
- pysam
The total sequencing space (TSS) is calculated based on the number of amplicons and their size:
$$TSS = Na \times La$$
Where: - $Na$ = Number of amplicons - $La$ = Amplicon length (bp)
Number of amplicons is calculated as:
$$Na = \left\lceil \frac{G}{La} \right\rceil$$
Where: - $G$ = Genome size (bp)
Code implementation:
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
sequencing_space = num_amplicons * amplicon_size # TSS calculation
# ...
# In downsample_fastq:
num_amplicons = np.ceil(genome_size / amplicon_size) # Na calculation
The mean coverage depth ($\bar{C}$) is calculated as:
$$\bar{C} = \frac{\sum{i=1}^{n} li}{TSS}$$
Where: - $l_i$ = Length of read $i$ - $n$ = Total number of reads - $TSS$ = Theoretical sequencing space
Code implementation:
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
sequencing_space = num_amplicons * amplicon_size
mean_coverage = total_bases / sequencing_space # Mean coverage calculation
# ...
# In downsample_fastq:
total_bases = df['len'].sum() # Sum of all read lengths
The expected number of reads per amplicon assuming uniform distribution:
$$R{amplicon} = \frac{\sum{i=1}^{n} li}{Na \times L_a}$$
Code implementation:
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
theoretical_reads_per_amplicon = total_bases / (num_amplicons * amplicon_size)
For a target coverage $Ct$, the sampling fraction ($fs$) is calculated as:
$$fs = \min(1, \frac{Ct}{\bar{C}})$$
Where: - $C_t$ = Target coverage - $\bar{C}$ = Current mean coverage
Code implementation:
# In downsample_fastq:
sampling_fraction = min(1.0, target_coverage / current_coverage)
The expected coverage after sampling ($C_e$) is:
$$Ce = fs \times \bar{C} = \frac{\sum{i \in S} li}{TSS}$$
Where: - $S$ = Set of sampled reads - $l_i$ = Length of read $i$
Code implementation:
# In downsample_fastq:
sampled_bases = sampled_reads['len'].sum()
expected_coverage = sampled_bases / coverage_stats['sequencing_space']
Coverage uniformity is assessed using the coefficient of variation:
$$CV = \frac{\sigmaC}{\muC}$$
Where: - $\sigmaC$ = Standard deviation of coverage - $\muC$ = Mean coverage
The representation of each amplicon is calculated as:
$$R{amplicon} = \frac{N{observed}}{N_{expected}}$$
Where: - $N{observed}$ = Observed number of reads per amplicon - $N{expected}$ = Expected number of reads based on uniform distribution
Code implementation:
# These metrics are calculated and included in the final report
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
# ...
return {
'mean_coverage': mean_coverage,
'sequencing_space': sequencing_space,
'theoretical_reads_per_amplicon': theoretical_reads_per_amplicon
}
[Rest of the README remains the same...]
The overall time complexity of the main operations:
$$T(n) = O(n \log n)$$
Where $n$ is the number of reads, due to sorting operations in the sampling process.
The space complexity for the core operations:
$$M(n) = O(n)$$
Where $n$ is the number of reads, primarily from storing read identifiers.
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
"""
Calculate coverage metrics for amplicon sequencing
Returns:
- mean_coverage
- sequencing_space
- theoretical_reads_per_amplicon
"""
def downsample_fastq(input_fastq, target_coverage, output_dir,
genome_size=16000, amplicon_size=500, random_seed=42):
"""
Main function for FASTQ downsampling
"""
covtrim -i <input.fastq> -o <output_dir> -tc <target_coverage> [options]
Required Arguments:
-i, --input Input FASTQ file
-o, --output Output directory
-tc, --target-coverage Target coverage depth (X)
Optional Arguments:
-gs, --genome-size Size of target genome (default: 16000 bp)
-as, --amplicon-size Size of amplicons (default: 500 bp)
-s, --seed Random seed (default: 42)
# Basic usage
covtrim -i sample.fastq -o output_dir -tc 30
# Specify custom genome and amplicon size
covtrim -i sample.fastq -o output_dir -tc 30 -gs 29903 -as 400
coverage_<target>X/
├── <filename>_<target>X.fastq # Downsampled FASTQ file
└── <filename>_<target>X_report.md # Analysis report
For questions or issues, contact Gurasis Osahan at the National Microbiology Laboratory.
Licensed under Apache License, Version 2.0. See LICENSE for details.
Copyright: Government of Canada
Written by: National Microbiology Laboratory, Public Health Agency of Canada
Ensuring public health through advanced genomics. Developed with unwavering commitment and expertise by National Microbiology Laboratory, Public Health Agency of Canada.