covtrim
Coverage-based FASTQ downsampling tool for amplicon sequencing data
Coverage-based FASTQ downsampling tool for amplicon sequencing data
To install this package, run one of the following:
|
|
CovTrim: Coverage-Based Downsampling of FASTQ Files |
CovTrim is a specialized bioinformatics tool designed for precise coverage-based downsampling of FASTQ files from amplicon sequencing data. It uses pysam for efficient FASTQ processing and implements robust algorithms for coverage calculation and read selection.
# Using conda
conda install -c gosahan covtrim
# Dependencies
- Python ≥ 3.7
- pandas
- numpy
- pysam
The total sequencing space (TSS) is calculated based on the number of amplicons and their size:
$$TSS = Na \times La$$
Where: - $Na$ = Number of amplicons - $La$ = Amplicon length (bp)
Number of amplicons is calculated as:
$$Na = \left\lceil \frac{G}{La} \right\rceil$$
Where: - $G$ = Genome size (bp)
Code implementation:
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
sequencing_space = num_amplicons * amplicon_size # TSS calculation
# ...
# In downsample_fastq:
num_amplicons = np.ceil(genome_size / amplicon_size) # Na calculation
The mean coverage depth ($\bar{C}$) is calculated as:
$$\bar{C} = \frac{\sum{i=1}^{n} li}{TSS}$$
Where: - $l_i$ = Length of read $i$ - $n$ = Total number of reads - $TSS$ = Theoretical sequencing space
Code implementation:
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
sequencing_space = num_amplicons * amplicon_size
mean_coverage = total_bases / sequencing_space # Mean coverage calculation
# ...
# In downsample_fastq:
total_bases = df['len'].sum() # Sum of all read lengths
The expected number of reads per amplicon assuming uniform distribution:
$$R{amplicon} = \frac{\sum{i=1}^{n} li}{Na \times L_a}$$
Code implementation:
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
theoretical_reads_per_amplicon = total_bases / (num_amplicons * amplicon_size)
For a target coverage $Ct$, the sampling fraction ($fs$) is calculated as:
$$fs = \min(1, \frac{Ct}{\bar{C}})$$
Where: - $C_t$ = Target coverage - $\bar{C}$ = Current mean coverage
Code implementation:
# In downsample_fastq:
sampling_fraction = min(1.0, target_coverage / current_coverage)
The expected coverage after sampling ($C_e$) is:
$$Ce = fs \times \bar{C} = \frac{\sum{i \in S} li}{TSS}$$
Where: - $S$ = Set of sampled reads - $l_i$ = Length of read $i$
Code implementation:
# In downsample_fastq:
sampled_bases = sampled_reads['len'].sum()
expected_coverage = sampled_bases / coverage_stats['sequencing_space']
Coverage uniformity is assessed using the coefficient of variation:
$$CV = \frac{\sigmaC}{\muC}$$
Where: - $\sigmaC$ = Standard deviation of coverage - $\muC$ = Mean coverage
The representation of each amplicon is calculated as:
$$R{amplicon} = \frac{N{observed}}{N_{expected}}$$
Where: - $N{observed}$ = Observed number of reads per amplicon - $N{expected}$ = Expected number of reads based on uniform distribution
Code implementation:
# These metrics are calculated and included in the final report
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
# ...
return {
'mean_coverage': mean_coverage,
'sequencing_space': sequencing_space,
'theoretical_reads_per_amplicon': theoretical_reads_per_amplicon
}
[Rest of the README remains the same...]
The overall time complexity of the main operations:
$$T(n) = O(n \log n)$$
Where $n$ is the number of reads, due to sorting operations in the sampling process.
The space complexity for the core operations:
$$M(n) = O(n)$$
Where $n$ is the number of reads, primarily from storing read identifiers.
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
"""
Calculate coverage metrics for amplicon sequencing
Returns:
- mean_coverage
- sequencing_space
- theoretical_reads_per_amplicon
"""
def downsample_fastq(input_fastq, target_coverage, output_dir,
genome_size=16000, amplicon_size=500, random_seed=42):
"""
Main function for FASTQ downsampling
"""
covtrim -i <input.fastq> -o <output_dir> -tc <target_coverage> [options]
Required Arguments:
-i, --input Input FASTQ file
-o, --output Output directory
-tc, --target-coverage Target coverage depth (X)
Optional Arguments:
-gs, --genome-size Size of target genome (default: 16000 bp)
-as, --amplicon-size Size of amplicons (default: 500 bp)
-s, --seed Random seed (default: 42)
# Basic usage
covtrim -i sample.fastq -o output_dir -tc 30
# Specify custom genome and amplicon size
covtrim -i sample.fastq -o output_dir -tc 30 -gs 29903 -as 400
coverage_<target>X/
├── <filename>_<target>X.fastq # Downsampled FASTQ file
└── <filename>_<target>X_report.md # Analysis report
For questions or issues, contact Gurasis Osahan at the National Microbiology Laboratory.
Licensed under Apache License, Version 2.0. See LICENSE for details.
Copyright: Government of Canada
Written by: National Microbiology Laboratory, Public Health Agency of Canada
Ensuring public health through advanced genomics. Developed with unwavering commitment and expertise by National Microbiology Laboratory, Public Health Agency of Canada.
Summary
Coverage-based FASTQ downsampling tool for amplicon sequencing data
Last Updated
Oct 29, 2024 at 22:32
License
Apache-2.0
Total Downloads
7
Version Downloads
7
Supported Platforms
GitHub Repository
https://github.com/phac-nml/covtrimDocumentation
https://github.com/phac-nml/covtrim/blob/main/README.md