covtrim

Community

Coverage-based FASTQ downsampling tool for amplicon sequencing data

Versions

Installation

To install this package, run one of the following:

Conda

$conda install gosahan::covtrim

Usage Tracking

Version

1 / 8 versions selected

Downloads (Last 6 months): 0

Description

CovTrim: Coverage-Based Downsampling of FASTQ Files

CovTrim: Technical Documentation and Implementation Guide

Introduction
Installation
Mathematical Framework
Implementation Details
Usage Guide
Output Files
Performance Considerations
Support
License

1. Introduction

CovTrim is a specialized bioinformatics tool designed for precise coverage-based downsampling of FASTQ files from amplicon sequencing data. It uses pysam for efficient FASTQ processing and implements robust algorithms for coverage calculation and read selection.

2. Installation

# Using conda
conda install -c gosahan covtrim

# Dependencies
- Python ≥ 3.7
- pandas
- numpy
- pysam

3. Mathematical Framework

3.1 Core Coverage Metrics

Theoretical Sequencing Space

The total sequencing space (TSS) is calculated based on the number of amplicons and their size:

$$TSS = Na \times La$$

Where: - $Na$ = Number of amplicons - $La$ = Amplicon length (bp)

Number of amplicons is calculated as:

$$Na = \left\lceil \frac{G}{La} \right\rceil$$

Where: - $G$ = Genome size (bp)

Code implementation:

def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
    sequencing_space = num_amplicons * amplicon_size  # TSS calculation
    # ...

# In downsample_fastq:
num_amplicons = np.ceil(genome_size / amplicon_size)  # Na calculation

Mean Coverage Depth

The mean coverage depth ($\bar{C}$) is calculated as:

$$\bar{C} = \frac{\sum{i=1}^{n} li}{TSS}$$

Where: - $l_i$ = Length of read $i$ - $n$ = Total number of reads - $TSS$ = Theoretical sequencing space

Code implementation:

def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
    sequencing_space = num_amplicons * amplicon_size
    mean_coverage = total_bases / sequencing_space  # Mean coverage calculation
    # ...

# In downsample_fastq:
total_bases = df['len'].sum()  # Sum of all read lengths

Theoretical Reads per Amplicon

The expected number of reads per amplicon assuming uniform distribution:

$$R{amplicon} = \frac{\sum{i=1}^{n} li}{Na \times L_a}$$

Code implementation:

def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
    theoretical_reads_per_amplicon = total_bases / (num_amplicons * amplicon_size)

3.2 Downsampling Calculations

Sampling Fraction

For a target coverage $Ct$, the sampling fraction ($fs$) is calculated as:

$$fs = \min(1, \frac{Ct}{\bar{C}})$$

Where: - $C_t$ = Target coverage - $\bar{C}$ = Current mean coverage

Code implementation:

# In downsample_fastq:
sampling_fraction = min(1.0, target_coverage / current_coverage)

Expected Coverage After Sampling

The expected coverage after sampling ($C_e$) is:

$$Ce = fs \times \bar{C} = \frac{\sum{i \in S} li}{TSS}$$

Where: - $S$ = Set of sampled reads - $l_i$ = Length of read $i$

Code implementation:

# In downsample_fastq:
sampled_bases = sampled_reads['len'].sum()
expected_coverage = sampled_bases / coverage_stats['sequencing_space']

3.3 Quality Control Metrics

Coverage Uniformity

Coverage uniformity is assessed using the coefficient of variation:

$$CV = \frac{\sigmaC}{\muC}$$

Where: - $\sigmaC$ = Standard deviation of coverage - $\muC$ = Mean coverage

Amplicon Representation

The representation of each amplicon is calculated as:

$$R{amplicon} = \frac{N{observed}}{N_{expected}}$$

Where: - $N{observed}$ = Observed number of reads per amplicon - $N{expected}$ = Expected number of reads based on uniform distribution

Code implementation:

# These metrics are calculated and included in the final report
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
    # ...
    return {
        'mean_coverage': mean_coverage,
        'sequencing_space': sequencing_space,
        'theoretical_reads_per_amplicon': theoretical_reads_per_amplicon
    }

[Rest of the README remains the same...]

Performance Analysis

Time Complexity

The overall time complexity of the main operations:

$$T(n) = O(n \log n)$$

Where $n$ is the number of reads, due to sorting operations in the sampling process.

Memory Complexity

The space complexity for the core operations:

$$M(n) = O(n)$$

Where $n$ is the number of reads, primarily from storing read identifiers.

4.2 Key Functions

calculateampliconcoverage

def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
    """
    Calculate coverage metrics for amplicon sequencing

    Returns:
    - mean_coverage
    - sequencing_space
    - theoretical_reads_per_amplicon
    """

downsample_fastq

def downsample_fastq(input_fastq, target_coverage, output_dir, 
                    genome_size=16000, amplicon_size=500, random_seed=42):
    """
    Main function for FASTQ downsampling
    """

5. Usage Guide

5.1 Command Line Interface

covtrim -i <input.fastq> -o <output_dir> -tc <target_coverage> [options]

Required Arguments:
  -i, --input          Input FASTQ file
  -o, --output         Output directory
  -tc, --target-coverage  Target coverage depth (X)

Optional Arguments:
  -gs, --genome-size   Size of target genome (default: 16000 bp)
  -as, --amplicon-size Size of amplicons (default: 500 bp)
  -s, --seed          Random seed (default: 42)

5.2 Example Usage

# Basic usage
covtrim -i sample.fastq -o output_dir -tc 30

# Specify custom genome and amplicon size
covtrim -i sample.fastq -o output_dir -tc 30 -gs 29903 -as 400

6. Output Files

6.1 Directory Structure

coverage_<target>X/
├── <filename>_<target>X.fastq      # Downsampled FASTQ file
└── <filename>_<target>X_report.md  # Analysis report

6.2 Report Contents

Analysis parameters
Sequencing metrics
Coverage calculations
Sampling results
Detailed methodology

7. Performance Considerations

7.1 Memory Usage

Streaming FASTQ processing using pysam
Only read headers stored in memory
Linear memory complexity: O(n) where n is number of reads

7.2 Processing Speed

Efficient indexing using pysam
Fast random sampling with pandas
Time complexity: O(n log n) for sorting operations

8. Support

For questions or issues, contact Gurasis Osahan at the National Microbiology Laboratory.

9. License

Licensed under Apache License, Version 2.0. See LICENSE for details.

Legal

Copyright: Government of Canada

Written by: National Microbiology Laboratory, Public Health Agency of Canada

Ensuring public health through advanced genomics. Developed with unwavering commitment and expertise by National Microbiology Laboratory, Public Health Agency of Canada.

About

Summary

Coverage-based FASTQ downsampling tool for amplicon sequencing data

Last Updated

Oct 29, 2024 at 22:32

License

Apache-2.0

Total Downloads

Version Downloads

Supported Platforms

linux-64

Home

https://github.com/phac-nml/covtrim

GitHub Repository

https://github.com/phac-nml/covtrim

Documentation

https://github.com/phac-nml/covtrim/blob/main/README.md

Installation

Conda

Usage Tracking

Description

CovTrim: Coverage-Based Downsampling of FASTQ Files

CovTrim: Technical Documentation and Implementation Guide

Table of Contents

1. Introduction

2. Installation

3. Mathematical Framework

3.1 Core Coverage Metrics

Theoretical Sequencing Space

Mean Coverage Depth

Theoretical Reads per Amplicon

3.2 Downsampling Calculations

Sampling Fraction

Expected Coverage After Sampling

3.3 Quality Control Metrics

Coverage Uniformity

Amplicon Representation

Performance Analysis

Time Complexity

Memory Complexity

4.2 Key Functions

calculateampliconcoverage

downsample_fastq

5. Usage Guide

5.1 Command Line Interface

5.2 Example Usage

6. Output Files

6.1 Directory Structure

6.2 Report Contents

7. Performance Considerations

7.1 Memory Usage

7.2 Processing Speed

8. Support

9. License

Legal

About