<div>
  <a class="close-reveal-modal" aria-label="Close">&times;</a>
  <h4 class="modal-title" id="myModalLabel"> linux-64/covtrim-0.1.0-py310he53d0f1_0.tar.bz2</h4>
  <p class="subheader">&lt;table align=&#34;center&#34; style=&#34;margin: 0px auto;&#34;&gt;
  &lt;tr&gt;
    &lt;td&gt;
      &lt;img src=&#34;https://raw.githubusercontent.com/phac-nml/covtrim/refs/heads/main/extra/covtrim-logo.svg&#34; alt=&#34;covtrim Logo&#34; width=&#34;400&#34; height=&#34;auto&#34;/&gt;
    &lt;/td&gt;
    &lt;td&gt;
      &lt;h1&gt;CovTrim: Coverage-Based Downsampling of FASTQ Files&lt;/h1&gt;
&lt;a href=&#34;https://anaconda.org/gosahan/covtrim&#34;&gt;
  &lt;img src=&#34;https://anaconda.org/gosahan/covtrim/badges/version.svg&#34; alt=&#34;covtrim on Anaconda&#34;/&gt;
&lt;/a&gt;
&lt;a href=&#34;&#34;&gt;
  &lt;img src=&#34;https://anaconda.org/gosahan/covtrim/badges/platforms.svg&#34;/&gt;        
&lt;/a&gt;
&lt;a href=&#34;&#34;&gt;
  &lt;img src=&#34;https://anaconda.org/gosahan/covtrim/badges/latest_release_date.svg&#34;/&gt;        
&lt;/a&gt;
    &lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;

# CovTrim: Technical Documentation and Implementation Guide

## Table of Contents
1. [Introduction](#1-introduction)
2. [Installation](#2-installation)
3. [Mathematical Framework](#3-mathematical-framework)
4. [Implementation Details](#4-implementation-details)
5. [Usage Guide](#5-usage-guide)
6. [Output Files](#6-output-files)
7. [Performance Considerations](#7-performance-considerations)
8. [Support](#8-support)
9. [License](#9-license)

## 1. Introduction

CovTrim is a specialized bioinformatics tool designed for precise coverage-based downsampling of FASTQ files from amplicon sequencing data. It uses pysam for efficient FASTQ processing and implements robust algorithms for coverage calculation and read selection.

## 2. Installation

```bash
# Using conda
conda install -c gosahan covtrim

# Dependencies
- Python ≥ 3.7
- pandas
- numpy
- pysam
```

## 3. Mathematical Framework

### 3.1 Core Coverage Metrics

#### Theoretical Sequencing Space
The total sequencing space (TSS) is calculated based on the number of amplicons and their size:

$$TSS = N_a \times L_a$$

Where:
- $N_a$ = Number of amplicons
- $L_a$ = Amplicon length (bp)

Number of amplicons is calculated as:

$$N_a = \left\lceil \frac{G}{L_a} \right\rceil$$

Where:
- $G$ = Genome size (bp)

Code implementation:
```python
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
    sequencing_space = num_amplicons * amplicon_size  # TSS calculation
    # ...

# In downsample_fastq:
num_amplicons = np.ceil(genome_size / amplicon_size)  # Na calculation
```

#### Mean Coverage Depth
The mean coverage depth ($\bar{C}$) is calculated as:

$$\bar{C} = \frac{\sum_{i=1}^{n} l_i}{TSS}$$

Where:
- $l_i$ = Length of read $i$
- $n$ = Total number of reads
- $TSS$ = Theoretical sequencing space

Code implementation:
```python
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
    sequencing_space = num_amplicons * amplicon_size
    mean_coverage = total_bases / sequencing_space  # Mean coverage calculation
    # ...

# In downsample_fastq:
total_bases = df[&#39;len&#39;].sum()  # Sum of all read lengths
```

#### Theoretical Reads per Amplicon
The expected number of reads per amplicon assuming uniform distribution:

$$R_{amplicon} = \frac{\sum_{i=1}^{n} l_i}{N_a \times L_a}$$

Code implementation:
```python
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
    theoretical_reads_per_amplicon = total_bases / (num_amplicons * amplicon_size)
```

### 3.2 Downsampling Calculations

#### Sampling Fraction
For a target coverage $C_t$, the sampling fraction ($f_s$) is calculated as:

$$f_s = \min(1, \frac{C_t}{\bar{C}})$$

Where:
- $C_t$ = Target coverage
- $\bar{C}$ = Current mean coverage

Code implementation:
```python
# In downsample_fastq:
sampling_fraction = min(1.0, target_coverage / current_coverage)
```

#### Expected Coverage After Sampling
The expected coverage after sampling ($C_e$) is:

$$C_e = f_s \times \bar{C} = \frac{\sum_{i \in S} l_i}{TSS}$$

Where:
- $S$ = Set of sampled reads
- $l_i$ = Length of read $i$

Code implementation:
```python
# In downsample_fastq:
sampled_bases = sampled_reads[&#39;len&#39;].sum()
expected_coverage = sampled_bases / coverage_stats[&#39;sequencing_space&#39;]
```

### 3.3 Quality Control Metrics

#### Coverage Uniformity
Coverage uniformity is assessed using the coefficient of variation:

$$CV = \frac{\sigma_C}{\mu_C}$$

Where:
- $\sigma_C$ = Standard deviation of coverage
- $\mu_C$ = Mean coverage

#### Amplicon Representation
The representation of each amplicon is calculated as:

$$R_{amplicon} = \frac{N_{observed}}{N_{expected}}$$

Where:
- $N_{observed}$ = Observed number of reads per amplicon
- $N_{expected}$ = Expected number of reads based on uniform distribution

Code implementation:
```python
# These metrics are calculated and included in the final report
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
    # ...
    return {
        &#39;mean_coverage&#39;: mean_coverage,
        &#39;sequencing_space&#39;: sequencing_space,
        &#39;theoretical_reads_per_amplicon&#39;: theoretical_reads_per_amplicon
    }
```

[Rest of the README remains the same...]

## Performance Analysis

### Time Complexity
The overall time complexity of the main operations:

$$T(n) = O(n \log n)$$

Where $n$ is the number of reads, due to sorting operations in the sampling process.

### Memory Complexity
The space complexity for the core operations:

$$M(n) = O(n)$$

Where $n$ is the number of reads, primarily from storing read identifiers.

### 4.2 Key Functions

#### calculate_amplicon_coverage
```python
def calculate_amplicon_coverage(total_bases, num_amplicons, amplicon_size):
    &#34;&#34;&#34;
    Calculate coverage metrics for amplicon sequencing
    
    Returns:
    - mean_coverage
    - sequencing_space
    - theoretical_reads_per_amplicon
    &#34;&#34;&#34;
```

#### downsample_fastq
```python
def downsample_fastq(input_fastq, target_coverage, output_dir, 
                    genome_size=16000, amplicon_size=500, random_seed=42):
    &#34;&#34;&#34;
    Main function for FASTQ downsampling
    &#34;&#34;&#34;
```

## 5. Usage Guide

### 5.1 Command Line Interface
```bash
covtrim -i &lt;input.fastq&gt; -o &lt;output_dir&gt; -tc &lt;target_coverage&gt; [options]

Required Arguments:
  -i, --input          Input FASTQ file
  -o, --output         Output directory
  -tc, --target-coverage  Target coverage depth (X)

Optional Arguments:
  -gs, --genome-size   Size of target genome (default: 16000 bp)
  -as, --amplicon-size Size of amplicons (default: 500 bp)
  -s, --seed          Random seed (default: 42)
```

### 5.2 Example Usage
```bash
# Basic usage
covtrim -i sample.fastq -o output_dir -tc 30

# Specify custom genome and amplicon size
covtrim -i sample.fastq -o output_dir -tc 30 -gs 29903 -as 400
```

## 6. Output Files

### 6.1 Directory Structure
```
coverage_&lt;target&gt;X/
├── &lt;filename&gt;_&lt;target&gt;X.fastq      # Downsampled FASTQ file
└── &lt;filename&gt;_&lt;target&gt;X_report.md  # Analysis report
```

### 6.2 Report Contents
- Analysis parameters
- Sequencing metrics
- Coverage calculations
- Sampling results
- Detailed methodology

## 7. Performance Considerations

### 7.1 Memory Usage
- Streaming FASTQ processing using pysam
- Only read headers stored in memory
- Linear memory complexity: O(n) where n is number of reads

### 7.2 Processing Speed
- Efficient indexing using pysam
- Fast random sampling with pandas
- Time complexity: O(n log n) for sorting operations

## 8. Support

For questions or issues, contact [Gurasis Osahan](mailto:gurasis.osahan@phac-aspc.gc.ca) at the National Microbiology Laboratory.

## 9. License

Licensed under Apache License, Version 2.0. See [LICENSE](http://www.apache.org/licenses/LICENSE-2.0) for details.

## Legal

**Copyright**: Government of Canada 

**Written by**: National Microbiology Laboratory, Public Health Agency of Canada

---

*Ensuring public health through advanced genomics. Developed with unwavering commitment and expertise by National Microbiology Laboratory, Public Health Agency of Canada.*</p>
  <div>
    <table class="full-width">
      <tbody>
        <tr>
          <td>Uploaded</td>
          <td>Tue Oct 29 22:32:23 2024</td>
        </tr>
        <tr>
          <td>md5 checksum</td>
          <td>56eb0c8a4bb608e7b84cb38acdc28ec8</td>
        </tr>
        
        
        
        <tr>
          <td>arch</td>
          <td>x86_64</td>
        </tr>
        
        
        
        <tr>
          <td>build</td>
          <td>py310he53d0f1_0</td>
        </tr>
        
        
        
        
        
        <tr>
          <td>depends</td>
          <td>numpy 1.24.3, pandas 1.5.3, pip, poetry-core, pysam &gt;=0.20.0, python &gt;=3.10,&lt;3.11.0a0, python_abi 3.10.* *_cp310, setuptools, wheel</td>
        </tr>
        
        
        
        <tr>
          <td>has_prefix</td>
          <td>True</td>
        </tr>
        
        
        
        <tr>
          <td>license</td>
          <td>Apache-2.0</td>
        </tr>
        
        
        
        <tr>
          <td>license_family</td>
          <td>Apache</td>
        </tr>
        
        
        
        <tr>
          <td>machine</td>
          <td>x86_64</td>
        </tr>
        
        
        
        <tr>
          <td>operatingsystem</td>
          <td>linux</td>
        </tr>
        
        
        
        <tr>
          <td>platform</td>
          <td>linux</td>
        </tr>
        
        
        
        <tr>
          <td>subdir</td>
          <td>linux-64</td>
        </tr>
        
        
        
        <tr>
          <td>target-triplet</td>
          <td>x86_64-any-linux</td>
        </tr>
        
        
        
        <tr>
          <td>timestamp</td>
          <td>1730240847734</td>
        </tr>
        
        
      </tbody>
    </table>
  </div>
</div><!-- /.modal-content -->
Uploaded	Tue Oct 29 22:32:23 2024
md5 checksum	56eb0c8a4bb608e7b84cb38acdc28ec8
arch	x86_64
build	py310he53d0f1_0
depends	numpy 1.24.3, pandas 1.5.3, pip, poetry-core, pysam >=0.20.0, python >=3.10,<3.11.0a0, python_abi 3.10.* *_cp310, setuptools, wheel
has_prefix	True
license	Apache-2.0
license_family	Apache
machine	x86_64
operatingsystem	linux
platform	linux
subdir	linux-64
target-triplet	x86_64-any-linux
timestamp	1730240847734