BiRD / packages / singlecellpipeline 0.8.0

0

104 total downloads
Last upload: 8 years and 2 months ago

Installers

linux-64 v0.8.0

conda install

To install this package run one of the following:
conda install bird::singlecellpipeline

Description

This pipeline aims to provide unsupervised data analysis from Singlecell RNAseq data. The fastq files are aligned on a reference genome with Tophat2 and count with htseqcount. After filtering, they are normalized with DeSeq2 and transformed with Vst. A quality control is provided with fastqc. Finally, the unsupervised analysis is done with WGCNA.

Prerequisites

The computing grid is expected to run on a beegfs partition (or at least a multi-thread capable partition)
Miniconda3 is a necessity

Input data

The fastq(.gz) files need to be gathered in a directory. The pathway to this directory will be specified in the config.json.
A conditionSheet.csv file is also expected in this directory. It gathers the technical and the functional information about the samples to be analysed.
- The first column is expected to be the functional names of the samples
- The second column is expected to be the technical names of the samples (without the fastq(.gz) extension)
- The remaining columns provide technical and functional information about the samples
Exemple of a conditionSheet.csv file :

~~~ Samplename,SeqID,samplenumber,Singlecell,libraryprep,Plate,sequencingrun,Familly,CultureBatch,Embryoscore,Embryonumber,TCmedia,Sampletype L019H01,LD96sH1,,TRUE,1,2,1,NA,NA,NA,NA,TeSR,L019 WA09H10,LD96sH10,,TRUE,1,2,1,NA,NA,NA,NA,TeSR,WA09 WA09H11,LD96sH11,,TRUE,1,2,1,NA,NA,NA,NA,TeSR,WA09 WA09H12,LD96s_H12,,TRUE,1,2,1,NA,NA,NA,NA,TeSR,WA09 ~~~

A csv file which contains genes of interest (optional)
- The first column is expected to be the name of the genes
- The second column is expected to be the gene group
Exemple of a specificMarkers.csv file :

~~~ GENE,Lineage CDX2,"TE" CLDN10,"TE" DAB2,"TE" ~~~

Installation of the virtual environments

~~~ conda create -n myVirtualEnvironment singlecellpipeline -c bird -c conda-forge -c bioconda -c r conda info --env # To get the path of the directory of myVirtualEnvironment : myCondaPath cd myCondaPath/singlecellpipeline/ conda env create -f virtualEnvs/TophatEnv.yml ~~~

config.json

~~~ |-------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| FASTQPATH: | directory of the fastq data |
| | fastq files will be moved toward the subdirectory fastqSE/ for Single-End or the subdirectory fastqPE/ for Paired-End |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| FASTQTYPE: | "singleEnd" for single-end fastq files and "pairedEnd" for paired-end fastq files |
| | for paired-end files, the expected name pattern is fastqName.R1.fastq(.gz) and fastqName.R2.fastq(.gz) |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| GENEREFERENCE: | pathway of the gtf file for the analysis |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| BOWTIEINDEX: | pathway of the bowtie fasta reference for the analysis (ex: "/mnt/beegfs/ylelievre/singlecell/index-bowtie-2.2.4/humang1kv37") |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| TOPHATCPU: | number of thread used by TopHat2 |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| TRIMMINGGENEMIN: | for a valid sample, the minimum number of genes |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| TRIMMINGQ30MIN: | for a valid sample, the minimum Q30 |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| TRIMMINGCOUNTSMIN: | for a valid gene, the minimum number of reads |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| TRIMMINGSAMPLESMIN: | for a valid gene, the minimum number of samples that contain at least one read |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| ISA2SEEDS: | isa2 parameter, it corresponds to the number of origin for the research of biclusters |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| ISA2ROWSTRINGENCY: | isa2 parameter, it corresponds to an arbitrary stringency value for the correlation between samples |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| ISA2COLUMNSTRINGENCY: | isa2 parameter, it corresponds to an arbitrary stringency value for the correlation between genes |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| WGCNAPOWER: | WGCNA parameter, it corresponds to the soft power (optional: calculated automatically if not provided) |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| WGCNASPECIES: | WGCNA parameter, it corresponds to the 2-letter species abbreviation. org.XX.eg.db R annotation package must be installed |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| WGCNA_MARKERS: | WGCNA parameter, it corresponds to the file which contains genes of interest (optional). | | | format: csv, header, 1st col = genes, 2nd col = gene group (ex: which tissueis specific of the gene) |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------|
~~~

Execution of the pipeline

~~~ source activate myVirtualEnvironment snakemake -p --latency-wait 60 --cluster "qsub -o ./logs/ -e ./logs/" --jobs 100 --jobscript singlecell.sh ~~~

Output data

fastQC : this directory contains the results obtain with FastQC on the fastq(.gz) files
BAM : this directory contains the resulting bam files obtained with Tophat2 on the fastq(.gz) files
counts : this directory contains the resulting counts files obtained with htseqcount on the bam files
QC : this directory contains the result of the quality control
analysis : this directory contains the results from the different analysis of the data
- the quality control overviews
- the PCA analysis
- the bicluster
- WGCNA
tables : this directory contains the raw counts table, the trimmed counts table with its corresponding trimmed condition sheet, the normalized counts table with DESeq2 and the transformed counts table with VST

BiRD / packages / singlecellpipeline 0.8.0 0

Installers

conda install

Description

Description

Prerequisites

Input data

Installation of the virtual environments

config.json

Execution of the pipeline

Output data

© 2025 Anaconda, Inc. All Rights Reserved. (v4.2.2) Legal | Privacy Policy

BiRD / packages / singlecellpipeline 0.8.0

0