NextFlow for WGS

NextFlow has been set up for Cedars-Sinai in the compbio cluster (esplhpccompbio-lv01.csmc.edu). If you have access to this cluster, you can directly load NextFlow using this command

module load nextflow
module load singularity-apptainer/1.1.6

Along with the above, it is recommended to provide directories for singularity container execution and temporary storage.

export NXF_SINGULARITY_CLI=apptainer
export NXF_SINGULARITY_CACHEDIR=/common/group_folder/data/project_folder/singularity_cache
export SINGULARITY_TMPDIR=/common/group_folder/data/project_folder
export SINGULARITY_CACHEDIR=/common/group_folder/data/project_folder/singularity_cache

export TMPDIR=/common/group_folder/projects/temp/project_folder

The next thing you would need is to create a samplesheet.csv file with information about the samples you are using for WGS analysis. The format should be comma-separated columns and should contain the patient ID, sample name, the lane, and the path of the paired fastq files. Each row represents a pair of fastq files.

patient,sample,lane,fastq_1,fastq_2
ID1,S1,L001,ID1_S1_L001_R1_001.fastq.gz,ID1_S1_L001_R2_001.fastq.gz
ID1,S1,L002,ID1_S1_L002_R1_001.fastq.gz,ID1_S1_L002_R2_001.fastq.gz
ID1,S2,L001,ID1_S2_L001_R1_001.fastq.gz,ID1_S2_L001_R2_001.fastq.gz
ID2,S1,L001,ID2_S1_L001_R1_001.fastq.gz,ID2_S1_L001_R2_001.fastq.gz
ID2,S1,L002,ID2_S1_L002_R1_001.fastq.gz,ID2_S1_L002_R2_001.fastq.gz
ID2,S2,L001,ID2_S2_L001_R1_001.fastq.gz,ID2_S2_L001_R2_001.fastq.gz
ID3,S1,L001,ID3_S1_L001_R1_001.fastq.gz,ID3_S1_L001_R2_001.fastq.gz
ID3,S1,L002,ID3_S1_L002_R1_001.fastq.gz,ID3_S1_L002_R2_001.fastq.gz
ID3,S2,L001,ID3_S2_L001_R1_001.fastq.gz,ID3_S2_L001_R2_001.fastq.gz

This supports multi-lane, multi-sample, and tumor/normal pairings.

You are now ready to run NextFlow by using this command providing the above samplesheet.csv. There are variations to this sample sheet depending on whether you would like to keep the status of the samples and the sex of the patient. This would be the full sample sheet.

patient,sex,status,sample,lane,fastq_1,fastq_2
patient1,XX,0,normal_sample,lane_1,test_L001_1.fastq.gz,test_L001_2.fastq.gz
patient1,XX,0,normal_sample,lane_2,test_L002_1.fastq.gz,test_L002_2.fastq.gz
patient1,XX,0,normal_sample,lane_3,test_L003_1.fastq.gz,test_L003_2.fastq.gz
patient1,XX,1,tumor_sample,lane_1,test2_L001_1.fastq.gz,test2_L001_2.fastq.gz
patient1,XX,1,tumor_sample,lane_2,test2_L002_1.fastq.gz,test2_L002_2.fastq.gz
patient1,XX,1,relapse_sample,lane_1,test3_L001_1.fastq.gz,test3_L001_2.fastq.gz

These sample sheets are used when running the WGS pipeline from the mapping stage, but you can still run this pipeline from any of the other stages, like duplicate marking, preparing recalibration tables, creating base quality score recalibration, variant calling, or annotation. For each of the stages, the sample sheet looks different and should be updated accordingly. For more information, please refer to this page here which contains very detailed information on using the NextFlow Sarek pipeline. The nf-core/sarek pipeline is a best-practice-compliant, production-ready pipeline for variant calling (both germline and somatic) from Whole Genome (WGS) or Whole Exome Sequencing (WES) data. Built on Nextflow, it supports containerization (Docker/Singularity), cloud computing, and HPC environments, making it reproducible and scalable.

nextflow run nf-core/sarek \
   -profile singularity \
   --input samplesheet.csv \
   --outdir {path_for_your_results_folder}/results

This is the default command line that you can use when you don’t want to change any parameters.

  • –profile singularity is used because we use singularity to run NextFlow on HPC.

  • –input is the samplesheet.csv that you would create for your samples using the format above

  • –outdir is the folder that would be used by NextFlow to save all results of the pipeline

Note - The default genome here is GATK.GRCh38. If you would like to change it to the genome of your choice, you can provide the ID for your reference. The reference for your genome of choice can be found here

Key pipeline options

Additional Parameters

Description

–genome

Genome build (e.g. GRCh38, GRCh37)

–tools

Comma-separated list of variant callers

–somatic

Enables somatic calling (requires tumor/normal pairs)

–germline

Enables germline variant calling

–step

Run from a specific step (mapping, variant_calling, etc.)

–saveReference

Saves intermediate reference files (useful for large-scale runs)

Tools used in Sarek

Step

Tools

QC

FastQC, MultiQC, BCFtools stats

Trimming (optional)

FastP

Alignment

bwa-mem (default), bwa-mem2, dragmap, sentieon-bwamem

MarkDuplicates

GATK MarkDuplicates, Sentieon LocusCollector and Sentieon Dedup

Base Recalibration

GATK BaseRecalibrator and GATK ApplyBQSR

Variant Calling

GATK HaplotypeCaller (germline), Mutect2 (somatic), Strelka2, FreeBayes, VarDict

Annotation

VEP (Variant Effect Predictor)

Structural Variant

Manta (optional)

There are also extensive quality control tools that are executed with the minimum parameters above. You can provide additional ones depending on your end goals. Please refer to this detailed tutorial that was developed by NextFlow developers here

Results

The results folder will have the alignment, annotation and variant calling files. It will also contain all the files generated from the quality control steps such as MultiQC, FASTQC, etc. For more information about the results generated, navigate to the “Results” section.