Step 1: Calculate QC metrics on raw data
In this step, Fastqc, quality control (QC) metrics are calculated for the raw sequencing data. This is done using the tool FastQC. This tool will run a series of tests on the input file. The output is a text file containing the output data which is used to create a summary in the form of several HTML pages with graphs for each test. Both the text file and the HTML document provide a flag for each test: pass, warning or fail. This flag is based on criteria set by the makers of this tool. Warnings or even failures do not necessarily mean that there is a problem with the data, only that it is unusual compared to the used criteria. It is possible that the biological nature of the sample means that this particular bias is to be expected.
Toolname: FastQC
Scriptname: Fastqc
Input: FastQ files (${filePrefix}${lane}${barcode}.fq.gz)
Output: ${filePrefix}.fastqc.zip archive containing amongst others the HTML document and the text file
fastqc \
fastq1.gz \
fastq2.gz \
-o outputDirectory
Step 2: Read alignment against reference sequence
In this step, the Hisat Aligner is used to align the (mostly paired end) sequencing data to the reference genome. The output is a SAM file.
Scriptname: HisatAlignment
Input: raw sequence file in the form of a gzipped fastq file (${filePrefix}.fq.gz) Output: SAM formatted file (${filePrefix}.sam)
Toolname: Hisat
hisat -x ${hisatIndex} \
${input} \
-p 8 \
--rg-id ${externalSampleID} \
--rg PL:illumina \
--rg PU:${sequencer}_${flowcell}_${run}_${lane}_${barcode} \
--rg LB:${sequencer}_${flowcell}_${run}_${lane}_${barcode} \
--rg SM:${externalSampleID} \
-S ${tmpAlignedSam} > ${intermediateDir}/${externalSampleID}_L${lane}.hisat.log 2>&1
Step 3: Add Or Replace Readgroup
This step adds readgroup information to the reads in a bamfile. This additional meta data inproves the mark duplicate step by making it possible the identify different sequence runs for indiviual samples, but also various technical features that are associated with artifacts.
Toolname: Picard AddOrReplaceReadGroups
ToolVersion: 1.130-Java-1.7.0_80
Input: BAM files from step 2 ${filePrefix}_${barcode}.sorted.bam)
Output: merged BAM file ${filePrefix}_${barcode}.sorted.rg.bam)
java -Xmx6g -XX:ParallelGCThreads=8 -jar ${EBROOTPICARD}/${picardJar} AddOrReplaceReadGroups \
INPUT=${sortedBam} \
OUTPUT=${tmpAddOrReplaceGroupsBam} \
SORT_ORDER=coordinate \
RGID=${externalSampleID} \
RGLB=${sequencer}_${flowcell}_${run}_${lane}_${barcode} \
RGPL=ILLUMINA \
RGPU=${sequencer}_${flowcell}_${run}_${lane}_${barcode} \
RGSM=${externalSampleID} \
RGDT=$(date --rfc-3339=date) \
CREATE_INDEX=true \
MAX_RECORDS_IN_RAM=4000000 \
TMP_DIR=${tempDir}
Step 4: Merge BAM and build index
To improve the coverage of sequence alignments, a sample can be sequenced on multiple lanes and/or flowcells. If this is the case for the sample(s) being analyzed, this step merges all BAM files of one sample and indexes this new file. If there is just one BAM file for a sample, nothing happens.
Toolname: Picard mergeBam
ToolVersion: 1.130-Java-1.7.0_80
Input: BAM files from step 2 ${filePrefix}_${barcode}.sorted.bam)
Output: merged BAM file (${sample}. sorted .merged.bam)
java -XX:ParallelGCThreads=4 -jar -Xmx6g ${EBROOTPICARD}/${picardJar} ${mergeSamFilesJar} \
${INPUTS[@]} \
SORT_ORDER=coordinate \
CREATE_INDEX=true \
USE_THREADING=true \
TMP_DIR=${tempDir} \
MAX_RECORDS_IN_RAM=6000000 \
VALIDATION_STRINGENCY=LENIENT \
OUTPUT=${tmpSampleMergedBam}
Step 5: Mark duplicates + creating dedup metrics
In this step, the BAM file is examined to locate duplicate reads, using Picard MarkDuplicates. A mapped read is considered to be duplicate if the start and end base of two or more reads are located at the same chromosomal position in comparison to the reference genome. For paired-end data the start and end locations for both ends need to be the same to be called duplicate. One read pair of these duplicates is kept, the remaining ones are flagged as being duplicate.
Toolname: Picard MarkDuplicates
Scriptname: MarkDuplicates
Input: Merged BAM file from generated in step 3 (${sample}.sorted.merged.bam)
Output: BAM file with duplicates flagged (${sample}.sorted.merged dedup.bam) BAM index file (${sample}.dedup.bam.bai) Dedup metrics file (${sample}.merged.dedup.metrics)
java -XX:ParallelGCThreads=4 -jar -Xmx6g ${EBROOTPICARD}/${picardJar} MarkDuplicates \
I=${sampleMergedBam} \
O=${tmpSampleMergedDedupBam} \
CREATE_INDEX=true \
VALIDATION_STRINGENCY=LENIENT \
M=${dupStatMetrics} AS=true
Step 6: Calculate alignment QC metrics
In this step, QC metrics are calculated for the alignments created in the previous steps. This is done using several QC related tools: • ${PICARD}/CollectRnaSeqMetrics • ${GCC}/gentrap_graph_seqgc (GC bias plot) • ${SAMTOOLS}/flagstat • ${PICARD}/MarkDuplicates • ${PICARD}/CollectInsertSizeMetrics (only paired end)
Toolname: several Picard or Samtools QC tools ScriptVersions: Samtools/1.2-goolf-1.7.20, picard-tools/1.130-Java-1.7.0_80 Input: BAM file generated in step 4 Output: collectrnaseqmetrics, alignmentmetrics, gcbiasplots, insertsizemetrics, dedupmetrics, (text files and matching PDF files)
These metrics are later used to create tables and graphs (step 9). The Picard tools also output a PDF version of the data themselves, containing graphs.
Step 7: HTSeq Count, Gene expression quantification
Before gene expression quantification SAMtools was used to sort the aligned reads by name. The gene level quantification was performed by HTSeq-0.6.1p1 using --mode=union --stranded=no|yes and, Ensembl version 75 was used as gene annotation database.
toolName: HTSeq Count toolVersion: HTSeq/0.6.1p1, Samtools/1.2-goolf-1.7.20 Input: BAM file Output: Textfile with gene level quantification per sample.
samtools \
view -h \
${sampleMergedBam}.nameSorted.bam | \
$EBROOTHTSEQ/scripts/htseq-count \
-m union \
-s ${STRANDED} \
- \
${annotationGtf} | \
head -n -5 \
> ${tmpSampleHTseqExpressionText}
Step 8a: GATK: SplitAndTrim
In this step the data is pre-processed for variantcalling using GATK SplitNCigarReads, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.
Toolname: GATK SplitNCigarReads Scriptname: SplitAndTrim Input: merged BAM files Output: BAM file (${sample}.sorted.merged.dedup.splitAndTrim.bam)
java -Xmx9g -XX:ParallelGCThreads=8 -Djava.io.tmpdir=${tmpTmpDataDir} -jar ${EBROOTGATK}/GenomeAnalysisTK.jar \
-T SplitNCigarReads \
-R ${indexFile} \
-I ${sampleMergedDedupBam} \
-o ${tmpsplitAndTrimBam} \
-rf ReassignOneMappingQuality \
-RMQF 255 \
-RMQT 60 \
-U ALLOW_N_CIGAR_READS
Step 8b: IndelRealignment
The local realignment process is designed to consume one or more BAM files and to locally realign reads such that the number of mismatching bases is minimized across all the reads. Toolname GATK SplitAndTrim Input: BAM file from step 8a Output: BAM file (${sample}.sorted.merged.dedup.splitAndTrim. realigned.bam)
java -Xmx8g -XX:ParallelGCThreads=8 -Djava.io.tmpdir=${tmpTmpDataDir} -jar $EBROOTGATK/GenomeAnalysisTK.jar \
-T IndelRealigner \
-R ${indexFile} \
-I ${splitAndTrimBam} \
-o ${tmpIndelRealignedBam} \
-targetIntervals ${indelRealignmentTargets} \
-known ${oneKgPhase1IndelsVcf} \
-known ${goldStandardVcf} \
-U ALLOW_N_CIGAR_READS \
--consensusDeterminationModel KNOWNS_ONLY \
--LODThresholdForCleaning 0.4
Step 8c: BQSR
In this step BQSR is performed. This is a data pre-processing step that detects systematic errors made by the sequencer when it estimates the quality score of each base call. Toolname GATK BQSR Input: BAM file from step 8b Output: BAM file (${sample}.sorted.merged.dedup.splitAndTrim. realigned.bqsr.bam)
java -Xmx14g -XX:ParallelGCThreads=8 -Djava.io.tmpdir=${tmpTmpDataDir} -jar $EBROOTGATK/GenomeAnalysisTK.jar \
-T BaseRecalibrator\
-R ${indexFile} \
-I ${IndelRealignedBam} \
-o ${bqsrBeforeGrp} \
-knownSites ${dbsnpVcf} \
-knownSites ${goldStandardVcf} \
-knownSites ${oneKgPhase1IndelsVcf} \
-nct 2
java -Xmx14g -XX:ParallelGCThreads=8 -Djava.io.tmpdir=${tmpTmpDataDir} -jar $EBROOTGATK/GenomeAnalysisTK.jar \
-T PrintReads \
-R ${indexFile} \
-I ${IndelRealignedBam} \
-o ${tmpBqsrBam} \
-BQSR ${bqsrBeforeGrp} \
-nct 2
Step 9a: HaplotypeCallerGvcf (VariantCalling)
The GATK HaplotypeCaller estimates the most likely genotypes and allele frequencies in a alignment using a Bayesian likelihood model for every position of the genome regardless of whether a variant was detected at that site or not. This information can later be used in the project based genotyping step.
Toolname: GATK HaplotypeCallerGvcf Scriptname: HaplotypeCallerGvcf Input: (${sample.sorted.merged.dedup.splitAndTrim.bam Output: gVCF file (${sample}.${batchBed}.variant.calls.g.vcf)
java -Xmx10g -XX:ParallelGCThreads=8 -Djava.io.tmpdir=${tmpTmpDataDir} -jar ${EBROOTGATK}/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R ${indexFile} \
${inputs[@]} \
--dbsnp ${dbsnpVcf}\
-dontUseSoftClippedBases \
-stand_call_conf 10.0 \
-stand_emit_conf 20.0 \
-o ${tmpGatkHaplotypeCallerGvcf} \
-variant_index_type LINEAR \
-variant_index_parameter 128000 \
--emitRefConfidence GVCF
Step 9b: Combine variants
When there 200 or more samples the gVCF files should be combined into batches of equal size. (NB: These batches are different then the ${batchBed}.) The batches will be calculated and created in this step. If there are less then 200, this step will automatically be skipped.
Toolname: GATK CombineGVCFs Scriptname: VariantGVCFCombine Input: gVCF file (from step 9a) Output: Multiple combined gVCF files ${project}.${batchBed}.variant.calls.combined.g.vcf{batch}
java -Xmx30g -XX:ParallelGCThreads=2 -Djava.io.tmpdir=${tmpTmpDataDir} -jar \
${EBROOTGATK}/GenomeAnalysisTK.jar \
-T CombineGVCFs \
-R ${indexFile} \
-L ${indexChrIntervalList} \
-o ${tmpProjectBatchCombinedVariantCalls}.${b} \
${ALLGVCFs[@]}
Step 9c: Genotype variants
In this step there will be a joint analysis over all the samples in the project. This leads to a posterior probability of a variant allele at a site. SNPs and small Indels are written to a VCF file, along with information such as genotype quality, allele frequency, strand bias and read depth for that SNP/Indel.
Toolname: GATK GenotypeGVCFs Scriptname: VariantGVCFGenotype Input: gVCF files from step 9a or combined gVCF files from step 9b Output: VCF file for all the samples in the project: ${project}.${batchBed}.variant.calls.genotyped.vcf
java -Xmx16g -XX:ParallelGCThreads=2 -Djava.io.tmpdir=${tmpTmpDataDir} -jar ${EBROOTGATK}/GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R ${indexFile} \
--dbsnp ${dbsnpVcf}\
-o ${tmpProjectBatchGenotypedVariantCalls} \
${ALLGVCFs[@]} \
-stand_call_conf 10.0 \
-stand_emit_conf 20.0
Step 10: MakeExpressionTable
In this step, gene level quantification files per sample, created in step 6 are merged into one table, using a inhouse developt script ProcessReadCounts.
Scriptname: MakeExpressionTable Input: textfile with gene level quantification per sample. Output: gene level quantification table with all samples in the project.
java -Xmx1g -XX:ParallelGCThreads=1 -Djava.io.tmpdir=${tmpTmpDataDir} -jar ${EBROOTNGSMINUTILS}/${processReadCountsJar} \
--mode makeExpressionTable \
--fileList ${intermediateDir}/fileList.txt \
--annot ${geneAnnotationTxt} \
--out ${tmpProjectHTseqExpressionTable}
Step 11: Generate quality control report
This step is to collect the statistics and metrics from step 3. Tables and graphs merged into a HTML and PDF Reports using KnitR. QC reports are then written to a quality control (QC) directory.
Scriptname: QC_report Input: QC statistics and metrics. Output: QC report (.QCReport.pdf)
Step 12: Kallisto
In this step kallisto is for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads.
Scriptname: Kallisto Input: raw sequence file in the form of a gzipped fastq file (.fq.gz) Output: results of the main quantification, i.e. the abundance estimate using Kallisto on the data is in the abundance.txt
kallisto quant \
-i ${kallistoIndex} \
-o ${tmpIntermediateDir}/${uniqueID} \
${peEnd1BarcodeFqGz} ${peEnd2BarcodeFqGz}
Step 13: Prepare data to ship to the customer
In this last step the final results of the inhouse sequence analysis pipeline are gathered and prepared to be shipped to the customer. The pipeline tools and scripts write intermediate results to a temporary directory. From these, a selection is copied to a results directory. This directory has five subdirectories:
- alignment: merged BAM file with index
- expression: textfiles with gene level quantification per sample, and per project
- fastqc: FastQC output
- images: QC images
- variants: VCF file with calling SNPs and indels.
- rawdata: raw sequence file in the form of a gzipped fastq file (.fq.gz)
Additionally, the results directory contains the final QC report, a README.txt with used pipeline description and toolversions, and the samplesheet which was the basis for this analysis and a zipped archive with the data that will be shipped to the client. The archive is accompanied by an md5 sum.