TheiaCoV Workflow Series¶
The TheiaCoV Workflow Series is a collection of WDL workflows developed for performing genomic characterization and genomic epidemiology of SARS-CoV-2 samples to support public health decision-making.
TheiaCoV Workflows for Genomic Characterization¶
Genomic characterization, i.e. generating consensus assemblies (FASTA format) from next-generation sequencing (NGS) read data (FASTQ format) to assign samples with relevant nomenclature designation (e.g. PANGO lineage and NextClade clades) is an increasingly critical function to public health laboratories around the world.
The TheiaCoV Genomic Characterization Series includes four separate WDL workflows (TheiaCoV_Illumina_PE, TheiaCoV_Illumina_SE, TheiaCoV_ClearLabs, and TheiaCoV_ONT) that process NGS read data from four different sequencing approaches: Illumina paired-end, Illumina single-end, Clear Labs, and Oxford Nanopore Technology (ONT)) to generate consensus assemblies, produce relevant quality-control metrics for both the input read data and the generated assembly, and assign samples with a lineage and clade designation using Pangolin and NextClade, respectively.
All four TheiaCoV workflows for genomic characterization will generate a viral assembly by mapping input read data to a reference genome, removing primer reads from that alignment, and then calling the consensus assembly based on the primer-trimmed alignment. These consensus assemblies are then fed into the Pangolin and NextClade CLI tools for lineage and clade assignments.
The major difference between each of these TheiaCoV Genomic Characterization workflows is in how the read mapping, primer trimming, and consensus genome calling is performed. More information on the technical details of these processes and information on how to utilize and apply these workflows for public health investigations is available below.
A fifth WDL workflow, TheiaCoV_FASTA, was added to take in assembled SC2 genomes, perform basic QC (e.g. number of Ns), and assign samples with a lineage and clade designation using Pangolin and NextClade, respectively.
A series of introductory training videos that provide conceptual overviews of methodologies and walkthrough tutorials on how to utilize these TheiaCoV workflows through Terra are available on the Theiagen Genomics YouTube page:
note Titan workflows in the video have since been renamed to TheiaCoV.
TheiaCoV_Illumina_PE¶
The TheiaCoV_Illumina_PE workflow was written to process Illumina paired-end (PE) read data. Input reads are assumed to be the product of sequencing tiled PCR-amplicons designed for the SARS-CoV-2 genome. The most common read data analyzed by the TheiaCoV_Illumina_PE workflow are generated with the Artic V3 protocol. Alternative primer schemes such as the Qiaseq Primer Panel, the Swift Amplicon SARS-CoV-2 Panel and the Artic V4 Amplicon Sequencing Panel however, can also be analysed with this workflow since the primer sequence coordinates of the PCR scheme utilized must be provided along with the raw paired-end Illumina read data in BED and FASTQ file formats, respectively.
Note
By default, this workflow will assume that input reads were generated using a 300-cycle kit (i.e. 2 x 150 bp reads). Modifications to the optional parameter for trimmomatic_minlen may be required to accommodate for shorter read data, such as 2 x 75bp reads generated using a 150-cycle kit.
Upon initiating a TheiaCoV_Illumina_PE job, the input primer scheme coordinates and raw paired-end Illumina read data provided for each sample will be processed to perform consensus genome assembly, infer the quality of both raw read data and the generated consensus genome, and assign SARS-CoV-2 lineage and clade types as outlined in the TheiaCoV_Illumina_PE data workflow below.
Consensus genome assembly with the TheiaCoV_Illumina_PE workflow is performed by first de-hosting read data with the NCBI SRA-Human-Scrubber tool then trimming low-quality reads with Trimmomatic and removing adapter sequences with BBDuk. These cleaned read data are then aligned to the Wuhan-1 reference genome with BWA to generate a Binary Alignment Mapping (BAM) file. Primer sequences are then removed from the BAM file using the iVar Trim sub-command. The iVar consensus sub-command is then utilized to generate a consensus assembly in FASTA format. This assembly is then used to assign lineage and clade designations with Pangolin and NextClade. NCBI’S VADR tool is also employed to screen for potentially errant features (e.g. erroneous frame-shift mutations) in the consensus assembly.
More information on required user inputs, optional user inputs, default tool parameters and the outputs generated by TheiaCoV_Illumina_PE are outlined below.
Required User Inputs¶
Download CSV: TheiaCoV_Illumina_PE_required_inputs.csv
Task |
Input Variable |
Data Type |
Description |
---|---|---|---|
theiacov_illumina_pe |
primer_bed |
File |
Primer sequence coordinates of the PCR scheme utilized in BED file format |
theiacov_illumina_pe |
read1_raw |
File |
Forward Illumina read in FASTQ file format |
theiacov_illumina_pe |
read2_raw |
File |
Reverse Illumina read in FASTQ file format |
theiacov_illumina_pe |
samplename |
String |
Name of the sample being analyzed |
Optional User Inputs¶
Download CSV: TheiaCoV_Illumina_PE_optional_inputs.csv
Task |
Variable Name |
Data Type |
Description |
Default |
---|---|---|---|---|
bwa |
reference_genome |
String |
Path to the reference genome within the staphb/ivar:1.2.2_artic20200528 Docker container |
/artic-ncov2019/primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta |
bwa |
cpus |
Int |
CPU resources allocated to the BWA task runtime environment |
6 |
consensus |
char_unknown |
String |
Character to print in regions with less than minimum coverage for iVar consensus |
N |
consensus |
count_orphans |
Boolean |
Do not skip anomalous read pairs in variant calling for SAMtools mpileup before running iVar consensus |
TRUE |
consensus |
disable_baq |
Boolean |
Disable read-pair overlap detection for SAMtools mpileup before running iVar consensus |
TRUE |
consensus |
max_depth |
Int |
Maximum reads read at a position per input file for SAMtools mpileup before running iVar consensus |
600000 |
consensus |
min_bq |
Int |
Minimum mapping quality for an alignment to be used for SAMtools mpileup before running iVar consensus |
0 |
consensus |
min_depth |
Int |
Minimum read depth to call variants for iVar consensus |
10 |
consensus |
min_freq |
Float |
Minimum frequency threshold(0 - 1) to call variants for iVar consensus |
0.6 |
consensus |
min_qual |
Int |
Minimum quality threshold for sliding window to pass for iVar consensus |
20 |
consensus |
ref_genome |
String |
Path to the reference genome within the staphb/ivar:1.2.2_artic20200528 Docker container |
/artic-ncov2019/primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta |
consensus |
ref_gff |
String |
Path to the general feature format of the reference genome within the staphb/ivar:1.2.2_artic20200528 Docker container |
/reference/GCF_009858895.2_ASM985889v3_genomic.gff |
nextclade_one_sample |
docker |
String |
Docker tag used for running NextClade |
neherlab/nextclade:0.14.2 |
nextclade_output_parser_one_sample |
docker |
String |
Docker tag used for parsing NextClade output |
python:slim |
pangolin3 |
docker |
String |
Docker tag used for running Pangolin |
staphb/pangolin:3.1.11-pangolearn-2021-08-24 |
pangolin3 |
inference_engine |
String |
pangolin inference engine for lineage designations (usher or pangolarn) |
usher |
pangolin3 |
min_length |
Int |
Minimum query length allowed for pangolin to attempt assignment |
10000 |
pangolin3 |
max_ambig |
Float |
Maximum proportion of Ns allowed for pangolin to attempt assignment |
0.5 |
primer_trim |
keep_noprimer_reads |
Boolean |
Include reads with no primers for iVar trim |
True |
read_QC_trim |
bbduk_mem |
Int |
Memory allocated to the BBDuk VM |
8 |
read_QC_trim |
trimmomatic_minlen |
Int |
Specifies the minimum length of reads to be kept for Trimmomatic |
25 |
read_QC_trim |
trimmomatic_quality_trim_score |
Int |
Specifies the average quality required for Trimmomatic |
30 |
read_QC_trim |
trimmomatic_window_size |
Int |
Specifies the number of bases to average across for Trimmomatic |
4 |
theiacov_illumina_pe |
nextclade_dataset_name |
String |
Nextclade organism dataset |
sars-cov-2 |
theiacov_illumina_pe |
nextclade_dataset_reference |
String |
Nextclade reference genome |
MN908947 |
theiacov_illumina_pe |
nextclade_dataset_tag |
Nextclade dataset tag |
2021-06-25T00:00:00Z |
|
theiacov_illumina_pe |
seq_method |
String |
Description of the sequencing methodology used to generate the input read data |
Illumina paired-end |
vadr |
docker |
String |
Docker tag used for running VADR |
staphb/vadr:1.2.1 |
vadr |
maxlen |
Int |
Maximum length for the fasta-trim-terminal-ambigs.pl VADR script |
30000 |
vadr |
minlen |
Int |
Minimum length subsequence to possibly replace Ns for the fasta-trim-terminal-ambigs.pl VADR script |
50 |
vadr |
skip_length |
Int |
Minimum assembly length (unambiguous) to run vadr |
10000 |
vadr |
vadr_opts |
String |
Options for the v-annotate.pl VADR script |
–glsearch -s -r –nomisc –mkey sarscov2 –alt_fail lowscore,fstukcnf,insertnn,deletinn –mdir /opt/vadr/vadr-models/ |
variant_call |
count_orphans |
Boolean |
Do not skip anomalous read pairs in variant calling for SAMtools mpileup before running iVar variants |
TRUE |
variant_call |
disable_baq |
Boolean |
Disable read-pair overlap detection for SAMtools mpileup before running iVar variants |
TRUE |
variant_call |
max_depth |
Int |
Maximum reads read at a position per input file for SAMtools mpileup before running iVar variants |
600000 |
variant_call |
min_bq |
Int |
Minimum mapping quality for an alignment to be used for SAMtools mpileup before running iVar variants |
0 |
variant_call |
min_depth |
Int |
Minimum read depth to call variants for iVar variants |
10 |
variant_call |
min_freq |
Float |
Minimum frequency threshold(0 - 1) to call variants for iVar variants |
0.6 |
variant_call |
min_qual |
Int |
Minimum quality threshold for sliding window to pass for iVar variants |
20 |
variant_call |
ref_gff |
String |
Path to the general feature format of the reference genome within the staphb/ivar:1.2.2_artic20200528 Docker container |
/reference/GCF_009858895.2_ASM985889v3_genomic.gff |
variant_call |
ref_genome |
String |
Path to the reference genome within the staphb/ivar:1.2.2_artic20200528 Docker container |
/artic-ncov2019/primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta |
version_capture |
timezone |
String |
User time zone in valid Unix TZ string (e.g. America/New_York) |
None |
Outputs¶
Download CSV: TheiaCoV_Illumina_PE_default_outputs.csv
Output Name |
Data Type |
Description |
---|---|---|
aligned_bai |
File |
Index companion file to the bam file generated during the consensus assembly process |
aligned_bam |
File |
Primer-trimmed BAM file; generated during conensus assembly process |
assembly_fasta |
File |
Consensus genome assembly |
assembly_length_unambiguous |
Int |
Number of unambiguous basecalls within the SC2 consensus assembly |
assembly_mean_coverage |
Float |
Mean sequencing depth throughout the conesnsus assembly generated after performing primer trimming–calculated using the SAMtools coverage command |
assembly_method |
String |
Method employed to generate consensus assembly |
auspice_json |
File |
Auspice-compatable JSON output generated from NextClade analysis that includes the NextClade default samples for clade-typing and the single sample placed on this tree |
bbduk_docker |
String |
Docker image used to run BBDuk |
bwa_version |
String |
Version of BWA used to map read data to the reference genome |
consensus_flagstat |
File |
Output from the SAMtools flagstat command to assess quality of the alignment file (BAM) |
consensus_stats |
File |
Output from the SAMtools stats command to assess quality of the alignment file (BAM) |
fastqc_clean1 |
Int |
Number of forward reads after seqyclean filtering as determined by FastQC |
fastqc_clean2 |
Int |
Number of reverse reads after seqyclean filtering as determined by FastQC |
fastqc_clean_pairs |
String |
Number of paired reads after SeqyClean filtering as determined by FastQC |
fastqc_raw1 |
Int |
Number of forward reads identified in the input fastq files as determined by FastQC |
fastqc_raw2 |
Int |
Number of reverse reads identified in the input fastq files as determined by FastQC |
fastqc_raw_pairs |
String |
Number of paired reads identified in the input fastq files as determined by FastQC |
fastqc_version |
String |
Version of the FastQC software used for read QC analysis |
ivar_tsv |
File |
Variant descriptor file generated by iVar variants |
ivar_variant_version |
String |
Version of iVar for running the iVar variants command |
ivar_vcf |
File |
iVar tsv output converted to VCF format |
ivar_version_consensus |
String |
Version of iVar for running the iVar consensus command |
ivar_version_primtrim |
String |
Version of iVar for running the iVar trim command |
kraken_human |
Float |
Percent of human read data detected using the Kraken2 software |
kraken_human_dehosted |
Float |
Percent of human read data detected using the Kraken2 software after host removal |
kraken_report |
File |
Full Kraken report |
kraken_report_dehosted |
File |
Full Kraken report after host removal |
kraken_sc2 |
Float |
Percent of SARS-CoV-2 read data detected using the Kraken2 software |
kraken_sc2_dehosted |
Float |
Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal |
kraken_version |
String |
Version of Kraken software used |
meanbaseq_trim |
Float |
Mean quality of the nucleotide basecalls aligned to the reference genome after primer trimming |
meanmapq_trim |
Float |
Mean quality of the mapped reads to the reference genome after primer trimming |
nextclade_aa_dels |
String |
Amino-acid deletions as detected by NextClade |
nextclade_aa_subs |
String |
Amino-acid substitutions as detected by NextClade |
nextclade_clade |
String |
NextClade clade designation |
nextclade_json |
File |
NexClade output in JSON file format |
nextclade_tsv |
File |
NextClade output in TSV file format |
nextclade_version |
String |
Version of NextClade software used |
number_Degenerate |
Int |
Number of degenerate basecalls within the consensus assembly |
number_N |
Int |
Number of fully ambiguous basecalls within the consensus assembly |
number_Total |
Int |
Total number of nucleotides within the consensus assembly |
pango_lineage |
String |
Pango lineage as detremined by Pangolin |
pango_lineage_report |
File |
Full Pango lineage report generated by Pangolin |
pangolin_assignment_version |
String |
Version of the pangolin software (e.g. PANGO or PUSHER) used for lineage asignment |
pangolin_conflicts |
String |
Number of lineage conflicts as deteremed by Pangolin |
pangolin_docker |
String |
Docker image used to run Pangolin |
pangolin_notes |
String |
Lineage notes as deteremined by Pangolin |
pangolin_versions |
String |
All Pangolin software and database version |
percent_reference_coverage |
Float |
Percent coverage of the reference genome after performing primer trimming; calculated as assembly_length_unambiguous / length of reference genome (SC2: 29,903) x 100 |
primer_bed_name |
String |
Name of the primer bed files used for primer trimming |
primer_trimmed_read_percent |
Float |
Percent of read data with primers trimmed as deteremined by iVar trim |
read1_clean |
File |
Forward read file after quality trimming and adapter removal |
read1_dehosted |
File |
Dehosted forward reads; suggested read file for SRA submission |
read2_clean |
File |
Reverse read file after quality trimming and adapter removal |
read2_dehosted |
File |
Dehosted reverse reads; suggested read file for SRA submissionsamtools_version |
samtools_version |
String |
Version of SAMtools used to sort and index the alignment file |
samtools_version_consensus |
String |
Version of SAMtools used to create the pileup before running iVar consensus |
samtools_version_primtrim |
String |
Version of SAMtools used to create the pileup before running iVar trim |
samtools_version_stats |
String |
Version of SAMtools used to assess quality of read mapping |
seq_platform |
String |
Description of the sequencing methodology used to generate the input read data |
theiacov_illumina_pe_analysis_date |
String |
Date of analysis |
theiacov_illumina_pe_version |
String |
Version of the Public Health Viral Genomics (PHVG) repository used |
trimmomatic_version |
String |
Version of Trimmomatic used |
vadr_alerts_list |
File |
File containing all of the fatal alerts as determined by VADR |
vadr_docker |
String |
Docker image used to run VADR |
vadr_num_alerts |
String |
Number of fatal alerts as determined by VADR |
TheiaCoV_Illumina_SE¶
The TheiaCoV_Illumina_SE workflow was written to process Illumina single-end (SE) read data. Input reads are assumed to be the product of sequencing tiled PCR-amplicons designed for the SARS-CoV-2 genome. The most common read data analyzed by the TheiaCoV_Illumina_SE workflow are generated with the Artic V3 protocol. Alternative primer schemes such as the Qiaseq Primer Panel, however, can also be analysed with this workflow since the primer sequence coordinates of the PCR scheme utilized must be provided along with the raw paired-end Illumina read data in BED and FASTQ file formats, respectively.
Note
By default, this workflow will assume that input reads were generated using a 35-cycle kit (i.e. 1 x 35 bp reads). Modifications to the optional parameter for trimmomatic_minlen may be required to accommodate for longer read data.
Upon initiating a TheiaCoV_Illumina_SE job, the input primer scheme coordinates and raw paired-end Illumina read data provided for each sample will be processed to perform consensus genome assembly, infer the quality of both raw read data and the generated consensus genome, and assign SARS-CoV-2 lineage and clade types as outlined in the TheiaCoV_Illumina_PE data workflow below.
Consensus genome assembly with the TheiaCoV_Illumina_SE workflow is performed by first trimming low-quality reads with Trimmomatic and removing adapter sequences with BBDuk. These cleaned read data are then aligned to the Wuhan-1 reference genome with BWA to generate a Binary Alignment Mapping (BAM) file. Primer sequences are then removed from the BAM file using the iVar Trim sub-command. The iVar consensus sub-command is then utilized to generate a consensus assembly in FASTA format. This assembly is then used to assign lineage and clade designations with Pangolin and NextClade. NCBI’S VADR tool is also employed to screen for potentially errant features (e.g. erroneous frame-shift mutations) in the consensus assembly.
More information on required user inputs, optional user inputs, default tool parameters and the outputs generated by TheiaCoV_Illumina_SE are outlined below.
Required User Inputs¶
Download CSV: TheiaCoV_Illumina_SE_required_inputs.csv
Task |
Input Variable |
Data Type |
Description |
---|---|---|---|
theiacov_illumina_pe |
primer_bed |
File |
Primer sequence coordinates of the PCR scheme utilized in BED file format |
theiacov_illumina_pe |
read1_raw |
File |
Single-end Illumina read in FASTQ file format |
theiacov_illumina_pe |
samplename |
String |
Name of the sample being analyzed |
Optional User Inputs¶
Download CSV: TheiaCoV_Illumina_SE_optional_inputs.csv
Task |
Variable Name |
Data Type |
Description |
Default |
---|---|---|---|---|
bwa |
reference_genome |
String |
Path to the reference genome within the staphb/ivar:1.2.2_artic20200528 Docker container |
/artic-ncov2019/primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta |
bwa |
cpus |
Int |
CPU resources allocated to the BWA task runtime environment |
6 |
bwa |
read2 |
File |
Optional input file for the Kraken task that is not applicable to this workflow |
None |
consensus |
char_unknown |
String |
Character to print in regions with less than minimum coverage for iVar consensus |
N |
consensus |
count_orphans |
Boolean |
Do not skip anomalous read pairs in variant calling for SAMtools mpileup before running iVar consensus |
TRUE |
consensus |
disable_baq |
Boolean |
Disable read-pair overlap detection for SAMtools mpileup before running iVar consensus |
TRUE |
consensus |
max_depth |
Int |
Maximum reads read at a position per input file for SAMtools mpileup before running iVar consensus |
600000 |
consensus |
min_bq |
Int |
Minimum mapping quality for an alignment to be used for SAMtools mpileup before running iVar consensus |
0 |
consensus |
min_depth |
Int |
Minimum read depth to call variants for iVar consensus |
10 |
consensus |
min_freq |
Float |
Minimum frequency threshold(0 - 1) to call variants for iVar consensus |
0.6 |
consensus |
min_qual |
Int |
Minimum quality threshold for sliding window to pass for iVar consensus |
20 |
consensus |
ref_genome |
String |
Path to the reference genome within the staphb/ivar:1.2.2_artic20200528 Docker container |
/artic-ncov2019/primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta |
consensus |
ref_gff |
String |
Path to the general feature format of the reference genome within the staphb/ivar:1.2.2_artic20200528 Docker container |
/reference/GCF_009858895.2_ASM985889v3_genomic.gff |
nextclade_one_sample |
docker |
String |
Docker tag used for running NextClade |
neherlab/nextclade:0.14.2 |
nextclade_output_parser_one_sample |
docker |
String |
Docker tag used for parsing NextClade output |
python:slim |
pangolin3 |
docker |
String |
Docker tag used for running Pangolin |
staphb/pangolin:3.1.11-pangolearn-2021-08-24 |
pangolin3 |
inference_engine |
String |
pangolin inference engine for lineage designations (usher or pangolarn) |
usher |
pangolin3 |
min_length |
Int |
Minimum query length allowed for pangolin to attempt assignment |
10000 |
pangolin3 |
max_ambig |
Float |
Maximum proportion of Ns allowed for pangolin to attempt assignment |
0.5 |
primer_trim |
keep_noprimer_reads |
Boolean |
Include reads with no primers for iVar trim |
True |
read_QC_trim |
bbduk_mem |
Int |
Memory allocated to the BBDuk VM |
8 |
read_QC_trim |
trimmomatic_minlen |
Int |
Specifies the minimum length of reads to be kept for Trimmomatic |
25 |
read_QC_trim |
trimmomatic_quality_trim_score |
Int |
Specifies the average quality required for Trimmomatic |
30 |
read_QC_trim |
trimmomatic_window_size |
Int |
Specifies the number of bases to average across for Trimmomatic |
4 |
theiacov_illumina_se |
nextclade_dataset_name |
String |
Nextclade organism dataset |
sars-cov-2 |
theiacov_illumina_se |
nextclade_dataset_reference |
String |
Nextclade reference genome |
MN908947 |
theiacov_illumina_se |
nextclade_dataset_tag |
Nextclade dataset tag |
2021-06-25T00:00:00Z |
|
theiacov_illumina_se |
seq_method |
String |
Description of the sequencing methodology used to generate the input read data |
Illumina paired-end |
vadr |
docker |
String |
Docker tag used for running VADR |
staphb/vadr:1.2.1 |
vadr |
maxlen |
Int |
Maximum length for the fasta-trim-terminal-ambigs.pl VADR script |
30000 |
vadr |
minlen |
Int |
Minimum length subsequence to possibly replace Ns for the fasta-trim-terminal-ambigs.pl VADR script |
50 |
vadr |
skip_length |
Int |
Minimum assembly length (unambiguous) to run vadr |
10000 |
vadr |
vadr_opts |
String |
Options for the v-annotate.pl VADR script |
–glsearch -s -r –nomisc –mkey sarscov2 –alt_fail lowscore,fstukcnf,insertnn,deletinn –mdir /opt/vadr/vadr-models/ |
variant_call |
count_orphans |
Boolean |
Do not skip anomalous read pairs in variant calling for SAMtools mpileup before running iVar variants |
TRUE |
variant_call |
disable_baq |
Boolean |
Disable read-pair overlap detection for SAMtools mpileup before running iVar variants |
TRUE |
variant_call |
max_depth |
Int |
Maximum reads read at a position per input file for SAMtools mpileup before running iVar variants |
600000 |
variant_call |
min_bq |
Int |
Minimum mapping quality for an alignment to be used for SAMtools mpileup before running iVar variants |
0 |
variant_call |
min_depth |
Int |
Minimum read depth to call variants for iVar variants |
10 |
variant_call |
min_freq |
Float |
Minimum frequency threshold(0 - 1) to call variants for iVar variants |
0.6 |
variant_call |
min_qual |
Int |
Minimum quality threshold for sliding window to pass for iVar variants |
20 |
variant_call |
ref_gff |
String |
Path to the general feature format of the reference genome within the staphb/ivar:1.2.2_artic20200528 Docker container |
/reference/GCF_009858895.2_ASM985889v3_genomic.gff |
variant_call |
ref_genome |
String |
Path to the reference genome within the staphb/ivar:1.2.2_artic20200528 Docker container |
/artic-ncov2019/primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta |
version_capture |
timezone |
String |
User time zone in valid Unix TZ string (e.g. America/New_York) |
None |
Outputs¶
Download CSV: TheiaCoV_Illumina_SE_default_outputs.csv
Output Name |
Data Type |
Description |
---|---|---|
aligned_bai |
File |
Index companion file to the bam file generated during the consensus assembly process |
aligned_bam |
File |
Primer-trimmed BAM file; generated during conensus assembly process |
assembly_fasta |
File |
Consensus genome assembly |
assembly_length_unambiguous |
Int |
Number of unambiguous basecalls within the SC2 consensus assembly |
assembly_mean_coverage |
Float |
Mean sequencing depth throughout the conesnsus assembly generated after performing primer trimming–calculated using the SAMtools coverage command |
assembly_method |
String |
Method employed to generate consensus assembly |
auspice_json |
File |
Auspice-compatable JSON output generated from NextClade analysis that includes the NextClade default samples for clade-typing and the single sample placed on this tree |
bbduk_docker |
String |
Docker image used to run BBDuk |
bwa_version |
String |
Version of BWA used to map read data to the reference genome |
consensus_flagstat |
File |
Output from the SAMtools flagstat command to assess quality of the alignment file (BAM) |
consensus_stats |
File |
Output from the SAMtools stats command to assess quality of the alignment file (BAM) |
fastqc_clean |
Int |
Number of reads after SeqyClean filtering as determined by FastQC |
fastqc_raw |
Int |
Number of reads after seqyclean filtering as determined by FastQC |
fastqc_version |
String |
Version of the FastQC software used for read QC analysis |
ivar_tsv |
File |
Variant descriptor file generated by iVar variants |
ivar_variant_version |
String |
Version of iVar for running the iVar variants command |
ivar_vcf |
File |
iVar tsv output converted to VCF format |
ivar_version_consensus |
String |
Version of iVar for running the iVar consensus command |
ivar_version_primtrim |
String |
Version of iVar for running the iVar trim command |
kraken_human |
Float |
Percent of human read data detected using the Kraken2 software |
kraken_report |
String |
Full Kraken report |
kraken_sc2 |
Float |
Percent of SARS-CoV-2 read data detected using the Kraken2 software |
kraken_version |
String |
Version of Kraken software used |
meanbaseq_trim |
Float |
Mean quality of the nucleotide basecalls aligned to the reference genome after primer trimming |
meanmapq_trim |
Float |
Mean quality of the mapped reads to the reference genome after primer trimming |
nextclade_aa_dels |
String |
Amino-acid deletions as detected by NextClade |
nextclade_aa_subs |
String |
Amino-acid substitutions as detected by NextClade |
nextclade_clade |
String |
NextClade clade designation |
nextclade_json |
File |
NexClade output in JSON file format |
nextclade_tsv |
File |
NextClade output in TSV file format |
nextclade_version |
String |
Version of NextClade software used |
number_Degenerate |
Int |
Number of degenerate basecalls within the consensus assembly |
number_N |
Int |
Number of fully ambiguous basecalls within the consensus assembly |
number_Total |
Int |
Total number of nucleotides within the consensus assembly |
pango_lineage |
String |
Pango lineage as detremined by Pangolin |
pango_lineage_report |
File |
Full Pango lineage report generated by Pangolin |
pangolin_assignment_version |
String |
Version of the pangolin software (e.g. PANGO or PUSHER) used for lineage asignment |
pangolin_conflicts |
String |
Number of lineage conflicts as deteremed by Pangolin |
pangolin_docker |
String |
Docker image used to run Pangolin |
pangolin_notes |
String |
Lineage notes as deteremined by Pangolin |
pangolin_versions |
String |
All Pangolin software and database version |
percent_reference_coverage |
Float |
Percent coverage of the reference genome after performing primer trimming; calculated as assembly_length_unambiguous / length of reference genome (SC2: 29,903) x 100 |
primer_bed_name |
String |
Name of the primer bed files used for primer trimming |
primer_trimmed_read_percent |
Float |
Percent of read data with primers trimmed as deteremined by iVar trim |
read1_clean |
File |
Forward read file after quality trimming and adapter removal |
samtools_version |
String |
Version of SAMtools used to sort and index the alignment file |
samtools_version_consensus |
String |
Version of SAMtools used to create the pileup before running iVar consensus |
samtools_version_primtrim |
String |
Version of SAMtools used to create the pileup before running iVar trim |
samtools_version_stats |
String |
Version of SAMtools used to assess quality of read mapping |
seq_platform |
String |
Description of the sequencing methodology used to generate the input read data |
theiacov_illumina_se_analysis_date |
String |
Date of analysis |
theiacov_illumina_se_version |
String |
Version of the Public Health Viral Genomics (PHVG) repository used |
trimmomatic_version |
String |
Version of Trimmomatic used |
vadr_alerts_list |
File |
File containing all of the fatal alerts as determined by VADR |
vadr_docker |
String |
Docker image used to run VADR |
vadr_num_alerts |
String |
Number of fatal alerts as determined by VADR |
TheiaCoV_ClearLabs¶
The TheiaCoV_ClearLabs workflow was written to process ClearLabs WGS read data for SARS-CoV-2 amplicon sequencing. Currently, Clear Labs sequencing is performed with the Artic V3 protocol. If alternative primer schemes such as the Qiaseq Primer Panel, the Swift Amplicon SARS-CoV-2 Panel and the Artic V4 Amplicon Sequencing Panel become avaialble on the platform, these data can can also be analysed with this workflow since the primer sequence coordinates of the PCR scheme utilized must be provided along with the raw Clear Labs read data must be provided in BED and FASTQ file formats, respectively.
Upon initiating a TheiaCoV_ClearLabs run, input ClearLabs read data provided for each sample will be processed to perform consensus genome assembly, infer the quality of both raw read data and the generated consensus genome, and assign SARS-CoV-2 lineage and clade types as outlined in the TheiaCoV_ClearLabs data workflow below.
Consensus genome assembly with the TheiaCoV_ClearLabs workflow is performed by first de-hosting read data with the NCBI SRA-Human-Scrubber tool then following the Artic nCoV-2019 novel coronavirs bioinformatics protocol <https://artic.network/ncov-2019/ncov2019-bioinformatics-sop.html>. Briefly, input reads are aligned to the Wuhan-1 reference genome with minimap2 to generate a Binary Alignment Mapping (BAM) file. Primer sequences are then removed from the BAM file and a consensus assembly file is generated using the Artic medaka command. This assembly is then used to assign lineage and clade designations with Pangolin and NextClade. NCBI’S VADR tool is also employed to screen for potentially errant features (e.g. erroneous frame-shift mutations) in the consensus assembly.
Note
Read-trimming is performed on raw read data generated on the ClearLabs instrument and thus not a required step in the TheiaCoV_ClearLabs workflow.
More information on required user inputs, optional user inputs, default tool parameters and the outputs generated by TheiaCoV_CLearLabs are outlined below.
Required User Inputs¶
Download CSV: TheiaCoV_ClearLabs_required_inputs.csv
Task |
Input Variable |
Data Type |
Description |
---|---|---|---|
theiacov_clearlabs |
clear_lab_fastq |
File |
Clear Labs FASTQ read files |
theiacov_clearlabs |
primer_bed |
File |
Primer sequence coordinates of the PCR scheme utilized in BED file format |
theiacov_clearlabs |
samplename |
String |
Name of the sample being analyzed |
Optional User Inputs¶
Download CSV: TheiaCoV_ClearLabs_optional_inputs.csv
Task |
Variable Name |
Data Type |
Description |
Default |
---|---|---|---|---|
consensus |
cpu |
Int |
CPU resources allocated to the Artric Medaka task runtime environment |
8 |
consensus |
docker |
String |
Docker tag used for running Medaka assemblyer |
staphb/artic-ncov2019:1.3.0 |
consensus |
medaka_model |
String |
Model for consensus genome assembly via Medaka |
r941_min_high_g360 |
fastqc_se_clean |
cpus |
Int |
CPU resources allocated to the FastQC task runtime environment for asessing clean read data |
|
fastqc_se_clean |
read1_name |
String |
Name of the sample being analyzed |
Inferred from the input read filefastqc_se_clean |
fastqc_se_raw |
cpus |
Int |
CPU resources allocated to the FastQC task runtime environment for asessing raw read data |
|
fastqc_se_raw |
read1_name |
String |
Name of the sample being analyzed |
Inferred from the input read file |
kraken2_dehosted |
cpus |
Int |
CPU resources allocated to the Kraken task runtime environment for asessing dehosted read data |
4 |
kraken2_dehosted |
kraken2_db |
String |
Path to the reference genome within the staphb/kraken2:2.0.8-beta_hv Docker container |
/kraken2-db |
kraken2_dehosted |
read2 |
File |
Optional input file for the Kraken task that is not applicable to this workflow |
None |
kraken2_raw |
cpus |
Int |
CPU resources allocated to the Kraken task runtime environment for asessing raw read data |
4 |
kraken2_raw |
kraken2_db |
String |
Path to the reference genome within the staphb/kraken2:2.0.8-beta_hv Docker container |
/kraken2-db |
kraken2_raw |
read2 |
File |
Optional input file for the Kraken task that is not applicable to this workflow |
None |
ncbi_scrub_se |
docker |
Docker tag used for running the NCBI SRA Human-Scruber tool |
gcr.io/ncbi-sys-gcr-public-research/sra-human-scrubber@sha256:b7dba71079344daea4ea3363e1a67fa54edb7ec65459d039669c68a66d38b140 |
|
nextclade_one_sample |
docker |
String |
Docker tag used for running NextClade |
neherlab/nextclade:0.14.2 |
nextclade_output_parser_one_sample |
docker |
String |
Docker tag used for parsing NextClade output |
python:slim |
pangolin3 |
docker |
String |
Docker tag used for running Pangolin |
staphb/pangolin:3.1.11-pangolearn-2021-08-24 |
pangolin3 |
inference_engine |
String |
pangolin inference engine for lineage designations (usher or pangolarn) |
usher |
pangolin3 |
min_length |
Int |
Minimum query length allowed for pangolin to attempt assignment |
10000 |
pangolin3 |
max_ambig |
Float |
Maximum proportion of Ns allowed for pangolin to attempt assignment |
0.5 |
theiacov_clearlabs |
nextclade_dataset_name |
String |
Nextclade organism dataset |
sars-cov-2 |
theiacov_clearlabs |
nextclade_dataset_reference |
String |
Nextclade reference genome |
MN908947 |
theiacov_clearlabs |
nextclade_dataset_tag |
Nextclade dataset tag |
2021-06-25T00:00:00Z |
|
theiacov_clearlabs |
normalise |
Int |
Value to normalize read counts |
200 |
theiacov_clearlabs |
seq_method |
String |
Description of the sequencing methodology used to generate the input read data |
ONT via Clear Labs WGS |
vadr |
docker |
String |
Docker tag used for running VADR |
staphb/vadr:1.2.1 |
vadr |
maxlen |
Int |
Maximum length for the fasta-trim-terminal-ambigs.pl VADR script |
30000 |
vadr |
minlen |
Int |
Minimum length subsequence to possibly replace Ns for the fasta-trim-terminal-ambigs.pl VADR script |
50 |
vadr |
skip_length |
Int |
Minimum assembly length (unambiguous) to run vadr |
10000 |
vadr |
vadr_opts |
String |
Options for the v-annotate.pl VADR script |
–glsearch -s -r –nomisc –mkey sarscov2 –alt_fail lowscore,fstukcnf,insertnn,deletinn –mdir /opt/vadr/vadr-models/ |
version_capture |
timezone |
String |
User time zone in valid Unix TZ string (e.g. America/New_York) |
None |
Outputs¶
Download CSV: TheiaCoV_ClearLabs_default_outputs.csv
Output Name |
Data Type |
Description |
---|---|---|
aligned_bai |
File |
Index companion file to the bam file generated during the consensus assembly process |
aligned_bam |
File |
Primer-trimmed BAM file; generated during conensus assembly process |
artic_version |
String |
Version of the Artic software utilized for read trimming and conesnsus genome assembly |
assembly_fasta |
File |
Consensus genome assembly |
assembly_length_unambiguous |
Int |
Number of unambiguous basecalls within the SC2 consensus assembly |
assembly_mean_coverage |
Float |
Mean sequencing depth throughout the conesnsus assembly generated after performing primer trimming–calculated using the SAMtools coverage command |
assembly_method |
String |
Method employed to generate consensus assembly |
auspice_json |
File |
Auspice-compatable JSON output generated from NextClade analysis that includes the NextClade default samples for clade-typing and the single sample placed on this tree |
consensus_flagstat |
File |
Output from the SAMtools flagstat command to assess quality of the alignment file (BAM) |
consensus_stats |
File |
Output from the SAMtools stats command to assess quality of the alignment file (BAM) |
dehosted_reads |
File |
Dehosted reads; suggested read file for SRA submission |
fastqc_clean |
Int |
Number of reads after dehosting as determined by FastQC |
fastqc_raw |
Int |
Number of raw input reads as determined by FastQC |
fastqc_version |
String |
Version of the FastQC version used |
kraken_human |
Float |
Percent of human read data detected using the Kraken2 software |
kraken_human_dehosted |
Float |
Percent of human read data detected using the Kraken2 software after host removal |
kraken_report |
String |
Full Kraken report |
kraken_report_dehosted |
File |
Full Kraken report after host removal |
kraken_sc2 |
Float |
Percent of SARS-CoV-2 read data detected using the Kraken2 software |
kraken_sc2_dehosted |
Float |
Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal |
kraken_version |
String |
Version of Kraken software used |
meanbaseq_trim |
Float |
Mean quality of the nucleotide basecalls aligned to the reference genome after primer trimming |
meanmapq_trim |
Float |
Mean quality of the mapped reads to the reference genome after primer trimming |
nextclade_aa_dels |
String |
Amino-acid deletions as detected by NextClade |
nextclade_aa_subs |
String |
Amino-acid substitutions as detected by NextClade |
nextclade_clade |
String |
NextClade clade designation |
nextclade_json |
File |
NexClade output in JSON file format |
nextclade_tsv |
File |
NextClade output in TSV file format |
nextclade_version |
String |
Version of NextClade software used |
number_Degenerate |
Int |
Number of degenerate basecalls within the consensus assembly |
number_N |
Int |
Number of fully ambiguous basecalls within the consensus assembly |
number_Total |
Int |
Total number of nucleotides within the consensus assembly |
pango_lineage |
String |
Pango lineage as detremined by Pangolin |
pango_lineage_report |
File |
Full Pango lineage report generated by Pangolin |
pangolin_assignment_version |
String |
Version of the pangolin software (e.g. PANGO or PUSHER) used for lineage asignment |
pangolin_conflicts |
String |
Number of lineage conflicts as deteremed by Pangolin |
pangolin_docker |
String |
Docker image used to run Pangolin |
pangolin_notes |
String |
Lineage notes as deteremined by Pangolin |
pangolin_versions |
String |
All Pangolin software and database versions |
percent_reference_coverage |
Float |
Percent coverage of the reference genome after performing primer trimming; calculated as assembly_length_unambiguous / length of reference genome (SC2: 29,903) x 100 |
primer_bed_name |
String |
Name of the primer bed files used for primer trimming |
reads_dehosted |
File |
De-hosted read files |
samtools_version |
String |
Version of SAMtools used to sort and index the alignment file |
seq_platform |
String |
Description of the sequencing methodology used to generate the input read data |
theiacov_clearlabs_analysis_date |
String |
Date of analysis |
theiacov_clearlabs_version |
String |
Version of the Public Health Viral Genomics (PHVG) repository used |
vadr_alerts_list |
File |
File containing all of the fatal alerts as determined by VADR |
vadr_docker |
String |
Docker image used to run VADR |
vadr_num_alerts |
String |
Number of fatal alerts as determined by VADR |
variants_from_ref_vcf |
File |
Number of variants relative to the reference genome |
TheiaCoV_ONT¶
The TheiaCoV_ONT workflow was written to process basecalled and demultiplexed Oxford Nanopore Technology (ONT) read data. The most common read data analyzed by the TheiaCoV_ONT workflow are generated with the Artic V3 protocol. Alternative primer schemes such as the Qiaseq Primer Panel, the Swift Amplicon SARS-CoV-2 Panel and the Artic V4 Amplicon Sequencing Panel however, can also be analysed with this workflow since the primer sequence coordinates of the PCR scheme utilized must be provided along with the raw paired-end Illumina read data in BED and FASTQ file formats, respectively.
Upon initiating a TheiaCoV_ONT run, input ONT read data provided for each sample will be processed to perform consensus genome assembly, infer the quality of both raw read data and the generated consensus genome, and assign SARS-CoV-2 lineage and clade types as outlined in the TheiaCoV_ONT data workflow below.
Consensus genome assembly with the TheiaCoV_ONT workflow is performed performed by first de-hosting read data with the NCBI SRA-Human-Scrubber tool then following then following Artic nCoV-2019 novel coronavirs bioinformatics protocol <https://artic.network/ncov-2019/ncov2019-bioinformatics-sop.html>. Briefly, input reads are filtered by size (min-length: 400bp; max-length: 700bp) with the Aritc guppyplex command. These size-selected read data are aligned to the Wuhan-1 reference genome with minimap2 to generate a Binary Alignment Mapping (BAM) file. Primer sequences are then removed from the BAM file and a consensus assembly file is generated using the Artic medaka command. This assembly is then used to assign lineage and clade designations with Pangolin and NextClade. NCBI’S VADR tool is also employed to screen for potentially errant features (e.g. erroneous frame-shift mutations) in the consensus assembly.
More information on required user inputs, optional user inputs, default tool parameters and the outputs generated by TheiaCoV_ONT are outlined below.
Required User Inputs¶
Download CSV: TheiaCoV_ONT_required_inputs.csv
Task |
Input Variable |
Data Type |
Description |
---|---|---|---|
theiacov_ont |
demultiplexed_reads |
File |
Basecalled and demultiplexed ONT read data (single FASTQ file per sample) |
theiacov_ont |
primer_bed |
File |
Primer sequence coordinates of the PCR scheme utilized in BED file format |
theiacov_ont |
samplename |
String |
Name of the sample being analyzed |
Optional User Inputs¶
Download CSV: TheiaCoV_ONT_optional_inputs.csv
Task |
Variable Name |
Data Type |
Description |
Default |
---|---|---|---|---|
consensus |
cpu |
Int |
CPU resources allocated to the Artric Medaka task runtime environment |
|
consensus |
docker |
String |
Docker tag used for running Medaka assemblyer |
staphb/artic-ncov2019:1.3.0 |
consensus |
medaka_model |
String |
Model for consensus genome assembly via Medaka |
r941_min_high_g360 |
fastqc_se_clean |
cpus |
Int |
CPU resources allocated to the FastQC task runtime environment for asessing size-selected read data |
2 |
fastqc_se_clean |
read1_name |
String |
Name of the sample being analyzed |
Inferred from the input read file |
fastqc_se_raw |
cpus |
Int |
CPU resources allocated to the FastQC task runtime environment for asessing raw read data |
|
fastqc_se_raw |
read1_name |
String |
Name of the sample being analyzed |
Inferred from the input read file |
kraken2_dehosted |
cpus |
Int |
CPU resources allocated to the Kraken task runtime environment for asessing dehosted read data |
4 |
kraken2_dehosted |
kraken2_db |
String |
Path to the reference genome within the staphb/kraken2:2.0.8-beta_hv Docker container |
/kraken2-db |
kraken2_dehosted |
read2 |
File |
Optional input file for the Kraken task that is not applicable to this workflow |
None |
kraken2_raw |
cpus |
Int |
CPU resources allocated to the Kraken task runtime environment for asessing raw read data |
4 |
kraken2_raw |
kraken2_db |
String |
Path to the reference genome within the staphb/kraken2:2.0.8-beta_hv Docker container |
/kraken2-db |
kraken2_raw |
read2 |
File |
Optional input file for the Kraken task that is not applicable to this workflow |
None |
ncbi_scrub_se |
docker |
Docker tag used for running the NCBI SRA Human-Scruber tool |
gcr.io/ncbi-sys-gcr-public-research/sra-human-scrubber@sha256:b7dba71079344daea4ea3363e1a67fa54edb7ec65459d039669c68a66d38b140 |
|
nextclade_one_sample |
docker |
String |
Docker tag used for running NextClade |
neherlab/nextclade:0.14.2 |
nextclade_output_parser_one_sample |
docker |
String |
Docker tag used for parsing NextClade output |
python:slim |
pangolin3 |
docker |
String |
Docker tag used for running Pangolin |
staphb/pangolin:3.1.11-pangolearn-2021-08-24 |
pangolin3 |
inference_engine |
String |
pangolin inference engine for lineage designations (usher or pangolarn) |
usher |
pangolin3 |
min_length |
Int |
Minimum query length allowed for pangolin to attempt assignment |
10000 |
pangolin3 |
max_ambig |
Float |
Maximum proportion of Ns allowed for pangolin to attempt assignment |
0.5 |
read_filtering |
cpu |
Int |
CPU resources allocated to the read filtering task (Artic guppypled) runtime environment |
8 |
read_filtering |
max_length |
Int |
Maximum sequence length |
700 |
read_filtering |
min_length |
Int |
Minimum sequence length |
400 |
read_filtering |
run_prefix |
String |
Run name |
artic_ncov2019 |
theiacov_ont |
nextclade_dataset_name |
String |
Nextclade organism dataset |
sars-cov-2 |
theiacov_ont |
nextclade_dataset_reference |
String |
Nextclade reference genome |
MN908947 |
theiacov_ont |
nextclade_dataset_tag |
Nextclade dataset tag |
2021-06-25T00:00:00Z |
|
theiacov_ont |
artic_primer_version |
String |
Version of the Artic PCR protocol used to generate input read data |
V3 |
theiacov_ont |
normalise |
Int |
Value to normalize read counts |
200 |
theiacov_ont |
seq_method |
String |
Description of the sequencing methodology used to generate the input read data |
ONT |
theiacov_ont |
pangolin_docker_image |
String |
Docker tag used for running Pangolin |
staphb/pangolin:2.4.2-pangolearn-2021-05-19 |
vadr |
docker |
String |
Docker tag used for running VADR |
staphb/vadr:1.2.1 |
vadr |
maxlen |
Int |
Maximum length for the fasta-trim-terminal-ambigs.pl VADR script |
30000 |
vadr |
minlen |
Int |
Minimum length subsequence to possibly replace Ns for the fasta-trim-terminal-ambigs.pl VADR script |
50 |
vadr |
vadr_opts |
String |
Options for the v-annotate.pl VADR script |
–glsearch -s -r –nomisc –mkey sarscov2 –alt_fail lowscore,fstukcnf,insertnn,deletinn –mdir /opt/vadr/vadr-models/ |
vadr |
skip_length |
Int |
Minimum assembly length (unambiguous) to run vadr |
10000 |
version_capture |
timezone |
String |
User time zone in valid Unix TZ string (e.g. America/New_York) |
None |
Outputs¶
Download CSV: TheiaCoV_ONT_default_outputs.csv
Output Name |
Data Type |
Description |
---|---|---|
aligned_bai |
File |
Index companion file to the bam file generated during the consensus assembly process |
aligned_bam |
File |
Primer-trimmed BAM file; generated during conensus assembly process |
amp_coverage |
File |
Sequence coverage per amplicon |
artic_version |
String |
Version of the Artic software utilized for read trimming and conesnsus genome assembly |
assembly_fasta |
File |
Consensus genome assembly |
assembly_length_unambiguous |
Int |
Number of unambiguous basecalls within the SC2 consensus assembly |
assembly_mean_coverage |
Float |
Mean sequencing depth throughout the conesnsus assembly generated after performing primer trimming–calculated using the SAMtools coverage command |
assembly_method |
String |
Method employed to generate consensus assembly |
auspice_json |
File |
Auspice-compatable JSON output generated from NextClade analysis that includes the NextClade default samples for clade-typing and the single sample placed on this tree |
bedtools_version |
String |
bedtools version utilized when calculating amplicon read coverage |
consensus_flagstat |
File |
Output from the SAMtools flagstat command to assess quality of the alignment file (BAM) |
consensus_stats |
File |
Output from the SAMtools stats command to assess quality of the alignment file (BAM) |
dehosted_reads |
File |
Dehosted reads; suggested read file for SRA submission |
fastqc_clean |
Int |
Number of reads after size filttering and dehosting as determined by FastQC |
fastqc_raw |
Int |
Number of raw reads input reads as determined by FastQC |
fastqc_version |
String |
Version of the FastQC version used |
kraken_human |
Float |
Percent of human read data detected using the Kraken2 software |
kraken_human_dehosted |
Float |
Percent of human read data detected using the Kraken2 software after host removal |
kraken_report |
File |
Full Kraken report |
kraken_report_dehosted |
File |
Full Kraken report after host removal |
kraken_sc2 |
Float |
Percent of SARS-CoV-2 read data detected using the Kraken2 software |
kraken_sc2_dehosted |
Float |
Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal |
kraken_version |
String |
Version of Kraken software used |
meanbaseq_trim |
Float |
Mean quality of the nucleotide basecalls aligned to the reference genome after primer trimming |
meanmapq_trim |
Float |
Mean quality of the mapped reads to the reference genome after primer trimming |
nextclade_aa_dels |
String |
Amino-acid deletions as detected by NextClade |
nextclade_aa_subs |
String |
Amino-acid substitutions as detected by NextClade |
nextclade_clade |
String |
NextClade clade designation |
nextclade_json |
File |
NexClade output in JSON file format |
nextclade_tsv |
File |
NextClade output in TSV file format |
nextclade_version |
String |
Version of NextClade software used |
number_Degenerate |
Int |
Number of degenerate basecalls within the consensus assembly |
number_N |
Int |
Number of fully ambiguous basecalls within the consensus assembly |
number_Total |
Int |
Total number of nucleotides within the consensus assembly |
pango_lineage |
String |
Pango lineage as detremined by Pangolin |
pango_lineage_report |
File |
Full Pango lineage report generated by Pangolin |
pangolin_assignment_version |
String |
Version of the pangolin software (e.g. PANGO or PUSHER) used for lineage asignment |
pangolin_conflicts |
String |
Number of lineage conflicts as deteremed by Pangolin |
pangolin_docker |
String |
Docker image used to run Pangolin |
pangolin_notes |
String |
Lineage notes as deteremined by Pangolin |
pangolin_versions |
String |
All Pangolin software and database versions |
percent_reference_coverage |
Float |
Percent coverage of the reference genome after performing primer trimming; calculated as assembly_length_unambiguous / length of reference genome (SC2: 29,903) x 100 |
primer_bed_name |
String |
Name of the primer bed files used for primer trimming |
pangolin_versions |
String |
All Pangolin software and database versions |
reads_dehosted |
File |
De-hosted read files |
samtools_version |
String |
Version of SAMtools used to sort and index the alignment file |
seq_platform |
String |
Description of the sequencing methodology used to generate the input read data |
theiacov_ont_analysis_date |
String |
Date of analysis |
theiacov_ont_version |
String |
Version of the Public Health Viral Genomics (PHVG) repository used |
vadr_alerts_list |
File |
File containing all of the fatal alerts as determined by VADR |
vadr_docker |
String |
Docker image used to run VADR |
vadr_num_alerts |
String |
Number of fatal alerts as determined by VADR |
variants_from_ref_vcf |
File |
Number of variants relative to the reference genome |
TheiaCoV_FASTA¶
The TheiaCoV_FASTA workflow was written to process SARS-CoV-2 assembly files to infer the quality of the input assembly and assign SARS-CoV-2 lineage and clade types as outlined in the TheiaCoV_FASTA data workflow below.
The quality of input SARS-CoV-2 genome assemblies are assessed by the TheiaCoV_FASTA workflow using a series of bash shell scripts. Input assemblies are then used to assign lineage and clade designations with Pangolin and NextClade. NCBI’S VADR tool is also employed to screen for potentially errant features (e.g. erroneous frame-shift mutations) in the consensus assembly.
More information on required user inputs, optional user inputs, default tool parameters and the outputs generated by TheiaCoV_FASTA are outlined below.
Required User Inputs¶
Download CSV: TheiaCoV_FASTA_required_inputs.csv
Task |
Input Variable |
Data Type |
Description |
---|---|---|---|
theiacov_fasta |
assembly_fasta |
File |
SARS-CoV-2 assemly file in fasta format |
theiacov_fasta |
input_assembly_method |
String |
Description of the method utilized to generate the input assembly fasta file; if unknown “NA” will be accepted |
theiacov_fasta |
samplename |
String |
Name of the sample being analyzed |
theiacov_fasta |
seq_method |
String |
Description of the sequencing method utilized to generate the raw sequencing data; if unknown “NA” will be accepted |
Optional User Inputs¶
Download CSV: TheiaCoV_FASTA_optional_inputs.csv
Task |
Variable Name |
Data Type |
Description |
Default |
---|---|---|---|---|
nextclade_one_sample |
docker |
String |
Docker tag used for running NextClade |
neherlab/nextclade:0.14.2 |
nextclade_output_parser_one_sample |
docker |
String |
Docker tag used for parsing NextClade output |
python:slim |
pangolin3 |
docker |
String |
Docker tag used for running Pangolin |
staphb/pangolin:3.1.11-pangolearn-2021-08-24 |
pangolin3 |
inference_engine |
String |
pangolin inference engine for lineage designations (usher or pangolarn) |
usher |
pangolin3 |
max_ambig |
Float |
Maximum proportion of Ns allowed for pangolin to attempt assignment |
0.5 |
pangolin3 |
min_length |
Int |
Minimum query length allowed for pangolin to attempt assignment |
10000 |
titan_fasta |
nextclade_dataset_name |
String |
Nextclade organism dataset |
sars-cov-2 |
titan_fasta |
nextclade_dataset_reference |
String |
Nextclade reference genome |
MN908947 |
titan_fasta |
nextclade_dataset_tag |
Nextclade dataset tag |
2021-06-25T00:00:00Z |
|
vadr |
docker |
String |
Docker tag used for running VADR |
staphb/vadr:1.2.1 |
vadr |
maxlen |
Int |
Maximum length for the fasta-trim-terminal-ambigs.pl VADR script |
30000 |
vadr |
minlen |
Int |
Minimum length subsequence to possibly replace Ns for the fasta-trim-terminal-ambigs.pl VADR script |
50 |
vadr |
skip_length |
Int |
Minimum assembly length (unambiguous) to run vadr |
10000 |
vadr |
vadr_opts |
String |
Options for the v-annotate.pl VADR script |
–glsearch -s -r –nomisc –mkey sarscov2 –alt_fail lowscore,fstukcnf,insertnn,deletinn –mdir /opt/vadr/vadr-models/ |
version_capture |
timezone |
String |
User time zone in valid Unix TZ string (e.g. America/New_York) |
None |
Outputs¶
Download CSV: TheiaCoV_FASTA_default_outputs.csv
TheiaCoV Workflows for Genomic Epidemiology¶
Genomic Epidemiology, i.e. generating phylogenetic trees from a set of consensus assemblies (FASTA format) to track the spread and evolution of viruses on a local, national or global scale, has been an important methodological approach in the effort to mitigate disease transmission.
The TheiaCoV Genomic Epidemiology Series contains two seperate WDL workflows (TheiaCoV_Augur_Prep and TheiaCoV_Augur_Run) that process a set of viral genomic assemblies to generate phylogenetic trees (JSON format) and metadata files which can be used to assign epidemiological data to each assembly for subsequent analyses.
The two TheiaCoV workflows for genomic epidemiology must be run sequentially to first prepare the data for phylogenetic analysis and second to generate the phylogenetic trees. More information on the technical details of these processes and information on how to utilize and apply these workflows for public health investigations is available below.
Download CSV: TheiaCoV_Augur_Prep_required_inputs.csv
Task |
Input Variable |
Data Type |
Description |
---|---|---|---|
prep_augur_metadata |
assembly |
File |
Assembly/consensus file (single FASTA file per sample) |
prep_augur_metadata |
collection_date |
String |
Collection date of the sample to be included in the analysis |
prep_augur_metadata |
iso_country |
String |
Country of the sample to be included in the analysis |
prep_augur_metadata |
iso_state |
String |
State of the sample to be included in the analysis |
prep_augur_metadata |
iso_continent |
String |
Continent of the sample to be included in the analysis |
prep_augur_metadata |
pango_lineage |
String |
Pango Lineage of the sample to be included in the analysis |
TheiaCoV_Augur_Prep¶
The TheiaCoV_Augur_Prep workflow was written to process consensus assemblies (FASTA format) and the associated metadata in preparation for running the TheiaCoV_Augur_Run. Input assemblies should be of similar quality (percent reference coverage, number of ambiguous bases, etc.). Inputs with highly discordant quality metrics may result in inaccurate inference of genetic relatedness.
Note
There must be some sequence diversity in the input set of assemblies to be analyzed. As a rule of thumb, the smaller the input set, the more sequence diversity will be required to make any sort of genomic inference. If a small (~10) set of viral genomic assemblies is used as the input then it may be necessary to add one significantly divergent assembly.
Upon initiating a TheiaCoV_Augur_Prep run, input assembly/consensus files and associated metadata will be used to produce the array of assembly/consensus files and the array of metadata files to be used as inputs for the TheiaCoV_Augur_Run workflow.
Metadata files are prepared with the Augur_Prep workflow by using BASH commands to first de-identify, and then to parse the headers of the input assembly files.
Required User Inputs¶
Download CSV: TheiaCoV_Augur_Prep_required_inputs.csv
Task |
Input Variable |
Data Type |
Description |
---|---|---|---|
prep_augur_metadata |
assembly |
File |
Assembly/consensus file (single FASTA file per sample) |
prep_augur_metadata |
collection_date |
String |
Collection date of the sample to be included in the analysis |
prep_augur_metadata |
iso_country |
String |
Country of the sample to be included in the analysis |
prep_augur_metadata |
iso_state |
String |
State of the sample to be included in the analysis |
prep_augur_metadata |
iso_continent |
String |
Continent of the sample to be included in the analysis |
prep_augur_metadata |
pango_lineage |
String |
Pango Lineage of the sample to be included in the analysis |
TheiaCoV_Augur_Run¶
The TheiaCoV_Augur_Run workflow was written to process an array of assembly/consensus files (FASTA format) and and array of sample metadata files (TSV format) using a modified version of The Broad Institute’s sarscov2_nextstrain WDL workflow to create an Auspice JSON file; output from the modified sarscov2_nextstrain workflow will also be used to infer SNP distances and create a static PDF report.
Upon initiating a TheiaCoV_Augur_Run run, the input assembly/consensus file array and the associated metadata file array will be used to generate a JSON file that is compatible with phylogenetic tree building software. This JSON can then be used in Auspice or Nextstrain to view the phylogenetic tree. This phylogeneic tree can be used in genomic epidemiological analysis to visualize the genetic relatedness of a set of samples. The associated metadata can then be used to add context to the phylogenetic visualization.
Required User Inputs¶
Download CSV: TheiaCoV_Augur_Run_required_inputs.csv
Task |
Input Variable |
Data Type |
Description |
---|---|---|---|
sarscov2_nextstrain |
assembly_fastas |
Array[File] |
An array of assembly/consensus files (FASTA) |
sarscov2_nextstrain |
sample_metadata_tsvs |
Array[File] |
An array of sample metadata files (TSV) |
sarscov2_nextstrain |
build_name |
String |
The name of the Augur build to be used in this analysis |