Titan Workflow Series

The Titan Workflow Series is a collection of WDL workflows developed for performing genomic characterization and genomic epidemiology of viral samples to support public health decision-making. As of today (May 4th, 2021) these workflows are specific to SARS-CoV-2 amplicon read data, but work is underway to allow for the analysis of other viral pathogens of concern.

Titan Workflows for Genomic Characterization

Genomic characterization, i.e. generating consensus assemblies (FASTA format) from next-generation sequencing (NGS) read data (FASTQ format) to assign samples with relevant nomenclature designation (e.g. PANGO lineage and NextClade clades) is an increasingly critical function to public health laboratories around the world.

The Titan Series includes four separate WDL workflows (Titan_Illumina_PE, Titan_Illumina_SE, Titan_ClearLabs, and Titan_ONT) that process NGS read data from four different sequencing approaches: Illumina paired-end, Illumina single-end, Clear Labs, and Oxford Nanopore Technology (ONT)) to generate consensus assemblies, produce relevant quality-control metrics for both the input read data and the generated assembly, and assign samples with a lineage and clade designation using Pangolin and NextClade, respectively.

All four Titan workflows for genomic characterization will generate a viral assembly by mapping input read data to a reference genome, removing primer reads from that alignment, and then calling the consensus assembly based on the primer-trimmed alignment. These consensus assemblies are then fed into the Pangolin and NextClade CLI tools for lineage and clade assignments.

The major difference between each of these Titan workflows is in how the read mapping, primer trimming, and consensus genome calling is performed. More information on the technical details of these processes and information on how to utilize and apply these workflows for public health investigations is available below.

A series of introductory training videos that provide conceptual overviews of methodologies and walkthrough tutorials on how to utilize these Titan workflows through Terra are available on the Theiagen Genomics YouTube page:


Titan_Illumina_PE

The Titan_Illumina_PE workflow was written to process Illumina paired-end (PE) read data. Input reads are assumed to be the product of sequencing tiled PCR-amplicons designed for the SARS-CoV-2 genome. The most common read data analyzed by the Titan_Illumina_PE workflow are generated with the Artic V3 protocol. Alternative primer schemes such as the Qiaseq Primer Panel, however, can also be analysed with this workflow. The primer sequence coordinates of the PCR scheme utilized must be provided along with the raw paired-end Illumina read data in BED and FASTQ file formats, respectively.

Note

By default, this workflow will assume that input reads were generated using a 300-cycle kit (i.e. 2 x 150 bp reads). Modifications to the optional parameter for trimmomatic_minlen may be required to accommodate for shorter read data, such as 2 x 75bp reads generated using a 150-cycle kit.

Upon initiating a Titan_Illumina_PE job, the input primer scheme coordinates and raw paired-end Illumina read data provided for each sample will be processed to perform consensus genome assembly, infer the quality of both raw read data and the generated consensus genome, and assign samples SARS-CoV-2 lineage and clade types as outlined in the Titan_Illumina_PE data workflow below.

Titan_Illumina_PE workflow

Titan_Illumina_PE v1.4.4 Data Workflow

Consensus genome assembly with the Titan_Illumina_PE workflow is performed by first de-hosting read data with the NCBI SRA-Human-Scrubber tool then trimming low-quality reads with Trimmomatic and removing adapter sequences with BBDuk. These cleaned read data are then aligned to the Wuhan-1 reference genome with BWA to generate a Binary Alignment Mapping (BAM) file. Primer sequences are then removed from the BAM file using the iVar Trim sub-command. The iVar consensus sub-command is then utilized to generate a consensus assembly in FASTA format. This assembly is then used to assign lineage and clade designations with Pangolin and NextClade. NCBI’S VADR tool is also employed to screen for potentially errant features (e.g. erroneous frame-shift mutations) in the consensus assembly.

More information on required user inputs, optional user inputs, default tool parameters and the outputs generated by Titan_Illumina_PE are outlined below.

Required User Inputs

Download CSV: Titan_Illumina_PE_required_inputs.csv

Task

Input Variable

Data Type

Description

titan_illumina_pe

primer_bed

File

Primer sequence coordinates of the PCR scheme utilized in BED file format

titan_illumina_pe

read1_raw

File

Forward Illumina read in FASTQ file format

titan_illumina_pe

read2_raw

File

Reverse Illumina read in FASTQ file format

titan_illumina_pe

samplename

String

Name of the sample being analyzed


Optional User Inputs

Download CSV: Titan_Illumina_PE_optional_inputs.csv

Task

Variable Name

Data Type

Description

Default

bedtools_cov

primer_bed

String

Path to the primer sequence coordinates of the PCR scheme utilized in BED file format

/artic-ncov2019/primer_schemes/nCoV-2019/V3/nCoV-2019_amplicon.bed

bedtools_cov

fail_threshold

String

Minimum coverage threshold to determin amplicon sequencing failture

20x

bwa

reference_genome

String

Path to the reference genome within the staphb/ivar:1.2.2_artic20200528 Docker container

/artic-ncov2019/primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta

bwa

cpus

Int

CPU resources allocated to the BWA task runtime environment

6

consensus

ref_gff

String

Path to the general feature format of the reference genome within the staphb/ivar:1.2.2_artic20200528 Docker container

/reference/GCF_009858895.2_ASM985889v3_genomic.gff

consensus

ref_genome

String

Path to the reference genome within the staphb/ivar:1.2.2_artic20200528 Docker container

/artic-ncov2019/primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta

consensus

min_qual

Int

Minimum quality threshold for sliding window to pass for iVar consensus

20

consensus

min_freq

Float

Minimum frequency threshold(0 - 1) to call variants for iVar consensus

0.6

consensus

min_depth

Int

Minimum read depth to call variants for iVar consensus

10

consensus

min_bq

Int

Minimum mapping quality for an alignment to be used for SAMtools mpileup before running iVar consensus

0

consensus

max_depth

Int

Maximum reads read at a position per input file for SAMtools mpileup before running iVar consensus

600000

consensus

disable_baq

Boolean

Disable read-pair overlap detection for SAMtools mpileup before running iVar consensus

TRUE

consensus

count_orphans

Boolean

Do not skip anomalous read pairs in variant calling for SAMtools mpileup before running iVar consensus

TRUE

consensus

char_unknown

String

Character to print in regions with less than minimum coverage for iVar consensus

N

nextclade_one_sample

root_sequence

File

Custom reference sequence file for NextClade

None

nextclade_one_sample

qc_config_json

File

Custom QC configuraiton file for NextClade

None

nextclade_one_sample

pcr_primers_csv

File

Custom PCR primers file for NextClade

None

nextclade_one_sample

gene_annotations_json

File

Custom gene annotation file for NextClade

None

nextclade_one_sample

docker

String

Docker tag used for running NextClade

neherlab/nextclade:0.14.2

nextclade_one_sample

auspice_reference_tree_json

File

Custom reference tree file for NextClade

None

pangolin3

inference_engine

String

pangolin inference engine for lineage designations (usher or pangolarn)

usher

pangolin3

min_length

Int

Minimum query length allowed for pangolin to attempt assignment

10000

pangolin3

max_ambig

Float

Maximum proportion of Ns allowed for pangolin to attempt assignment

0.5

primer_trim

keep_noprimer_reads

Boolean

Include reads with no primers for iVar trim

True

read_QC_trim

trimmomatic_window_size

Int

Specifies the number of bases to average across for Trimmomatic

4

read_QC_trim

trimmomatic_quality_trim_score

Int

Specifies the average quality required for Trimmomatic

30

read_QC_trim

trimmomatic_minlen

Int

Specifies the minimum length of reads to be kept for Trimmomatic

75

titan_illumina_pe

seq_method

String

Description of the sequencing methodology used to generate the input read data

Illumina paired-end

titan_illumina_pe

pangolin_docker_image

String

Docker tag used for running Pangolin

staphb/pangolin:2.4.2-pangolearn-2021-05-19

vadr

docker

String

Docker tag used for running VADR

staphb/vadr:1.2.1

vadr

maxlen

Int

Maximum length for the fasta-trim-terminal-ambigs.pl VADR script

30000

vadr

minlen

Int

Minimum length subsequence to possibly replace Ns for the fasta-trim-terminal-ambigs.pl VADR script

50

vadr

vadr_opts

String

Options for the v-annotate.pl VADR script

–glsearch -s -r –nomisc –mkey sarscov2 –alt_fail lowscore,fstukcnf,insertnn,deletinn –mdir /opt/vadr/vadr-models/

vadr

skip_length

Int

Minimum assembly length (unambiguous) to run vadr

10000

variant_call

ref_gff

String

Path to the general feature format of the reference genome within the staphb/ivar:1.2.2_artic20200528 Docker container

/reference/GCF_009858895.2_ASM985889v3_genomic.gff

variant_call

ref_genome

String

Path to the reference genome within the staphb/ivar:1.2.2_artic20200528 Docker container

/artic-ncov2019/primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta

variant_call

min_qual

Int

Minimum quality threshold for sliding window to pass for iVar variants

20

variant_call

min_freq

Float

Minimum frequency threshold(0 - 1) to call variants for iVar variants

0.6

variant_call

min_depth

Int

Minimum read depth to call variants for iVar variants

10

variant_call

min_bq

Int

Minimum mapping quality for an alignment to be used for SAMtools mpileup before running iVar variants

0

variant_call

max_depth

Int

Maximum reads read at a position per input file for SAMtools mpileup before running iVar variants

600000

variant_call

disable_baq

Boolean

Disable read-pair overlap detection for SAMtools mpileup before running iVar variants

TRUE

variant_call

count_orphans

Boolean

Do not skip anomalous read pairs in variant calling for SAMtools mpileup before running iVar variants

TRUE

version_capture

timezone

String

User time zone in valid Unix TZ string (e.g. America/New_York)

None


Outputs

Download CSV: Titan_Illumina_PE_default_outputs.csv

Output Name

Data Type

Description

aligned_bai

File

Index companion file to the bam file generated during the consensus assembly process

aligned_bam

File

Primer-trimmed BAM file; generated during conensus assembly process

assembly_fasta

File

Consensus genome assembly

assembly_length_unambiguous

Int

Number of unambiguous basecalls within the SC2 consensus assembly

assembly_mean_coverage

Float

Mean sequencing depth throughout the conesnsus assembly generated after performing primer trimming–calculated using the SAMtools coverage command

assembly_method

String

Method employed to generate consensus assembly

auspice_json

File

Auspice-compatable JSON output generated from NextClade analysis that includes the NextClade default samples for clade-typing and the single sample placed on this tree

bbduk_docker

String

Docker image used to run BBDuk

bwa_version

String

Version of BWA used to map read data to the reference genome

consensus_flagstat

File

Output from the SAMtools flagstat command to assess quality of the alignment file (BAM)

consensus_stats

File

Output from the SAMtools stats command to assess quality of the alignment file (BAM)

dehosted_read1

File

Dehosted forward reads; suggested read file for SRA submission

dehosted_read2

File

Dehosted reverse reads; suggested read file for SRA submission

fastqc_clean_pairs

String

Number of paired reads after SeqyClean filtering as determined by FastQC

fastqc_clean1

Int

Number of forward reads after seqyclean filtering as determined by FastQC

fastqc_clean2

Int

Number of reverse reads after seqyclean filtering as determined by FastQC

fastqc_raw_pairs

String

Number of paired reads identified in the input fastq files as determined by FastQC

fastqc_raw1

Int

Number of forward reads identified in the input fastq files as determined by FastQC

fastqc_raw2

Int

Number of reverse reads identified in the input fastq files as determined by FastQC

fastqc_version

String

Version of the FastQC software used for read QC analysis

ivar_tsv

File

Variant descriptor file generated by iVar variants

ivar_variant_version

String

Version of iVar for running the iVar variants command

ivar_version_consensus

String

Version of iVar for running the iVar consensus command

ivar_version_primtrim

String

Version of iVar for running the iVar trim command

kraken_human

Float

Percent of human read data detected using the Kraken2 software

kraken_human_dehosted

Float

Percent of human read data detected using the Kraken2 software after host removal

kraken_report

File

Full Kraken report

kraken_report_dehosted

File

Full Kraken report after host removal

kraken_sc2

Float

Percent of SARS-CoV-2 read data detected using the Kraken2 software

kraken_sc2_dehosted

Float

Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal

kraken_version

String

Version of Kraken software used

meanbaseq_trim

Float

Mean quality of the nucleotide basecalls aligned to the reference genome after primer trimming

meanmapq_trim

Float

Mean quality of the mapped reads to the reference genome after primer trimming

nextclade_aa_dels

String

Amino-acid deletions as detected by NextClade

nextclade_aa_subs

String

Amino-acid substitutions as detected by NextClade

nextclade_clade

String

NextClade clade designation

nextclade_json

File

NexClade output in JSON file format

nextclade_tsv

File

NextClade output in TSV file format

nextclade_version

String

Version of NextClade software used

number_Degenerate

Int

Number of degenerate basecalls within the consensus assembly

number_N

Int

Number of fully ambiguous basecalls within the consensus assembly

number_Total

Int

Total number of nucleotides within the consensus assembly

pango_lineage

String

Pango lineage as detremined by Pangolin

pango_lineage_report

File

Full Pango lineage report generated by Pangolin

pangolin_conflicts

String

Number of lineage conflicts as deteremed by Pangolin

pangolin_docker

String

Docker image used to run Pangolin

pangolin_notes

String

Lineage notes as deteremined by Pangolin

pangolin_version

String

Pangolin and PangoLEARN versions used

percent_reference_coverage

Float

Percent coverage of the reference genome after performing primer trimming; calculated as assembly_length_unambiguous / length of reference genome (SC2: 29,903) x 100

primer_trimmed_read_percent

Float

Percent of read data with primers trimmed as deteremined by iVar trim

read1_clean

File

Forward read file after quality trimming and adapter removal

read2_clean

File

Reverse read file after quality trimming and adapter removal

samtools_version

String

Version of SAMtools used to sort and index the alignment file

samtools_version_consensus

String

Version of SAMtools used to create the pileup before running iVar consensus

samtools_version_primtrim

String

Version of SAMtools used to create the pileup before running iVar trim

samtools_version_stats

String

Version of SAMtools used to assess quality of read mapping

seq_platform

String

Description of the sequencing methodology used to generate the input read data

titan_illumina_pe_analysis_date

String

Date of analysis

titan_illumina_pe_version

String

Version of the Public Health Viral Genomics (PHVG) repository used

trimmomatic_version

String

Version of Trimmomatic used

vadr_alerts_list

File

File containing all of the fatal alerts as determined by VADR

vadr_docker

String

Docker image used to run VADR

vadr_num_alerts

String

Number of fatal alerts as determined by VADR


Titan_Illumina_SE

The Titan_Illumina_SE workflow was written to process Illumina single-end (SE) read data. Input reads are assumed to be the product of sequencing tiled PCR-amplicons designed for the SARS-CoV-2 genome. The most common read data analyzed by the Titan_Illumina_SE workflow are generated with the Artic V3 protocol. Alternative primer schemes such as the Qiaseq Primer Panel, however, can also be analysed with this workflow. The primer sequence coordinates of the PCR scheme utilized must be provided along with the raw paired-end Illumina read data in BED and FASTQ file formats, respectively.

Note

By default, this workflow will assume that input reads were generated using a 35-cycle kit (i.e. 1 x 35 bp reads). Modifications to the optional parameter for trimmomatic_minlen may be required to accommodate for longer read data.

Upon initiating a Titan_Illumina_SE job, the input primer scheme coordinates and raw paired-end Illumina read data provided for each sample will be processed to perform consensus genome assembly, infer the quality of both raw read data and the generated consensus genome, and assign samples SARS-CoV-2 lineage and clade types as outlined in the Titan_Illumina_PE data workflow below.

Titan_Illumina_SE workflow

Titan_Illumina_SE v1.4.4 Data Workflow

Consensus genome assembly with the Titan_Illumina_SE workflow is performed by first trimming low-quality reads with Trimmomatic and removing adapter sequences with BBDuk. These cleaned read data are then aligned to the Wuhan-1 reference genome with BWA to generate a Binary Alignment Mapping (BAM) file. Primer sequences are then removed from the BAM file using the iVar Trim sub-command. The iVar consensus sub-command is then utilized to generate a consensus assembly in FASTA format. This assembly is then used to assign lineage and clade designations with Pangolin and NextClade. NCBI’S VADR tool is also employed to screen for potentially errant features (e.g. erroneous frame-shift mutations) in the consensus assembly.

More information on required user inputs, optional user inputs, default tool parameters and the outputs generated by Titan_Illumina_SE are outlined below.

Required User Inputs

Download CSV: Titan_Illumina_SE_required_inputs.csv

Task

Input Variable

Data Type

Description

titan_illumina_pe

primer_bed

File

Primer sequence coordinates of the PCR scheme utilized in BED file format

titan_illumina_pe

read1_raw

File

Single-end Illumina read in FASTQ file format

titan_illumina_pe

samplename

String

Name of the sample being analyzed


Optional User Inputs

Download CSV: Titan_Illumina_SE_optional_inputs.csv

Task

Variable Name

Data Type

Description

Default

bedtools_cov

primer_bed

String

Path to the primer sequence coordinates of the PCR scheme utilized in BED file format

/artic-ncov2019/primer_schemes/nCoV-2019/V3/nCoV-2019_amplicon.bed

bedtools_cov

fail_threshold

String

Minimum coverage threshold to determin amplicon sequencing failture

20x

bwa

reference_genome

String

Path to the reference genome within the staphb/ivar:1.2.2_artic20200528 Docker container

/artic-ncov2019/primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta

bwa

cpus

Int

CPU resources allocated to the BWA task runtime environment

6

bwa

read2

File

Optional input file for the bwa task that is not applicable to this workflow

None

consensus

ref_gff

String

Path to the general feature format of the reference genome within the staphb/ivar:1.2.2_artic20200528 Docker container

/reference/GCF_009858895.2_ASM985889v3_genomic.gff

consensus

ref_genome

String

Path to the reference genome within the staphb/ivar:1.2.2_artic20200528 Docker container

/artic-ncov2019/primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta

consensus

min_qual

Int

Minimum quality threshold for sliding window to pass for iVar consensus

20

consensus

min_freq

Float

Minimum frequency threshold(0 - 1) to call variants for iVar consensus

0.6

consensus

min_depth

Int

Minimum read depth to call variants for iVar consensus

10

consensus

min_bq

Int

Minimum mapping quality for an alignment to be used for SAMtools mpileup before running iVar consensus

0

consensus

max_depth

Int

Maximum reads read at a position per input file for SAMtools mpileup before running iVar consensus

600000

consensus

disable_baq

Boolean

Disable read-pair overlap detection for SAMtools mpileup before running iVar consensus

TRUE

consensus

count_orphans

Boolean

Do not skip anomalous read pairs in variant calling for SAMtools mpileup before running iVar consensus

TRUE

consensus

char_unknown

String

Character to print in regions with less than minimum coverage for iVar consensus

N

nextclade_one_sample

root_sequence

File

Custom reference sequence file for NextClade

None

nextclade_one_sample

qc_config_json

File

Custom QC configuraiton file for NextClade

None

nextclade_one_sample

pcr_primers_csv

File

Custom PCR primers file for NextClade

None

nextclade_one_sample

gene_annotations_json

File

Custom gene annotation file for NextClade

None

nextclade_one_sample

docker

String

Docker tag used for running NextClade

neherlab/nextclade:0.14.2

nextclade_one_sample

auspice_reference_tree_json

File

Custom reference tree file for NextClade

None

pangolin3

inference_engine

String

pangolin inference engine for lineage designations (usher or pangolarn)

usher

pangolin3

min_length

Int

Minimum query length allowed for pangolin to attempt assignment

10000

pangolin3

max_ambig

Float

Maximum proportion of Ns allowed for pangolin to attempt assignment

0.5

primer_trim

keep_noprimer_reads

Boolean

Include reads with no primers for iVar trim

True

read_QC_trim

trimmomatic_window_size

Int

Specifies the number of bases to average across for Trimmomatic

4

read_QC_trim

trimmomatic_quality_trim_score

Int

Specifies the average quality required for Trimmomatic

30

read_QC_trim

trimmomatic_minlen

Int

Specifies the minimum length of reads to be kept for Trimmomatic

25

titan_illumina_pe

seq_method

String

Description of the sequencing methodology used to generate the input read data

Illumina paired-end

titan_illumina_pe

pangolin_docker_image

String

Docker tag used for running Pangolin

staphb/pangolin:2.4.2-pangolearn-2021-05-19

vadr

docker

String

Docker tag used for running VADR

staphb/vadr:1.2.1

vadr

maxlen

Int

Maximum length for the fasta-trim-terminal-ambigs.pl VADR script

30000

vadr

minlen

Int

Minimum length subsequence to possibly replace Ns for the fasta-trim-terminal-ambigs.pl VADR script

50

vadr

vadr_opts

String

Options for the v-annotate.pl VADR script

–glsearch -s -r –nomisc –mkey sarscov2 –alt_fail lowscore,fstukcnf,insertnn,deletinn –mdir /opt/vadr/vadr-models/

vadr

skip_length

Int

Minimum assembly length (unambiguous) to run vadr

10000

variant_call

ref_gff

String

Path to the general feature format of the reference genome within the staphb/ivar:1.2.2_artic20200528 Docker container

/reference/GCF_009858895.2_ASM985889v3_genomic.gff

variant_call

ref_genome

String

Path to the reference genome within the staphb/ivar:1.2.2_artic20200528 Docker container

/artic-ncov2019/primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta

variant_call

min_qual

Int

Minimum quality threshold for sliding window to pass for iVar variants

20

variant_call

min_freq

Float

Minimum frequency threshold(0 - 1) to call variants for iVar variants

0.6

variant_call

min_depth

Int

Minimum read depth to call variants for iVar variants

10

variant_call

min_bq

Int

Minimum mapping quality for an alignment to be used for SAMtools mpileup before running iVar variants

0

variant_call

max_depth

Int

Maximum reads read at a position per input file for SAMtools mpileup before running iVar variants

600000

variant_call

disable_baq

Boolean

Disable read-pair overlap detection for SAMtools mpileup before running iVar variants

TRUE

variant_call

count_orphans

Boolean

Do not skip anomalous read pairs in variant calling for SAMtools mpileup before running iVar variants

TRUE

version_capture

timezone

String

User time zone in valid Unix TZ string (e.g. America/New_York)

None


Outputs

Download CSV: Titan_Illumina_SE_default_outputs.csv

Output Name

Data Type

Description

aligned_bai

File

Index companion file to the bam file generated during the consensus assembly process

aligned_bam

File

Primer-trimmed BAM file; generated during conensus assembly process

assembly_fasta

File

Consensus genome assembly

assembly_length_unambiguous

Int

Number of unambiguous basecalls within the SC2 consensus assembly

assembly_mean_coverage

Float

Mean sequencing depth throughout the conesnsus assembly generated after performing primer trimming–calculated using the SAMtools coverage command

assembly_method

String

Method employed to generate consensus assembly

auspice_json

File

Auspice-compatable JSON output generated from NextClade analysis that includes the NextClade default samples for clade-typing and the single sample placed on this tree

bbduk_docker

String

Docker image used to run BBDuk

bwa_version

String

Version of BWA used to map read data to the reference genome

consensus_flagstat

File

Output from the SAMtools flagstat command to assess quality of the alignment file (BAM)

consensus_stats

File

Output from the SAMtools stats command to assess quality of the alignment file (BAM)

fastqc_clean

Int

Number of reads after SeqyClean filtering as determined by FastQC

fastqc_raw

Int

Number of reads after seqyclean filtering as determined by FastQC

fastqc_version

String

Version of the FastQC software used for read QC analysis

ivar_tsv

File

Variant descriptor file generated by iVar variants

ivar_variant_version

String

Version of iVar for running the iVar variants command

ivar_version_consensus

String

Version of iVar for running the iVar consensus command

ivar_version_primtrim

String

Version of iVar for running the iVar trim command

kraken_human

Float

Percent of human read data detected using the Kraken2 software

kraken_report

String

Full Kraken report

kraken_sc2

Float

Percent of SARS-CoV-2 read data detected using the Kraken2 software

kraken_version

String

Version of Kraken software used

meanbaseq_trim

Float

Mean quality of the nucleotide basecalls aligned to the reference genome after primer trimming

meanmapq_trim

Float

Mean quality of the mapped reads to the reference genome after primer trimming

nextclade_aa_dels

String

Amino-acid deletions as detected by NextClade

nextclade_aa_subs

String

Amino-acid substitutions as detected by NextClade

nextclade_clade

String

NextClade clade designation

nextclade_json

File

NexClade output in JSON file format

nextclade_tsv

File

NextClade output in TSV file format

nextclade_version

String

Version of NextClade software used

number_Degenerate

Int

Number of degenerate basecalls within the consensus assembly

number_N

Int

Number of fully ambiguous basecalls within the consensus assembly

number_Total

Int

Total number of nucleotides within the consensus assembly

pango_lineage

String

Pango lineage as detremined by Pangolin

pango_lineage_report

File

Full Pango lineage report generated by Pangolin

pangolin_conflicts

String

Number of lineage conflicts as deteremed by Pangolin

pangolin_docker

String

Docker image used to run Pangolin

pangolin_notes

String

Lineage notes as deteremined by Pangolin

pangolin_version

String

Pangolin and PangoLEARN versions used

percent_reference_coverage

Float

Percent coverage of the reference genome after performing primer trimming; calculated as assembly_length_unambiguous / length of reference genome (SC2: 29,903) x 100

primer_trimmed_read_percent

Float

Percent of read data with primers trimmed as deteremined by iVar trim

read1_clean

File

Forward read file after quality trimming and adapter removal

samtools_version

String

Version of SAMtools used to sort and index the alignment file

samtools_version_consensus

String

Version of SAMtools used to create the pileup before running iVar consensus

samtools_version_primtrim

String

Version of SAMtools used to create the pileup before running iVar trim

samtools_version_stats

String

Version of SAMtools used to assess quality of read mapping

seq_platform

String

Description of the sequencing methodology used to generate the input read data

titan_illumina_se_analysis_date

String

Date of analysis

titan_illumina_se_version

String

Version of the Public Health Viral Genomics (PHVG) repository used

trimmomatic_version

String

Version of Trimmomatic used

vadr_alerts_list

File

File containing all of the fatal alerts as determined by VADR

vadr_docker

String

Docker image used to run VADR

vadr_num_alerts

String

Number of fatal alerts as determined by VADR


Titan_ClearLabs

The Titan_ClearLabs workflow was written to process ClearLabs WGS read data for SARS-CoV-2 Artic V3 amplicon sequencing.

Upon initiating a Titan_ClearLabs run, input ClearLabs read data provided for each sample will be processed to perform consensus genome assembly, infer the quality of both raw read data and the generated consensus genome, and assign samples SARS-CoV-2 lineage and clade types as outlined in the Titan_ClearLabs data workflow below.

Titan_ClearLabs workflow

Titan_ClearLabs v1.4.4 Data Workflow

Consensus genome assembly with the Titan_ClearLabs workflow is performed by first de-hosting read data with the NCBI SRA-Human-Scrubber tool then following the Artic nCoV-2019 novel coronavirs bioinformatics protocol <https://artic.network/ncov-2019/ncov2019-bioinformatics-sop.html>. Briefly, input reads are aligned to the Wuhan-1 reference genome with minimap2 to generate a Binary Alignment Mapping (BAM) file. Primer sequences are then removed from the BAM file and a consensus assembly file is generated using the Artic medaka command. This assembly is then used to assign lineage and clade designations with Pangolin and NextClade. NCBI’S VADR tool is also employed to screen for potentially errant features (e.g. erroneous frame-shift mutations) in the consensus assembly.

Note

Read-trimming is performed on raw read data generated on the ClearLabs instrument and thus not a required step in the Titan_ClearLabs workflow.

More information on required user inputs, optional user inputs, default tool parameters and the outputs generated by Titan_CLearLabs are outlined below.

Required User Inputs

Download CSV: Titan_ClearLabs_required_inputs.csv

Task

Input Variable

Data Type

Description

titan_clearlabs

clear_lab_fastq

File

Clear Labs FASTQ read files

titan_clearlabs

samplename

String

Name of the sample being analyzed


Optional User Inputs

Download CSV: Titan_ClearLabs_optional_inputs.csv

Task

Variable Name

Data Type

Description

Default

bedtools_cov

primer_bed

String

Path to the primer sequence coordinates of the PCR scheme utilized in BED file format

/artic-ncov2019/primer_schemes/nCoV-2019/V3/nCoV-2019_amplicon.bed

bedtools_cov

fail_threshold

String

Minimum coverage threshold to determin amplicon sequencing failture

20x

consensus

cpu

Int

CPU resources allocated to the Artric Medaka task runtime environment

8

fastqc_se_raw

cpus

Int

CPU resources allocated to the FastQC task runtime environment for asessing raw read data

fastqc_se_raw

read1_name

String

Name of the sample being analyzed

Inferred from the input read file

kraken2_raw

cpus

Int

CPU resources allocated to the Kraken task runtime environment for asessing raw read data

4

kraken2_raw

kraken2_db

String

Path to the reference genome within the staphb/kraken2:2.0.8-beta_hv Docker container

/kraken2-db

kraken2_raw

read2

File

Optional input file for the Kraken task that is not applicable to this workflow

None

nextclade_one_sample

root_sequence

File

Custom reference sequence file for NextClade

None

nextclade_one_sample

qc_config_json

File

Custom QC configuraiton file for NextClade

None

nextclade_one_sample

pcr_primers_csv

File

Custom PCR primers file for NextClade

None

nextclade_one_sample

gene_annotations_json

File

Custom gene annotation file for NextClade

None

nextclade_one_sample

docker

String

Docker tag used for running NextClade

neherlab/nextclade:0.14.2

nextclade_one_sample

auspice_reference_tree_json

File

Custom reference tree file for NextClade

None

pangolin3

inference_engine

String

pangolin inference engine for lineage designations (usher or pangolarn)

usher

pangolin3

min_length

Int

Minimum query length allowed for pangolin to attempt assignment

10000

pangolin3

max_ambig

Float

Maximum proportion of Ns allowed for pangolin to attempt assignment

0.5

titan_clearlabs

artic_primer_version

String

Version of the Artic PCR protocol used to generate input read data

V3

titan_clearlabs

normalise

Int

Value to normalize read counts

200

titan_clearlabs

seq_method

String

Description of the sequencing methodology used to generate the input read data

ONT via Clear Labs WGS

titan_clearlabs

pangolin_docker_image

String

Docker tag used for running Pangolin

staphb/pangolin:2.4.2-pangolearn-2021-05-19

vadr

docker

String

Docker tag used for running VADR

staphb/vadr:1.2.1

vadr

maxlen

Int

Maximum length for the fasta-trim-terminal-ambigs.pl VADR script

30000

vadr

minlen

Int

Minimum length subsequence to possibly replace Ns for the fasta-trim-terminal-ambigs.pl VADR script

50

vadr

vadr_opts

String

Options for the v-annotate.pl VADR script

–glsearch -s -r –nomisc –mkey sarscov2 –alt_fail lowscore,fstukcnf,insertnn,deletinn –mdir /opt/vadr/vadr-models/

vadr

skip_length

Int

Minimum assembly length (unambiguous) to run vadr

10000

version_capture

timezone

String

User time zone in valid Unix TZ string (e.g. America/New_York)

None


Outputs

Download CSV: Titan_ClearLabs_default_outputs.csv

Output Name

Data Type

Description

aligned_bai

File

Index companion file to the bam file generated during the consensus assembly process

aligned_bam

File

Primer-trimmed BAM file; generated during conensus assembly process

artic_version

String

Version of the Artic software utilized for read trimming and conesnsus genome assembly

assembly_fasta

File

Consensus genome assembly

assembly_length_unambiguous

Int

Number of unambiguous basecalls within the SC2 consensus assembly

assembly_mean_coverage

Float

Mean sequencing depth throughout the conesnsus assembly generated after performing primer trimming–calculated using the SAMtools coverage command

assembly_method

String

Method employed to generate consensus assembly

auspice_json

File

Auspice-compatable JSON output generated from NextClade analysis that includes the NextClade default samples for clade-typing and the single sample placed on this tree

consensus_flagstat

File

Output from the SAMtools flagstat command to assess quality of the alignment file (BAM)

consensus_stats

File

Output from the SAMtools stats command to assess quality of the alignment file (BAM)

dehosted_reads

File

Dehosted reads; suggested read file for SRA submission

fastqc_clean

Int

Number of reads after dehosting as determined by FastQC

fastqc_raw

Int

Number of raw input reads as determined by FastQC

fastqc_version

String

Version of the FastQC version used

kraken_human

Float

Percent of human read data detected using the Kraken2 software

kraken_human_dehosted

Float

Percent of human read data detected using the Kraken2 software after host removal

kraken_report

String

Full Kraken report

kraken_report_dehosted

File

Full Kraken report after host removal

kraken_sc2

Float

Percent of SARS-CoV-2 read data detected using the Kraken2 software

kraken_sc2_dehosted

Float

Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal

kraken_version

String

Version of Kraken software used

meanbaseq_trim

Float

Mean quality of the nucleotide basecalls aligned to the reference genome after primer trimming

meanmapq_trim

Float

Mean quality of the mapped reads to the reference genome after primer trimming

nextclade_aa_dels

String

Amino-acid deletions as detected by NextClade

nextclade_aa_subs

String

Amino-acid substitutions as detected by NextClade

nextclade_clade

String

NextClade clade designation

nextclade_json

File

NexClade output in JSON file format

nextclade_tsv

File

NextClade output in TSV file format

nextclade_version

String

Version of NextClade software used

number_Degenerate

Int

Number of degenerate basecalls within the consensus assembly

number_N

Int

Number of fully ambiguous basecalls within the consensus assembly

number_Total

Int

Total number of nucleotides within the consensus assembly

pango_lineage

String

Pango lineage as detremined by Pangolin

pango_lineage_report

File

Full Pango lineage report generated by Pangolin

pangolin_conflicts

String

Number of lineage conflicts as deteremed by Pangolin

pangolin_docker

String

Docker image used to run Pangolin

pangolin_notes

String

Lineage notes as deteremined by Pangolin

pangolin_version

String

Pangolin and PangoLEARN versions used

percent_reference_coverage

Float

Percent coverage of the reference genome after performing primer trimming; calculated as assembly_length_unambiguous / length of reference genome (SC2: 29,903) x 100

pool1_percent

Float

Percentage of aligned read data assocaited with the pool1 amplicons

pool2_percent

Float

Percentage of aligned read data assocaited with the pool 2 amplicons

samtools_version

String

Version of SAMtools used to sort and index the alignment file

seq_platform

String

Description of the sequencing methodology used to generate the input read data

titan_clearlabs_analysis_date

String

Date of analysis

titan_clearlabs_version

String

Version of the Public Health Viral Genomics (PHVG) repository used

vadr_alerts_list

File

File containing all of the fatal alerts as determined by VADR

vadr_docker

String

Docker image used to run VADR

vadr_num_alerts

String

Number of fatal alerts as determined by VADR

variants_from_ref_vcf

File

Number of variants relative to the reference genome


Titan_ONT

The Titan_ONT workflow was written to process basecalled and demultiplexed Oxford Nanopore Technology (ONT) read data. IInput reads are assumed to be the product of sequencing Artic V3 tiled PCR-amplicons designed for the SARS-CoV-2 genome.

Note

As of May 2021, alternative primer schemes are not currently supported for the Titan_ONT workflow, but active development us underway to allow for such analysis in the near future.

Upon initiating a Titan_ONT run, input ONT read data provided for each sample will be processed to perform consensus genome assembly, infer the quality of both raw read data and the generated consensus genome, and assign samples SARS-CoV-2 lineage and clade types as outlined in the Titan_ONT data workflow below.

Titan_ONT workflow

Titan_ONT v1.4.4 Data Workflow

Consensus genome assembly with the Titan_ONT workflow is performed performed by first de-hosting read data with the NCBI SRA-Human-Scrubber tool then following then following Artic nCoV-2019 novel coronavirs bioinformatics protocol <https://artic.network/ncov-2019/ncov2019-bioinformatics-sop.html>. Briefly, input reads are filtered by size (min-length: 400bp; max-length: 700bp) with the Aritc guppyplex command. These size-selected read data are aligned to the Wuhan-1 reference genome with minimap2 to generate a Binary Alignment Mapping (BAM) file. Primer sequences are then removed from the BAM file and a consensus assembly file is generated using the Artic medaka command. This assembly is then used to assign lineage and clade designations with Pangolin and NextClade. NCBI’S VADR tool is also employed to screen for potentially errant features (e.g. erroneous frame-shift mutations) in the consensus assembly.

More information on required user inputs, optional user inputs, default tool parameters and the outputs generated by Titan_ONT are outlined below.

Required User Inputs

Download CSV: Titan_ONT_required_inputs.csv

Task

Input Variable

Data Type

Description

titan_ont

demultiplexed_reads

File

Basecalled and demultiplexed ONT read data (single FASTQ file per sample)

titan_ont

samplename

String

Name of the sample being analyzed


Optional User Inputs

Download CSV: Titan_ONT_optional_inputs.csv

Task

Variable Name

Data Type

Description

Default

bedtools_cov

primer_bed

String

Path to the primer sequence coordinates of the PCR scheme utilized in BED file format

/artic-ncov2019/primer_schemes/nCoV-2019/V3/nCoV-2019_amplicon.bed

bedtools_cov

fail_threshold

String

Minimum coverage threshold to determin amplicon sequencing failture

20x

consensus

cpu

Int

CPU resources allocated to the Artric Medaka task runtime environment

8

fastqc_se_clean

cpus

Int

CPU resources allocated to the FastQC task runtime environment for asessing size-selected read data

2

fastqc_se_clean

read1_name

String

Name of the sample being analyzed

Inferred from the input read file

fastqc_se_raw

cpus

Int

CPU resources allocated to the FastQC task runtime environment for asessing raw read data

fastqc_se_raw

read1_name

String

Name of the sample being analyzed

Inferred from the input read file

kraken2_raw

cpus

Int

CPU resources allocated to the Kraken task runtime environment for asessing raw read data

4

kraken2_raw

kraken2_db

String

Path to the reference genome within the staphb/kraken2:2.0.8-beta_hv Docker container

/kraken2-db

kraken2_raw

read2

File

Optional input file for the Kraken task that is not applicable to this workflow

None

nextclade_one_sample

root_sequence

File

Custom reference sequence file for NextClade

None

nextclade_one_sample

qc_config_json

File

Custom QC configuraiton file for NextClade

None

nextclade_one_sample

pcr_primers_csv

File

Custom PCR primers file for NextClade

None

nextclade_one_sample

gene_annotations_json

File

Custom gene annotation file for NextClade

None

nextclade_one_sample

docker

String

Docker tag used for running NextClade

neherlab/nextclade:0.14.2

nextclade_one_sample

auspice_reference_tree_json

File

Custom reference tree file for NextClade

None

pangolin3

inference_engine

String

pangolin inference engine for lineage designations (usher or pangolarn)

usher

pangolin3

min_length

Int

Minimum query length allowed for pangolin to attempt assignment

10000

pangolin3

max_ambig

Float

Maximum proportion of Ns allowed for pangolin to attempt assignment

0.5

read_filtering

cpu

Int

CPU resources allocated to the read filtering task (Artic guppypled) runtime environment

8

read_filtering

max_length

Int

Maximum sequence length

700

read_filtering

min_length

Int

Minimum sequence length

400

read_filtering

run_prefix

String

Run name

artic_ncov2019

titan_ont

artic_primer_version

String

Version of the Artic PCR protocol used to generate input read data

V3

titan_ont

normalise

Int

Value to normalize read counts

200

titan_ont

seq_method

String

Description of the sequencing methodology used to generate the input read data

ONT

titan_ont

pangolin_docker_image

String

Docker tag used for running Pangolin

staphb/pangolin:2.4.2-pangolearn-2021-05-19

vadr

docker

String

Docker tag used for running VADR

staphb/vadr:1.2.1

vadr

maxlen

Int

Maximum length for the fasta-trim-terminal-ambigs.pl VADR script

30000

vadr

minlen

Int

Minimum length subsequence to possibly replace Ns for the fasta-trim-terminal-ambigs.pl VADR script

50

vadr

vadr_opts

String

Options for the v-annotate.pl VADR script

–glsearch -s -r –nomisc –mkey sarscov2 –alt_fail lowscore,fstukcnf,insertnn,deletinn –mdir /opt/vadr/vadr-models/

vadr

skip_length

Int

Minimum assembly length (unambiguous) to run vadr

10000

version_capture

timezone

String

User time zone in valid Unix TZ string (e.g. America/New_York)

None


Outputs

Download CSV: Titan_ONT_default_outputs.csv

Output Name

Data Type

Description

aligned_bai

File

Index companion file to the bam file generated during the consensus assembly process

aligned_bam

File

Primer-trimmed BAM file; generated during conensus assembly process

amp_coverage

File

Sequence coverage per amplicon

artic_version

String

Version of the Artic software utilized for read trimming and conesnsus genome assembly

assembly_fasta

File

Consensus genome assembly

assembly_length_unambiguous

Int

Number of unambiguous basecalls within the SC2 consensus assembly

assembly_mean_coverage

Float

Mean sequencing depth throughout the conesnsus assembly generated after performing primer trimming–calculated using the SAMtools coverage command

assembly_method

String

Method employed to generate consensus assembly

auspice_json

File

Auspice-compatable JSON output generated from NextClade analysis that includes the NextClade default samples for clade-typing and the single sample placed on this tree

bedtools_version

String

bedtools version utilized when calculating amplicon read coverage

consensus_flagstat

File

Output from the SAMtools flagstat command to assess quality of the alignment file (BAM)

consensus_stats

File

Output from the SAMtools stats command to assess quality of the alignment file (BAM)

dehosted_reads

File

Dehosted reads; suggested read file for SRA submission

fastqc_clean

Int

Number of reads after size filttering and dehosting as determined by FastQC

fastqc_raw

Int

Number of raw reads input reads as determined by FastQC

fastqc_version

String

Version of the FastQC version used

kraken_human

Float

Percent of human read data detected using the Kraken2 software

kraken_human_dehosted

Float

Percent of human read data detected using the Kraken2 software after host removal

kraken_report

File

Full Kraken report

kraken_report_dehosted

File

Full Kraken report after host removal

kraken_sc2

Float

Percent of SARS-CoV-2 read data detected using the Kraken2 software

kraken_sc2_dehosted

Float

Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal

kraken_version

String

Version of Kraken software used

meanbaseq_trim

Float

Mean quality of the nucleotide basecalls aligned to the reference genome after primer trimming

meanmapq_trim

Float

Mean quality of the mapped reads to the reference genome after primer trimming

nextclade_aa_dels

String

Amino-acid deletions as detected by NextClade

nextclade_aa_subs

String

Amino-acid substitutions as detected by NextClade

nextclade_clade

String

NextClade clade designation

nextclade_json

File

NexClade output in JSON file format

nextclade_tsv

File

NextClade output in TSV file format

nextclade_version

String

Version of NextClade software used

number_Degenerate

Int

Number of degenerate basecalls within the consensus assembly

number_N

Int

Number of fully ambiguous basecalls within the consensus assembly

number_Total

Int

Total number of nucleotides within the consensus assembly

pango_lineage

String

Pango lineage as detremined by Pangolin

pango_lineage_report

File

Full Pango lineage report generated by Pangolin

pangolin_conflicts

String

Number of lineage conflicts as deteremed by Pangolin

pangolin_docker

String

Docker image used to run Pangolin

pangolin_notes

String

Lineage notes as deteremined by Pangolin

pangolin_version

String

Pangolin and PangoLEARN versions used

percent_reference_coverage

Float

Percent coverage of the reference genome after performing primer trimming; calculated as assembly_length_unambiguous / length of reference genome (SC2: 29,903) x 100

pool1_percent

Float

Percentage of aligned read data assocaited with the pool1 amplicons

pool2_percent

Float

Percentage of aligned read data assocaited with the pool 2 amplicons

samtools_version

String

Version of SAMtools used to sort and index the alignment file

seq_platform

String

Description of the sequencing methodology used to generate the input read data

titan_ont_analysis_date

String

Date of analysis

titan_ont_version

String

Version of the Public Health Viral Genomics (PHVG) repository used

vadr_alerts_list

File

File containing all of the fatal alerts as determined by VADR

vadr_docker

String

Docker image used to run VADR

vadr_num_alerts

String

Number of fatal alerts as determined by VADR

variants_from_ref_vcf

File

Number of variants relative to the reference genome