Avian influenza viral strain analysis from gene segment sequencing data
OverviewQuestions:Objectives:
With reassortment of gene segments being a common event in avian influenza virus (AIV) evolution, does it make sense to use a reference-based mapping approach for constructing consensus genome sequences for AIV samples?
Is it possible to reuse existing tools and workflows developed for the analysis of sequencing data from other viruses?
How can we obtain meaningful phylogenetic insight from AIV consensus sequences?
Requirements:
Determine how reassortment impacts reference-based mapping approaches
Use a collection of per-segment reference sequences to construct a hybrid reference genome that is sufficiently close to a sequenced sample to be useful as a reference for mapping
Construct a sample consensus genome from mapped reads
Generate per-segment phylogenetic trees of AIV consensus sequences
- Introduction to Galaxy Analyses
- slides Slides: Quality Control
- tutorial Hands-on: Quality Control
- slides Slides: Mapping
- tutorial Hands-on: Mapping
Time estimation: 4 hoursLevel: Intermediate IntermediateSupporting Materials:Published: Nov 21, 2022Last modification: Oct 4, 2024License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00308rating Rating: 4.9 (0 recent ratings, 7 all time)version Revision: 10
Of the four species of influenza viruses (Influenza A-D), Influenza A is the most virulent in human hosts and subtypes of it have been responsible for all historic flu pandemics.
The different influenza species infect distinct animal hosts (though all of them can infect humans and pigs) and the most important natural reservoir for Influenza A are birds (in particular, wild aquatic birds), in which it causes Avian influenza.
Importantly, all flu pandemics of the last century have been triggered by reassortment events between avian and human Influenza A strains (although at least the 2009 swine flu pandemic involved an additional mixing event with an Influenza A strain from pigs).
Reassortment events are a key trait of Influenza A made possible by its wide range of natural hosts combined with two molecular characteristics of the species:
- the linear negative-sense single-stranded RNA genomes of influenza viruses (and of all viruses of the Orthomyxoviridae family) are segmented, i.e. consist of several (eight in the case of Influenza A and B) distinct pieces of RNA, which typically encode just one, sometimes two viral proteins.
- the mutation rate in the genome is high in influenza viruses, and particularly high in Influenza A, since their RNA polymerase lacks exonuclease activity necessary for proof-reading.
Together these characteristics enable Influenza A to evolve relatively rapidly in non-human hosts, then re-adapt to humans through exchange of segments (reassortment) in a host co-infected with a human and, e.g., an avian strain. If such a reassortment introduces new variants of the two most antigenic proteins of influenza, HA (hemagglutinin) and NA (neuraminidase), this antigenic shift provides the reassorted strain with immune escape potential, which, if it happens in a genetic background compatible with efficient transmission in humans, can trigger an unusually strong wave of influenza or even a new pandemic.
In order to estimate the likelihood of an epidemic or pandemic influenza event in the near future it is, therefor, important to monitor closely the genome composition of circulating Avian Influenza strains in wild and domestic birds and the huge technological advances over the last decade make it possible to use next-generation sequencing for this purpose.
At the same time, the segmented nature of their genomes combined with high genetic variability, especially in the HA and NA genes, requires rather specialized bioinformatics workflows for the analysis of data from influenza viruses compared to other viruses with similar genome size (like, e.g. SARS-CoV-2):
The viral surface proteins HA and (to a lesser extent) NA are the main targets of the host antibody response and are, thus, under constant selection pressure to mutate into forms capable of evading an existing host immune response. As a consequence, these segments have evolved into a much richer panel of sequence variants than the other segments, to the point that sequences of the HA segment can, at present, be classified into 18 distinct subfamilies, H1-H18, while there are 11 recognized subfamilies of NA, and Influenza A strains are subtyped (as, for example, H5N1, H3N2, etc.) according to the combination of HA- and NA-encoding segments they are carrying.
Importantly, the sequence diversity of HA and (again to a lesser extent) of NA segments is big enough to prevent a naive approach of mapping sequenced reads to one specific agreed-upon Influenza A reference sequence. While this would work for the other six segments, mapping software would regularly fail to find enough plausible mappings for sequenced reads of HA and NA origin to continue analysis with. This is why, in this tutorial we are going to explore an alternative approach, which is also mapping-based but chooses a suitable reference for each segment dynamically based on the input sequencing data.
AgendaIn this tutorial, we will cover:
Prepare analysis history and data
In this tutorial you are going to work on a single avian influenza sample sequenced in paired-end mode on the Illumina platform, i.e. we are going to download two datasets of sequenced reads for that sample.
In addition, we are going to base the analysis on a small collection of multiple reference sequences for each influenza gene segment. We prepared this collection for you from public INSAFlu data.
If you have your own curated collection of reference sequences, you should be able to use it to follow this tutorial without any problem. Note, however, that the analysis results referred to in many of the questions will be different if you are exchanging the reference collection. If you want to learn how you can create a Galaxy collection from your own references that is structured like the default we are suggesting here, you may want to follow the dedicated tutorial on Using dataset collections.
Get a history with reference data
Any analysis should get its own Galaxy history, but in this tutorial we won’t start with creating one.
Instead, to save some time, you can copy an existing shared history on Galaxy Europe.
Hands-on: Prepare the Galaxy history
On Galaxy Europe, open the shared history with pre-prepared INSAFlu reference data.
On the Galaxy US server, use this shared history instead.
Rename the history
- Click on galaxy-pencil (Edit) next to the history name (which by default is “Unnamed history”)
- Type the new name:
Avian influenza tutorial
- Click on Save
- To cancel renaming, click the galaxy-undo “Cancel” button
If you do not have the galaxy-pencil (Edit) next to the history name (which can be the case if you are using an older version of Galaxy) do the following:
- Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel
- Type the new name:
Avian influenza tutorial
- Press Enter
Get the sequencing data
Hands-on: Get the data
Import the forward and reverse reads of the sequenced sample into your history and organize them into a Paired Collection.
https://usegalaxy.eu/api/datasets/4838ba20a6d86765171e4f201205515c/display?to_ext=fastqsanger.gz https://usegalaxy.eu/api/datasets/4838ba20a6d867657d09280baae4e1ec/display?to_ext=fastqsanger.gz
- Copy the links above
- Open the Upload Manager
- In the top row of tabs select Collection
- Configure the drop-down select boxes on that tab like this:
- “Collection Type”:
Pair
- “File Type”:
fastqsanger.gz
- Click on Paste/Fetch data and paste the links you copied into the empty text box
- Press Start
- Wait for the Build button to become enabled, click it, and, in the next dialogue, give a suitable Name, like
Sequencing data
, to the new collection.- Click on Create collection
Inspect the reference data
Hands-on: What is in the reference collection?
- Step inside the imported reference collection by clicking on it.
- Expand individual elements of the collection by clicking on them to reveal more details about them.
View the content of any element by clicking its galaxy-eye.
You can then scroll through the data and use your browser’s search function to look for particular sequences.
Question: Questions
- How many elements are in the collection? What do they represent?
- How many sequences are stored in each element?
- Are all subtypes of HA and NA represented in the sequences?
- Are there sequences of non-A Influenza species?
- The collection consists of eight elements and each element provides reference sequences for one genome segment.
- Each element has data for 56 sequences (Galaxy knows how to interpret the fasta format of the sequencing data and counts sequences for you; you can see the count for any of the elements by expanding it).
- The collection has reference sequence data for H1-H18 (with the exception of H11) and for N1-N11. For many of those subtypes, however, there is only one reference, i.e., within-subtype variation is not captured well by the collection.
- There are 6 Influenza B references in the collection. Since Influenza B is not normally seen in birds and we are going to analyze an avian influenza sample here, these can be considered controls: if we get an Influenza B assignment for the sample at any step in our analysis, we know something is very suspicious. This also means that we only have 50 informative references in the collection.
galaxy-info There are lots of text processing and filtering tools for Galaxy, of which this tutorial is going to introduce just a small subset. You could use them to extract statistics about which subtypes are found how many times in the collection of references instead of relying on scrolling and searching. If you are interested and have the time, you may want to try the Data Manipulation Olympics to learn more about available such tools and how to combine them.
Quality control
As a very first step, we would like to make sure that we base our analysis only on the high-quality parts of the sequenced reads.
NGS reads often have lower base calling quality near their ends, so here we are going to trim low-quality stretches of bases from both ends. In addition, we are going to discard reads shorter than 30 bases after trimming. This is reasonable since our next step will involve extracting all possible 21-mers from the reads and there are very few possibilities for a read shorter than 30 bases.
The tool fastp lets us perform these tasks and obtain a nice quality report for our reads before and after processing in one go, but many other options exist to perform sequenced reads quality control and trimming/filtering in Galaxy and the dedicated tutorial on quality control introduces more of them.
Hands-on: QC and read trimming/filtering with fastp
- fastp ( Galaxy version 0.23.4+galaxy1)
- “Single-end or paired reads”:
Paired Collection
- “Select paired collection(s)”: the uploaded paired collection of sequenced reads
- In “Filter Options” under “Length filtering options”:
- “Length required”:
30
- In “Read Modification Options” under “Per read cutting by quality options”:
- “Cut by quality in front (5’)”:
Yes
- “Cut by quality in tail (3’)”:
Yes
- “Cutting mean quality”:
30
fastp comes with basic default settings for some of its many parameters. In particular, with no explicit parameter specifications, the tool will still remove reads that
- have base qualities < 15 for more than 40% of their bases
- have more than 5 Ns in their sequence
have a length < 15 bases (after removing poly-G tails from Illumina reads)
Because our input data has far longer MiSeq-generated reads, we can increase this length threshold to 30 bases as done above.
These are very relaxed criteria that will only remove the least usable reads, but still assure minimal quality standards that are good enough for many (including training) purposes.
Running the tool will result in two outputs. One is a new paired collection with the processed reads, the other one is a report of initial quality, the processing actions performed and their effect on key quality metrics.
galaxy-eye Inspect the report and try to answer the following questions:
Question: Questions
- Which set of reads (forward or reverse reads) did profit most quality-wise from our low-quality base trimming?
- What percentage of reads got discarded completely with our settings, and what percentage of bases?
- Why, according to the Filtering result, have some reads been discarded because of too many Ns?
- The forward reads were of worse quality in particular near their 3’-ends than the reverse reads and, consequently, were affected more strongly by trimming (see the different “quality” plots in the report).
- Less than 2% of reads got discarded completely (the “Filtering result” section of the report says 98.03% of all reads were retained). In contrast, around 10% of all bases got discarded (compare total bases before and after filtering) so most reads got trimmed but not discarded.
- As explained in the “fastp default settings” box above, fastp is a tool with many hidden default actions in its various sections. Inside the Filter Options inspect Quality filtering options and the various defaults mentioned in the parameters help therein. One of these defaults is to discard reads with more than 5 Ns. These defaults are often reasonable, but it’s good to be aware of them.
Per-segment subtyping and hybrid reference construction
Our goal at this step is to find best matches between our sequencing data and the reference sequences of each genome segment. This will:
- give us preliminary subtyping results with regard to the HA and NA segments for the sequenced sample
- suggest the best reference to map the sequencing data to for each segment.
We are then going to combine the best reference of each segment into a hybrid reference genome to use for mapping our sequenced reads against.
To identify the best matching reference segments, we are going to run the tool VAPOR asking it to report the stats for any hits it identifies.
Hands-on: Exploring best-matching reference segment scores
- VAPOR ( Galaxy version 1.0.2+galaxy3)
- param-collection “Reference sequences”:
References per segment (INSAFlu)
- “Type of sequencing data”:
Paired-end as collection
- “Paired collection of sequenced reads”: quality-trimmed reads; output of fastp
- “Desired output”:
Return scores of best matches
- “Limit number of reported matches to”:
0
- In “Optional arguments”:
- “Read kmer filtering threshold”:
0.1
- “Minimum k-mer proportion”:
0.0
galaxy-eye Explore the output collection produced by the tool
Question: Questions
- Is there a difference between the results for segment 4 (encoding HA), segment 6 (encoding NA) and those for the remaining segments?
- According to the tool, what is the likely subtype with regard to HA and NA of the sample?
- The values for
% of query bases in reads
and forTotal score
are drastically lower for the HA and somewhat lower for the NA segment than for any of the other segments. This could be due to the structure of the collection of references we have used, but also fits very well with the higher selection pressure on NA and, in particular on HA, as the main antigenic proteins of the virus, which leads to higher variability in these segments than in the others.The best match (assigned an almost 2x higher score than the second-best match) found for the HA segment is from an H4N6 strain. The top two matches with regard to the NA segment (again assigned ~ 2x higher scores than the next best matches) are from strains with the N6 subtype of NA.
The most likely subtype of the sample, thus, appears to be H4N6.
Now that we have established that things may make sense, we can use the output of VAPOR to extract the actual sequence of the top hit for each reference segment. We then concatenate these best matches into a hybrid reference genome for mapping.
Hands-on: Obtaining sequences of top hits identified by VAPOR
- Replace parts of text ( Galaxy version 9.3+galaxy1) to extract just the name of the sequence from each line of VAPOR’s output
- param-collection “File to process”: collection of score outputs of VAPOR
- In param-repeat “1. Find and Replace”:
- “Find pattern”:
^.+\t>(.+)$
- “Replace with”:
$1
- “Find-Pattern is a regular expression”:
Yes
- Select first lines from a dataset to get the first line, i.e. the best match from VAPOR’s output
- “Select first”:
1
lines- param-collection “from”: output of Replace
- seqtk_subseq ( Galaxy version 1.4+galaxy0) to extract the reference sequences based on their names reported by VAPOR
- param-collection “Input FASTA/Q file”:
References per segment (INSAFlu)
- “Select source of sequence choices”:
FASTA/Q ID list
- param-collection “Input ID list”: the collection output of Select first
- Collapse Collection ( Galaxy version 5.1.0) to combine the best-matching sequence of each segment into one dataset
- param-collection “Collection of files to collapse into single dataset”: the selected sequences collection produced by seqtk_subseq
At this point, you may want to galaxy-eye inspect the output of the last step to see if it is an eight-segments reference genome as expected, and if the segments correspond to the top hits found by VAPOR.
Mapping to a hybrid reference
If things went well, the hybrid reference we just obtained should be close enough across all segments to our sample to allow successful mapping of reads. Before we start the mapping we may want to truncate the segment names in our hybrid reference genome though because currently these names still reflect the full origin of the segment sequences, but from now on we are fine with just the segment ID. We can use the Replace tool from before again to truncate the names.
Hands-on: Shortening sequence titles
- Replace parts of text ( Galaxy version 9.3+galaxy1)
- param-file “File to process”: the hybrid reference genome; output of Collapse Collection
- In param-repeat “1. Find and Replace”:
- “Find pattern”:
^>([^|]+).+$
- “Replace with”:
>$1
- “Find-Pattern is a regular expression”:
Yes
Add a tag to the output dataset
This is the hybrid reference genome that we are going to use for mapping now. It is an important intermediate result, to which we will need to return later, and to make that easier we are going to add a tag to it now for extra visibility in the history.
- Click on the dataset to expand it
- Click on galaxy-tags Add Tags
- Enter a tag like
HybridRef
(tags have to consist of a single word)- Confirm your choice by pressing Enter on the keyboard or clicking on the new name.
Having polished the titles of the segments in our hybrid reference genome we are finally ready for mapping, which we will carry out with BWA-MEM, clean up a bit with Samtools view and produce a quality report for with QualiMap BamQC.
Hands-on: Read mapping and quality control
- Map with BWA-MEM ( Galaxy version 0.7.18)
- “Will you select a reference genome from your history or use a built-in index?”:
Use a genome from history and build index
- param-file “Use the following dataset as the reference sequence”: hybrid reference genome with shortened names; output of Replace
- “Single or Paired-end reads”:
Paired Collection
- “Select a paired collection”: quality-trimmed reads; output of fastp in QC and Trimming
- Samtools view ( Galaxy version 1.20+galaxy3)
- param-file “SAM/BAM/CRAM data set”: mapped reads BAM dataset; output of BWA-MEM
- “What would you like to look at?”:
A filtered/subsampled selection of reads
- In “Configure filters”:
- “Filter by quality”:
20
- “Require that these flags are set”:
Read is paired
andRead is mapped in a proper pair
- QualiMap BamQC ( Galaxy version 2.2.2d+galaxy3)
- param-file “Mapped reads input dataset”: filtered mapped reads BAM dataset; output of Samtools view
- “Skip duplicate reads”:
Unselect all
- In “Settings affecting specific plots”:
- “Number of bins to use in across-reference plots”:
40
galaxy-eye Study the report generated with QualiMap
Question: Questions
- What is the coverage of each segment by the sequenced reads, and is it uniform?
- Look for a plot showing read mapping quality across the reference. What can you conclude?
- Coverage is rather different for the different segments. Looking at “Mean coverage” and its “Standard deviation” in the “Chromosome stats” table and at the “Coverage across reference” plot, coverage of the HA segment seems to be most critical since it approaches zero in some regions.
- Mapping quality is almost constant at or very near 60 (which happens to be the maximum mapping quality value emitted by BWA-MEM) across all segments with the exception of HA.
These two observations combined with the VAPOR statistics for HA show again that even the best matching reference for this segment is not exactly close to the sequence of our sample. The read mapper had difficulties placing the sequenced reads on that sub-optimal reference and it looks as if we might have very little sequence information for some HA regions.
Consensus sequence construction
From the polished mapping of reads to our custom reference we can now construct the consensus sequence of our sample.
Unfortunately, the tool we are going to use for this, ivar consensus, is not capable of working with more than one reference name at a time, but because this is influenza data we have mappings to 8 different segments described in our data. So we need to take a little detour and split the mapped reads data into a collection of datasets each containing the mappings for just one segment first again, then perform the consensus construction for all of them in parallel.
Hands-on: Splitting mapped reads by genome segment
- Split BAM by Reference ( Galaxy version 2.5.2+galaxy2)
- param-file “BAM dataset to split by reference”: filtered mapped reads; output of Samtools view
The output from this step has the desired collection structure, but the names of the collection elements are not the nicest. Ideally, we would just reuse the segment names, which are already provided in our mapping reference genome. So lets extract these names again and use them as new element labels
Hands-on: Relabeling collection elements
- Select lines that match an expression
param-file “Select lines from”: the mapping reference; output of Replace in Mapping to a hybrid reference
tip This is the dataset you should have galaxy-tags tagged before. Lets hope you did!
- “that”:
Matching
- “the pattern”:
^>.+
- Replace parts of text ( Galaxy version 9.3+galaxy1)
- param-file “File to process”: output of Select
- In param-repeat “1. Find and Replace”:
- “Find pattern”:
^>(.+)$
- “Replace with”:
$1
- “Find-Pattern is a regular expression”:
Yes
- Relabel identifiers
- param-file “Input collection”: the split mappings; output of Split BAM by Reference
- “How should the new labels be specified?”:
Using lines in a simple text file.
- param-file “New Identifiers”: output of Replace
- “Ensure strict mapping”:
Yes
And with that we are ready for consensus sequence generation!
To accept any base suggested by the mapped sequenced reads as the consensus base for the corresponding genome position, we ask for the following requirements to be fulfilled:
- at least ten sequenced reads have to provide information about the base in question
- at a minimum, 70% of these reads have to agree on the base at this position.
To avoid getting misled too much by sequencing errors, we are also going to ignore bases with a base calling quality less than 20 in the above counts (i.e., we are going to base our decisions only on bases in sequenced reads that the basecaller of the sequencer was reasonably sure about.
Now what if we cannot obtain a consensus base for a position with the above criteria? In such cases of uncertainty we want to insert an N (i.e. an unknown base) to express that we either did not have enough information about the position or that this information was ambiguous.
galaxy-info All of the above limits for consensus base calling are arbitrary to some degree, and depend somewhat on the quality of the sequencing data. With very high overall coverage, for example, it is possible to increase the coverage threshold, but if you increase that threshold too much, you may end up with a consensus sequence consisting mostly of Ns.
Hands-on: Per-segment consensus construction
- ivar consensus ( Galaxy version 1.4.3+galaxy0)
- param-collection “Bam file”: the relabeled collection of mapped reads; output of Relabel identifiers
- “Minimum quality score threshold to count base”:
20
- “Minimum frequency threshold”:
0.7
- “Minimum indel frequency threshold”:
0.8
- “Minimum depth to call consensus”:
10
- “How to represent positions with coverage less than the minimum depth threshold”:
Represent as N (-n N)
The output is a consensus sequence in FASTA format, one per segment, with the names just providing a bit too much detail for our purpose.
- Replace parts of text ( Galaxy version 9.3+galaxy1)
- param-collection “File to process”: consensus sequences; output of ivar consensus
- In param-repeat “1: Find and Replace”:
- “Find pattern”:
^>Consensus_(.*)_threshold_.*
- “Replace with”:
>$1
- “Find-Pattern is a regular expression”:
Yes
galaxy-eye Inspect each consensus sequence generated for the different segments
QuestionDoes everything look ok?
As expected from the findings so far, the consensus sequence for the HA segment has stretches of Ns in it, which likely reflect the mapping issues and associated loss of coverage caused by our insufficiently sized collection of references.
Placing segments on a phylogenetic tree
The next logical step after obtaining the consensus sequences of segments of our sample is to explore how those sequences are related to the sequences in our reference collection. To do so, we are going to combine the reference sequences of all segments with their corresponding consensus sequence into one multiple sequence alignment (MSA) per segment, and use these to generate phylogenetic trees, again one per segment. We are going to use two rather standard tools, MAFFT and IQTree, for generating MSAs and trees, respectively.
Hands-on: Exploring phylogeny
- MAFFT ( Galaxy version 7.526+galaxy0)
- “For multiple inputs generate”:
one or several MSAs depending on input structure
- In “Input batch”:
- param-repeat “1: Input batch”
- param-collection “Sequences to align”: collection of
References per segment (INSAFlu)
- param-repeat “2: Input batch”
- param-collection “Sequences to align”: collection of renamed consensus sequences; output of Replace on consensus sequences
- “Type of sequences”:
Nucleic acids
Because both input batches are collections of eight elements each, the result is also a collection of eight MSAs, each aligning all reference sequences of one genome segment plus the consensus sequence we have obtained for that segment against each other.
- IQ-TREE ( Galaxy version 2.3.6+galaxy0)
- param-collection “Specify input alignment file in PHYLIP, FASTA, NEXUS, CLUSTAL or MSF format.”: output of MAFFT
- “Specify sequence type …“:
DNA
galaxy-eye Explore each of the final trees produced by IQTree for the different segments
There are two ways to explore the trees in Galaxy:
One option is to open the Report and Final Tree datasets produced by IQ-Tree. When you scroll down through their content, you will find that these feature simple text representations of the trees.
Another option is to expand the MaxLikelihood Tree datasets generated by IQ-Tree, then click on galaxy-barchart Visualize. This will let you launch the Phylogenetic Tree Visualization in the middle panel, which lets you explore the tree interactively and in different representations.
QuestionWhat are your conclusions about the sample in general and its HA and NA segments in particular?
- For most of its segments the sample resembles relatively recent (from the last decade) Eurasian reference sequences.
- For HA and NA the sample clusters with the few available samples of the corresponding subtype.
- None of the references closest to the sample with respect to HA and NA are close to the recent Eurasian reference cluster for their remaining segments.
- A plausible explanation is that the H4 and N6 segments of the sample have been brought into the recent Eurasian background through a reassortment event. Caveat: interpretations like this can be heavily influenced by the size of the reference collection!
Conclusion
Analysis workflows for influenza whole-genome sequencing data need to take into account the specific characteristics of the viral genome. Due to their higher natural variability this is especially true for avian influenza samples and for the HA- and NA-encoding segments of the genome.
Nevertheless, it looks possible, with carefully chosen reference segment sequences and bioinformatic tools, to avoid a computationally expensive de-novo assembly approach and to use mapping against a dynamically compiled reference genome instead.
The rather small reference segment collection suggested for this tutorial consists of 56 different samples, of which only a single one has the H4N6 subtype of the sample analyzed here. Still it allowed us to perform subtyping of the sample, to construct complete consensus sequences for 7 segments including NA, and to draw valuable conclusions about the origin of the sample. It is conceivable that a larger collection of references chosen to capture several strains from each HA subtype could solve the remaining issue of the incomplete HA consensus sequence.