Analyses of metagenomics data - The global picture

Author(s)	Saskia Hiltemann Bérénice Batut
Reviewers

Overview
Questions:

How to analyze metagenomics data?

What information can be extracted of metagenomics data?

What is the difference between amplicon and shotgun data?

What are the difference in the analyses of amplicon and shotgun data?

Objectives:

Choosing the best approach to analyze metagenomics data

Selection of tools to analyze amplicon data or shotgun data

Visualisation of a community structure

Requirements:

Introduction to Galaxy Analyses

Time estimation: 2 hours 30 minutes

Supporting Materials:

Datasets

Workflows

FAQs

instances Available on these Galaxies

Possibly Working

UseGalaxy.fr

UseGalaxy.no

UseGalaxy.org.au

Published: Jun 22, 2017

Last modification: Aug 9, 2024

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00385

rating Rating: 5.0 (1 recent ratings, 17 all time)

version Revision: 4

In metagenomics, information about micro-organisms in an environment can be extracted with two main techniques:

Amplicon sequencing, which sequences only the rRNA or ribosomal DNA of organisms
Shotgun sequencing, which sequences full genomes of the micro-organisms in the environment

In this tutorial, we will introduce the two main types of analyses with their general principles and differences. For a more in-depth look at these analyses, we recommend our detailed tutorials on each analysis.

We will use two datasets (one amplicon and one shotgun) from the same project on the Argentinean agricultural pampean soils. In this project, three different geographic regions that are under different types of land uses and two soil types (bulk and rhizospheric) were analyzed using shotgun and amplicon sequencing. We will focus on data from the Argentina Anguil and Pampas Bulk Soil (the original study included one more geographical regions Rascovan et al. 2013.

Agenda

In this tutorial, we will deal with:

Amplicon data

Importing the data

Quality Control

Sequence Alignment

Extraction of taxonomic information

Visualization

Shotgun metagenomics data

Data upload

Extraction of taxonomic information

Extraction of functional information

Conclusion

Amplicon data

Amplicon sequencing is a highly targeted approach for analyzing genetic variation in specific genomic regions. In the metagenomics fields, amplicon sequencing refers to capture and sequence of rRNA data in a sample. It can be 16S for bacteria or archea or 18S for eukaryotes.

Comment: Background: The 16S ribosomal RNA gene

Open image in new tab

Figure 1: The 16S ribosomal RNA gene. Credit: https://www.slideshare.net/beiko/ccbc-tutorial-beiko

The 16S rRNA gene has several properties that make it ideally suited to our purposes

Present in all living organisms

Highly conserved + highly variable regions

Huge reference databases

Open image in new tab

Figure 2: Variable regions of the 16S rRNA

The highly conserved regions make it easy to target the gene across different organisms, while the highly variable regions allow us to distinguish between different species.

With amplicon data, we can determine the micro-organisms from which the sequences in our sample are coming from. This is called taxonomic assignation. We try to assign sequences to taxons and then classify or extract the taxonomy in our sample.

In this analysis, we will use the mothur tool suite, but only a small portion of its tools and possibilities. To learn more in detail about how to use this tool, check out the full mothur tutorial.

Importing the data

Our datasets come from a soil samples in two different Argentinian locations, for which the 16S rDNA V4 region has been sequenced using 454 GS FLX Titanium. For the tutorial, the original fastq data has been down sampled and converted to fasta. The original data are available at EBI Metagenomics under the following run numbers:

Pampa soil: SRR531818
Anguil soil: SRR651839

Hands-on: Data upload
Import the following files from Zenodo, or from the shared data library (in GTN - Material -> microbiome -> Analyses of metagenomics data - The global picture):

SRR531818_pampa.fasta

SRR651839_anguil.fasta
https://zenodo.org/record/815875/files/SRR531818_pampa.fasta
https://zenodo.org/record/815875/files/SRR651839_anguil.fasta
Copy the link location

Open the Galaxy Upload Manager

Select Paste/Fetch Data

Paste the link into the text field

Press Start

Go into “Data” (top panel) then “Data libraries”

Click on “Training data” and then “Analyses of metagenomics data”

Select interesting file

Click on “Import selected datasets into history”

Import in a new history

As default, Galaxy takes the link as name, so rename them.

Preparing datasets

We will perform a multisample analysis with mothur. To do this, we will merge all reads into a single file, and create a group file, indicating which reads belong to which samples.

Hands-on: Prepare multisample analysis

Merge.files tool with the following parameters

“Merge” to fasta files

“Inputs” to the two sample fasta files

Make.group tool with the following parameters

“Method to create group file” to Manually specify fasta files and group names

“Additional”: Add two elements to this repeat

First element

“fasta - Fasta to group” to SRR531818_pampa file

“group - Group name” to pampa

Second element (click on “Insert Additional”)

“fasta - Fasta to group” to SRR651839_anguil file

“group - Group name” to anguil

Comment: Note

Because we only have a small number of samples, we used the manual specification. If you have hundreds of samples this would quickly become bothersome. The solution? use a collection! To read more about collections in Galaxy, please see dedicated collections tutorial.

Have a look at the group file. It is a very simple file, it contains two columns: the first contains the read names, the second contains the group (sample) name, in our case pampa or anguil.

Optimization of files for computation

Because we are sequencing many of the same organisms, we anticipate that many of our sequences are duplicates of each other. Because it’s computationally wasteful to align the same thing a bazillion times, we’ll unique our sequences using the Unique.seqs command:

Hands-on: Remove duplicate sequences

Unique.seqs tool with the following parameters

“fasta” to the merged fasta file

“output format” to Name File

Question

How many sequences were unique? How many duplicates were removed?

19,502 unique sequences and 498 duplicates.

This can be determined from the number of lines in the fasta (or names) output, compared to the number of lines in the fasta file before this step.

This Unique.seqs tool outputs two files: a fasta file containing only the unique sequences, and a names file. The names file consists of two columns, the first contains the sequence names for each of the unique sequences, and the second column contains all other sequence names that are identical to the representative sequence in the first column.

name          representatives
read_name1    read_name2,read_name,read_name5,read_name11
read_name4    read_name6,read_name,read_name10
read_name7    read_name8
...

Hands-on: Count sequences

Count.seqs tool with the following parameters

“name” to the name file from Unique.seqs

“Use a group file” to yes

“group” to the group file from Make.group

The Count.seqs file keeps track of the number of sequences represented by each unique representative across multiple samples. We will pass this file to many of the following tools to be used or updated as needed.

Quality Control

The first step in any analysis should be to check and improve the quality of our data.

Comment

For more information on the topic of quality control, please see our dedicated training materials.

First, let’s get a feel for our data:

Hands-on: Summarize data

Summary.seqs tool with the following parameters

“fasta” parameter to the fasta from Unique.seqs

“count” to count table from Count.seqs

“output logfile?” to yes

The summary output files give information per read. The logfile outputs also contain some summary statistics:

              Start    End        NBases     Ambigs   Polymer  NumSeqs
Minimum:      1        80         80         0        3        1
2.5%-tile:    1        104        104        0        3        501
25%-tile:     1        242        242        0        4        5001
Median:       1        245        245        0        4        10001
75%-tile:     1        245        245        0        4        15001
97.5%-tile:   1        247        247        0        6        19501
Maximum:      1        275        275        2        31       20000
Mean:         1        237.519    237.519    0.00495  4.24965
# of unique seqs:   19502
total # of seqs:    20000

This tells us that we have a total of 19,502 unique sequences, representing 20,000 total sequences that vary in length between 80 and 275 bases. Also, note that at least some of our sequences had some ambiguous base calls. Furthermore, at least one read had a homopolymer stretch of 31 bases, this is likely an error so we would like to filter such reads out as well.

If you are thinking that 20,000 is an oddly round number, you are correct; we downsampled the original datasets to 10,000 reads per sample for this tutorial to reduce the amount of time the analysis steps will take.

We can filter our dataset on length, base quality, and maximum homopolymer length using the Screen.seqs tool

The following tool will remove any sequences with ambiguous bases (maxambig parameter), homopolymer stretches of 9 or more bases (maxhomop parameter) and any reads longer than 275 bp or shorter than 225 bp.

Hands-on: Filter reads based on quality and length

Screen.seqs tool with the following parameters

“fasta” to the fasta file from Unique.seqs

“minlength” parameter to 225

“maxlength” parameter to 275

“maxambig” parameter to 0

“maxhomop” parameter to 8

“count” to the count file from Count.seqs

Question

How many reads were removed in this screening step? (Hint: run the Summary.seqs tool again)

1,804.

This can be determined by looking at the number of lines in bad.accnos output of screen.seqs step or by comparing the total number of seqs between of the summary.seqs log before and after this screening step

Sequence Alignment

Aligning our sequences to a reference helps improve OTU assignment [Schloss et. al.], so we will now align our sequences to an alignment of the V4 variable region of the 16S rRNA. This alignment has been created as described in [mothur’s MiSeq SOP] from the Silva reference database.

Hands-on: Align sequences
Import the silva.v4.fasta file in your history
https://zenodo.org/record/815875/files/silva.v4.fasta
Align.seqs tool with the following parameters

“fasta” to the good.fasta output from Screen.seqs

“Select Reference Template from” to Your history

“reference” to the silva.v4.fasta reference file

“flip” to Yes

This step may take a few minutes, please be patient.

Summary.seqs tool with the following parameters

“fasta” parameter to the aligned output from Align.seqs

“count” parameter to count_table output from Screen.seqs

To get an idea of the quality of the alignment, we can view the log output from the summary step:

        Start   End NBases  Ambigs  Polymer NumSeqs
Minimum:    2391    10674   9   0   2   1
2.5%-tile:  3080    12071   234 0   4   455
25%-tile:   3080    13424   244 0   4   4545
Median:     3080    13424   245 0   4   9090
75%-tile:   3080    13424   245 0   4   13634
97.5%-tile: 3082    13424   246 0   6   17724
Maximum:    13396   13425   267 0   7   18178
Mean:   3080.6  13380   244.212 0   4.27946
# of unique seqs:   17698
total # of seqs:    18178

Question

How many sequences have been aligned?

Between which positions most of the reads are aligned to this references?

17,698 are aligned

From this we can see that most of our reads align nicely to positions 3080-13424 on this reference. This corresponds exactly to the V4 target region of the 16S gene.

To make sure that everything overlaps the same region we’ll re-run Screen.seqs to get sequences that start at or before position 3,080 and end at or after position 13,424.

Hands-on: Remove poorly aligned sequences

Screen.seqs tool with the following parameters

“fasta” to the aligned fasta file

“start” to 3080

“end” to 13424

“count” to the group file created by the previous run of Screen.seqs

Question

How many sequences were removed in this step?

4,579 sequences were removed. This is the number of lines in the bad.accnos output.

Now we know our sequences overlap the same alignment coordinates, we want to make sure they only overlap that region. So we’ll filter the sequences to remove the overhangs at both ends. In addition, there are many columns in the alignment that only contain external gap characters (i.e. “.”), while columns containing only internal gap characters (i.e., “-“) are not considered. These can be pulled out without losing any information. We’ll do all this with Filter.seqs:

Hands-on: Filter sequences

Filter.seqs tool with the following parameters

“fasta”” to good.fasta output from Screen.seqs

“trump” to .

Extraction of taxonomic information

The main questions when analyzing amplicon data are: Which micro-organisms are present in an environmental samples? And in what proportion? What is the structure of the community of the micro-organisms?

The idea is to take the sequences and assign them to a taxon. To do that, we group (or cluster) sequences based on their similarity to defined Operational Taxonomic Units (OTUs): groups of similar sequences that can be treated as a single “genus” or “species” (depending on the clustering threshold)

Comment: Background: Operational Taxonomic Units (OTUs)

In 16S metagenomics approaches, OTUs are clusters of similar sequence variants of the 16S rDNA marker gene sequence. Each of these clusters is intended to represent a taxonomic unit of a bacterial species or genus depending on the sequence similarity threshold. Typically, OTU clusters are defined by a 97% identity threshold of the 16S gene sequence variants at genus level. 98% or 99% identity is suggested for species separation.

Open image in new tab

Figure 3: OTU and cluster with 97% identity threshold

Open image in new tab

Figure 4: Cladogram of operational taxonomic units (OTUs). Credit: Danzeisen et al. 2013, 10.7717/peerj.237

The first thing we want to do is to further de-noise our sequences from potential sequencing errors, by pre-clustering the sequences using the Pre.cluster command, allowing for up to 2 differences between sequences. This command will split the sequences by group and then sort them by abundance, then go from most abundant to least and identify sequences that differ by no more than 2 nucleotides from on another. If this is the case, then they get merged. We generally recommend allowing 1 difference for every 100 basepairs of sequence:

Hands-on: Perform preliminary clustering of sequences and remove undesired sequences

Pre.cluster tool with the following parameters

“fasta” to the fasta output from the last Filter.seqs run

“name file or count table” to the count table from the last Screen.seqs step

“diffs” to 2

Question

How many unique sequences are we left with after this clustering of highly similar sequences?

10,398

This is the number of lines in the fasta output

We would like to classify the sequences using a training set, which is again is provided on [mothur’s MiSeq SOP].

Hands-on: Classify the sequences into phylotypes
Import the trainset16_022016.pds.fasta and trainset16_022016.pds.tax in your history
https://zenodo.org/record/815875/files/trainset16_022016.pds.fasta
https://zenodo.org/record/815875/files/trainset16_022016.pds.tax
Classify.seqs tool with the following parameters

“fasta” to the fasta output from Pre.cluster

“Select Reference Template from” to History

“reference” to trainset16_022016.pds.fasta from your history

“Select Taxonomy from” to History

“taxonomy” to trainset16_022016.pds.tax from your history

“count” to the count table from Pre.cluster
This step may take a couple of minutes, now may be a good time to grab a cup of tea/coffee.

Have a look at the taxonomy output.

name    taxonomy
SRR651839.9109-HXY9DLL01BSHQO-2 Bacteria(99);Verrucomicrobia(99);Spartobacteria(99);unclassified;unclassified;unclassified;
SRR651839.11437-HXY9DLL01AMIPI-2    Bacteria(100);Verrucomicrobia(97);Spartobacteria(97);unclassified;unclassified;unclassified;
SRR651839.15884-HXY9DLL01BDG2F-2    Bacteria(99);unclassified;unclassified;unclassified;unclassified;unclassified;
SRR651839.12048-HXY9DLL01DSOWS-2    Bacteria(100);Verrucomicrobia(100);Spartobacteria(100);unclassified;unclassified;unclassified;
SRR651839.9410-HXY9DLL01BGRG2-2 Bacteria(100);Proteobacteria(100);Alphaproteobacteria(100);Rhizobiales(100);Bradyrhizobiaceae(100);unclassified;
SRR651839.9029-HXY9DLL01E4W9Z-2 Bacteria(100);Verrucomicrobia(100);Spartobacteria(100);unclassified;unclassified;unclassified;
SRR651839.6283-HXY9DLL01CHX25-2 Bacteria(100);Acidobacteria(100);Acidobacteria_Gp4(100);Gp4(100);unclassified;unclassified;
SRR651839.3134-HXY9DLL01DOM67-2 Bacteria(100);Acidobacteria(100);Acidobacteria_Gp6(100);Gp6(100);unclassified;unclassified;
SRR651839.13044-HXY9DLL01ETN7L-2    Bacteria(100);Proteobacteria(100);Alphaproteobacteria(100);Rhizobiales(100);Bradyrhizobiaceae(100);unclassified;
SRR531818.61708-G88ZSJI01AVPPR-2    Bacteria(100);Acidobacteria(99);Acidobacteria_Gp6(99);Gp6(99);unclassified;unclassified;

You will see that every read now has a classification.

The next step is to use this information to determine the abundances of the different found taxa. This consists of three steps:

first, all individual sequences are classified, and get assigned a confidence score (0-100%)
next, sequences are grouped at 97% identity threshold (not using taxonomy info)
finally, for each cluster, a consensus classification is determined based on the classification of the individual sequences and taking their confidence scores into account

Hands-on: Assign sequences to OTUs

Cluster.split tool with the following parameters

“Split by” to Classification using fasta

“fasta” to the fasta output from Pre.cluster

“taxonomy” to the taxonomy output from Classify.seqs

“count” to the count table output from Pre.cluster

“Clustering method” to Average Neighbour

“cutoff” to 0.15

We obtain a table with the columns being the different identified OTUs, the rows the different distances and the cells the ids of the sequences identified for these OTUs for the different distances.

Next we want to know how many sequences are in each OTU from each group with a distance of 0.03 (97% similarity). We can do this using the Make.shared command with the 0.03 cutoff level:

Hands-on: Estimate OTU abundance

Make.shared tool with the following parameters

“Select input type” to OTU list

“list” to list output from Cluster.split

“name file or count table” to the count table from Pre.cluster

“label” to 0.03

We probably also want to know the taxonomy for each of our OTUs. We can get the consensus taxonomy for each OTU using the Classify.otu command:

Hands-on: Classify the OTUs

Classify.otu tool with the following parameters

“list” to output from Cluster.split

“count” to the count table from Pre.cluster

“Select Taxonomy from” to History

“taxonomy” to the taxonomy output from Classify.seqs

“label” to 0.03

Question

How many OTUs with taxonomic assignation are found for the Anguil sample? And for the Pampa sample?

What is the annotation of first OTU and its size?

2,195 for Anguil and 2,472 for Pampa (“tax.summary”)

Otu00001 is associated to 929 sequences and to Bacteria (kingdom), Verrucomicrobia (phylum), Spartobacteria (class) in “taxonomy” file

Visualization

We have now determined our OTUs and classified them, but looking at a long text file is not very informative. Let’s visualize our data using Krona:

Hands-on: Krona

First we convert our mothur taxonomy file to a format compatible with Krona

Taxonomy-to-Krona tool with the following parameters

“Taxonomy file” to the taxonomy output from Classify.otu (note: this is a collection input)

Krona pie chart tool with the following parameters

“Type of input” to Tabular

“Input file” to taxonomy output from Taxonomy-to-Krona (collection)

The result is an HTML file with an interactive visualization, for instance try clicking on one of the rings in the image or playing around with some of the settings.

This produced a single plot for both your samples, but what if you want to compare the two samples?

Hands-on: Per-sample Krona plots

Classify.otu tool

Hit the rerun button on the Classify.otu job in your history and see if you can find settings that will give you per-sample taxonomy data

Krona tool

Now use this new output collection to create per-sample Krona plots

In this new Krona output you can switch between the combined plot and the per-sample plots via the selector in the top-left corner.

Question

Which soil sample had a higher percentage of Acidobacteria, anguil or pampa? what were the respective percentages?

The anguil sample had a higher proportion of Acidobacteria. The exact percentages can be found by looking at the pie charts at the top right-hand corner after clicking on the label Acidobacteria. For anguil the percentage is 36%, for the pampa sample it is 26%.

To further explore the community structure, we can visualize it with dedicated tools such as Phinch.

Hands-on: Visualization of the community structure with Phinch

Make.biom tool with the following parameters

“shared” to Make.shared output

“constaxonomy” to taxonomy output from the first run of Classify.otu (collection)

Expand the dataset and click on the “view biom at phinch” link

Comment

If this link is not present on your Galaxy, you can download the generated BIOM file and upload directly to Phinch server at http://phinch.org.

Play with the several interactive visualisations:

Open image in new tab

Figure 5: Phinch visualizations

Once we have information about the community structure (OTUs with taxonomic structure), we can do more analysis on it: estimation of the diversity of micro-organism, comparison of diversity between samples, analysis of populations, … We will not go into detail of such analyses here but you can follow our tutorials on amplicon data analyses to learn more about them.

Shotgun metagenomics data

In the previous section, we saw how to analyze amplicon data to extract the community structure. Such information can also be extracted from shotgun metagenomic data.

In shotgun data, full genomes of the micro-organisms in the environment are sequenced (not only the 16S or 18S). We can then have access to the rRNA (only a small part of the genomes), but also to the other genes of the micro-organisms. Using this information, we can try to answer questions such as “What are the micro-organisms doing?” in addition to the question “What micro-organisms are present?”.

In this second part, we will use a metagenomic sample of the Pampas Soil (SRR606451).

Data upload

Hands-on: Data upload
Create a new history
Import the SRR606451_pampa Fasta file from Zenodo or from the data library (in “Analyses of metagenomics data”)
https://zenodo.org/record/815875/files/SRR606451_pampa.fasta

Extraction of taxonomic information

As for amplicon data, we can extract taxonomic and community structure information from shotgun data. Different approaches can be used:

Same approach as for amplicon data with identification and classification of OTUs

Such an approach requires a first step of sequence sorting to extract only the 16S and 18S sequences, then using the same tools as for amplicon data. However, rRNA sequences represent a low proportion (< 1%) of the shotgun sequences so such an approach is not the most statistically supported
Assignation of taxonomy on the whole sequences using databases with marker genes

In this tutorial, we use the second approach with MetaPhlAn2. This tools is using a database of ~1M unique clade-specific marker genes (not only the rRNA genes) identified from ~17,000 reference (bacterial, archeal, viral and eukaryotic) genomes.

Hands-on: Taxonomic assignation with MetaPhlAn2

MetaPhlAN2 tool with

“Input file” to the imported file

“Database with clade-specific marker genes” to locally cached

“Cached database with clade-specific marker genes” to MetaPhlAn2 clade-specific marker genes

This step may take a couple of minutes.

3 files are generated:

A tabular file with the community structure

  #SampleID   Metaphlan2_Analysis
  k__Bacteria 100.0
  k__Bacteria|p__Proteobacteria   86.20712
  k__Bacteria|p__Actinobacteria   13.79288
  k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria    86.20712
  k__Bacteria|p__Actinobacteria|c__Actinobacteria 13.79288

Each line contains a taxa and its relative abundance found for our sample. The file starts with high level taxa (kingdom: k__) and go to more precise taxa.

A BIOM file with the same information as the previous file but in BIOM format

It can be used then by mothur and other tools requiring community structure information in BIOM format
A SAM file with the results of the mapping of the sequences on the reference database

Question

What is the most precise level we have access to with MetaPhlAn2?

What are the two orders found in our sample?

What is the most abundant family in our sample?

We have access to species level

Pseudomonadales and Solirubrobacterales are found in our sample

The most abundant family is Pseudomonadaceae with 86.21 % of the assigned sequences

Even if the output of MetaPhlAn2 is bit easier to parse than the BIOM file, we want to visualize and explore the community structure with KRONA

Hands-on: Interactive visualization with KRONA

Format MetaPhlAn2 output for Krona tool with

“Input file” to Community profile output of MetaPhlAn2

KRONA pie chart tool with

“What is the type of your input data” as MetaPhlan

“Input file” to the output of Format MetaPhlAn2

Extraction of functional information

We would now like to answer the question “What are the micro-organisms doing?” or “Which functions are performed by the micro-organisms in the environment?”.

In the shotgun data, we have access to the gene sequences from the full genome. We use that to identify the genes, associate them with a function, build pathways, etc., to investigate the functional part of the community.

Hands-on: Metabolism function identification
HUMAnN2 tool with

“Input sequence file” to the imported sequence file

“Use of a custom taxonomic profile” to Yes

“Taxonomic profile file” to Community profile output of MetaPhlAn2

“Nucleotide database” to Locally cached

“Nucleotide database” to Full

“Protein database” to Locally cached

“Protein database” to Full UniRef50

“Search for uniref50 or uniref90 gene families?” to uniref50

“Database to use for pathway computations” to MetaCyc

“Advanced Options”

“Remove stratification from output” to Yes

This step is long so we generated the output for you!
Import the 3 files whose the name is starting with “humann2”
https://zenodo.org/record/815875/files/humann2_gene_families_abundance.tsv
https://zenodo.org/record/815875/files/humann2_pathways_abundance.tsv
https://zenodo.org/record/815875/files/humann2_pathways_coverage.tsv

HUMAnN2 generates 3 files

A file with the abundance of gene families

Gene family abundance is reported in RPK (reads per kilobase) units to normalize for gene length. It reflects the relative gene (or transcript) copy number in the community.

The “UNMAPPED” value is the total number of reads which remain unmapped after both alignment steps (nucleotide and translated search). Since other gene features in the table are quantified in RPK units, “UNMAPPED” can be interpreted as a single unknown gene of length 1 kilobase recruiting all reads that failed to map to known sequences.
A file with the coverage of pathways

Pathway coverage provides an alternative description of the presence (1) and absence (0) of pathways in a community, independent of their quantitative abundance.
A file with the abundance of pathways

Question

How many gene families and pathways have been identified?

44 gene families but no pathways are identified

The RPK for the gene families are quite difficult to interpret in term of relative abundance. We decide then to normalize the values

Hands-on: Normalize the gene family abundances

Renormalize a HUMAnN2 generated table tool with

“Gene/pathway table” to the gene family table generated with HUMAnN2

“Normalization scheme” to Relative abundance

“Normalization level” to Normalization of all levels by community total

Question

What percentage of sequences has not be assigned to a gene family?

What is the most abundant gene family?

55% of the sequences have not be assigned to a gene family

The most abundant gene family with 25% of sequences is a putative secreted protein

With the previous analyses, we investigate “Which micro-organims are present in my sample?” and “What function are performed by the micro-organisms in my sample?”. We can go further in these analyses (for example, with a combination of functional and taxonomic results). We did not detail that in this tutorial but you can find more analyses in our tutorials on shotgun metagenomic data analyses.

Conclusion

We can summarize the analyses with amplicon and shotgun metagenomic data:

Both analyses are quite complex! However, in this tutorial, we only showed simple cases of metagenomics data analysis with subset of real data.

Check our other tutorials to learn more in detail of how to analyze metagenomics data.

You've Finished the Tutorial

Key points

With amplicon data, we can extract information about the studied community structure

With shotgun data, we can extract information about the studied community structure and also the functions realised by the community

The tools used to analyze amplicon and shotgun data are different, except for the visualisation

Metagenomics data analyses are complex and time-consuming

Frequently Asked Questions

Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Microbiome topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

Rascovan, N., B. Carbonetto, S. Revale, M. D. Reinert, R. Alvarez et al., 2013 The PAMPA datasets: a metagenomic survey of microbial communities in Argentinean pampean soils. Microbiome 1: 10.1186/2049-2618-1-21

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Saskia Hiltemann, Bérénice Batut, Analyses of metagenomics data - The global picture (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/general-tutorial/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{microbiome-general-tutorial,
author = "Saskia Hiltemann and Bérénice Batut",
	title = "Analyses of metagenomics data - The global picture (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/general-tutorial/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/microbiome/tutorials/general-tutorial/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: taxonomy_krona_chart
  owner: crs4
  revisions: 73486cc6170e
  tool_panel_section_label: Metagenomic Analysis
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: humann2
  owner: iuc
  revisions: 1ab06263e083
  tool_panel_section_label: Metagenomic Analysis
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: humann2_regroup_table
  owner: iuc
  revisions: a1747df2bc21
  tool_panel_section_label: Metagenomic Analysis
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: humann2_renorm_table
  owner: iuc
  revisions: ba4dfbdc1295
  tool_panel_section_label: Metagenomic Analysis
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: metaphlan2
  owner: iuc
  revisions: 8c82c4d90cc6
  tool_panel_section_label: Metagenomic Analysis
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: metaphlan2krona
  owner: iuc
  revisions: ebb74774457e
  tool_panel_section_label: Metagenomic Analysis
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_align_seqs
  owner: iuc
  revisions: 2692b3107533
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_classify_otu
  owner: iuc
  revisions: e664787de3ab
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_classify_seqs
  owner: iuc
  revisions: cc9c7b550c29
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_cluster_split
  owner: iuc
  revisions: a29ac7993cb0
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_count_seqs
  owner: iuc
  revisions: ecbd9ed733e6
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_filter_seqs
  owner: iuc
  revisions: bf20a30c263a
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_make_biom
  owner: iuc
  revisions: 5dc4d49a238a
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_make_group
  owner: iuc
  revisions: 8c964d119da8
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_make_shared
  owner: iuc
  revisions: 83dabeeadb55
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_merge_files
  owner: iuc
  revisions: a70b25374a23
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_pre_cluster
  owner: iuc
  revisions: 58d0fe3cc0f2
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_screen_seqs
  owner: iuc
  revisions: 8743aecd26a9
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_summary_seqs
  owner: iuc
  revisions: e4a4cb8b6bbf
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: mothur_unique_seqs
  owner: iuc
  revisions: 1f72d9a3d09f
  tool_panel_section_label: Mothur
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: krona_text
  owner: saskia-hiltemann
  revisions: b14f1444e464
  tool_panel_section_label: Graph/Display Data
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

5 stars 8

4 stars 6

3 stars 1

2 stars 1

1 stars 1

May 2024

5 stars: Disliked: Usegalaxy.org did not support the metaplAN2 and HUManN2.

November 2022

5 stars: Liked: Hands in Disliked: Videos might help us more

May 2022

5 stars: Liked: Simplicity of explaining complicate issue (-; Disliked: n.a.

July 2020

4 stars: Liked: well described Disliked: help when stuck

December 2019

5 stars: Liked: flow of the tutorial Disliked: more accurate galaxy functions reference.

April 2019

4 stars: Liked: step by step very well described Disliked: data analysis with OTU / phylotype

November 2018

4 stars: Liked: clear step by step module

4 stars: Liked: The overview of all the tools available. Disliked: The time for the processes on Galaxy.

October 2018

1 stars: Liked: Some instructions were not clear at all, please be more specific as i encounter errors very likely despite following the steps Disliked: Please review the protocol and revise some segments

2 stars: Liked: Krona Disliked: It needs to be updated, some returns error and some is missing. Be more detail and specific