Functional annotation of protein sequences

Author(s)	Anthony Bretaudeau
Reviewers

Overview
Questions:

How to perform functional annotation on protein sequences?

Objectives:

Perform functional annotation using EggNOG-mapper and InterProScan

Requirements:

Introduction to Galaxy Analyses

Time estimation: 1 hour

Level: Introductory Introductory

Supporting Materials:

Datasets

Workflows

FAQs

instances Available on these Galaxies

Known Working

UseGalaxy.eu ✅ ⭐️

UseGalaxy.fr ✅ ⭐️

UseGalaxy.org (Main) ✅ ⭐️

UseGalaxy.org.au ✅ ⭐️

UseGalaxy.cz ✅

Possibly Working

UseGalaxy.no

Published: Jul 20, 2022

Last modification: Oct 15, 2024

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT

purl PURL: https://gxy.io/GTN:T00173

version Revision: 10

When performing the structural annotation of a genome sequence, you get the position of each gene, but you don’t have information about their name of their function. That’s the goal of functional annotation.

In this short tutorial, we will run the most commonly used tools to perform functional annotation, starting from the predicted protein sequences of a few example genes.

For a more complete view of how this step integrates into a whole genome sequencing and annotation process, you can have a look at the Funannotate tutorial.

Agenda

In this tutorial, we will cover:

Data upload

Functional annotation

EggNOG Mapper

InterProScan

Conclusion

Data upload

We will annotate a small set of protein sequences. These sequences were predicted from the gene structures obtained in the Funannotate tutorial? Though these sequences from from a fungal species, you can run the same tools on proteins from any organisms, including prokaryotes.

Hands-on: Data upload
Create a new history for this tutorial

To create a new history simply click the new-history icon at the top of the history panel:
Import the files from Zenodo or from the shared data library (GTN - Material -> genome-annotation -> Functional annotation of protein sequences):
https://zenodo.org/record/6861851/files/proteins.fasta
Copy the link location

Click galaxy-upload Upload Data at the top of the tool panel

Select galaxy-wf-edit Paste/Fetch Data

Paste the link(s) into the text field

Press Start

Close the window

As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

Go into Data (top panel) then Data libraries

Navigate to the correct folder as indicated by your instructor.

On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.

Select the desired files

Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu

In the pop-up window, choose

“Select history”: the history you want to import the data to (or create a new one)

Click on Import

Functional annotation

EggNOG Mapper

EggNOG Mapper compares each protein sequence of the annotation to a huge set of ortholog groups from the EggNOG database. In this database, each ortholog group is associated with functional annotation like Gene Ontology (GO) terms or KEGG pathways. When the protein sequence of a new gene is found to be very similar to one of these ortholog groups, the corresponding functional annotation is transfered to this new gene.

Hands-on

eggNOG Mapper ( Galaxy version 2.1.8+galaxy3) with the following parameters:

param-file “Fasta sequences to annotate”: proteins.fasta (Input dataset)

“Version of eggNOG Database”: select the latest version available

In “Output Options”:

“Exclude header lines and stats from output files”: No

The output of this tool is a tabular file, where each line represents a gene from our annotation, with the functional annotation that was found by EggNOG-mapper. It includes a predicted protein name, GO terms, EC numbers, KEGG identifiers, …

Display the file and explore which kind of identifiers were found by EggNOG Mapper.

InterProScan

InterPro is a huge integrated database of protein families. Each family is characterized by one or muliple signatures (i.e. sequence motifs) that are specific to the protein family, and corresponding functional annotation like protein names or Gene Ontology (GO). A good proportion of the signatures are manually curated, which means they are of very good quality.

InterProScan is a tool that analyses each protein sequence from our annotation to determine if they contain one or several of the signatures from InterPro. When a protein contains a known signature, the corresponding functional annotation will be assigned to it by InterProScan.

InterProScan itself runs multiple applications to search for the signatures in the protein sequences. It is possible to select exactly which ones we want to use when launching the analysis (by default all will be run).

Hands-on

InterProScan ( Galaxy version 5.59-91.0+galaxy3) with the following parameters:

param-file “Protein FASTA File”: proteins.fasta (Input dataset)

“InterProScan database”: select the latest version available

“Use applications with restricted license, only for non-commercial use?”: Yes (set it to No if you run InterProScan for commercial use)

“Output format”: Tab-separated values format (TSV) and XML

Comment

To speed up the processing by InterProScan during this tutorial, you can disable Pfam and PANTHER applications. When analysing real data, it is adviced to keep them enabled.

When some applications are disabled, you will of course miss the corresponding results in the output of InterProScan.

The output of this tool is both a tabular file and an XML file. Both contain the same information, but the tabular one is more readable for a Human: each line represents a gene from our annotation, with the different domains and motifs that were found by InterProScan.

If you display the TSV file you should see something like this:

InterProScan TSV output

Each line correspond to a motif found in one of the annotated proteins. The most interesting columns are:

Column 1: the protein identifier
Column 5: the identifier of the signature that was found in the protein sequence
Column 4: the databank where this signature comes from (InterProScan regroups several motifs databanks)
Column 6: the human readable description of the motif
Columns 7 and 8: the position where the motif was found
Column 9: a score for the match (if available)
Column 12 and 13: identifier of the signature integrated in InterPro (if available). Have a look an example webpage for IPR036859 on InterPro.
The following columns contains various identifiers that were assigned to the protein based on the match with the signature (Gene ontology term, Reactome, …)

The XML output file contains the same information in a computer-friendly format, we will use it in the next step.

Conclusion

Congratulations for reaching the end of this tutorial! Now you know how to perform the functional annotation of a set of protein sequences, using EggNOG mapper and InterProScan.

If you want to collect more functional annotation, you can try to run the NCBI BLAST+ blastp ( Galaxy version 2.10.1+galaxy2) or Diamond ( Galaxy version 2.0.15+galaxy0) tools against the UniProt or NR databases (Diamond runs much faster on big datasets). These tools will search for similarities between your protein sequences and the ones already described in big international databases.

Also note that many other more specialised tools exist to collect even more functional annotation, in particular for certain species (prokaryotes forexample), or enzyme/protein families.

You've Finished the Tutorial

Key points

EggNOG Mapper compares sequences to a database of annotated orthologous sequences

InterProScan detects known motifs in protein sequences

Frequently Asked Questions

Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Genome Annotation topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Anthony Bretaudeau, Functional annotation of protein sequences (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/functional/tutorial.html Online; accessed TODAY
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{genome-annotation-functional,
author = "Anthony Bretaudeau",
	title = "Functional annotation of protein sequences (Galaxy Training Materials)",
	year = "",
	month = "",
	day = "",
	url = "\url{https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/functional/tutorial.html}",
	note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
	doi = {10.1371/journal.pcbi.1010752},
	url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
	year = 2023,
	month = {jan},
	publisher = {Public Library of Science ({PLoS})},
	volume = {19},
	number = {1},
	pages = {e1010752},
	author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
	editor = {Francis Ouellette},
	title = {Galaxy Training: A powerful framework for teaching!},
	journal = {PLoS Comput Biol}
}

                   

Funding

These individuals or organisations provided funding support for the development of this resource

Gallantries

This project (2020-1-NL01-KA203-064717) is funded with the support of the Erasmus+ programme of the European Union. Their funding has supported a large number of tutorials within the GTN across a wide array of topics.

EuroScienceGateway

EuroScienceGateway was funded by the European Union programme Horizon Europe (HORIZON-INFRA-2021-EOSC-01-04) under grant agreement number 101057388 and by UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee grant number 10038963.

ELIXIR Europe

IFB

Congratulations on successfully completing this tutorial!

You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.

shed-tools install [-g GALAXY] [-a API_KEY] -t <(curl https://training.galaxyproject.org/training-material/api/topics/genome-annotation/tutorials/functional/tutorial.json | jq .admin_install_yaml -r)

Alternatively you can copy and paste the following YAML

---
install_tool_dependencies: true
install_repository_dependencies: true
install_resolver_dependencies: true
tools:
- name: diamond
  owner: bgruening
  revisions: e8ac2b53f262
  tool_panel_section_label: NCBI Blast
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: interproscan
  owner: bgruening
  revisions: 74810db257cc
  tool_panel_section_label: Annotation
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: ncbi_blast_plus
  owner: devteam
  revisions: 0e3cf9594bb7
  tool_panel_section_label: NCBI Blast
  tool_shed_url: https://toolshed.g2.bx.psu.edu/
- name: eggnog_mapper
  owner: galaxyp
  revisions: 844fa988236b
  tool_panel_section_label: Proteomics
  tool_shed_url: https://toolshed.g2.bx.psu.edu/

No feedback has been recieved yet for this training. Be the first one by filling in the feedback form from above.