Working with very large fasta datasets
- Run FastQC on your data to make sure the format/content is what you expect. Run more QA as needed.
- Search GTN tutorials with the keyword “qa-qc” for examples.
- Search Galaxy Help with the keywords “qa-qc” and “fasta” for more help.
- Assembly result?
- Consider filtering by length to remove reads that did not assemble.
- Formatting criteria:
- All sequence identifiers must be unique.
- Some tools will require that there is no description line content, only identifiers, in the fasta title line (“>” line). Use NormalizeFasta to remove the description (all content after the first whitespace) and wrap the sequences to 80 bases.
- Custom genome, transcriptome exome?
- Only appropriate for smaller genomes (bacterial, viral, most insects).
- Not appropriate for any mammalian genomes, or some plants/fungi.
- Sequence identifiers must be an exact match with all other inputs or expect problems. See GFF GFT GFF3.
- Formatting criteria:
- All sequence identifiers must be unique.
- ALL tools will require that there is no description content, only identifiers, in the fasta title line (“>” line). Use NormalizeFasta to remove the description (all content after the first whitespace) and wrap the sequences to 80 bases.
- The only exception is when executing the MakeBLASTdb tool and when the input fasta is in NCBI BLAST format (see the tool form).
Persistent URL
Resource purlPURL: https://gxy.io/GTN:F00050Still have questions?
Gitter Chat Support
Galaxy Help Forum