Author(s) | Polina Polunina |
Posted on: 3 June 2024 purlPURL: https://gxy.io/GTN:N00081
As part of the EuroScienceGateway and in cooperation with Onedata and EGI we are providing all GTN training data on a publicly accessible cloud storage. Those training datasets are curated, small but meaningful for educational purposes and contain 1530 files with a total size of 170Gb. An invaluable set of resources for everyone dealing with data science and training. Please thank the more than 350 contributors to GTN.
What does this mean for you?
- Teachers: if you’re a teacher contributing to the GTN, you can now be sure that the datasets you use in your materials are even more accessible and easier to use.
- GTN Users: When following training materials, it’ll now be easier to access those same datasets in new locations.
- Galaxy Admins: If you’re running a Galaxy server, you can now more easily integrate the GTN data into your server without taking up unnecessary storage space, and still making it available to all your users.
Accessing GTN Data in the Cloud
GTN training data is always accessible, annotated and linked for every tutorial. Usually, it’s stored in Zenodo, referenced via a DOI. You can access all GTN training data using several methods:
-
Onedata Share — access without authentication:
a. Visit the public share link to browse and download the data via the Onedata Web UI.
b. Use the public REST API to access the data; on the share page (see above) you will find ready-to-use
curl
examples by right-clicking on a file/directory and choosing the Information context menu. -
Galaxy Server integration: access the data on the European Galaxy server. Go to the “Upload data” button, select “Choose remote files,” and navigate to the GTN repository.
-
Configure your own Galaxy server: to include the GTN data in your Galaxy server, use the following configuration:
- type: onedata id: gtn_public_onedata label: GTN training data doc: Training data from the Galaxy Training Network (powered by Onedata) # The following token is a public, read-only token that can be shared. accessToken: "MDAxY2xvY2F00aW9uIGRhdGFodWIuZWdpLmV1CjAwNmJpZGVudGlmaWVyIDIvbm1kL3Vzci00yNmI4ZTZiMDlkNDdjNGFkN2E3NTU00YzgzOGE3MjgyY2NoNTNhNS9hY3QvMGJiZmY1NWU4NDRiMWJjZGEwNmFlODViM2JmYmRhNjRjaDU00YjYKMDAxNmNpZCBkYXRhLnJlYWRvbmx5CjAwNDljaWQgZGF00YS5wYXRoID00gTHpaa1pUTTROMkl4WmpjMllXVmpOMlU00WWpreU5XWmtNV00ZpT1RKbU1ETXlZMmhoWTJReAowMDJmc2lnbmF00dXJlIIQvnXp01Oey02LnaNwEkFJAyArzhHN8SlXSYFsBbSkqdqCg" onezoneDomain: "datahub.egi.eu"
-
Onedata clients — access the data using the public read-only access token and Oneclient (local POSIX mount) or OnedataFS (PyFilesystem interface), e.g.:
mkdir ~/oneclient oneclient \ -H plg-cyfronet-01.datahub.egi.eu \ -t MDAxY2xvY2F00aW9uIGRhdGFodWIuZWdpLmV1CjAwNmJpZGVudGlmaWVyIDIvbm1kL3Vzci00yNmI4ZTZiMDlkNDdjNGFkN2E3NTU00YzgzOGE3MjgyY2NoNTNhNS9hY3QvMGJiZmY1NWU4NDRiMWJjZGEwNmFlODViM2JmYmRhNjRjaDU00YjYKMDAxNmNpZCBkYXRhLnJlYWRvbmx5CjAwNDljaWQgZGF00YS5wYXRoID00gTHpaa1pUTTROMkl4WmpjMllXVmpOMlU00WWpreU5XWmtNV00ZpT1RKbU1ETXlZMmhoWTJReAowMDJmc2lnbmF00dXJlIIQvnXp01Oey02LnaNwEkFJAyArzhHN8SlXSYFsBbSkqdqCg \ ~/oneclient ls ~/oneclient/GTN\ data
What is the GTN Downloader?
GTN-Downloader makes it easier for users to access and organize data from the Galaxy Training Network (GTN).
The GTN Downloader is a Python script that automates the download of data from GTN tutorials. It goes through the tutorials in the GTN repository, finds data-library.yaml
files,
and creates a structured directory based on the tutorial names and file contents.
Key Features:
- Automated Data Download: The script finds
data-library.yaml
files in the GTN repository and downloads the associated data files. - Structured Organization: It creates directories based on the tutorial names and the information in the
data-library.yaml
files, so the files are organized. - Download Summary: It generates a
download-summary.tsv
file, which includes metadata about the downloaded files, a download report (error, success, already downloaded), and the overall size of the files.
Seamless Integration with Onedata
In addition to local downloads, the GTN Downloader can upload data to Onedata, a distributed data management platform. This integration ensures that the latest GTN data is always available to users.
Automated Workflow with GitHub CI/CD:
- Automated Workflow: A GitHub Actions workflow runs once a week on weekends to download the latest data from the GTN tutorials and upload it to Onedata.
- Environment Setup: The workflow sets up necessary environment variables and installs dependencies, including Oneclient, the Onedata POSIX client.
- Data Upload: After downloading the data, the workflow uploads it to Onedata, making it publicly accessible.
Funding
These organisations or grants provided funding support for the development of this resource