Bulk matrix to ESet | Creating the bulk RNA-seq dataset for deconvolution

Overview
Creative Commons License: CC-BY Questions:
  • Where can I find good quality RNA-seq datasets?

  • How can I reformat and manipulate these downloads to create the right format for MuSiC?

Objectives:
  • You will retrieve raw data from the EMBL-EBI Expression Atlas.

  • You will manipulate the metadata and matrix files.

  • You will combine the metadata and matrix files into an ESet object for MuSiC deconvolution.

  • You will create multiple ESet objects - both combined and separated out by disease phenotype for your bulk dataset.

Requirements:
Time estimation: 1 hour
Supporting Materials:
Published: Jan 20, 2023
Last modification: Feb 13, 2024
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00242
version Revision: 8

After completing the MuSiC deconvolution tutorial (Wang et al. 2019), you are hopefully excited to apply this analysis to data of your choice. Annoyingly, getting data in the right format is often what prevents us from being able to successfully apply analyses. This tutorial is all about reformatting a raw bulk RNA-seq dataset pulled from a public resource (the EMBL-EBI Expression atlas (Moreno et al. 2021). Let’s get started!

Agenda

In this tutorial, we will cover:

  1. Metadata Manipulation
    1. Find the data
  2. Manipulate the expression matrix
  3. Construct Expression Set Objects
  4. Conclusion

Metadata Manipulation

Just as in our scRNA-dataset preparation tutorial, we will tackle the metadata first. We are roughly following the same concept as in the previous bulk deconvolution tutorial, by comparing human pancreas data across a disease variable (type II diabetes vs healthy), but using public datasets to do it.

Find the data

We explored the expression atlas, browsing experiments in order to find the bulk RNA-seq pancreas dataset (Segerstolpe et al. 2016). You can explore this dataset here using their browser. These cells come from 7 healthy individuals and 4 individuals with Type II diabetes, so we will create reference Expression Set objects for the total as well as separating out by phenotype, as you may have reason to do this in your analysis (or you may not!). This dataset is from the same lab that we built our scRNA-seq reference from, so we should get quite accurate results given the same lab made both datasets!

Hands-on: Data upload
  1. Create a new history for this tutorial
  2. Import the files from Zenodo or from the shared data library (GTN - Material -> single-cell -> Bulk matrix to ESet | Creating the bulk RNA-seq dataset for deconvolution):

    https://zenodo.org/record/7319173/files/E-MTAB-5060-experiment-design.tsv
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

  3. Rename the datasets as needed

  4. Check that the datatype is tabular

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select tabular from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

  5. Add to experiment-design the following tags #metadata #bulk #ebi

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

As before, the metadata object annoyingly has a bunch of unnecessary columns. You can examine this with the galaxy-eye in the Galaxy history. Let’s remove them!

Columns in a table where some contain run info or Sample Characteristic[age] while others are empty.

Tools are frequently updated to new versions. Your Galaxy may have multiple versions of the same tool available. By default, you will be shown the latest version of the tool. This may NOT be the same tool used in the tutorial you are accessing. Furthermore, if you use a newer tool in one step, and try using an older tool in the next step… this may fail! To ensure you use the same tool versions of a given tutorial, use the Tutorial mode feature.

  • Open your Galaxy server
  • Click on the curriculum icon on the top menu, this will open the GTN inside Galaxy.
  • Navigate to your tutorial
  • Tool names in tutorials will be blue buttons that open the correct tool for you
  • Note: this does not work for all tutorials (yet) gif showing how GTN-in-Galaxy works
  • You can click anywhere in the grey-ed out area outside of the tutorial box to return back to the Galaxy analytical interface
Warning: Not all browsers work!
  • We’ve had some issues with Tutorial mode on Safari for Mac users.
  • Try a different browser if you aren’t seeing the button.

Hands-on: Remove unnecessary columns
  1. Advanced Cut ( Galaxy version 1.1.0) with the following parameters:
    • param-file “File to cut”: output (Input dataset)
    • “Operation”: Discard
    • “Cut by”: fields
      • “List of Fields”: 3 5 7 8 9 10 11 12 13 15 16 17 18
    Comment

    Advanced cut works slightly differently in a workflow versus running the tool independently. Independently, there is a list and you can click through the list to note your columns, while in a workflow it appears as a text option and you put each column on a different line. The point is, each number above represents a column, so remove them!

Now let’s take care of the excessively wordy header titles - and note that oftentimes various programmes struggle with titles or cells that have any spaces ` ` in them, so removing those now often saves hassle later.

Comment

You might also remember in the MuSiC tutorial that we can analyse numeric parameters in the metadata (in that case, hbac1c content). Reformatting to ensure numerical values in these columns (i.e. taking the ` years` out of an age cell) is helpful then too.

Hands-on: Fixing titles
  1. Regex Find And Replace ( Galaxy version 1.0.2) with the following parameters:
    • param-file “Select lines from”: output (output of Advanced Cut tool)
    • In “Check”:
      • param-repeat “Insert Check”
        • “Find Regex”: Sample Characteristic\[age\]
        • “Replacement”: Age
      • param-repeat “Insert Check”
        • “Find Regex”: year
      • param-repeat “Insert Check”
        • “Find Regex”: Sample Characteristic\[body mass index\]
        • “Replacement”: BMI
      • param-repeat “Insert Check”
        • “Find Regex”: Sample Characteristic\[disease\]
        • “Replacement”: Disease
      • param-repeat “Insert Check”
        • “Find Regex”: Sample Characteristic\[individual\]
        • “Replacement”: Individual
      • param-repeat “Insert Check”
        • “Find Regex”: Sample Characteristic\[sex\]
        • “Replacement”: Sex
  2. Change the datatype to tabular

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select tabular from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

Now examine galaxy-eye your resultant metadata file in the Galaxy history. Better, right?

5 columns with numerical or string information on Run, Age, BMI, Disease and Sex. Open image in new tab

Figure 1: Look at the pretty metadata

This is ready to go, so now we’ll reformat the matrix!

Manipulate the expression matrix

Let’s upload the dataset.

Hands-on: Data upload
  1. Import the files from Zenodo or from the shared data library (GTN - Material -> single-cell -> Bulk matrix to ESet | Creating the bulk RNA-seq dataset for deconvolution):

    https://zenodo.org/record/7319173/files/E-MTAB-5060-raw-counts.tsv
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

  2. Rename the dataset as needed
  3. Check that the datatype is tabular

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select tabular from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

  4. Add to raw-counts the following tags #matrix #bulk #ebi

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

Now examine galaxy-eye your raw counts file in the Galaxy history.

Question
  1. Are samples in the rows or columns?
Column 1 contains Gene ID followed by many lines of ENSG####. Column 2 contains the gene names. The following columns contain numerous iterations of ERR#####. Open image in new tab

Figure 2: Gene info
  1. By examining the matrix, you can find that genes are the rows while samples are the columns.

While it’s awesome that there’s a gene name column, unfortunately the gene names will be duplicated - different ENS IDs can refer to the same Gene Name. This going to be a problem later. So we need to get this in a format to collapse the ENS IDs, just as we did previously in the scRNA-seq data reference preparation. Sadly, we’ll start by removing the column of gene names to prepare for the ENS ID collapse.

Hands-on: Remove gene names column
  1. Remove columns ( Galaxy version 1.0) with the following parameters:
    • param-file “Tabular file”: raw-counts (Input dataset)
    • In “Select Columns”:
      • param-repeat “Insert Select Columns”
        • “Header name”: Gene Name

Now that your data is in a format of having a rows of ENS IDs and samples as columns, you can apply the handy ENS ID collapsing workflow as we did in the scRNA-seq reference. If you have already imported this workflow during the first tutorial, then you can use it again now.

Hands-on: Convert from Ensembl to GeneSymbol using workflow
  1. Import this workflow.

    • Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
    • Click on galaxy-upload Import at the top-right of the screen
    • Provide your workflow
      • Option 1: Paste the URL of the workflow into the box labelled “Archived Workflow URL”
      • Option 2: Upload the workflow file in the box labelled “Archived Workflow File”
    • Click the Import workflow button

    Below is a short video demonstrating how to import a workflow from GitHub using this procedure:

    Video: Importing a workflow from URL

  2. Run the workflow on your sample with the following parameters:

    • “Organism”: Human
    • param-file “Expression Matrix (Gene Rows)”: output (output of Remove columns tool)
    • Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
    • Click on the workflow-run (Run workflow) button next to your workflow
    • Configure the workflow as needed
    • Click the Run Workflow button at the top-right of the screen
    • You may have to refresh your history to see the queued jobs

The output will likely be called Text transformation and will look like this:

Alphabetised gene symbols appear in column one with integers in the following columns corresponding to samples. Open image in new tab

Figure 3: Output of the ENS ID collapsing workflow for bulk dataset

Success! You’ve now prepared your metadata and your matrix. It’s time to put it together to create the Expression Set objects needed for MuSiC deconvolution.

Construct Expression Set Objects

We have three more tasks to do: first, we need to create the expression set object with all the phenotypes combined. Then, we will create the two objects we actually need - one for healthy and one for diseased.

Hands-on: Creating the combined object
  1. Construct Expression Set Object ( Galaxy version 0.1.1+galaxy4) with the following parameters:
    • param-file “Assay Data”: out_file #matrix (output of Text transformation tool)
    • param-file “Phenotype Data”: out_file1 #metadata (output of Regex Find And Replace tool)
  2. Remove the #metadata #matrix tags from the output RData ESet Object
Question
  1. How many genes are in your object?
  2. How many samples?
  3. What metadata categories are there?

The trick with all of these questions is to examine galaxy-eye the General info output param-file of the Construct Expression Set Object tool.

Lines showing ExpressionSet'; assayData: 34997 features, 7 samples; protocolData: none; phenoData; sampleNames: ERR### (7 total); varLabels: Age BMI Disease Sex; varMetadata: labelDescription; and 3 more useless lines . Open image in new tab

Figure 4: General info output
  1. There are 34997 features, which are the genes.
  2. There are 7 samples.
  3. The metadata categories are the same you prepared earlier, shown here in a category of phenoData: Age BMI Disease Sex
Hands-on: Creating the disease-only object
  1. Manipulate Expression Set Object ( Galaxy version 0.1.1+galaxy4) with the following parameters:
    • param-file “Expression Set Dataset”: out_rds (output of Construct Expression Set Object tool)
    • “Concatenate other Expression Set objects?”: No
    • “Subset the dataset?”: Yes
      • “By”: Filter Samples and Genes by Phenotype Values
        • In “Filter Samples by Condition”:
          • param-repeat “Insert Filter Samples by Condition”
            • “Name of phenotype column”: Disease
            • “List of values in this column to filter for, comma-delimited”: type II diabetes mellitus
  2. Add the tag #T2D to the output RData ESet Object

You can either re-run this tool or set it up again to create the healthy-only object.

Hands-on: Creating the healthy-only object
  1. Manipulate Expression Set Object ( Galaxy version 0.1.1+galaxy4) with the following parameters:
    • param-file “Expression Set Dataset”: out_rds (output of Construct Expression Set Object tool)
    • “Concatenate other Expression Set objects?”: No
    • “Subset the dataset?”: Yes
      • “By”: Filter Samples and Genes by Phenotype Values
        • In “Filter Samples by Condition”:
          • param-repeat “Insert Filter Samples by Condition”
            • “Name of phenotype column”: Disease
            • “List of values in this column to filter for, comma-delimited”: normal
  2. Add the tag #healthy to the output RData ESet Object

Conclusion

congratulations Congrats! You have successfully reformatted the RNA-seq samples into two ESet objects consisting of disease-only or healthy-only samples. You’re ready to take all this hard work and start comparing cell compositions in the next tutorial.

You can find the workflow for generating the ESet object and the answer key history.

7 boxes in the workflow editor and a subworkflow box for converting Ensembl to GeneSymbol. Open image in new tab

Figure 5: Workflow: Generating the bulk ESet Objects

This tutorial is part of the https://singlecell.usegalaxy.eu portal (Tekman et al. 2020).

feedback To discuss with like-minded scientists, join our Galaxy Training Network chatspace in Slack and discuss with fellow users of Galaxy single cell analysis tools on #single-cell-users

We also post new tutorials / workflows there from time to time, as well as any other news.

point-right If you’d like to contribute ideas, requests or feedback as part of the wider community building single-cell and spatial resources within Galaxy, you can also join our Single cell & sPatial Omics Community of Practice.

tool You can request tools here on our Single Cell and Spatial Omics Community Tool Request Spreadsheet