Python - Globbing
OverviewQuestions:Objectives:
How can I collect a list of files.
Requirements:
Use glob to collect a list of files
Learn about the potential pitfalls of glob
Time estimation: 15 minutesLevel: Intermediate IntermediateSupporting Materials:Published: Apr 25, 2022Last modification: Feb 13, 2023License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITpurl PURL: https://gxy.io/GTN:T00090version Revision: 3
Best viewed in a Jupyter NotebookThis tutorial is best viewed in a Jupyter notebook! You can load this notebook one of the following ways
Run on the GTN with JupyterLite (in-browser computations)
Launching the notebook in Jupyter in Galaxy
- Instructions to Launch JupyterLab
- Open a Terminal in JupyterLab with File -> New -> Terminal
- Run
wget https://training.galaxyproject.org/training-material/topics/data-science/tutorials/python-glob/data-science-python-glob.ipynb
- Select the notebook that appears in the list of files on the left.
Downloading the notebook
- Right click one of these links: Jupyter Notebook (With Solutions), Jupyter Notebook (Without Solutions)
- Save Link As..
Globbing is the term used in computer science when we have a bunch of files and we want to list all of them matching some pattern.
AgendaIn this tutorial, we will cover:
Setup
We’ll start by creating some files for use in the rest of this tutorial
import os
import subprocess
dirs = ['a', 'a/b', 'c', 'c/e', 'd', '.']
files = ['a.txt', 'a.csv', 'b.csv', 'b.txt', 'e.glm']
for d in dirs:
# Create some directories
os.makedirs(d, exist_ok=True)
# Create some files
for f in files:
subprocess.check_output(['touch', os.path.join(d, f)])
Now we should have a pretty full folder!
Finding Files
We can use the glob module to find files:
import glob
print(glob.glob('*.csv'))
print(glob.glob('*.txt'))
Here we use an asterisk (*
) as a wildcard, it matches any bit of text (but not into folders!) to all matching files. Here we list all matching csv
or txt
files. This is great to find files matching a pattern.
We can also use asterisks anywhere in the glob, it doesn’t just have to be the filename portion:
print(glob.glob('a*'))
Here we even see a third entry: the directory.
Finding files in directories
Until now we’ve found only files in a single top level directory, but what if we wanted to find files in subdirectories?
Only need a single directory? Just include that!
print(glob.glob('a/*.csv'))
But if you need more levels, or want to look in all folders, then you need the double wildcard! With two asterisks **
we can search recursively through directories for files:
print(glob.glob('**/a.csv'))
Exercise
Question: Where in the world is the CSV?
- How would you find all
.csv
files?- How would you find all
.txt
files?- How would you find all files starting with the letter ‘e’?
glob.glob('**/*.csv')
glob.glob('**/*.txt')
glob.glob('**/e*')
# Try things out here!
Pitfalls
Some analyses (especially simultaions) can be dependent on data input order or data sorting. This was recently seen in Neupane et al. 2019 where the data files used were sorted one way on Windows, and another on Linux, resulting in different results for the same code and the same datasets! Yikes!
If you know your analyses are dependent on file ordering, then you can use sorted()
to make sure the data is provided in a uniform way every time.
print(sorted(glob.glob('**/a.csv')))
If you’re not sure if your results will be dependent, you can try sorting anyway. Or better yet, randomising the list of inputs to make sure your code behaves properly in any scenario.