name: inverse layout: true class: center, middle, inverse
---
# Galaxy Troubleshooting
Martin Čech
Nate Coraor
last_modification
Updated:
purl
PURL
:
gxy.io/GTN:S00026
text-document
Plain-text slides
|
Tip:
press
P
to view the presenter notes |
arrow-keys
Use arrow keys to move between slides
??? Presenter notes contain extra information which might be useful if you intend to use these slides for teaching. Press `P` again to switch presenter notes off Press `C` to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other. Useful when presenting. --- In system administration... # Everything always goes wrong --- # When things go wrong, where do you begin? -- ## LOGS -- .left[Check **all** the logs] - gunicorn/gravity logs - handler logs (if not all-in-one) - Pulsar logs - nginx error and access logs - syslog/messages - authlog - browser console log --- # Most common problems - Startup problems - Disk Full, Database Disk Full - Web/UI problems - Tool failures - Job execution problems - General performance problems - Dependency issues - Other stuff --- # Startup problems -- Where do you begin? In the... -- ## LOGS Specifically, gravity/gunicorn and job handler logs. --- # Database migration .reduce90[ ```console galaxy.model.migrations.OutdatedDatabaseError: Your galaxy database has version e0e3bb173ee6, but this code expects version caa7742f7bca. To upgrade your database, run `manage_db.sh upgrade`. For more options (e.g. upgrading/downgrading to a specific version) see instructions in that file. Please remember to backup your database before migrating. ``` ] Use Ansible! This is permanently solved for you. Otherwise, upgrade as instructed. .left[If you believe this message is in error, you can:] - Check DB table `alembic_version`, column `version`. - Check folder `lib/galaxy/model/migrations/alembic/versions_gxy/` - the latest migration should match the DB. - Clean the `*.pyc` files in migrate versions folder to make sure there is no remnant from other code revisions. - See [makepyc.py in galaxyproject.galaxy](https://github.com/galaxyproject/ansible-galaxy/blob/master/files/makepyc.py). - Ansible role does this for you. --- class: left # Database migration **Downgrading** - Perform in this order: 1. Downgrade the DB with `manage_db.sh` 2. Downgrade Galaxy with `git checkout` 3. Clean `*.pyc` --- # Stuck in a restart loop .reduce90[ ```console Executing: PYTHONPATH=lib GALAXY_CONFIG_FILE=/srv/galaxy/config/galaxy.yml VIRTUAL_ENV=/srv/galaxy/venv /srv/galaxy/venv/bin/gunicorn 'galaxy.webapps.galaxy.fast_factory:factory()' --timeout 300 ... DEBUG:galaxy.app:python path is: /srv/galaxy/server, /srv/galaxy/venv/bin, /srv/galaxy/server/lib, ... ... bunch of stuff ... Executing: PYTHONPATH=lib GALAXY_CONFIG_FILE=/srv/galaxy/config/galaxy.yml VIRTUAL_ENV=/srv/galaxy/venv /srv/galaxy/venv/bin/gunicorn 'galaxy.webapps.galaxy.fast_factory:factory()' --timeout 300 ... DEBUG:galaxy.app:python path is: /srv/galaxy/server, /srv/galaxy/venv/bin, /srv/galaxy/server/lib, ... ... bunch of stuff ... Executing: PYTHONPATH=lib GALAXY_CONFIG_FILE=/srv/galaxy/config/galaxy.yml VIRTUAL_ENV=/srv/galaxy/venv /srv/galaxy/venv/bin/gunicorn 'galaxy.webapps.galaxy.fast_factory:factory()' --timeout 300 ... DEBUG:galaxy.app:python path is: /srv/galaxy/server, /srv/galaxy/venv/bin, /srv/galaxy/server/lib, ... ... ``` ] 1. Open the log *without* following 2. Scroll to the end 3. Search/scroll back to the last startup attempt 4. The lines *directly preceding* should offer some insight --- # Stuck in a restart loop The reason that Galaxy is dying might not be in the log. If it's not, then If using **systemd** then it should be in `journalctl -eu galaxy-gunicorn` If using something else, it's wherever `stderr` is being redirected to (supervisor?) Or maybe the system is killing it... (`/var/log/{messages,syslog,kern.log}`, `dmesg`) --- # Web/UI Problems -- Where do you begin? In the... -- ## LOGS Specifically: gunicorn, nginx, and browser console logs --- # 502 Bad Gateway Your reverse proxy has failed to connect to the Galaxy socket - `systemctl status galaxy-gunicorn` - Check that Galaxy is running / not in a reboot loop - Check that your proxy and Galaxy/gunicorn socket options match - For UNIX domain sockets, ensure the web server user can r/w the socket - Check proxy server logs (e.g. `/var/log/nginx`) - Check gunicorn logs for `[<PID>] [ERROR] Connection in use: ('localhost', 8080)` - Check that something is listening on the correct port (`sudo ss -tlpn 'sport = :port'` (demo!) or `sudo lsof -i :port` (demo!)) - If the culprit is gunicorn, make sure it's not an old one! --- # 504 Gateway Timeout Use `htop` (demo!) and `ps` (demo!) - `htop` process state **D** (*kernel uninterruptable sleep*)? .reduce90[ ```console $ htop -u galaxy 0[|| 1.6%] Tasks: 98, 244 thr; 1 running 1[|| 2.7%] Load average: 105.04 102.08 91.04 Mem[||||||||||||||||||||||||||||||||||||||2.28G/15.6G] Uptime: 3 days, 15:46:31 Swp[ 0K/0K] PID USER PRI NI VIRT RES SHR S CPU%▽MEM% TIME+ Command 3440 galaxy 20 0 2806M 297M 53364 D 17.2 5.2 3:11.40 python3 gunicorn ... ... ``` ] --- # 504 Gateway Timeout - `ps` process state **D**? .reduce90[ ```console $ ps uwwU galaxy USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND galaxy 3440 17.2 5.2 2873344 304128 ? D Nov10 3:11 python3 gunicorn ... ... ``` ] --- # 504 Gateway Timeout Cycling state **D**/**S**/**R** is fine, but stuck on **D** is Very Bad™ Probably IO - check filesystems, disks. Restart Galaxy .reduce70[.footnote[A later slide will cover how to investigate kernel uninterruptable sleep processes]] --- # 504 Gateway Timeout or slow UI Check load ```console $ uptime 16:08:15 up 3 days, 8:34, 8 users, load average: 0.12, 0.23, 0.24 ``` Averages are 1, 5, 15 minutes. Average should be <= number of cores (`lscpu`, (beware SMT) or just `htop`) More on troubleshooting load later --- # Where do you begin? In the... ## LOGS* -- *often I now begin in Grafana ![Grafana Load Graph](../../images/troubleshooting/grafana-load-graph.png "Grafana Load Graph") --- # 504 Gateway Timeout or slow UI - Investigate load - Web server(s) - Database server - Investigate memory usage, swapping - Investigate iowait --- # Galaxy UI is slow - [Tutorial from @mvdbeek](https://docs.galaxyproject.org/en/master/dev/finding_and_improving_slow_code.html#profiling) - Use Galaxy [heartbeat](https://github.com/galaxyproject/galaxy/blob/8c2066f07ac7bb48c59cf8378e5a852e6fb82a4e/config/galaxy.yml.sample#L1036) (not super relevant anymore thanks to...) - Use `py-spy` (demo later!) --- # Blank page or no CSS/JavaScript Serving of static content is broken. - Check browser console for 404 errors. - Check proxy error log for permission errors. - Verify that your proxy static configuration is correct. - If you have recently upgraded Galaxy or changed the GUI in some way, you will need to rebuild the client --- # Tool failures Tools can fail for a variety of reasons, some valid, some invalid. Some made up examples follow. --- # Tool missing from Galaxy - Restart Galaxy and watch the log for ```console Loaded tool id: toolshed.g2.bx.psu.edu/repos/iuc/sickle/sickle/1.33, version: 1.33 into tool panel.... ``` - After startup, check `integrated_tool_panel.xml` for ```xml <tool id="toolshed.g2.bx.psu.edu/repos/iuc/sickle/sickle/1.33" /> ``` - If it is TS tool check `shed_tool_conf.xml` for ```xml <tool file="toolshed.g2.bx.psu.edu/repos/iuc/sickle/43e081d32f90/sickle/sickle.xml" guid="toolshed.g2.bx.psu.edu/repos/iuc/sickle/sickle/1.33"> ... </tool> ``` - Multiple job handlers? Sometimes they don't all get the update. --- # Tool errors Tool stdout/stderr is available in UI under "i" icon on history dataset .left[Debugging on the filesystem:] 1. Set `cleanup_job` to `onsuccess` 2. Cause a job failure 3. Go to job working directory (find in logs or `/srv/galaxy/jobs/<hash>/<job_id>`) 4. Poke around, try running things (`srun --pty bash` considered useful) Familiarize yourself with the places Galaxy keeps things (demo?) --- # Tool errors - stderr **Galaxy considers *any* output to standard error (stderr) to be an error when no tool profile version is set by a tool.** Why would it do this????!?!!!11 In the old days, tools were bad about setting the exit code on failure, so it could not be trusted. Galaxy had no functionality to inspect output to decide on failure. --- class: left # Tool errors So what if tool stderr contains: ```console Congratulations, running SuperAwesome tool was successful! ``` What happens: job fails (!!?!?) Solutions: - If the wrapped tool uses proper exit codes, use `<tool profile="16.04">` or later to ignore stderr - Using current tool development best practices ensures this is the case - [List of tool profiles](https://docs.galaxyproject.org/en/latest/dev/schema.html#tool-profile) --- class: left # Tool errors Tool output contains: ```console Warning: Discarded 10000 lines of /input/dataset_1.dat because they looked funny ``` Maybe a problem, maybe not. Solutions: - Check tool input(s) and parameters (maybe requires some biological knowledge) - Verify input is not corrupt - User education --- class: left # Tool errors - memory errors Tool output contains one of: ```console MemoryError # Python what(): std::bad_alloc # C++ Segmentation Fault # C - but could be other problems too Killed # Linux OOM Killer ``` Solutions: - Change input sizes or params - Map/reduce? - Decrease the amount of memory the tool needs - Increase the amount of memory available to the job - Request more memory from cluster scheduler - Use job resubmission to automatically rerun with a larger memory allocation - Cross your fingers and rerun the job --- class: left # Tool errors - system errors Tool output contains: ```console open(): /input/dataset_1.dat: No such file or directory ``` Solutions: - Verify that `/input/dataset_1.dat` exists. - On node the job ran on - As the user the job ran as - Fix the filesystem error (NFS?) and rerun the job - See NFS caching errors slide later --- class: left # Tool errors - dependency problems Tool output contains: ```console sh: command not found: samtools ``` Solutions: - Verify that `tool_dependency_dir` is accessible *on the cluster, as the user running Galaxy jobs* - Verify that tool dependencies are properly installed: Galaxy UI "Admin > Manage dependencies" - Use BioContainers (Docker/Singularity) --- # Manage Dependencies An incredibly useful feature to hit unruly tool dependencies with a large hammer. ![Manage Dependencies](../../images/troubleshooting/manage-dependencies.png "Manage Dependencies") -- (demo!) --- # An aside - dependency problems on geriatric Galaxies Galaxy Tool Shed dependencies are dead.<sup>1</sup> If you don't know what Tool Shed dependencies are (lucky you) skip this slide. Don't attempt to fix Galaxy Tool Shed dependencies, just update tool and use container (or Conda) deps. Put Conda first in [dependency_resolvers_conf.xml](https://github.com/galaxyproject/galaxy/blob/dev/config/dependency_resolvers_conf.xml.sample) if you still need to support both. .reduce70[.footnote[<sup>1</sup> The Tool Shed should only be used for Galaxy Tool configuration files and wrapper scripts]] --- class: left # Tool errors - dependency problems Tool output contains: ```console foo: error while loading shared libraries: libdeepthought.so.42: cannot open shared object file: No such file or directory ``` `foo` was compiled against `libdeepthought.so.42` but it's not on the runtime linker path Solutions: - Reinstall tool dependencies ("Admin > Manage dependencies"). If it still fails<sup>2</sup>: - Determine conda package providing `libdeepthought.so` - Downgrade it to the version that provides `libdeepthought.so.42` - Use BioContainers (Docker/Singularity) .center[.reduce70[.footnote[<sup>2</sup> First, Google the error because someone named Björn has probably already found and fixed the problem.]]] ??? Galaxy ensures that the correct version(s) of tool(s) immediate dependencies are controlled, but dependencies of dependencies are up to conda. Occasionally, upgrades to these second level dependencies break existing conda packages, requiring package authors to update the first level dependencies to correct the issue. --- class: left # Tool errors - Empty green history item 1. The tool is not correctly detecting error conditions: inspect stdout/stderr 2. The tool correctly produced an empty dataset for the given params, inputs Solutions: - Fix the tool wrapper to detect errors - User education --- # Summary of types of tool failures - Input/parameter problem (user or tool wrapper author problem) - Tool wrapper bug (tool wrapper author problem) - Tool bug (tool wrapper author or tool developer problem) - Resource problem (sysadmin problem) -- Everything else: always the sysadmin's problem --- # One last word on tool errors All IUC/devteam tools in the Tool Shed have tests Use these tests (automateable with [Ephemeris](https://github.com/galaxyproject/ephemeris)!) to verify that the tool works in the basic case --- # Job execution problems --- class: smaller # Jobs aren't running: always gray Corresponds to `job.state = 'new'` or `'queued'` in database .left[Check the Galaxy server log for errors. Successful job lifecycle:] <pre class="reduce50" style="text-align: left"> galaxy.tools INFO 2023-04-20 13:51:20,736 [pN:main.1,p:273250,tN:WSGI_0] Validated and populated state for tool request (4.308 ms) galaxy.tools.actions INFO 2023-04-20 13:51:20,747 [pN:main.1,p:273250,tN:WSGI_0] Handled output named out_file1 for tool secure_hash_message_digest (0.756 ms) galaxy.tools.actions INFO 2023-04-20 13:51:20,759 [pN:main.1,p:273250,tN:WSGI_0] Added output datasets to history (12.102 ms) galaxy.tools.actions INFO 2023-04-20 13:51:20,761 [pN:main.1,p:273250,tN:WSGI_0] Setup for job Job[unflushed,tool_id=secure_hash_message_digest] complete, ready to be enqueued (1.097 ms) galaxy.tools.execute DEBUG 2023-04-20 13:51:20,761 [pN:main.1,p:273250,tN:WSGI_0] Tool secure_hash_message_digest created job None (21.968 ms) galaxy.web_stack.handlers INFO 2023-04-20 13:51:20,792 [pN:main.1,p:273250,tN:WSGI_0] (Job[id=2,tool_id=secure_hash_message_digest]) Handler '_default_' assigned using 'db-skip-locked' assignment method galaxy.tools.execute DEBUG 2023-04-20 13:51:20,797 [pN:main.1,p:273250,tN:WSGI_0] Created 1 job(s) for tool secure_hash_message_digest request (60.168 ms) galaxy.jobs.handler DEBUG 2023-04-20 13:51:20,879 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] Grabbed Job(s): 2 tpv.rules.gateway INFO 2023-04-20 13:51:20,920 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] loading tpv rules from: ['https://gxy.io/tpv/db.yml', '/srv/galaxy/config/TPV_DO_NOT_TOUCH/tpv_rules_local.yml'] tpv.rules.gateway INFO 2023-04-20 13:51:21,266 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] Watching for changes in file: /srv/galaxy/config/TPV_DO_NOT_TOUCH/tpv_rules_local.yml galaxy.util.watcher DEBUG 2023-04-20 13:51:21,266 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] Watching for changes to file: /srv/galaxy/config/TPV_DO_NOT_TOUCH/tpv_rules_local.yml tpv.core.entities DEBUG 2023-04-20 13:51:21,354 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] Ranking destinations: [runner=local_runner, ... (line truncated) galaxy.jobs.mapper DEBUG 2023-04-20 13:51:21,354 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] (2) Mapped job to destination id: local_env galaxy.jobs.handler DEBUG 2023-04-20 13:51:21,370 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] (2) Dispatching to local_runner runner galaxy.jobs DEBUG 2023-04-20 13:51:21,387 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] (2) Persisting job destination (destination id: local_env) galaxy.jobs DEBUG 2023-04-20 13:51:21,402 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] (2) Working directory for job is: /data/jobs/000/2 galaxy.jobs.runners DEBUG 2023-04-20 13:51:21,412 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] Job [2] queued (41.730 ms) galaxy.jobs.handler INFO 2023-04-20 13:51:21,423 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] (2) Job dispatched galaxy.jobs DEBUG 2023-04-20 13:51:21,500 [pN:handler_1,p:272701,tN:LocalRunner.work_thread-3] Job wrapper for Job [2] prepared (70.012 ms) galaxy.jobs.command_factory INFO 2023-04-20 13:51:21,508 [pN:handler_1,p:272701,tN:LocalRunner.work_thread-3] Built script [/data/jobs/000/2/tool_script.sh] for tool command [python '/srv/galaxy/server/tools/filters/secure_hash_message_digest.py' --input '/data/datasets/5/9/6/dataset_596da164-5b91-4003-8699-db43afd9f26d.dat' --output '/data/jobs/000/2/outputs/galaxy_dataset_bfc51e23-4d98-4968-b9f6-a7fc970fad28.dat' --algorithm "sha256"] galaxy.jobs.runners DEBUG 2023-04-20 13:51:21,554 [pN:handler_1,p:272701,tN:LocalRunner.work_thread-3] (2) command is: (multi-line output trimmed) galaxy.jobs.runners.local DEBUG 2023-04-20 13:51:21,558 [pN:handler_1,p:272701,tN:LocalRunner.work_thread-3] (2) executing job script: /data/jobs/000/2/galaxy_2.sh galaxy.jobs.runners.util.process_groups DEBUG 2023-04-20 13:51:25,191 [pN:handler_1,p:272701,tN:LocalRunner.work_thread-3] check_pg(): No process found in process group 587010 galaxy.jobs.runners.local DEBUG 2023-04-20 13:51:25,192 [pN:handler_1,p:272701,tN:LocalRunner.work_thread-3] execution finished: /data/jobs/000/2/galaxy_2.sh galaxy.jobs DEBUG 2023-04-20 13:51:25,202 [pN:handler_1,p:272701,tN:LocalRunner.work_thread-3] finish(): Moved /data/jobs/000/2/outputs/galaxy_dataset_bfc51e23-4d98-4968-b9f6-a7fc970fad28.dat to /data/datasets/b/f/c/dataset_bfc51e23-4d98-4968-b9f6-a7fc970fad28.dat galaxy.model.metadata DEBUG 2023-04-20 13:51:25,212 [pN:handler_1,p:272701,tN:LocalRunner.work_thread-3] loading metadata from file for: HistoryDatasetAssociation 2 galaxy.jobs INFO 2023-04-20 13:51:25,235 [pN:handler_1,p:272701,tN:LocalRunner.work_thread-3] Collecting metrics for Job 2 in /data/jobs/000/2 galaxy.jobs DEBUG 2023-04-20 13:51:25,250 [pN:handler_1,p:272701,tN:LocalRunner.work_thread-3] job_wrapper.finish for job 2 executed (51.404 ms) </pre> --- class: smaller, left # Successful job lifecycle Note `[pN:NNNN,p:PPPP,tN:NNNN]` in log messages: .reduce70[ ```console galaxy.tools.execute DEBUG 2023-04-20 13:51:20,761 [pN:main.1,p:273250,tN:WSGI_0] Tool secure_hash_message_digest created job None (21.968 ms) galaxy.jobs.mapper DEBUG 2023-04-20 13:51:21,354 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] (2) Mapped job to destination id: local_env ``` ] - `[pN:main.1,p:273250,tN:WSGI_0]`: PID 273250, web worker (not a job handler) - `[pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread]`: PID 272701, job handler (not a web worker) --- class: smaller # Successful job lifecycle Dissecting the lifecycle messages .reduce70[ ```console galaxy.tools INFO 2023-04-20 13:51:20,736 [pN:main.1,p:273250,tN:WSGI_0] Validated and populated state for tool request (4.308 ms) galaxy.tools.actions INFO 2023-04-20 13:51:20,747 [pN:main.1,p:273250,tN:WSGI_0] Handled output named out_file1 for tool secure_hash_message_digest (0.756 ms) galaxy.tools.actions INFO 2023-04-20 13:51:20,759 [pN:main.1,p:273250,tN:WSGI_0] Added output datasets to history (12.102 ms) galaxy.tools.actions INFO 2023-04-20 13:51:20,761 [pN:main.1,p:273250,tN:WSGI_0] Setup for job Job[unflushed,tool_id=secure_hash_message_digest] complete, ready to be enqueued (1.097 ms) galaxy.tools.execute DEBUG 2023-04-20 13:51:20,761 [pN:main.1,p:273250,tN:WSGI_0] Tool secure_hash_message_digest created job None (21.968 ms) galaxy.web_stack.handlers INFO 2023-04-20 13:51:20,792 [pN:main.1,p:273250,tN:WSGI_0] (Job[id=2,tool_id=secure_hash_message_digest]) Handler '_default_' assigned using 'db-skip-locked' assignment method galaxy.tools.execute DEBUG 2023-04-20 13:51:20,797 [pN:main.1,p:273250,tN:WSGI_0] Created 1 job(s) for tool secure_hash_message_digest request (60.168 ms) ``` ] - Job is assigned **Galaxy** job ID 2 - "Executed" is misleading - the Job has been created in the `job` table of the database, but is not picked up by a job handler yet. - All of this has occurred in the web worker. - If "Executed" is the last message you see, verify job assigned to a valid handler. --- class: smaller # Successful job lifecycle .reduce70[ ```console galaxy.jobs.handler DEBUG 2023-04-20 13:51:20,879 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] Grabbed Job(s): 2 tpv.rules.gateway INFO 2023-04-20 13:51:20,920 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] loading tpv rules from: ['https://gxy.io/tpv/db.yml', '/srv/galaxy/config/TPV_DO_NOT_TOUCH/tpv_rules_local.yml'] tpv.rules.gateway INFO 2023-04-20 13:51:21,266 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] Watching for changes in file: /srv/galaxy/config/TPV_DO_NOT_TOUCH/tpv_rules_local.yml galaxy.util.watcher DEBUG 2023-04-20 13:51:21,266 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] Watching for changes to file: /srv/galaxy/config/TPV_DO_NOT_TOUCH/tpv_rules_local.yml tpv.core.entities DEBUG 2023-04-20 13:51:21,354 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] Ranking destinations: [runner=local_runner, dest_name=local_env, min_accepted_cores=None, min_accepted_mem=None, min_accepted_gpus=None, max_accepted_cores=1, max_accepted_mem=None, max_accepted_gpus=None, tpv_dest_tags=<class 'tpv.core.entities.TagSetManager'> tags=[], handler_tags=None<class 'tpv.core.entities.Destination'> id=local_env, abstract=False, cores=None, mem=None, gpus=None, min_cores = None, min_mem = None, min_gpus = None, max_cores = None, max_mem = None, max_gpus = None, env=None, params={'tmp_dir': True}, resubmit=None, tags=<class 'tpv.core.entities.TagSetManager'> tags=[], rank=, inherits=None, context=None, rules={}, runner=local_runner, dest_name=singularity, min_accepted_cores=None, min_accepted_mem=None, min_accepted_gpus=None, max_accepted_cores=1, max_accepted_mem=None, max_accepted_gpus=None, tpv_dest_tags=<class 'tpv.core.entities.TagSetManager'> tags=[], handler_tags=None<class 'tpv.core.entities.Destination'> id=singularity, abstract=False, cores=None, mem=None, gpus=None, min_cores = None, min_mem = None, min_gpus = None, max_cores = None, max_mem = None, max_gpus = None, env=[{'name': 'LC_ALL', 'value': 'C'}, {'name': 'SINGULARITY_CACHEDIR', 'value': '/tmp/singularity'}, {'name': 'SINGULARITY_TMPDIR', 'value': '/tmp'}], params={'singularity_enabled': True}, resubmit=None, tags=<class 'tpv.core.entities.TagSetManager'> tags=[], rank=, inherits=None, context=None, rules={}, runner=slurm, dest_name=slurm, min_accepted_cores=None, min_accepted_mem=None, min_accepted_gpus=None, max_accepted_cores=24, max_accepted_mem=256, max_accepted_gpus=None, tpv_dest_tags=<class 'tpv.core.entities.TagSetManager'> tags=[], handler_tags=None<class 'tpv.core.entities.Destination'> id=slurm, abstract=False, cores=None, mem=None, gpus=None, min_cores = None, min_mem = None, min_gpus = None, max_cores = 2, max_mem = 8, max_gpus = None, env=[{'name': 'LC_ALL', 'value': 'C'}, {'name': 'SINGULARITY_CACHEDIR', 'value': '/tmp/singularity'}, {'name': 'SINGULARITY_TMPDIR', 'value': '/tmp'}], params={'singularity_enabled': True, 'native_specification': "--nodes=1 --ntasks=1 --cpus-per-task={cores} --time={params['walltime']}:00:00"}, resubmit={}, tags=<class 'tpv.core.entities.TagSetManager'> tags=[], rank=, inherits=singularity, context={}, rules={}] for entity: <class 'tpv.core.entities.Tool'> id=default, abstract=False, cores=1, mem=4, gpus=None, min_cores = None, min_mem = None, min_gpus = None, max_cores = None, max_mem = None, max_gpus = None, env=[], params={'walltime': 8}, resubmit={}, tags=<class 'tpv.core.entities.TagSetManager'> tags=[<Tag: name=scheduling, value=offline, type=TagType.REJECT>], rank=helpers.we, inherits=None, context={}, rules={} using custom function ``` ] - handler_1 "grabbed" (pulled the job from the database and assigned itself) by setting `job.handler` to its own `server_name` - The job is passed to TPV, which evaluates its rules and decides where to send the job --- class: smaller # Successful job lifecycle .reduce70[ ```console galaxy.jobs.mapper DEBUG 2023-04-20 13:51:21,354 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] (2) Mapped job to destination id: slurm_env galaxy.jobs.handler DEBUG 2023-04-20 13:51:21,370 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] (2) Dispatching to slurm_runner runner galaxy.jobs DEBUG 2023-04-20 13:51:21,387 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] (2) Persisting job destination (destination id: slurm_env) galaxy.jobs DEBUG 2023-04-20 13:51:21,402 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] (2) Working directory for job is: /data/jobs/000/2 galaxy.jobs.runners DEBUG 2023-04-20 13:51:21,412 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] Job [2] queued (41.730 ms) galaxy.jobs.handler INFO 2023-04-20 13:51:21,423 [pN:handler_1,p:272701,tN:JobHandlerQueue.monitor_thread] (2) Job dispatched ``` ] - `(2)` at beginning of job log messages is **Galaxy** job ID - Job is mapped to job environment `slurm_env` - Job is dispatched to job runner plugin `slurm_runner` - The job working directory is created --- class: smaller # Successful job lifecycle .reduce70[ ```console galaxy.jobs DEBUG 2023-04-20 13:51:21,500 [pN:handler_1,p:272701,tN:SlurmRunner.work_thread-3] Job wrapper for Job [2] prepared (70.012 ms) galaxy.jobs.command_factory INFO 2023-04-20 13:51:21,508 [pN:handler_1,p:272701,tN:SlurmRunner.work_thread-3] Built script [/data/jobs/000/2/tool_script.sh] for tool command [python '/srv/galaxy/server/tools/filters/secure_hash_message_digest.py' --input '/data/datasets/5/9/6/dataset_596da164-5b91-4003-8699-db43afd9f26d.dat' --output '/data/jobs/000/2/outputs/galaxy_dataset_bfc51e23-4d98-4968-b9f6-a7fc970fad28.dat' --algorithm "sha256"] galaxy.jobs.runners DEBUG 2023-04-20 13:51:21,554 [pN:handler_1,p:272701,tN:SlurmRunner.work_thread-3] (2) command is: [command trimmed, see tool script (above), job script (below)] galaxy.jobs.runners.drmaa DEBUG 2023-04-20 13:51:21,821 [pN:handler_1,p:272701,tN:SlurmRunner.work_thread-3] (2) submitting file /srv/galaxy/jobs/000/2/galaxy_2.sh galaxy.jobs.runners.drmaa DEBUG 2023-04-20 13:51:21,843 [pN:handler_1,p:272701,tN:SlurmRunner.work_thread-3] (2) queued as 1 galaxy.jobs.runners.drmaa DEBUG 2023-04-20 13:51:21,867 [pN:handler_1,p:272701,tN:SlurmRunner.work_thread-3] (2) Persisting job destination (destination id: slurm_env) ``` ] - The job has been dispatched and has been assigned **Slurm** job ID 1 --- class: smaller # Successful job lifecycle ```console galaxy.jobs.runners.drmaa DEBUG 2023-04-20 13:51:22,158 [pN:handler_1,p:272701,tN:SlurmRunner.work_thread-3] (2/1) state change: job is queued and active galaxy.jobs.runners.drmaa DEBUG 2023-04-20 13:51:23,171 [pN:handler_1,p:272701,tN:SlurmRunner.work_thread-3] (2/1) state change: job is running galaxy.jobs.runners.drmaa DEBUG 2023-04-20 13:51:51,666 [pN:handler_1,p:272701,tN:SlurmRunner.work_thread-3] (2/1) state change: job finished normally galaxy.model.metadata DEBUG 2023-04-20 13:51:25,212 [pN:handler_1,p:272701,tN:SlurmRunner.work_thread-3] loading metadata from file for: HistoryDatasetAssociation 2 galaxy.jobs INFO 2023-04-20 13:51:25,235 [pN:handler_1,p:272701,tN:SlurmRunner.work_thread-3] Collecting metrics for Job 2 in /data/jobs/000/2 galaxy.jobs DEBUG 2023-04-20 13:51:25,250 [pN:handler_1,p:272701,tN:SlurmRunner.work_thread-3] job_wrapper.finish for job 2 executed (51.404 ms) ``` - `(2/1)` at beginning of job log messages now includes **Slurm** job ID - The job is done - Its `state` column and its output datasets' `state` columns have been updated to `ok` --- # Jobs aren't running ![Job in new state](../../images/troubleshooting/hda-new.png "Job in new state") Corresponds to `job.state = 'new'` in database Not yet picked up by the Galaxy job handler subsystem Troubleshooting depends on your job handler configuration --- Solutions: - If using single process - Ensure no `handlers` in job conf - Check Galaxy logs - If using `assign: "db-skip-locked"` - Ensure no `default` key in `handling:` - Ensure no individual `handler:` defined - Ensure `handler` column of `job` table is being set to `_default_` - Ensure handlers started with `--attach-to-pool=job-handlers` --- class: left, reduce90 # Handler ID to server_name match check Job handler assignment check: ```console $ journalctl -g ' is running$' -u 'galaxy-handler*' galaxy.web.stack INFO 2019-01-31 15:18:35,228 [MainThread] Galaxy server instance 'handler_0' is running galaxy.web.stack INFO 2019-01-31 15:18:35,228 [MainThread] Galaxy server instance 'handler_1' is running ``` ```console $ gxadmin query job-info 2 ``` .reduce70[ id | tool_id | state | handler | username | create_time | job_runner_name | job_runner_external_id --- | ------- | ----- | ----------------- | -------- | -------------------------- | --------------- | ----------------------- 21 | testing | new | **job_handler_0** | admin | 2019-02-01 14:30:20.540327 | | ] In this case, `job_handler_0` is not `handler_0` or `handler_1` (probably due to a misconfiguration in `handlers` section of job conf), so no handler will find this job. --- # An aside - unruly logs `galaxyctl follow` or `journalctl -fu 'galaxy*'` follows *all* logs, which can make it hard to find what you're lookng for when watching live logs. .left[Limit to specific services, e.g.:] `galaxyctl` | `journalctl` --- | --- `galaxyctl follow gunicorn` | `journalctl -fu galaxy-gunicorn` `galaxyctl follow handler` | `journalctl -fu 'galaxy-handler*'` `galaxyctl follow tusd` | `journalctl -fu 'galaxy-tusd'` --- # Jobs aren't running ![Job in queued state](../../images/troubleshooting/hda-queued.png "Job in queued state") Corresponds to `job.state = 'queued'` in database A handler has seen this job .left[Verify concurrency limits unmet in job conf:] - Local jobs: value of plugin `workers` attribute (default: 4) - All jobs: Entire `limits:` section: Users are *not* informed when they have reached the job limits. Exceeding disk quota also pauses jobs, but users are notified of this. --- # Jobs aren't running .left[Check database for:] - Destination ID (`gxadmin query job-info`) - Job runner external (DRM) ID (`gxadmin query job-info`) - Jobs owned by user in non-terminal state if limits in use (`gxadmin query jobs-nonterminal [user]`) - Check handler **logs** If external ID is assigned, job is queued on a cluster. Use cluster queue status tool(s) to investigate further. --- # Jobs aren't finishing ![Job in running state](../../images/troubleshooting/hda-running.png "Job in running state") Corresponds to `job.state = 'running'` in database Get job runner external (DRM) ID (`gxadmin query job-info`), use cluster queue status tool(s) to investigate further. .left[If the DRM job is finished but the Galaxy job is still "running":] - Check last job handler log message for job - Setting/collecting metadata can be slow: wait - Check handlers health (`py-spy` or `gdb` (demos later!)) --- # Overall slow processing of jobs Process state **D**? ```console USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND galaxy 3440 17.2 5.2 1696480 863536 ? D Nov10 167:46 python3 scripts/galaxy-main --server-name=handler0 ... ``` Kernel uninterruptable sleep. Probably IO. Check filesystems, disks. .reduce70[.footnote[A later slide will cover how to investigate kernel uninterruptable sleep processes]] --- # Overall slow processing of jobs Process state *not* **D**? - Use `py-spy` (demo!) - Use `gdb` (demo!) --- # General performance problems --- # Kernel uninterruptable sleep Process state **D**? ```console USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND galaxy 3440 17.2 5.2 1696480 863536 ? D Nov10 167:46 python3 scripts/galaxy-main --server-name=handler0 ... ``` Kernel uninterruptable sleep. Probably IO. Check filesystems, disks. - Investigate `/proc/<pid>/fd` (demo!) or `lsof -p <pid>` (demo!) - Use `strace` (demo!) .reduce70[.footnote[This slide covers how to investigate kernel uninterruptable sleep processes]] --- # Local or Network FS slow/down - Use `iostat` and/or `nfsiostat` to see device performance (demo!) - Use `dstat` to see pretty device performance summary (demo!) - Use `time dd` (quick and dirty) or `fio` to test write performance (demo!) - Use Wireshark ![Packet capture in Wireshark](../../images/troubleshooting/wireshark.png "Packet capture in Wireshark") --- # Local or Network FS slow/down Solutions: - Don't put the Galaxy server in that FS. Local distribution, CVMFS, ?? - Install Galaxy in two places and use [Pulsar Embedded Mode](https://github.com/galaxyproject/galaxy/blob/3a90b833d1169100b07ba64332886a817aebb55d/lib/galaxy/config/sample/job_conf.sample.yml#L319) with [file actions](https://pulsar.readthedocs.io/en/latest/galaxy_conf.html#data-staging) to rewrite paths<sup>3</sup> - Get a better network FS .reduce70[.footnote[<sup>3</sup> Ply @jmchilton with McDonalds hashbrowns for more Pulsar Embedded Mode documentation]] --- # NFS caching errors Galaxy gets "No such file or directory" for files that exist. NFS attribute caching is to blame. Set: ```yaml galaxy: retry_job_output_collection: 5 ``` [retry_job_output_collection](https://github.com/galaxyproject/galaxy/blob/476b1948190436664581cfe56c8c1c1dcc0b2370/lib/galaxy/jobs/__init__.py#L1469) --- # Database problems Slow queries, high load, etc. - Use `EXPLAIN ANALYZE` (demo!) - [Example of the "jobs ready to run" query](https://gist.github.com/natefoo/da46bea8136ba67d65f616c86a27c454) - `database_engine_option_echo` (but warning, extremely verbose) - `slow_query_log_threshold` logs to Galaxy log file - `sentry_sloreq_threshold` if using Sentry - [Postgres EXPLAIN Visualizer (PEV)](https://tatiyants.com/pev/#/plans) considered useful ([demo data](https://gist.github.com/hexylena/467c7726b5ab9e18c47080893dbc072e)) - Use `VACUUM ANALYZE` - `gxadmin query pg-*` commands Increase `shared_buffers`. 2GB on Main (16GB of memory on VM). Better: [Use PGTune!](https://pgtune.leopard.in.ua/) --- # Other stuff --- # error: [Errno 32] Broken pipe This error is not indicative of any kind of failure. It just means that the client closed a connection before the server finished sending a response. --- # Other stuff - Galaxy server process (not jobs): Run Galaxy in a cgroup with a memory limit - `scripts/helper.py` and `scripts/secret_decoder_ring.py` can encode/decode IDs from UI - `scripts/helper.py` can also fail jobs with a notice to the user --- # Job failures "Unable to run job due to a misconfiguration of the Galaxy job running system. Please contact a site administrator." There is a traceback in the Galaxy log. Go find it. --- # Node failures ```console $ sinfo -Nel Fri Nov 11 09:50:01 2016 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON localhost 1 debug* down 2 2:1:1 1 0 1 (null) Node unexpectedly rebooted ``` Figure out why it rebooted and: ```console $ sudo scontrol update nodename=localhost state=resume ``` --- # User over quota Instruct user (or use impersonate with consent) to check for *deleted* but not *purged* histories In the Admin/Users menu you can run `Recalculate Disk Usage` on a certain user There is also a command line version: `scripts/set_user_disk_usage.py` --- # Any troubling Galaxy situations you have? --- # Where to get help - [Admins Chat](https://matrix.to/#/#galaxyproject_admins:gitter.im) - [Galaxy Help](https://help.galaxyproject.org/) - [Hub Support page](https://galaxyproject.org/support/) --- ## Thank You! This material is the result of a collaborative work. Thanks to the [Galaxy Training Network](https://training.galaxyproject.org) and all the contributors!
Authors:
Martin Čech
Nate Coraor
Tutorial Content is licensed under
Creative Commons Attribution 4.0 International License
.