-
Notifications
You must be signed in to change notification settings - Fork 15
Condor troubleshooting
This page is to collect some Condor knowledge and troubleshooting help
There are some example templates for the HTCondor plugin here. In order to properly point the configuration to your Condor resources, you need to:
- Look in AGIS (replace the name of your PanDA queue in the URL) the name of the condor-ce endpoint.
- Lookup the schedd name:
$ condor_q -p <ce-endpoint> -g | grep Schedd
-- Schedd: <schedd>...
- Change the grid_resource line in the sdf template to
grid_resource = condor <schedd> <ce-endpoint>
Harvester may stop submitting workers for many reasons. But it may also relate to abnormal events on condor schedd.
It is a bad sign when there are many condor job held. For example:
[root@aipanda024 ~]# condor_q -nob
-- Schedd: aipanda024.cern.ch : <137.138.157.183:19696> @ 07/20/18 10:30:07
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
...
5400.0 atlpan 7/18 14:18 1+20:08:57 R 0 0.0 runpilot3-wrapper.sh -s RRC-KI-T1 -h RRC-KI-T1 -p 25443 -w https://pandaserver.cern.ch -u manag
6575.0 atlpan 7/18 20:31 0+18:31:09 R 0 0.0 runpilot3-wrapper.sh -s UKI-SCOTGRID-ECDF_MCORE_SL7 -h UKI-SCOTGRID-ECDF_MCORE_SL7 -p 25443 -w
8341.0 atlpan 7/19 06:07 0+16:03:32 R 0 0.0 runpilot3-wrapper.sh -s FMPhI-UNIBA_MCORE -h FMPhI-UNIBA-all-prod-CEs_MCORE -p 25443 -w https:/
9482.0 atlpan 7/19 09:15 1+00:17:21 R 0 1221.0 runpilot3-wrapper.sh -s BNL_PROD -h BNL_PROD-condor -p 25443 -w https://pandaserver.cern.ch -u
9662.0 atlpan 7/19 09:18 0+00:00:00 H 0 0.0 runpilot3-wrapper.sh -s CERN-PROD_UCORE -h CERN-PROD_UCORE -p 25443 -w https://pandaserver.cern
9665.0 atlpan 7/19 09:18 0+05:36:20 H 0 2198.0 runpilot3-wrapper.sh -s CERN-PROD_UCORE -h CERN-PROD_UCORE -p 25443 -w https://pandaserver.cern
9666.0 atlpan 7/19 09:18 0+00:00:00 H 0 0.0 runpilot3-wrapper.sh -s CERN-PROD_UCORE -h CERN-PROD_UCORE -p 25443 -w https://pandaserver.cern
9670.0 atlpan 7/19 09:18 0+00:00:00 H 0 0.0 runpilot3-wrapper.sh -s CERN-PROD_UCORE -h CERN-PROD_UCORE -p 25443 -w https://pandaserver.cern
9671.0 atlpan 7/19 09:18 0+00:00:00 H 0 0.0 runpilot3-wrapper.sh -s CERN-PROD_UCORE -h CERN-PROD_UCORE -p 25443 -w https://pandaserver.cern
9673.0 atlpan 7/19 09:18 0+05:35:16 H 0 2686.0 runpilot3-wrapper.sh -s CERN-PROD_UCORE -h CERN-PROD_UCORE -p 25443 -w https://pandaserver.cern
9682.0 atlpan 7/19 09:18 0+00:00:00 H 0 0.0 runpilot3-wrapper.sh -s CERN-PROD_UCORE -h CERN-PROD_UCORE -p 25443 -w https://pandaserver.cern
...
9998.0 atlpan 7/19 09:20 0+00:00:00 H 0 0.0 runpilot3-wrapper.sh -s CERN-PROD_UCORE -h CERN-PROD_UCORE -p 25443 -w https://pandaserver.cern
10038.0 atlpan 7/19 09:21 1+00:15:13 R 0 733.0 runpilot3-wrapper.sh -s CERN-PROD_UCORE -h CERN-PROD_UCORE -p 25443 -w https://pandaserver.cern
10147.0 atlpan 7/19 09:24 0+05:36:20 H 0 2198.0 runpilot3-wrapper.sh -s CERN-PROD_UCORE -h CERN-PROD_UCORE -p 25443 -w https://pandaserver.cern
Since held status of condor job is not a final state (can later become idle or running), harvester will thread held condor jobs as submitted workers until it reaches timeout (default 2 hours) and then cancelled those workers.
Thus, too many held jobs can lead to no new worker submission when the limit of nQueueLimitWorkers reaches.
One can check condor HoldReason of the condor job for more detail. E.g.
[root@aipanda024 ~]# condor_q -l -constraint " JobStatus == 5" | egrep 'GridResource |HoldReason '
GridResource = "condor ce516.cern.ch ce516.cern.ch:9619"
HoldReason = "Error connecting to schedd ce516.cern.ch: SECMAN:2007:Failed to received post-auth
ClassAd|AUTHENTICATE:1004:Failed to authenticate using FS"
GridResource = "condor ce507.cern.ch ce507.cern.ch:9619"
HoldReason = "CE job in status 1 put on hold by SYSTEM_PERIODIC_HOLD due to non-existent route in JOB_ROUTER_ENTRIES or route job limit."
...
Handle the issues indicated in HoldReason.
When condor jobs submitted with grid universe, communication error to remote CE or schedd may cause the condor job held.
One can check the condor GridmanagerLog for details: say, /var/log/condor/GridmanagerLog.atlpan
Getting started |
---|
Installation and configuration |
Testing and running |
Debugging |
Work with Middleware |
Admin FAQ |
Development guides |
---|
Development workflow |
Tagging |
Production & commissioning |
---|
Condor experiences |
Commissioning on the grid |
Production servers |
Service monitoring |
Auto Queue Configuration with AGIS |
GCE setup |
Kubernetes setup |
SSH+RPC middleware setup |