condor: can't find address of local schedd

7.4k views Asked by At

I'm trying to submit my condor job but it keeps giving me an error saying:

ERROR: Can't find address of local schedd

I'm a beginner condor user and I'm not quite sure what this means.

Also when I type condor_q i get the following Error message:

Error: Can't find address for schedd (name)

Extra Info: You probably saw this error because the condor_schedd is not  running on the machine you are trying to query. If the condor_schedd is not  running, the Condor system will not be able to find an address and port to  connect to and satisfy this request. Please make sure the Condor daemons are  running and try again.

  Extra Info: If the condor_schedd is running on the machine you are trying to  query and you still see the error, the most likely cause is that you have  setup a personal Condor, you have not defined SCHEDD_NAME in your  condor_config file, and something is wrong with your SCHEDD_ADDRESS_FILE  setting. You must define either or both of those settings in your config  file, or you must use the -name option to condor_q. Please see the Condor  manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.

Interestingly condor_status works just fine(I can see the status of all the clusters).

I did some research and it says I need to use public directory in order to access it. Is there a specific directory for condor submissions/queues?

4

There are 4 answers

1
user4992136 On

Check if the condor scheduler is running (you can use $ ps aux | grep condor to see all the condor* processes in your machine)

If sched is not running you need to add it to the daemons list in your central manager machine conf (the line that contains a list like MASTER, STARTD, NEGOTIATOR ...)

BTW: condor status works OK because the COLLECTOR daemon is certanly running.

0
Charlie Parker On

For me it was that you can't submit a batch job in an interactive job. Make sure you are on the head node.

Head node for me:

(automl-meta-learning) miranda9~/automl-meta-learning $ hostname
vision-sched.cs.illinois.edu

Compute node:

(automl-meta-learning) miranda9~/automl-meta-learning $ hostname
vision-19.cs.illinois.edu
0
Fernando Augusto On

I've had this problem twice and both times were after a CERN system crash, the best thing to do is just wait a few hours for everything to go back to normal on its own and be careful not to create bigger problems trying to solve this!

0
alper On

This may be related to a permission error. I was having same error, done following lines and the issue was fixed.

mkdir -p /var/run/condor /var/lock/condor # If it does not exist

# Recreate them from scratch
sudo rm -rf /var/lib/condor
sudo mkdir -p /var/lib/condor/spool/local_univ_execute
sudo mkdir -p /var/lib/condor/execute
sudo chown -R condor: /var/lib/condor
sudo chmod 1777 /var/lib/condor/spool/local_univ_execute
sudo chmod 1777 /var/lib/condor/execute

mkdir -p /var/log/condor/
sudo chown -R condor: /var/log/condor
sudo chmod 1777 /var/log/condor

# Kill all the condor daemons you have running,
sudo service condor stop
sudo killall condor
sudo killall condor_procd

$ sudo service condor start # Condor should run as a system service.
$ ps auxwwww | grep condor # You should see all processes run under condor.
condor      7656  0.0  0.2  47508  4644 ?        Ss   08:43   0:00 /usr/sbin/condor_master -pidfile /var/run/condor/condor.pid
root        7699  0.2  0.1  24384  3920 ?        S    08:43   0:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 126
condor      7700  0.0  0.2  47004  5436 ?        Ss   08:43   0:00 condor_shared_port -f
condor      7701  0.1  0.3  57252  6620 ?        Ss   08:43   0:00 condor_collector -f
condor      7704  0.1  0.3  48352  6816 ?        Ss   08:43   0:00 condor_startd -f
condor      7705  0.0  0.3  58052  7188 ?        Ss   08:43   0:00 condor_schedd -f
condor      7706  0.0  0.2  47500  5880 ?        Ss   08:43   0:00 condor_negotiator -f

$ condor_q # check condor_q works or not
-- Schedd: condor@ebloc : <127.0.0.1:9618?... @ 10/26/18 08:46:06
OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended