Hi all, some of my jobs are failing. It happens very rarely and with no apparent reason. Log says it got SIGKILL, although sacct just says COMPLETED. I had a job this week with this problem and it ran without issue after restarting it. This is particularly annoying since my jobs usually take
1 day. I'm not exceeding my requested runtime or memory limits.
I had just another one like it. I restarted it and believe it will run through without issue. I attached what sacct reported. It failed on ncm0217. Anyone had issues like this? Best Johannes -- M.Sc. Johannes Sauer Researcher Institut fuer Nachrichtentechnik RWTH Aachen University Melatener Str. 23 52074 Aachen Tel +49 241 80-27678 Fax +49 241 80-22196 sauer@ient.rwth-aachen.de http://www.ient.rwth-aachen.de
I cannot really find anything within the logfiles, that would point me in the right direction. On the master and on ncm0217: [2019-05-24T17:33:04.169] _slurm_rpc_submit_batch_job: JobId=2204307 InitPrio=109755 usec=25672 [2019-05-24T17:33:06.654] sched: Allocate JobId=2204307 NodeList=ncm0217 #CPUs=1 Partition=c18m [2019-05-24T17:33:08.055] prolog_running_decr: Configuration for JobId=2204307 is complete [2019-05-24T17:33:08.205] task_p_slurmd_batch_request: 2204307 [2019-05-24T17:33:08.205] task/affinity: job 2204307 CPU input mask for node: 0x000000000080 [2019-05-24T17:33:08.205] task/affinity: job 2204307 CPU final HW mask for node: 0x000000002000 [2019-05-24T17:33:08.653] _run_prolog: prolog with lock for job 2204307 ran for 0 seconds [2019-05-24T17:33:08.677] [2204307.extern] Considering each NUMA node as a socket [2019-05-24T17:33:08.698] [2204307.extern] task/cgroup: /slurm/uid_26982/job_2204307: alloc=10240MB mem.limit=10240MB memsw.limit=unlimited [2019-05-24T17:33:08.715] [2204307.extern] task/cgroup: /slurm/uid_26982/job_2204307/step_extern: alloc=10240MB mem.limit=10240MB memsw.limit=unlimited [2019-05-24T17:33:09.107] Launching batch job 2204307 for UID 26982 [2019-05-24T17:33:09.134] [2204307.batch] Considering each NUMA node as a socket [2019-05-24T17:33:09.161] [2204307.batch] task/cgroup: /slurm/uid_26982/job_2204307: alloc=10240MB mem.limit=10240MB memsw.limit=unlimited [2019-05-24T17:33:09.174] [2204307.batch] task/cgroup: /slurm/uid_26982/job_2204307/step_batch: alloc=10240MB mem.limit=10240MB memsw.limit=unlimited [2019-05-24T17:33:09.224] [2204307.batch] task_p_pre_launch: Using sched_affinity for tasks [2019-05-24T17:37:49.526] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=2204307 uid 35249 [2019-05-24T17:37:49.526] error: Security violation, REQUEST_KILL_JOB RPC for JobId=2204307 from uid 35249 [2019-05-24T17:37:49.526] _slurm_rpc_kill_job: job_str_signal() JobId=2204307 sig 9 returned Access/permission denied [2019-05-25T12:01:24.419] [2204307.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0 [2019-05-25T12:01:24.458] _job_complete: JobId=2204307 WEXITSTATUS 0 [2019-05-25T12:01:24.459] _job_complete: JobId=2204307 done [2019-05-25T12:01:24.478] [2204307.batch] done with job [2019-05-25T12:01:32.453] [2204307.extern] _oom_event_monitor: oom-kill event count: 1 [2019-05-25T12:01:32.646] [2204307.extern] done with job [2019-05-25T12:01:41.412] epilog for job 2204307 ran for 8 seconds Interestingly someone tried to kill your job, a few minutes after it started, but with no success. Nothing else in any logs, neither the journal, nor in messages. So, to me and to the system, it looks like the job ended normally. I do not see any signs from SLURM, that is killed the job. With kind regards Marcus On 5/26/19 11:55 AM, Johannes Sauer wrote:
Hi all,
some of my jobs are failing. It happens very rarely and with no apparent reason. Log says it got SIGKILL, although sacct just says COMPLETED. I had a job this week with this problem and it ran without issue after restarting it. This is particularly annoying since my jobs usually take > 1 day. I'm not exceeding my requested runtime or memory limits.
I had just another one like it. I restarted it and believe it will run through without issue. I attached what sacct reported. It failed on ncm0217.
Anyone had issues like this?
Best
Johannes
_______________________________________________ claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de
-- Marcus Wagner, Dipl.-Inf. IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wagner@itc.rwth-aachen.de www.itc.rwth-aachen.de
participants (2)
-
Johannes Sauer
-
Marcus Wagner