I cannot really find anything within the logfiles, that would point
me in the right direction.
On the master and on ncm0217:
[2019-05-24T17:33:04.169]
_slurm_rpc_submit_batch_job: JobId=2204307 InitPrio=109755
usec=25672
[2019-05-24T17:33:06.654] sched: Allocate JobId=2204307
NodeList=ncm0217 #CPUs=1 Partition=c18m
[2019-05-24T17:33:08.055] prolog_running_decr: Configuration for
JobId=2204307 is complete
[2019-05-24T17:33:08.205]
task_p_slurmd_batch_request: 2204307
[2019-05-24T17:33:08.205] task/affinity: job 2204307 CPU input
mask for node: 0x000000000080
[2019-05-24T17:33:08.205] task/affinity: job 2204307 CPU final
HW mask for node: 0x000000002000
[2019-05-24T17:33:08.653] _run_prolog: prolog with lock for job
2204307 ran for 0 seconds
[2019-05-24T17:33:08.677] [2204307.extern] Considering each NUMA
node as a socket
[2019-05-24T17:33:08.698] [2204307.extern] task/cgroup:
/slurm/uid_26982/job_2204307: alloc=10240MB mem.limit=10240MB
memsw.limit=unlimited
[2019-05-24T17:33:08.715] [2204307.extern] task/cgroup:
/slurm/uid_26982/job_2204307/step_extern: alloc=10240MB
mem.limit=10240MB memsw.limit=unlimited
[2019-05-24T17:33:09.107] Launching batch job 2204307 for UID
26982
[2019-05-24T17:33:09.134] [2204307.batch] Considering each NUMA
node as a socket
[2019-05-24T17:33:09.161] [2204307.batch] task/cgroup:
/slurm/uid_26982/job_2204307: alloc=10240MB mem.limit=10240MB
memsw.limit=unlimited
[2019-05-24T17:33:09.174] [2204307.batch] task/cgroup:
/slurm/uid_26982/job_2204307/step_batch: alloc=10240MB
mem.limit=10240MB memsw.limit=unlimited
[2019-05-24T17:33:09.224] [2204307.batch] task_p_pre_launch:
Using sched_affinity for tasks
[2019-05-24T17:37:49.526] _slurm_rpc_kill_job: REQUEST_KILL_JOB
JobId=2204307 uid 35249
[2019-05-24T17:37:49.526] error: Security violation,
REQUEST_KILL_JOB RPC for JobId=2204307 from uid 35249
[2019-05-24T17:37:49.526] _slurm_rpc_kill_job: job_str_signal()
JobId=2204307 sig 9 returned Access/permission denied
[2019-05-25T12:01:24.419] [2204307.batch]
sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
[2019-05-25T12:01:24.458] _job_complete: JobId=2204307 WEXITSTATUS
0
[2019-05-25T12:01:24.459] _job_complete: JobId=2204307 done
[2019-05-25T12:01:24.478]
[2204307.batch] done with job
[2019-05-25T12:01:32.453] [2204307.extern] _oom_event_monitor:
oom-kill event count: 1
[2019-05-25T12:01:32.646] [2204307.extern] done with job
[2019-05-25T12:01:41.412] epilog for job 2204307 ran for 8 seconds
Interestingly someone tried to kill your job, a few minutes after it
started, but with no success.
Nothing else in any logs, neither the journal, nor in messages.
So, to me and to the system, it looks like the job ended normally. I
do not see any signs from SLURM, that is killed the job.
With kind regards
Marcus
On 5/26/19 11:55 AM, Johannes Sauer
wrote:
Hi
all,
some of my jobs are failing. It happens very rarely and with no
apparent reason. Log says it got SIGKILL, although sacct just says
COMPLETED. I had a job this week with this problem and it ran
without issue after restarting it. This is particularly annoying
since my jobs usually take > 1 day. I'm not exceeding my
requested runtime or memory limits.
I had just another one like it. I restarted it and believe it will
run through without issue. I attached what sacct reported. It
failed on ncm0217.
Anyone had issues like this?
Best
Johannes
_______________________________________________
claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de
To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner@itc.rwth-aachen.de
www.itc.rwth-aachen.de