Hi all,
some of my jobs are failing. It happens very rarely and with no apparent reason. Log says it got SIGKILL, although sacct just says COMPLETED. I had a job this week with this problem and it ran without issue after restarting it. This is particularly annoying since my jobs usually take
1 day. I'm not exceeding my requested runtime or memory limits.
I had just another one like it. I restarted it and believe it will run through without issue. I attached what sacct reported. It failed on ncm0217.
Anyone had issues like this?
Best
Johannes