Hi, we had (at least) yesterday a problem with the scheduler, resulting in requeueing jobs, that it wanted to start. This led to a cluster, which was only using one sixth of its capacity. I had to rewrite the whole prolog part of the scheduler, now it is performant again. This should also decrease the probability of the hatred "socket send/receive" errors. The queue is much smaller now $> squeue -t pd | wc -l 1251 This can also be seen in the following picture: hourly graph Nonetheless, the length of the queue and therefore how long users need to wait, is nothing we can influence. Its you, the users, who submit jobs. Regarding the accounting, it might be misunderstood, that we do not record the data. The problem is, that the tools needed to do the final accounting need rewriting. But SLURM does not behave a way, we expected so I'm again and again distracted from continueing my work on the accounting. It is not simply switching on accounting. The "empty cluster" phenomenon e.g. was my yesterdays work. With kind regards Marcus On 5/22/19 6:42 AM, Pavel Ondračka wrote:
Hi,
I second this, some jobs I submitted on monday have not started yet (though those have 2 days runtime, so I guess it might be tricky for the scheduler to squeeze them somewhere). In general there seems to be a large amount of queued jobs, squeue | grep " PD " | wc -l 12577 so I guess this is expected?
What is bugging me more is that I'm not getting any start time estimates... i.e. for some of the mentioned jobs, squeue --start -j 2121545 returns: 2121545 c18m openmx-j sl119982 PD N/A 1 (null) (Priority) e.g. there is no estimated starttime. What command should I use to get some estimate?
For example on Monday around noon I queued and interactive job to do some post-processing. (8 CPUs/1hour/no additional requirements). I was thinking that such small task must surely be scheduled quickly, but it was not till Tuesday morning when I killed it (and run it elsewhere). For the interactive jobs the lack of start time estimate is especially annoying.
It would be nice if smaller jobs could get some priority boost when the user has no (or very small amount of) running jobs already.
And in general some email/link to how the job scheduling and job priority is currently set up would be nice. It is possible I missed it, but the https://doc.itc.rwth-aachen.de/display/CC/ has almost no info...
BTW At first I thought I'm out of CPU hours, however r_batch_submission does not show any usage for the last month, which brings me to the other question, why the accounting seems to be still disabled? I know it was disabled when in the trial phase, however now when we should be in production it might be a good idea to enable it?
Just my two cents.
Best regards Pavel
On Tue, 2019-05-21 at 18:15 +0000, simon@isf.rwth-aachen.de wrote:
Hi everybody,
It seems like in the past two weeks something happened and now the queue is extremely long. I submitted a job yesterday afternoon and only today at 2pm ist started (and immediately exited as it wasnt debugged yet, but this is not the issue). Before that it usually started within moments (<5min).
It seems like an enormous ammount of jobs is being started at once. Is this purposefully? Is there any chance to restrict that (like # of jobs per user)?
Or is it just, that my project has such a low priority?
Best Regards, Marek _______________________________________________ claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de
claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de
-- Marcus Wagner, Dipl.-Inf. IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wagner@itc.rwth-aachen.de www.itc.rwth-aachen.de