Hi everybody, It seems like in the past two weeks something happened and now the queue is extremely long. I submitted a job yesterday afternoon and only today at 2pm ist started (and immediately exited as it wasnt debugged yet, but this is not the issue). Before that it usually started within moments (<5min). It seems like an enormous ammount of jobs is being started at once. Is this purposefully? Is there any chance to restrict that (like # of jobs per user)? Or is it just, that my project has such a low priority? Best Regards, Marek
Hi, I second this, some jobs I submitted on monday have not started yet (though those have 2 days runtime, so I guess it might be tricky for the scheduler to squeeze them somewhere). In general there seems to be a large amount of queued jobs, squeue | grep " PD " | wc -l 12577 so I guess this is expected? What is bugging me more is that I'm not getting any start time estimates... i.e. for some of the mentioned jobs, squeue --start -j 2121545 returns: 2121545 c18m openmx-j sl119982 PD N/A 1 (null) (Priority) e.g. there is no estimated starttime. What command should I use to get some estimate? For example on Monday around noon I queued and interactive job to do some post-processing. (8 CPUs/1hour/no additional requirements). I was thinking that such small task must surely be scheduled quickly, but it was not till Tuesday morning when I killed it (and run it elsewhere). For the interactive jobs the lack of start time estimate is especially annoying. It would be nice if smaller jobs could get some priority boost when the user has no (or very small amount of) running jobs already. And in general some email/link to how the job scheduling and job priority is currently set up would be nice. It is possible I missed it, but the https://doc.itc.rwth-aachen.de/display/CC/ has almost no info... BTW At first I thought I'm out of CPU hours, however r_batch_submission does not show any usage for the last month, which brings me to the other question, why the accounting seems to be still disabled? I know it was disabled when in the trial phase, however now when we should be in production it might be a good idea to enable it? Just my two cents. Best regards Pavel On Tue, 2019-05-21 at 18:15 +0000, simon@isf.rwth-aachen.de wrote:
Hi everybody,
It seems like in the past two weeks something happened and now the queue is extremely long. I submitted a job yesterday afternoon and only today at 2pm ist started (and immediately exited as it wasnt debugged yet, but this is not the issue). Before that it usually started within moments (<5min).
It seems like an enormous ammount of jobs is being started at once. Is this purposefully? Is there any chance to restrict that (like # of jobs per user)?
Or is it just, that my project has such a low priority?
Best Regards, Marek _______________________________________________ claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de
Hi, we had (at least) yesterday a problem with the scheduler, resulting in requeueing jobs, that it wanted to start. This led to a cluster, which was only using one sixth of its capacity. I had to rewrite the whole prolog part of the scheduler, now it is performant again. This should also decrease the probability of the hatred "socket send/receive" errors. The queue is much smaller now $> squeue -t pd | wc -l 1251 This can also be seen in the following picture: hourly graph Nonetheless, the length of the queue and therefore how long users need to wait, is nothing we can influence. Its you, the users, who submit jobs. Regarding the accounting, it might be misunderstood, that we do not record the data. The problem is, that the tools needed to do the final accounting need rewriting. But SLURM does not behave a way, we expected so I'm again and again distracted from continueing my work on the accounting. It is not simply switching on accounting. The "empty cluster" phenomenon e.g. was my yesterdays work. With kind regards Marcus On 5/22/19 6:42 AM, Pavel Ondračka wrote:
Hi,
I second this, some jobs I submitted on monday have not started yet (though those have 2 days runtime, so I guess it might be tricky for the scheduler to squeeze them somewhere). In general there seems to be a large amount of queued jobs, squeue | grep " PD " | wc -l 12577 so I guess this is expected?
What is bugging me more is that I'm not getting any start time estimates... i.e. for some of the mentioned jobs, squeue --start -j 2121545 returns: 2121545 c18m openmx-j sl119982 PD N/A 1 (null) (Priority) e.g. there is no estimated starttime. What command should I use to get some estimate?
For example on Monday around noon I queued and interactive job to do some post-processing. (8 CPUs/1hour/no additional requirements). I was thinking that such small task must surely be scheduled quickly, but it was not till Tuesday morning when I killed it (and run it elsewhere). For the interactive jobs the lack of start time estimate is especially annoying.
It would be nice if smaller jobs could get some priority boost when the user has no (or very small amount of) running jobs already.
And in general some email/link to how the job scheduling and job priority is currently set up would be nice. It is possible I missed it, but the https://doc.itc.rwth-aachen.de/display/CC/ has almost no info...
BTW At first I thought I'm out of CPU hours, however r_batch_submission does not show any usage for the last month, which brings me to the other question, why the accounting seems to be still disabled? I know it was disabled when in the trial phase, however now when we should be in production it might be a good idea to enable it?
Just my two cents.
Best regards Pavel
On Tue, 2019-05-21 at 18:15 +0000, simon@isf.rwth-aachen.de wrote:
Hi everybody,
It seems like in the past two weeks something happened and now the queue is extremely long. I submitted a job yesterday afternoon and only today at 2pm ist started (and immediately exited as it wasnt debugged yet, but this is not the issue). Before that it usually started within moments (<5min).
It seems like an enormous ammount of jobs is being started at once. Is this purposefully? Is there any chance to restrict that (like # of jobs per user)?
Or is it just, that my project has such a low priority?
Best Regards, Marek _______________________________________________ claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de
claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de
-- Marcus Wagner, Dipl.-Inf. IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wagner@itc.rwth-aachen.de www.itc.rwth-aachen.de
On Thu, 2019-05-23 at 08:20 +0200, Marcus Wagner wrote:
Hi,
we had (at least) yesterday a problem with the scheduler, resulting in requeueing jobs, that it wanted to start. This led to a cluster, which was only using one sixth of its capacity. I had to rewrite the whole prolog part of the scheduler, now it is performant again. This should also decrease the probability of the hatred "socket send/receive" errors.
The queue is much smaller now $> squeue -t pd | wc -l 1251
This can also be seen in the following picture:
OK, thank you for the fix.
Nonetheless, the length of the queue and therefore how long users need to wait, is nothing we can influence. Its you, the users, who submit jobs.
I can understand that and in no way I was suggesting that the long queue is your fault, if you got this felling from my email, then I apologize. What about the start time estimates? Any chance to get this working? I would also really appreciate some more info about the job scheduling priority, but this has low priority ATM I guess.
Regarding the accounting, it might be misunderstood, that we do not record the data. The problem is, that the tools needed to do the final accounting need rewriting. But SLURM does not behave a way, we expected so I'm again and again distracted from continueing my work on the accounting. It is not simply switching on accounting.
So just to make this clear, you do record the used hours, it is just nor possible to show them at the moment (e.g., with the r_batch_submission)? So can I somehow tell if I'm ATM burning CPU hours from last month/this month/next months quota (can I ATM use all of my/projects CPU hours without knowing)? BTW I share your sentiment towards SLURM, it also makes me distracted from my real work way more than I would like to, I'm missing the old scheduler already ;-) Best regards Pavel
Hi, is the multifactor priority plugin enabled? In this case the scheduling priority can be affected by several factors. What I see atm is sprio -w JOBID PARTITION PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS Weights 10000 10000 10000 100000 1 I was wondering what affects the fairshare? I believe it is bad if my jobs take (much) less memory than requested. How about the requested time? Is it also influencing the fairshare? Best Johannes PS: One can display his own calculated factors with sshare -l -a On 5/23/19 9:02 AM, Pavel Ondračka wrote:
On Thu, 2019-05-23 at 08:20 +0200, Marcus Wagner wrote:
Hi,
we had (at least) yesterday a problem with the scheduler, resulting in requeueing jobs, that it wanted to start. This led to a cluster, which was only using one sixth of its capacity. I had to rewrite the whole prolog part of the scheduler, now it is performant again. This should also decrease the probability of the hatred "socket send/receive" errors.
The queue is much smaller now $> squeue -t pd | wc -l 1251
This can also be seen in the following picture: OK, thank you for the fix.
Nonetheless, the length of the queue and therefore how long users need to wait, is nothing we can influence. Its you, the users, who submit jobs. I can understand that and in no way I was suggesting that the long queue is your fault, if you got this felling from my email, then I apologize.
What about the start time estimates? Any chance to get this working?
I would also really appreciate some more info about the job scheduling priority, but this has low priority ATM I guess.
Regarding the accounting, it might be misunderstood, that we do not record the data. The problem is, that the tools needed to do the final accounting need rewriting. But SLURM does not behave a way, we expected so I'm again and again distracted from continueing my work on the accounting. It is not simply switching on accounting. So just to make this clear, you do record the used hours, it is just nor possible to show them at the moment (e.g., with the r_batch_submission)? So can I somehow tell if I'm ATM burning CPU hours from last month/this month/next months quota (can I ATM use all of my/projects CPU hours without knowing)?
BTW I share your sentiment towards SLURM, it also makes me distracted from my real work way more than I would like to, I'm missing the old scheduler already ;-)
Best regards Pavel _______________________________________________ claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de
-- M.Sc. Johannes Sauer Researcher Institut fuer Nachrichtentechnik RWTH Aachen University Melatener Str. 23 52074 Aachen Tel +49 241 80-27678 Fax +49 241 80-22196 sauer@ient.rwth-aachen.de http://www.ient.rwth-aachen.de
participants (4)
-
Johannes Sauer
-
Marcus Wagner
-
Pavel Ondračka
-
simon@isf.rwth-aachen.de