On Thu, 2019-05-23 at 08:20 +0200, Marcus Wagner wrote:
Hi,
we had (at least) yesterday a problem with the scheduler, resulting in requeueing jobs, that it wanted to start. This led to a cluster, which was only using one sixth of its capacity. I had to rewrite the whole prolog part of the scheduler, now it is performant again. This should also decrease the probability of the hatred "socket send/receive" errors.
The queue is much smaller now $> squeue -t pd | wc -l 1251
This can also be seen in the following picture:
OK, thank you for the fix.
Nonetheless, the length of the queue and therefore how long users need to wait, is nothing we can influence. Its you, the users, who submit jobs.
I can understand that and in no way I was suggesting that the long queue is your fault, if you got this felling from my email, then I apologize. What about the start time estimates? Any chance to get this working? I would also really appreciate some more info about the job scheduling priority, but this has low priority ATM I guess.
Regarding the accounting, it might be misunderstood, that we do not record the data. The problem is, that the tools needed to do the final accounting need rewriting. But SLURM does not behave a way, we expected so I'm again and again distracted from continueing my work on the accounting. It is not simply switching on accounting.
So just to make this clear, you do record the used hours, it is just nor possible to show them at the moment (e.g., with the r_batch_submission)? So can I somehow tell if I'm ATM burning CPU hours from last month/this month/next months quota (can I ATM use all of my/projects CPU hours without knowing)? BTW I share your sentiment towards SLURM, it also makes me distracted from my real work way more than I would like to, I'm missing the old scheduler already ;-) Best regards Pavel