[claix18-slurm-pilot] Re: Long Queue

23 May 2019

      On Thu, 2019-05-23 at 08:20 +0200, Marcus Wagner wrote:
...
Hi,
we had (at least) yesterday a problem with the scheduler, resulting
in requeueing jobs, that it wanted to start.
This led to a cluster, which was only using one sixth of its
capacity.
I had to rewrite the whole prolog  part of the scheduler, now it is
performant again. 
This should also decrease the probability of the hatred "socket
send/receive" errors.
The queue is much smaller now
$> squeue -t pd | wc -l
1251
This can also be seen in the following picture:
OK, thank you for the fix.
...
Nonetheless, the length of the queue and therefore how long users
need to wait, is nothing we can influence. Its you, the users, who
submit jobs.
I can understand that and in no way I was suggesting that the long
queue is your fault, if you got this felling from my email, then I
apologize.

What about the start time estimates? Any chance to get this working?

I would also really appreciate some more info about the job scheduling
priority, but this has low priority ATM I guess.
...
Regarding the accounting, it might be misunderstood, that we do not
record the data. 
The problem is, that the tools needed to do the final accounting need
rewriting. But SLURM does not behave a way, we expected so I'm again
and again distracted from continueing my work on the accounting.
It is not simply switching on accounting.
So just to make this clear, you do record the used hours, it is just
nor possible to show them at the moment (e.g., with the
r_batch_submission)? So can I somehow tell if I'm ATM burning CPU hours
from last month/this month/next months quota (can I ATM use all of
my/projects CPU hours without knowing)?

BTW I share your sentiment towards SLURM, it also makes me distracted
from my real work way more than I would like to, I'm missing the old
scheduler already ;-)

Best regards
Pavel

[claix18-slurm-pilot] Re: Long Queue

Pavel Ondračka