Hi,

we had (at least) yesterday a problem with the scheduler, resulting in requeueing jobs, that it wanted to start.
This led to a cluster, which was only using one sixth of its capacity.
I had to rewrite the whole prolog  part of the scheduler, now it is performant again.
This should also decrease the probability of the hatred "socket send/receive" errors.

The queue is much smaller now
$> squeue -t pd | wc -l
1251

This can also be seen in the following picture:

hourly graph


Nonetheless, the length of the queue and therefore how long users need to wait, is nothing we can influence. Its you, the users, who submit jobs.

Regarding the accounting, it might be misunderstood, that we do not record the data.
The problem is, that the tools needed to do the final accounting need rewriting. But SLURM does not behave a way, we expected so I'm again and again distracted from continueing my work on the accounting.
It is not simply switching on accounting.

The "empty cluster" phenomenon e.g. was my yesterdays work.


With kind regards
Marcus


On 5/22/19 6:42 AM, Pavel Ondračka wrote:
Hi,

I second this, some jobs I submitted on monday have not started yet
(though those have 2 days runtime, so I guess it might be tricky for
the scheduler to squeeze them somewhere). In general there seems to be
a large amount of queued jobs,
squeue | grep " PD " | wc -l
12577
 so I guess this is expected?

What is bugging me more is that I'm not getting any start time
estimates... i.e. for some of the mentioned jobs, squeue --start -j
2121545 returns:
 2121545  c18m openmx-j sl119982 PD  N/A   1 (null)  (Priority)
e.g. there is no estimated starttime. What command should I use to get
some estimate?

For example on Monday around noon I queued and interactive job to do
some post-processing. (8 CPUs/1hour/no additional requirements). I was
thinking that such small task must surely be scheduled quickly, but it
was not till Tuesday morning when I killed it (and run it elsewhere).
For the interactive jobs the lack of start time estimate is especially
annoying.

It would be nice if smaller jobs could get some priority boost when the
user has no (or very small amount of) running jobs already.

And in general some email/link to how the job scheduling and job
priority is currently set up would be nice. It is possible I missed it,
but the https://doc.itc.rwth-aachen.de/display/CC/ has almost no
info...

BTW At first I thought I'm out of CPU hours, however r_batch_submission
does not show any usage for the last month, which brings me to the
other question, why the accounting seems to be still disabled? I know
it was disabled when in the trial phase, however now when we should be
in production it might be a good idea to enable it?

Just my two cents.

Best regards
Pavel


On Tue, 2019-05-21 at 18:15 +0000, simon@isf.rwth-aachen.de wrote:
Hi everybody,

It seems like in the past two weeks something happened and now the
queue is extremely long. I submitted a job yesterday afternoon and
only today at 2pm ist started (and immediately exited as it wasnt
debugged yet, but this is not the issue). Before that it usually
started within moments (<5min).

It seems like an enormous ammount of jobs is being started at once.
Is this purposefully? Is there any chance to restrict that (like # of
jobs per user)?

Or is it just, that my project has such a low priority?

Best Regards,
Marek
_______________________________________________
claix18-slurm-pilot mailing list -- 
claix18-slurm-pilot@lists.rwth-aachen.de
To unsubscribe send an email to 
claix18-slurm-pilot-leave@lists.rwth-aachen.de
_______________________________________________
claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de
To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner@itc.rwth-aachen.de
www.itc.rwth-aachen.de