[claix18-slurm-pilot] jobs hang in rsync

8 Mar 2019

      Hi,

I've recently noticed that some of my jobs do hang (or at least take a many hours to finish while normally they run around 3 hours). The problematic part seems to be in rsync calls.

Since I have bad experience with the network discs throughput and stability, I do something like this in my jobfiles:

CASE=$(basename $SLURM_SUBMIT_DIR)
TMP=/w0/tmp/slurm_$(username).$SLURM_JOB_ID/
rsync -a $SLURM_SUBMIT_DIR $TMP
cd $TMP/$CASE

# DO WORK

rsync -a --exclude '*.dayfile' $TMP/$CASE/ $SLURM_SUBMIT_DIR
scp $TMP/$CASE/*.dayfile $SLURM_SUBMIT_DIR/
rm -rf $TMP/$CASE/

i.e., I copy data to the local drive to speed calculations up. Mostly it works OK, however sometime the job hangs, connecting to the node with "srun --jobid <jobid> --pty /bin/zsh" and running ps ux shows:

USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
sl119982 212008  0.0  0.0 124916  2032 ?        S    11:56   0:00 /bin/zsh /var/spool/slurm/job437663/slurm_script
sl119982 218531  0.0  0.0 118248  1532 ?        S    12:01   0:00 rsync -a --exclude *.dayfile /w0/tmp/slurm_sl119982.437663//6-4/ /rwthfs/rz/cluster/work/sl119982/TiAlON/x_0.6667_y_0.0000_g_0.0625/6-4
sl119982 218532  0.0  0.0 117928   876 ?        S    12:01   0:00 rsync -a --exclude *.dayfile /w0/tmp/slurm_sl119982.437663//6-4/ /rwthfs/rz/cluster/work/sl119982/TiAlON/x_0.6667_y_0.0000_g_0.0625/6-4
sl119982 218533  0.0  0.0 118188   768 ?        D    12:01   0:00 rsync -a --exclude *.dayfile /w0/tmp/slurm_sl119982.437663//6-4/ /rwthfs/rz/cluster/work/sl119982/TiAlON/x_0.6667_y_0.0000_g_0.0625/6-4
sl119982 223991  0.0  0.0 127240  2440 pts/0    Ss   12:13   0:00 /bin/zsh
sl119982 224467  0.0  0.0 115588  2220 pts/0    S    12:13   0:00 bash
sl119982 233247  0.0  0.0 155380  1928 pts/0    R+   12:28   0:00 ps ux

 There are 3?!? rsync processes running and all are sleeping? I have no idea what is going on. I tried to attach to the rsync process to see what is going on:

gdb attach 218533
bt
#0  0x00002b2909102620 in __close_nocancel () from /lib64/libc.so.6
#1  0x0000564544f79de6 in recv_files ()
#2  0x0000564544f84161 in do_recv ()
#3  0x0000564544f849ac in start_server ()
#4  0x0000564544f84af5 in child_main ()
#5  0x0000564544fa3ce9 in local_child ()
#6  0x0000564544f67e9b in main ()

And actually after detaching from the process it somehow got started, switched back to running status and everything finished.

Any ideas? The jobid of the last stuck job was 437663 if anyone wants to investigate. I'll send more jobids when I see this again...

Best regards
Pavel

[claix18-slurm-pilot] jobs hang in rsync

Ondracka, Pavel