Hi, I've recently noticed that some of my jobs do hang (or at least take a many hours to finish while normally they run around 3 hours). The problematic part seems to be in rsync calls. Since I have bad experience with the network discs throughput and stability, I do something like this in my jobfiles: CASE=$(basename $SLURM_SUBMIT_DIR) TMP=/w0/tmp/slurm_$(username).$SLURM_JOB_ID/ rsync -a $SLURM_SUBMIT_DIR $TMP cd $TMP/$CASE # DO WORK rsync -a --exclude '*.dayfile' $TMP/$CASE/ $SLURM_SUBMIT_DIR scp $TMP/$CASE/*.dayfile $SLURM_SUBMIT_DIR/ rm -rf $TMP/$CASE/ i.e., I copy data to the local drive to speed calculations up. Mostly it works OK, however sometime the job hangs, connecting to the node with "srun --jobid <jobid> --pty /bin/zsh" and running ps ux shows: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND sl119982 212008 0.0 0.0 124916 2032 ? S 11:56 0:00 /bin/zsh /var/spool/slurm/job437663/slurm_script sl119982 218531 0.0 0.0 118248 1532 ? S 12:01 0:00 rsync -a --exclude *.dayfile /w0/tmp/slurm_sl119982.437663//6-4/ /rwthfs/rz/cluster/work/sl119982/TiAlON/x_0.6667_y_0.0000_g_0.0625/6-4 sl119982 218532 0.0 0.0 117928 876 ? S 12:01 0:00 rsync -a --exclude *.dayfile /w0/tmp/slurm_sl119982.437663//6-4/ /rwthfs/rz/cluster/work/sl119982/TiAlON/x_0.6667_y_0.0000_g_0.0625/6-4 sl119982 218533 0.0 0.0 118188 768 ? D 12:01 0:00 rsync -a --exclude *.dayfile /w0/tmp/slurm_sl119982.437663//6-4/ /rwthfs/rz/cluster/work/sl119982/TiAlON/x_0.6667_y_0.0000_g_0.0625/6-4 sl119982 223991 0.0 0.0 127240 2440 pts/0 Ss 12:13 0:00 /bin/zsh sl119982 224467 0.0 0.0 115588 2220 pts/0 S 12:13 0:00 bash sl119982 233247 0.0 0.0 155380 1928 pts/0 R+ 12:28 0:00 ps ux There are 3?!? rsync processes running and all are sleeping? I have no idea what is going on. I tried to attach to the rsync process to see what is going on: gdb attach 218533 bt #0 0x00002b2909102620 in __close_nocancel () from /lib64/libc.so.6 #1 0x0000564544f79de6 in recv_files () #2 0x0000564544f84161 in do_recv () #3 0x0000564544f849ac in start_server () #4 0x0000564544f84af5 in child_main () #5 0x0000564544fa3ce9 in local_child () #6 0x0000564544f67e9b in main () And actually after detaching from the process it somehow got started, switched back to running status and everything finished. Any ideas? The jobid of the last stuck job was 437663 if anyone wants to investigate. I'll send more jobids when I see this again... Best regards Pavel