| Summary: | removing a completing MPI job from squeue | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Hadrian <hxd58> |
| Component: | Scheduling | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 14.11.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Case | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
comp001 log |
||
Can you grab the slurmd.log from the affected compute nodes? Those should at least hint at why the job is still completing. The last two lines there indicate a non-admin non-job-owner attempted to 'scancel' the job, I'm guessing someone noticed it was stuck and tried to cancel it? The failed job removal was my feeble attempt to remove the job as a regular user. slurm.log from the compute nodes: xdsh comp002,comp010,comp012-comp016,comp125 grep 186018 /var/log/slurm/slurm.log comp013: [2016-02-03T14:08:05.546] _run_prolog: prolog with lock for job 186018 ran for 0 seconds comp013: [2016-02-03T14:08:05.631] launch task 186018.0 request from 715780.10076@192.168.223.241 (port 58843) comp013: [2016-02-03T14:08:05.631] lllp_distribution jobid [186018] implicit auto binding: sockets,one_thread, dist 2 comp013: [2016-02-03T14:08:05.631] _lllp_generate_cpu_bind jobid [186018]: mask_cpu,one_thread, 0x020000 comp013: [2016-02-03T14:08:05.677] [186018.0] task_p_pre_launch: Using sched_affinity for tasks comp013: [2016-02-03T14:08:05.751] launch task 186018.1 request from 715780.10076@192.168.208.137 (port 25297) comp013: [2016-02-03T14:08:05.780] launch task 186018.2 request from 715780.10076@192.168.209.26 (port 32442) comp013: [2016-02-03T14:08:05.798] [186018.1] task_p_pre_launch: Using sched_affinity for tasks comp013: [2016-02-03T14:08:05.809] launch task 186018.3 request from 715780.10076@192.168.208.135 (port 50641) comp013: [2016-02-03T14:08:05.828] [186018.2] task_p_pre_launch: Using sched_affinity for tasks comp013: [2016-02-03T14:08:05.839] launch task 186018.4 request from 715780.10076@192.168.208.141 (port 11666) comp013: [2016-02-03T14:08:05.858] [186018.3] task_p_pre_launch: Using sched_affinity for tasks comp013: [2016-02-03T14:08:05.870] launch task 186018.5 request from 715780.10076@192.168.208.127 (port 23503) comp013: [2016-02-03T14:08:05.890] [186018.4] task_p_pre_launch: Using sched_affinity for tasks comp013: [2016-02-03T14:08:05.900] launch task 186018.6 request from 715780.10076@192.168.208.136 (port 43157) comp013: [2016-02-03T14:08:05.918] [186018.5] task_p_pre_launch: Using sched_affinity for tasks comp013: [2016-02-03T14:08:05.929] launch task 186018.7 request from 715780.10076@192.168.208.138 (port 53687) comp013: [2016-02-03T14:08:05.946] [186018.6] task_p_pre_launch: Using sched_affinity for tasks comp013: [2016-02-03T14:08:05.957] launch task 186018.8 request from 715780.10076@192.168.208.140 (port 9179) comp013: [2016-02-03T14:08:05.978] [186018.7] task_p_pre_launch: Using sched_affinity for tasks comp013: [2016-02-03T14:08:05.987] launch task 186018.10 request from 715780.10076@192.168.208.142 (port 23749) comp013: [2016-02-03T14:08:06.009] [186018.8] task_p_pre_launch: Using sched_affinity for tasks comp013: [2016-02-03T14:08:06.016] launch task 186018.11 request from 715780.10076@192.168.208.128 (port 40656) comp013: [2016-02-03T14:08:06.034] [186018.10] task_p_pre_launch: Using sched_affinity for tasks comp013: [2016-02-03T14:08:06.049] launch task 186018.12 request from 715780.10076@192.168.209.26 (port 35770) comp013: [2016-02-03T14:08:06.068] [186018.11] task_p_pre_launch: Using sched_affinity for tasks comp013: [2016-02-03T14:08:06.097] [186018.12] task_p_pre_launch: Using sched_affinity for tasks comp013: [2016-02-03T14:08:06.626] [186018.1] done with job comp013: [2016-02-03T14:08:06.752] [186018.2] done with job comp013: [2016-02-03T14:08:06.852] [186018.5] done with job comp013: [2016-02-03T14:08:06.880] [186018.3] done with job comp013: [2016-02-03T14:08:25.088] [186018.12] done with job comp013: [2016-02-03T14:08:25.092] [186018.7] done with job comp013: [2016-02-03T14:08:25.093] [186018.8] done with job comp013: [2016-02-03T14:08:25.098] [186018.6] done with job comp013: [2016-02-03T14:08:25.099] [186018.4] done with job comp013: [2016-02-03T14:08:25.100] [186018.10] done with job comp013: [2016-02-03T14:08:25.101] [186018.11] done with job comp013: [2016-02-05T15:17:15.608] Warning: revoke on job 186018 has no expiration comp013: [2016-02-05T15:18:24.744] error: Error reading step 186018.0 memory limits comp013: [2016-02-05T15:26:34.767] Warning: revoke on job 186018 has no expiration comp013: [2016-02-05T15:26:36.990] error: Error reading step 186018.0 memory limits comp014: [2016-02-03T14:08:05.546] _run_prolog: prolog with lock for job 186018 ran for 0 seconds comp014: [2016-02-03T14:08:05.631] launch task 186018.0 request from 715780.10076@192.168.223.241 (port 13445) comp014: [2016-02-03T14:08:05.632] lllp_distribution jobid [186018] implicit auto binding: sockets,one_thread, dist 2 comp014: [2016-02-03T14:08:05.632] _lllp_generate_cpu_bind jobid [186018]: mask_cpu,one_thread, 0x800 comp014: [2016-02-03T14:08:05.676] [186018.0] task_p_pre_launch: Using sched_affinity for tasks comp014: [2016-02-03T14:08:05.752] launch task 186018.1 request from 715780.10076@192.168.208.137 (port 11731) comp014: [2016-02-03T14:08:05.781] launch task 186018.2 request from 715780.10076@192.168.209.26 (port 59071) comp014: [2016-02-03T14:08:05.796] [186018.1] task_p_pre_launch: Using sched_affinity for tasks comp014: [2016-02-03T14:08:05.809] launch task 186018.3 request from 715780.10076@192.168.208.135 (port 25275) comp014: [2016-02-03T14:08:05.824] [186018.2] task_p_pre_launch: Using sched_affinity for tasks comp014: [2016-02-03T14:08:05.838] launch task 186018.4 request from 715780.10076@192.168.208.141 (port 19369) comp014: [2016-02-03T14:08:05.854] [186018.3] task_p_pre_launch: Using sched_affinity for tasks comp014: [2016-02-03T14:08:05.868] launch task 186018.5 request from 715780.10076@192.168.208.127 (port 62101) comp014: [2016-02-03T14:08:05.883] [186018.4] task_p_pre_launch: Using sched_affinity for tasks comp014: [2016-02-03T14:08:05.896] launch task 186018.6 request from 715780.10076@192.168.208.136 (port 37815) comp014: [2016-02-03T14:08:05.912] [186018.5] task_p_pre_launch: Using sched_affinity for tasks comp014: [2016-02-03T14:08:05.924] launch task 186018.9 request from 715780.10076@192.168.208.139 (port 53677) comp014: [2016-02-03T14:08:05.939] [186018.6] task_p_pre_launch: Using sched_affinity for tasks comp014: [2016-02-03T14:08:05.953] launch task 186018.7 request from 715780.10076@192.168.208.138 (port 54192) comp014: [2016-02-03T14:08:05.968] [186018.9] task_p_pre_launch: Using sched_affinity for tasks comp014: [2016-02-03T14:08:05.981] launch task 186018.10 request from 715780.10076@192.168.208.142 (port 7575) comp014: [2016-02-03T14:08:05.997] [186018.7] task_p_pre_launch: Using sched_affinity for tasks comp014: [2016-02-03T14:08:06.010] launch task 186018.11 request from 715780.10076@192.168.208.128 (port 15053) comp014: [2016-02-03T14:08:06.027] [186018.10] task_p_pre_launch: Using sched_affinity for tasks comp014: [2016-02-03T14:08:06.039] launch task 186018.12 request from 715780.10076@192.168.209.26 (port 62399) comp014: [2016-02-03T14:08:06.054] [186018.11] task_p_pre_launch: Using sched_affinity for tasks comp014: [2016-02-03T14:08:06.082] [186018.12] task_p_pre_launch: Using sched_affinity for tasks comp014: [2016-02-03T14:08:06.664] [186018.1] done with job comp014: [2016-02-03T14:08:06.756] [186018.2] done with job comp014: [2016-02-03T14:08:06.852] [186018.5] done with job comp014: [2016-02-03T14:08:06.886] [186018.3] done with job comp014: [2016-02-03T14:08:25.075] [186018.4] done with job comp014: [2016-02-03T14:08:25.087] [186018.11] done with job comp014: [2016-02-03T14:08:25.090] [186018.7] done with job comp014: [2016-02-03T14:08:25.096] [186018.9] done with job comp014: [2016-02-03T14:08:25.096] [186018.10] done with job comp014: [2016-02-03T14:08:25.097] [186018.6] done with job comp014: [2016-02-03T14:08:25.103] [186018.12] done with job comp014: [2016-02-05T15:17:16.752] Warning: revoke on job 186018 has no expiration comp014: [2016-02-05T15:18:24.745] error: Error reading step 186018.0 memory limits comp014: [2016-02-05T15:26:34.758] Warning: revoke on job 186018 has no expiration comp014: [2016-02-05T15:26:36.991] error: Error reading step 186018.0 memory limits comp012: [2016-02-03T14:08:05.547] _run_prolog: prolog with lock for job 186018 ran for 0 seconds comp012: [2016-02-03T14:08:05.632] launch task 186018.0 request from 715780.10076@192.168.223.241 (port 2447) comp012: [2016-02-03T14:08:05.632] lllp_distribution jobid [186018] implicit auto binding: sockets,one_thread, dist 2 comp012: [2016-02-03T14:08:05.632] _lllp_generate_cpu_bind jobid [186018]: mask_cpu,one_thread, 0x800 comp012: [2016-02-03T14:08:05.676] [186018.0] task_p_pre_launch: Using sched_affinity for tasks comp012: [2016-02-03T14:08:05.752] launch task 186018.1 request from 715780.10076@192.168.208.137 (port 43945) comp012: [2016-02-03T14:08:05.780] launch task 186018.2 request from 715780.10076@192.168.209.26 (port 42971) comp012: [2016-02-03T14:08:05.796] [186018.1] task_p_pre_launch: Using sched_affinity for tasks comp012: [2016-02-03T14:08:05.808] launch task 186018.3 request from 715780.10076@192.168.208.135 (port 21473) comp012: [2016-02-03T14:08:05.823] [186018.2] task_p_pre_launch: Using sched_affinity for tasks comp012: [2016-02-03T14:08:05.838] launch task 186018.4 request from 715780.10076@192.168.208.141 (port 57541) comp012: [2016-02-03T14:08:05.853] [186018.3] task_p_pre_launch: Using sched_affinity for tasks comp012: [2016-02-03T14:08:05.867] launch task 186018.5 request from 715780.10076@192.168.208.127 (port 7389) comp012: [2016-02-03T14:08:05.882] [186018.4] task_p_pre_launch: Using sched_affinity for tasks comp012: [2016-02-03T14:08:05.896] launch task 186018.6 request from 715780.10076@192.168.208.136 (port 23271) comp012: [2016-02-03T14:08:05.912] [186018.5] task_p_pre_launch: Using sched_affinity for tasks comp012: [2016-02-03T14:08:05.926] launch task 186018.9 request from 715780.10076@192.168.208.139 (port 45961) comp012: [2016-02-03T14:08:05.941] [186018.6] task_p_pre_launch: Using sched_affinity for tasks comp012: [2016-02-03T14:08:05.955] launch task 186018.8 request from 715780.10076@192.168.208.140 (port 12685) comp012: [2016-02-03T14:08:05.970] [186018.9] task_p_pre_launch: Using sched_affinity for tasks comp012: [2016-02-03T14:08:05.983] launch task 186018.10 request from 715780.10076@192.168.208.142 (port 5355) comp012: [2016-02-03T14:08:05.999] [186018.8] task_p_pre_launch: Using sched_affinity for tasks comp012: [2016-02-03T14:08:06.013] launch task 186018.11 request from 715780.10076@192.168.208.128 (port 64902) comp012: [2016-02-03T14:08:06.028] [186018.10] task_p_pre_launch: Using sched_affinity for tasks comp012: [2016-02-03T14:08:06.044] launch task 186018.12 request from 715780.10076@192.168.209.26 (port 46043) comp012: [2016-02-03T14:08:06.060] [186018.11] task_p_pre_launch: Using sched_affinity for tasks comp012: [2016-02-03T14:08:06.090] [186018.12] task_p_pre_launch: Using sched_affinity for tasks comp012: [2016-02-03T14:08:06.665] [186018.1] done with job comp012: [2016-02-03T14:08:06.757] [186018.2] done with job comp012: [2016-02-03T14:08:06.847] [186018.5] done with job comp012: [2016-02-03T14:08:06.887] [186018.3] done with job comp012: [2016-02-03T14:08:25.077] [186018.12] done with job comp012: [2016-02-03T14:08:25.087] [186018.8] done with job comp012: [2016-02-03T14:08:25.087] [186018.9] done with job comp012: [2016-02-03T14:08:25.095] [186018.11] done with job comp012: [2016-02-03T14:08:25.100] [186018.10] done with job comp012: [2016-02-03T14:08:25.102] [186018.4] done with job comp012: [2016-02-03T14:08:25.105] [186018.6] done with job comp012: [2016-02-05T15:17:15.506] Warning: revoke on job 186018 has no expiration comp012: [2016-02-05T15:18:24.746] error: Error reading step 186018.0 memory limits comp012: [2016-02-05T15:26:34.789] Warning: revoke on job 186018 has no expiration comp012: [2016-02-05T15:26:36.991] error: Error reading step 186018.0 memory limits comp002: [2016-02-03T14:08:05.546] _run_prolog: prolog with lock for job 186018 ran for 0 seconds comp002: [2016-02-03T14:08:05.631] launch task 186018.0 request from 715780.10076@192.168.223.241 (port 47503) comp002: [2016-02-03T14:08:05.631] lllp_distribution jobid [186018] implicit auto binding: sockets,one_thread, dist 2 comp002: [2016-02-03T14:08:05.631] _lllp_generate_cpu_bind jobid [186018]: mask_cpu,one_thread, 0x800 comp002: [2016-02-03T14:08:05.675] [186018.0] task_p_pre_launch: Using sched_affinity for tasks comp002: [2016-02-03T14:08:05.751] launch task 186018.1 request from 715780.10076@192.168.208.137 (port 12744) comp002: [2016-02-03T14:08:05.781] launch task 186018.2 request from 715780.10076@192.168.209.26 (port 30179) comp002: [2016-02-03T14:08:05.797] [186018.1] task_p_pre_launch: Using sched_affinity for tasks comp002: [2016-02-03T14:08:05.810] launch task 186018.3 request from 715780.10076@192.168.208.135 (port 14300) comp002: [2016-02-03T14:08:05.825] [186018.2] task_p_pre_launch: Using sched_affinity for tasks comp002: [2016-02-03T14:08:05.839] launch task 186018.4 request from 715780.10076@192.168.208.141 (port 5269) comp002: [2016-02-03T14:08:05.854] [186018.3] task_p_pre_launch: Using sched_affinity for tasks comp002: [2016-02-03T14:08:05.868] launch task 186018.5 request from 715780.10076@192.168.208.127 (port 18304) comp002: [2016-02-03T14:08:05.883] [186018.4] task_p_pre_launch: Using sched_affinity for tasks comp002: [2016-02-03T14:08:05.896] launch task 186018.8 request from 715780.10076@192.168.208.140 (port 29586) comp002: [2016-02-03T14:08:05.912] [186018.5] task_p_pre_launch: Using sched_affinity for tasks comp002: [2016-02-03T14:08:05.925] launch task 186018.7 request from 715780.10076@192.168.208.138 (port 41637) comp002: [2016-02-03T14:08:05.941] [186018.8] task_p_pre_launch: Using sched_affinity for tasks comp002: [2016-02-03T14:08:05.954] launch task 186018.9 request from 715780.10076@192.168.208.139 (port 6087) comp002: [2016-02-03T14:08:05.970] [186018.7] task_p_pre_launch: Using sched_affinity for tasks comp002: [2016-02-03T14:08:05.983] launch task 186018.6 request from 715780.10076@192.168.208.136 (port 16523) comp002: [2016-02-03T14:08:05.999] [186018.9] task_p_pre_launch: Using sched_affinity for tasks comp002: [2016-02-03T14:08:06.013] launch task 186018.10 request from 715780.10076@192.168.208.142 (port 24241) comp002: [2016-02-03T14:08:06.029] [186018.6] task_p_pre_launch: Using sched_affinity for tasks comp002: [2016-02-03T14:08:06.043] launch task 186018.12 request from 715780.10076@192.168.209.26 (port 33251) comp002: [2016-02-03T14:08:06.060] [186018.10] task_p_pre_launch: Using sched_affinity for tasks comp002: [2016-02-03T14:08:06.089] [186018.12] task_p_pre_launch: Using sched_affinity for tasks comp002: [2016-02-03T14:08:06.666] [186018.1] done with job comp002: [2016-02-03T14:08:06.758] [186018.2] done with job comp002: [2016-02-03T14:08:06.854] [186018.5] done with job comp002: [2016-02-03T14:08:06.950] [186018.3] done with job comp002: [2016-02-03T14:08:25.104] [186018.9] done with job comp002: [2016-02-03T14:08:25.104] [186018.10] done with job comp002: [2016-02-03T14:08:25.105] [186018.6] done with job comp002: [2016-02-03T14:08:25.107] [186018.12] done with job comp002: [2016-02-03T14:08:25.107] [186018.4] done with job comp002: [2016-02-03T14:08:25.109] [186018.8] done with job comp002: [2016-02-03T14:08:25.109] [186018.7] done with job comp002: [2016-02-05T15:17:15.267] Warning: revoke on job 186018 has no expiration comp002: [2016-02-05T15:18:24.726] error: Error reading step 186018.0 memory limits comp002: [2016-02-05T15:26:34.994] Warning: revoke on job 186018 has no expiration comp002: [2016-02-05T15:26:36.980] error: Error reading step 186018.0 memory limits comp010: [2016-02-03T14:08:05.547] _run_prolog: prolog with lock for job 186018 ran for 0 seconds comp010: [2016-02-03T14:08:05.632] launch task 186018.0 request from 715780.10076@192.168.223.241 (port 41910) comp010: [2016-02-03T14:08:05.632] lllp_distribution jobid [186018] implicit auto binding: sockets,one_thread, dist 2 comp010: [2016-02-03T14:08:05.632] _lllp_generate_cpu_bind jobid [186018]: mask_cpu,one_thread, 0x800 comp010: [2016-02-03T14:08:05.677] [186018.0] task_p_pre_launch: Using sched_affinity for tasks comp010: [2016-02-03T14:08:05.752] launch task 186018.1 request from 715780.10076@192.168.208.137 (port 44971) comp010: [2016-02-03T14:08:05.781] launch task 186018.2 request from 715780.10076@192.168.209.26 (port 32934) comp010: [2016-02-03T14:08:05.795] [186018.1] task_p_pre_launch: Using sched_affinity for tasks comp010: [2016-02-03T14:08:05.810] launch task 186018.3 request from 715780.10076@192.168.208.135 (port 61321) comp010: [2016-02-03T14:08:05.824] [186018.2] task_p_pre_launch: Using sched_affinity for tasks comp010: [2016-02-03T14:08:05.838] launch task 186018.4 request from 715780.10076@192.168.208.141 (port 183) comp010: [2016-02-03T14:08:05.853] [186018.3] task_p_pre_launch: Using sched_affinity for tasks comp010: [2016-02-03T14:08:05.868] launch task 186018.5 request from 715780.10076@192.168.208.127 (port 63953) comp010: [2016-02-03T14:08:05.883] [186018.4] task_p_pre_launch: Using sched_affinity for tasks comp010: [2016-02-03T14:08:05.896] launch task 186018.9 request from 715780.10076@192.168.208.139 (port 17856) comp010: [2016-02-03T14:08:05.911] [186018.5] task_p_pre_launch: Using sched_affinity for tasks comp010: [2016-02-03T14:08:05.926] launch task 186018.7 request from 715780.10076@192.168.208.138 (port 48040) comp010: [2016-02-03T14:08:05.941] [186018.9] task_p_pre_launch: Using sched_affinity for tasks comp010: [2016-02-03T14:08:05.955] launch task 186018.8 request from 715780.10076@192.168.208.140 (port 34504) comp010: [2016-02-03T14:08:05.970] [186018.7] task_p_pre_launch: Using sched_affinity for tasks comp010: [2016-02-03T14:08:05.985] launch task 186018.10 request from 715780.10076@192.168.208.142 (port 55465) comp010: [2016-02-03T14:08:06.000] [186018.8] task_p_pre_launch: Using sched_affinity for tasks comp010: [2016-02-03T14:08:06.015] launch task 186018.11 request from 715780.10076@192.168.208.128 (port 48343) comp010: [2016-02-03T14:08:06.030] [186018.10] task_p_pre_launch: Using sched_affinity for tasks comp010: [2016-02-03T14:08:06.050] launch task 186018.12 request from 715780.10076@192.168.209.26 (port 36006) comp010: [2016-02-03T14:08:06.066] [186018.11] task_p_pre_launch: Using sched_affinity for tasks comp010: [2016-02-03T14:08:06.095] [186018.12] task_p_pre_launch: Using sched_affinity for tasks comp010: [2016-02-03T14:08:06.664] [186018.1] done with job comp010: [2016-02-03T14:08:06.757] [186018.2] done with job comp010: [2016-02-03T14:08:06.853] [186018.5] done with job comp010: [2016-02-03T14:08:06.885] [186018.3] done with job comp010: [2016-02-03T14:08:25.088] [186018.9] done with job comp010: [2016-02-03T14:08:25.097] [186018.12] done with job comp010: [2016-02-03T14:08:25.099] [186018.11] done with job comp010: [2016-02-03T14:08:25.102] [186018.10] done with job comp010: [2016-02-03T14:08:25.102] [186018.7] done with job comp010: [2016-02-03T14:08:25.104] [186018.4] done with job comp010: [2016-02-03T14:08:25.104] [186018.8] done with job comp010: [2016-02-05T15:17:15.774] Warning: revoke on job 186018 has no expiration comp010: [2016-02-05T15:18:24.734] error: Error reading step 186018.0 memory limits comp010: [2016-02-05T15:26:34.760] Warning: revoke on job 186018 has no expiration comp010: [2016-02-05T15:26:36.980] error: Error reading step 186018.0 memory limits comp015: [2016-02-03T14:08:05.546] _run_prolog: prolog with lock for job 186018 ran for 0 seconds comp015: [2016-02-03T14:08:05.631] launch task 186018.0 request from 715780.10076@192.168.223.241 (port 6331) comp015: [2016-02-03T14:08:05.631] lllp_distribution jobid [186018] implicit auto binding: sockets,one_thread, dist 2 comp015: [2016-02-03T14:08:05.631] _lllp_generate_cpu_bind jobid [186018]: mask_cpu,one_thread, 0x800 comp015: [2016-02-03T14:08:05.675] [186018.0] task_p_pre_launch: Using sched_affinity for tasks comp015: [2016-02-03T14:08:05.751] launch task 186018.1 request from 715780.10076@192.168.208.137 (port 65460) comp015: [2016-02-03T14:08:05.780] launch task 186018.2 request from 715780.10076@192.168.209.26 (port 26766) comp015: [2016-02-03T14:08:05.795] [186018.1] task_p_pre_launch: Using sched_affinity for tasks comp015: [2016-02-03T14:08:05.808] launch task 186018.3 request from 715780.10076@192.168.208.135 (port 56782) comp015: [2016-02-03T14:08:05.823] [186018.2] task_p_pre_launch: Using sched_affinity for tasks comp015: [2016-02-03T14:08:05.836] launch task 186018.5 request from 715780.10076@192.168.208.127 (port 25281) comp015: [2016-02-03T14:08:05.851] [186018.3] task_p_pre_launch: Using sched_affinity for tasks comp015: [2016-02-03T14:08:05.865] launch task 186018.6 request from 715780.10076@192.168.208.136 (port 42180) comp015: [2016-02-03T14:08:05.880] [186018.5] task_p_pre_launch: Using sched_affinity for tasks comp015: [2016-02-03T14:08:05.894] launch task 186018.9 request from 715780.10076@192.168.208.139 (port 38786) comp015: [2016-02-03T14:08:05.909] [186018.6] task_p_pre_launch: Using sched_affinity for tasks comp015: [2016-02-03T14:08:05.923] launch task 186018.7 request from 715780.10076@192.168.208.138 (port 57803) comp015: [2016-02-03T14:08:05.938] [186018.9] task_p_pre_launch: Using sched_affinity for tasks comp015: [2016-02-03T14:08:05.951] launch task 186018.8 request from 715780.10076@192.168.208.140 (port 56241) comp015: [2016-02-03T14:08:05.967] [186018.7] task_p_pre_launch: Using sched_affinity for tasks comp015: [2016-02-03T14:08:05.980] launch task 186018.10 request from 715780.10076@192.168.208.142 (port 64433) comp015: [2016-02-03T14:08:05.996] [186018.8] task_p_pre_launch: Using sched_affinity for tasks comp015: [2016-02-03T14:08:06.010] launch task 186018.11 request from 715780.10076@192.168.208.128 (port 1950) comp015: [2016-02-03T14:08:06.026] [186018.10] task_p_pre_launch: Using sched_affinity for tasks comp015: [2016-02-03T14:08:06.038] launch task 186018.12 request from 715780.10076@192.168.209.26 (port 30094) comp015: [2016-02-03T14:08:06.054] [186018.11] task_p_pre_launch: Using sched_affinity for tasks comp015: [2016-02-03T14:08:06.085] [186018.12] task_p_pre_launch: Using sched_affinity for tasks comp015: [2016-02-03T14:08:06.623] [186018.1] done with job comp015: [2016-02-03T14:08:06.751] [186018.2] done with job comp015: [2016-02-03T14:08:06.840] [186018.3] done with job comp015: [2016-02-03T14:08:06.851] [186018.5] done with job comp015: [2016-02-03T14:08:25.070] [186018.9] done with job comp015: [2016-02-03T14:08:25.076] [186018.6] done with job comp015: [2016-02-03T14:08:25.100] [186018.10] done with job comp015: [2016-02-03T14:08:25.102] [186018.12] done with job comp015: [2016-02-03T14:08:25.106] [186018.8] done with job comp015: [2016-02-03T14:08:25.107] [186018.11] done with job comp015: [2016-02-03T14:08:25.107] [186018.7] done with job comp015: [2016-02-05T15:17:16.310] Warning: revoke on job 186018 has no expiration comp015: [2016-02-05T15:18:24.720] error: Error reading step 186018.0 memory limits comp015: [2016-02-05T15:26:34.770] Warning: revoke on job 186018 has no expiration comp015: [2016-02-05T15:26:36.970] error: Error reading step 186018.0 memory limits comp016: [2016-02-03T14:08:05.546] _run_prolog: prolog with lock for job 186018 ran for 0 seconds comp016: [2016-02-03T14:08:05.631] launch task 186018.0 request from 715780.10076@192.168.223.241 (port 13277) comp016: [2016-02-03T14:08:05.631] lllp_distribution jobid [186018] implicit auto binding: sockets,one_thread, dist 2 comp016: [2016-02-03T14:08:05.631] _lllp_generate_cpu_bind jobid [186018]: mask_cpu,one_thread, 0x800 comp016: [2016-02-03T14:08:05.674] [186018.0] task_p_pre_launch: Using sched_affinity for tasks comp016: [2016-02-03T14:08:05.751] launch task 186018.1 request from 715780.10076@192.168.208.137 (port 59354) comp016: [2016-02-03T14:08:05.780] launch task 186018.2 request from 715780.10076@192.168.209.26 (port 18389) comp016: [2016-02-03T14:08:05.795] [186018.1] task_p_pre_launch: Using sched_affinity for tasks comp016: [2016-02-03T14:08:05.808] launch task 186018.3 request from 715780.10076@192.168.208.135 (port 64650) comp016: [2016-02-03T14:08:05.824] [186018.2] task_p_pre_launch: Using sched_affinity for tasks comp016: [2016-02-03T14:08:05.837] launch task 186018.4 request from 715780.10076@192.168.208.141 (port 13759) comp016: [2016-02-03T14:08:05.851] [186018.3] task_p_pre_launch: Using sched_affinity for tasks comp016: [2016-02-03T14:08:05.865] launch task 186018.5 request from 715780.10076@192.168.208.127 (port 40646) comp016: [2016-02-03T14:08:05.880] [186018.4] task_p_pre_launch: Using sched_affinity for tasks comp016: [2016-02-03T14:08:05.894] launch task 186018.6 request from 715780.10076@192.168.208.136 (port 45255) comp016: [2016-02-03T14:08:05.909] [186018.5] task_p_pre_launch: Using sched_affinity for tasks comp016: [2016-02-03T14:08:05.922] launch task 186018.9 request from 715780.10076@192.168.208.139 (port 52171) comp016: [2016-02-03T14:08:05.936] [186018.6] task_p_pre_launch: Using sched_affinity for tasks comp016: [2016-02-03T14:08:05.951] launch task 186018.7 request from 715780.10076@192.168.208.138 (port 56971) comp016: [2016-02-03T14:08:05.966] [186018.9] task_p_pre_launch: Using sched_affinity for tasks comp016: [2016-02-03T14:08:05.980] launch task 186018.8 request from 715780.10076@192.168.208.140 (port 60060) comp016: [2016-02-03T14:08:05.995] [186018.7] task_p_pre_launch: Using sched_affinity for tasks comp016: [2016-02-03T14:08:06.008] launch task 186018.11 request from 715780.10076@192.168.208.128 (port 43670) comp016: [2016-02-03T14:08:06.024] [186018.8] task_p_pre_launch: Using sched_affinity for tasks comp016: [2016-02-03T14:08:06.041] launch task 186018.12 request from 715780.10076@192.168.209.26 (port 21717) comp016: [2016-02-03T14:08:06.056] [186018.11] task_p_pre_launch: Using sched_affinity for tasks comp016: [2016-02-03T14:08:06.085] [186018.12] task_p_pre_launch: Using sched_affinity for tasks comp016: [2016-02-03T14:08:06.664] [186018.1] done with job comp016: [2016-02-03T14:08:06.760] [186018.2] done with job comp016: [2016-02-03T14:08:06.846] [186018.5] done with job comp016: [2016-02-03T14:08:06.885] [186018.3] done with job comp016: [2016-02-03T14:08:25.073] [186018.7] done with job comp016: [2016-02-03T14:08:25.074] [186018.8] done with job comp016: [2016-02-03T14:08:25.086] [186018.4] done with job comp016: [2016-02-03T14:08:25.088] [186018.6] done with job comp016: [2016-02-03T14:08:25.089] [186018.11] done with job comp016: [2016-02-03T14:08:25.091] [186018.12] done with job comp016: [2016-02-03T14:08:25.096] [186018.9] done with job comp016: [2016-02-05T15:17:15.634] Warning: revoke on job 186018 has no expiration comp016: [2016-02-05T15:18:24.721] error: Error reading step 186018.0 memory limits comp016: [2016-02-05T15:26:34.771] Warning: revoke on job 186018 has no expiration comp016: [2016-02-05T15:26:36.991] error: Error reading step 186018.0 memory limits comp125: [2016-02-03T14:08:05.545] _run_prolog: prolog with lock for job 186018 ran for 0 seconds comp125: [2016-02-03T14:08:05.630] launch task 186018.0 request from 715780.10076@192.168.223.241 (port 24192) comp125: [2016-02-03T14:08:05.630] lllp_distribution jobid [186018] implicit auto binding: sockets,one_thread, dist 2 comp125: [2016-02-03T14:08:05.630] _lllp_generate_cpu_bind jobid [186018]: mask_cpu,one_thread, 0x200,0x001 comp125: [2016-02-03T14:08:05.675] [186018.0] task_p_pre_launch: Using sched_affinity for tasks comp125: [2016-02-03T14:08:05.676] [186018.0] task_p_pre_launch: Using sched_affinity for tasks comp125: [2016-02-03T14:08:05.751] launch task 186018.1 request from 715780.10076@192.168.208.137 (port 14236) comp125: [2016-02-03T14:08:05.780] launch task 186018.3 request from 715780.10076@192.168.208.135 (port 27342) comp125: [2016-02-03T14:08:05.798] [186018.1] task_p_pre_launch: Using sched_affinity for tasks comp125: [2016-02-03T14:08:05.811] launch task 186018.4 request from 715780.10076@192.168.208.141 (port 28321) comp125: [2016-02-03T14:08:05.826] [186018.3] task_p_pre_launch: Using sched_affinity for tasks comp125: [2016-02-03T14:08:05.840] launch task 186018.5 request from 715780.10076@192.168.208.127 (port 28828) comp125: [2016-02-03T14:08:05.855] [186018.4] task_p_pre_launch: Using sched_affinity for tasks comp125: [2016-02-03T14:08:05.869] launch task 186018.6 request from 715780.10076@192.168.208.136 (port 31143) comp125: [2016-02-03T14:08:05.884] [186018.5] task_p_pre_launch: Using sched_affinity for tasks comp125: [2016-02-03T14:08:05.898] launch task 186018.7 request from 715780.10076@192.168.208.138 (port 19585) comp125: [2016-02-03T14:08:05.913] [186018.6] task_p_pre_launch: Using sched_affinity for tasks comp125: [2016-02-03T14:08:05.927] launch task 186018.9 request from 715780.10076@192.168.208.139 (port 41391) comp125: [2016-02-03T14:08:05.942] [186018.7] task_p_pre_launch: Using sched_affinity for tasks comp125: [2016-02-03T14:08:05.956] launch task 186018.8 request from 715780.10076@192.168.208.140 (port 62366) comp125: [2016-02-03T14:08:05.971] [186018.9] task_p_pre_launch: Using sched_affinity for tasks comp125: [2016-02-03T14:08:05.977] launch task 186018.10 request from 715780.10076@192.168.208.142 (port 28586) comp125: [2016-02-03T14:08:05.995] [186018.8] task_p_pre_launch: Using sched_affinity for tasks comp125: [2016-02-03T14:08:05.999] launch task 186018.11 request from 715780.10076@192.168.208.128 (port 31163) comp125: [2016-02-03T14:08:06.014] [186018.10] task_p_pre_launch: Using sched_affinity for tasks comp125: [2016-02-03T14:08:06.038] [186018.11] task_p_pre_launch: Using sched_affinity for tasks comp125: [2016-02-03T14:08:06.625] [186018.1] done with job comp125: [2016-02-03T14:08:06.852] [186018.5] done with job comp125: [2016-02-03T14:08:06.944] [186018.3] done with job comp125: [2016-02-03T14:08:25.082] [186018.11] done with job comp125: [2016-02-03T14:08:25.085] [186018.6] done with job comp125: [2016-02-03T14:08:25.098] [186018.9] done with job comp125: [2016-02-03T14:08:25.101] [186018.10] done with job comp125: [2016-02-03T14:08:25.108] [186018.4] done with job comp125: [2016-02-03T14:08:25.108] [186018.7] done with job comp125: [2016-02-03T14:08:25.118] [186018.8] done with job comp125: [2016-02-05T15:17:15.931] Warning: revoke on job 186018 has no expiration comp125: [2016-02-05T15:18:24.744] error: Error reading step 186018.0 memory limits comp125: [2016-02-05T15:26:34.804] Warning: revoke on job 186018 has no expiration comp125: [2016-02-05T15:26:36.990] error: Error reading step 186018.0 memory limits comp125: [2016-02-11T10:35:47.088] [186018.0] Failed to send MESSAGE_TASK_EXIT: Connection refused comp125: [2016-02-11T10:35:47.095] [186018.0] done with job That last line there is suspicious: comp125: [2016-02-11T10:35:47.088] [186018.0] Failed to send MESSAGE_TASK_EXIT: Connection refused Do you happen to have the slurm-186018.out file from the user? It may have some debug output in it from the srun that ran into problems. It'd also be nice to have a chunk of slurmctld.log from around that time - just searching on the job number doesn't always pick up everything that may have been related. The warnings repeated warnings about "error: Error reading step 186018.0 memory limits" are also curious - do you see this for other jobs, or is that isolated to this specific one? Can you also attach a copy of your slurm.conf file for reference? Created attachment 2747 [details]
slurm.conf
comp125 - the node that reports the error is not in the list of the nodes where the job ran. The user had /dev/null as the output. Also, I see the following errors on one of the compute nodes: [2016-02-03T14:08:05.546] _run_prolog: prolog with lock for job 186018 ran for 0 seconds [2016-02-03T14:08:05.631] launch task 186018.0 request from 715780.10076@192.168.223.241 (port 47503) [2016-02-03T14:08:05.631] lllp_distribution jobid [186018] implicit auto binding: sockets,one_thread, dist 2 [2016-02-03T14:08:05.631] _lllp_generate_cpu_bind jobid [186018]: mask_cpu,one_thread, 0x800 [2016-02-03T14:08:05.675] [186018.0] task_p_pre_launch: Using sched_affinity for tasks [2016-02-03T14:08:05.751] launch task 186018.1 request from 715780.10076@192.168.208.137 (port 12744) [2016-02-03T14:08:05.781] launch task 186018.2 request from 715780.10076@192.168.209.26 (port 30179) [2016-02-03T14:08:05.797] [186018.1] task_p_pre_launch: Using sched_affinity for tasks [2016-02-03T14:08:05.810] launch task 186018.3 request from 715780.10076@192.168.208.135 (port 14300) [2016-02-03T14:08:05.825] [186018.2] task_p_pre_launch: Using sched_affinity for tasks [2016-02-03T14:08:05.839] launch task 186018.4 request from 715780.10076@192.168.208.141 (port 5269) [2016-02-03T14:08:05.854] [186018.3] task_p_pre_launch: Using sched_affinity for tasks [2016-02-03T14:08:05.868] launch task 186018.5 request from 715780.10076@192.168.208.127 (port 18304) [2016-02-03T14:08:05.883] [186018.4] task_p_pre_launch: Using sched_affinity for tasks [2016-02-03T14:08:05.896] launch task 186018.8 request from 715780.10076@192.168.208.140 (port 29586) [2016-02-03T14:08:05.912] [186018.5] task_p_pre_launch: Using sched_affinity for tasks [2016-02-03T14:08:05.925] launch task 186018.7 request from 715780.10076@192.168.208.138 (port 41637) [2016-02-03T14:08:05.941] [186018.8] task_p_pre_launch: Using sched_affinity for tasks [2016-02-03T14:08:05.954] launch task 186018.9 request from 715780.10076@192.168.208.139 (port 6087) [2016-02-03T14:08:05.970] [186018.7] task_p_pre_launch: Using sched_affinity for tasks [2016-02-03T14:08:05.983] launch task 186018.6 request from 715780.10076@192.168.208.136 (port 16523) [2016-02-03T14:08:05.999] [186018.9] task_p_pre_launch: Using sched_affinity for tasks [2016-02-03T14:08:06.013] launch task 186018.10 request from 715780.10076@192.168.208.142 (port 24241) [2016-02-03T14:08:06.029] [186018.6] task_p_pre_launch: Using sched_affinity for tasks [2016-02-03T14:08:06.043] launch task 186018.12 request from 715780.10076@192.168.209.26 (port 33251) [2016-02-03T14:08:06.060] [186018.10] task_p_pre_launch: Using sched_affinity for tasks [2016-02-03T14:08:06.089] [186018.12] task_p_pre_launch: Using sched_affinity for tasks [2016-02-03T14:08:06.666] [186018.1] done with job [2016-02-03T14:08:06.758] [186018.2] done with job [2016-02-03T14:08:06.854] [186018.5] done with job [2016-02-03T14:08:06.950] [186018.3] done with job [2016-02-03T14:08:25.104] [186018.9] done with job [2016-02-03T14:08:25.104] [186018.10] done with job [2016-02-03T14:08:25.105] [186018.6] done with job [2016-02-03T14:08:25.107] [186018.12] done with job [2016-02-03T14:08:25.107] [186018.4] done with job [2016-02-03T14:08:25.109] [186018.8] done with job [2016-02-03T14:08:25.109] [186018.7] done with job [2016-02-05T15:17:15.267] Warning: revoke on job 186018 has no expiration [2016-02-05T15:18:24.726] error: Error reading step 186018.0 memory limits [2016-02-05T15:26:34.994] Warning: revoke on job 186018 has no expiration [2016-02-05T15:26:36.980] error: Error reading step 186018.0 memory limits Is there a way to forcefully remove the job from the queue? Restarting slurmctld should clear it up, and should be safe to do while the system is running. I'm concerned as to how comp125 wound up running that job - have you made any change to the network or slurm configuration lately? We tried restarting slurmctld but the job is still there. The only changes that were made to slurm.conf is that the TaskProlog flag was uncommented. The node names/parameters were not changed. The other observation is that the job id # actually shows up on the logs of all the compute nodes even though the user requested around 7 nodes. We've been talking over this internally, and there are a few odd things happening here that we're trying to sort through. It sounds like some step - possibly errant - is still out there, or it should have been marked as completed. It also looks like comp001 and comp125 may have been mixed up at some point - is it possible that there's an address conflict or some other issue between those two nodes? We're also curious about the comp001t vs comp001 distinction. Can you explain why you have that setup that way? Can you run scontrol show steps 186018 and sacct -j 186018 Getting logs from comp001t would also be helpful, and if you can increase the SlurmdDebug level to debug2 that may shed some light on why that job refuses to go away. comp001 and comp001t differ in the interfaces (xxt is on 1g -- used for provisioning and DRAC, the other is on 10g). The addresses for comp001 and comp025 have been the way they are from a long time. Please find the output of scontrol show steps 186018: StepId=186018.0 UserId=715780 StartTime=2016-02-03T14:08:05 TimeLimit=UNLIMITED State=RUNNING Partition=batch NodeList=comp001t,comp002t,comp009t,comp010t,comp011t,comp012t,comp013t,comp014t,comp015t,comp016t,comp125t Gre Nodes=11 CPUs=12 Tasks=12 Name=mpirun Network=(null) ResvPorts=(null) Checkpoint=0 CheckpointDir=/home/sam238/test/mpi CPUFreqReq=Default and sacct -j 186018: JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 186018 mpirun batch arc_staff 12 FAILED 11:0 186018.0 mpirun arc_staff 12 FAILED 0:0 186018.1 orted arc_staff 10 COMPLETED 0:0 186018.2 orted arc_staff 10 COMPLETED 0:0 186018.3 orted arc_staff 10 COMPLETED 0:0 186018.4 orted arc_staff 10 FAILED 1:0 186018.5 orted arc_staff 10 COMPLETED 0:0 186018.6 orted arc_staff 10 COMPLETED 0:0 186018.7 orted arc_staff 10 COMPLETED 0:0 186018.8 orted arc_staff 10 FAILED 1:0 186018.9 orted arc_staff 10 FAILED 1:0 186018.10 orted arc_staff 10 FAILED 1:0 186018.11 orted arc_staff 10 FAILED 1:0 186018.12 orted arc_staff 10 FAILED 1:0 We have the debug level set at 3. I have also added comp001's log file below.. Created attachment 2749 [details]
comp001 log
Can you try cancelling the orphaned step? ("scancel 186018.0")
I'd also like to see the output for
sacct --format=jobid,state,exitcode,start,end -j 186018
I get this when I do scancel 186018.0: scancel: error: slurm_kill_job2() failed Invalid job id specified and when I do sacct --format=jobid,state,exitcode,start,end -j 186018, I get the following: JobID State ExitCode Start End ------------ ---------- -------- ------------------- ------------------- 186018 FAILED 11:0 2016-02-03T14:08:05 2016-02-03T14:08:25 186018.0 FAILED 0:0 2016-02-03T14:08:05 2016-02-03T14:08:25 186018.1 COMPLETED 0:0 2016-02-03T14:08:05 2016-02-03T14:08:06 186018.2 COMPLETED 0:0 2016-02-03T14:08:05 2016-02-03T14:08:06 186018.3 COMPLETED 0:0 2016-02-03T14:08:05 2016-02-03T14:08:06 186018.4 FAILED 1:0 2016-02-03T14:08:05 2016-02-03T14:08:25 186018.5 COMPLETED 0:0 2016-02-03T14:08:05 2016-02-03T14:08:06 186018.6 COMPLETED 0:0 2016-02-03T14:08:05 2016-02-03T14:08:25 186018.7 COMPLETED 0:0 2016-02-03T14:08:05 2016-02-03T14:08:25 186018.8 FAILED 1:0 2016-02-03T14:08:05 2016-02-03T14:08:25 186018.9 FAILED 1:0 2016-02-03T14:08:05 2016-02-03T14:08:25 186018.10 FAILED 1:0 2016-02-03T14:08:05 2016-02-03T14:08:25 186018.11 FAILED 1:0 2016-02-03T14:08:05 2016-02-03T14:08:25 186018.12 FAILED 1:0 2016-02-03T14:08:05 2016-02-03T14:08:25 Any updates, Tim? It looks like scancel won't cooperate here. It's still unclear how that job got into such a state, and we're trying to sort that out further. Can you pull some additional logs from each of the nodes for us, and check that no processes / slurmstepd's are running for this job on any of the originally allocated nodes? (comp001t,comp002t,comp009t,comp010t,comp011t,comp012t,comp013t,comp014t,comp015t,comp016t,comp125t) We believe that at least one stepd must still be running somewhere. If you find it, if you're able to connect to it with `gdb -p (PID for stepd)` and give us the output of 'thread apply all bt full' that'd give us a better picture of the problem. One thing I noticed from a possibly related error message for comp001 - does /tmp/slurm exist on each node? Is there anything that may be automatically removing files under /tmp on the nodes such as RHEL's tmpwatch? You are right, Tim. All the nodes had a zombie slurmstepd process running. I killed them manually, and the job was eventually removed. /tmp/slurm is manually cleared but /tmp has a huge space. I shall configure tmpwatch as you suggested. Thank you again. (In reply to Hadrian from comment #16) > You are right, Tim. All the nodes had a zombie slurmstepd process running. I > killed them manually, and the job was eventually removed. If you see another job in this state in the future, we'd like to see the backtrace before you kill it. Without the backtrace I don't have a way to find out what the underlying problem is > /tmp/slurm is > manually cleared but /tmp has a huge space. I shall configure tmpwatch as > you suggested. Thank you again. Just to check - I didn't mean to suggest that you use it or not, I just wanted to make sure it wasn't causing problems for /tmp/slurm. Removing anything under there automatically could cause some odd problems, although I don't think it was a factor here. I'm going to go ahead and mark this as resolved for now. If it recurs, please try to get a backtrace from one of the slurmstepd's before killing them and reopen this bug. - Tim |
An MPI job has been in CG mode for a while, and none of the processes run on the compute nodes. Is there a way to forcefully remove the job from the queue? squeue JOBID PARTITION NAME USER ST TIME NODES CPU NODELIST(REASON) 186018 batch mpirun sam238 CG 0:20 7 12 comp002t,comp010t,comp012t,comp013t,comp014t,comp015t,comp016t $ scontrol show job 186018 JobId=186018 JobName=mpirun UserId=sam238(715780) GroupId=hpcadmin(10076) Priority=5557 Nice=0 Account=arc_staff QOS=normal JobState=COMPLETING Reason=NonZeroExitCode Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=139:0 RunTime=00:00:20 TimeLimit=10:00:00 TimeMin=N/A SubmitTime=2016-02-03T14:08:05 EligibleTime=2016-02-03T14:08:05 StartTime=2016-02-03T14:08:05 EndTime=2016-02-03T14:08:25 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=batch AllocNode:Sid=hpctest:43576 ReqNodeList=(null) ExcNodeList=(null) NodeList=comp002t,comp010t,comp012t,comp013t,comp014t,comp015t,comp016t BatchHost=comp001t NumNodes=7 NumCPUs=12 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=1900M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=mpirun WorkDir=/home/sam238/test/mpi Power= SICP=0 slurm log: [2016-02-11T10:33:45.147] Resending TERMINATE_JOB request JobId=186018 Nodelist=comp002t,comp010t,comp012t,comp013t,comp014t,comp015t,comp016t,comp125t [2016-02-11T10:34:45.250] Resending TERMINATE_JOB request JobId=186018 Nodelist=comp002t,comp010t,comp012t,comp013t,comp014t,comp015t,comp016t,comp125t [2016-02-11T10:35:56.077] error: job_update_cpu_cnt: cpu_cnt underflow on job_id 186018 [2016-02-15T12:33:40.314] _sync_nodes_to_comp_job: Job 186018 in completing state [2016-02-15T12:57:02.041] Resending TERMINATE_JOB request JobId=186018 Nodelist=comp002t,comp010t,comp012t,comp013t,comp014t,comp015t,comp016t [2016-02-15T13:44:12.004] _sync_nodes_to_comp_job: Job 186018 in completing state [2016-02-16T16:15:41.990] error: Security violation, JOB_CANCEL RPC for jobID 186018 from uid 661646 [2016-02-16T16:15:41.990] error: _slurm_rpc_kill_job2: job_str_signal() job 186018 sig 9 returned Access/permission denied