Created attachment 6636 [details]
slurmd (b060) - batch job complete failure
Comment on attachment 6635 [details]
logs - slurmctld and slurmd
Duplicate job id
Further to my notes below, we are also experiencing "batch job complete failure" (and drain) on a couple of nodes. For example see the slurmd from the node b060. Hi Ahmed, I think you have the limits for slurmctld, slurmd daemons or for the entire system too low for this volume of ~1500 concurrent jobs and 500 nodes. Your slurmctld is crashing continuously which would explain why you are having issues. Please, you must fix this urgently: [2018-04-16T01:58:00.128] fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable Check the limits for slurm daemons, slurm user and for the system. Set at least: /proc/sys/fs/file-max: 32.832 Follow this guideliness: https://slurm.schedmd.com/high_throughput.html If you have systemd a start point would be (note the TasksMax and Limit* settings): [Unit] Description=Slurm controller daemon After=network.target munge.service ConditionPathExists=/etc/slurm/slurm.conf [Service] Type=forking EnvironmentFile=-/etc/sysconfig/slurmctld ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID PIDFile=/var/run/slurm/slurmctld.pid TasksMax=infinity LimitNOFILE=1048576 LimitNPROC=1541404 LimitMEMLOCK=infinity LimitSTACK=infinity [Install] WantedBy=multi-user.target Fix also this: [2018-04-16T01:58:09.371] error: chdir(/var/log): Permission denied And check that node galaxy-bio is set correctly in both slurm.conf and gres.conf (if you have one): [2018-04-16T01:56:14.622] error: _slurm_rpc_node_registration node=galaxy-bio: Invalid argument You see in the log many errors due to daemon failure: [2018-04-16T01:58:09.381] error: _shutdown_backup_controller:send/recv: Connection refused After fixing this, check everything again. Remember you have to *restart* daemons to apply some limits, or otherwise you have to do it manually modifying /proc/<pid>/limits Regarding to your second comment and the nodes draining, if a job is requeued due to node failure it is normal that this node is set to drain. This can happen in your situation if a socket cannot be opened to communicate to some nodes. I wouldn't worry about that until you fix the problem mentioned above. It would be nice too if you can also fix this seen in slurmd: [2018-03-21T10:06:04.778] error: gres/mic unable to set OFFLOAD_DEVICES, no device files configured Tell me how it goes. I just realized that you are managing this with Alex in bug 5064 from James Powell. Let's keep track of this issue resolution in the other bug, and after that if duplicate job id keeps showing let's diagnose from scratch again from here. Please, keep me posted about the progression. Hi Ahmed, I see that the issue from 5064 has been solved. Are you still experimenting this issue? Yes, the duplicate job id issue has been resolved ) following your input(LimitNOFILE=262144 and TasksMax=infinity), however we are still experiencing the 'batch job complete failure' issue time to time. Please see the b060 logs. Any thoughts? (In reply to Ahmed Arefin from comment #7) > Yes, the duplicate job id issue has been resolved ) following your > input(LimitNOFILE=262144 and TasksMax=infinity), Good, glad it helped. > however we are still > experiencing the 'batch job complete failure' issue time to time. Please see > the b060 logs. > > Any thoughts? Hm, I checked again the logs and I identify this situations: 1st. It is possible that the job have reached memory limit. I see you are using cgroups, are you enforcing memory limits in cgroup.conf (ConstrainRAMSpace=yes)? If this is the case, ensure that in slurm.conf you have the following set: MemLimitEnforce=no JobAcctGatherParams=NoOverMemoryKill This will disable the internal mem. limit enforce mechanism and the job acct gather memory enforce mechanism, so keeping only one mechanism, the cgroup one, enabled for memory limit identification. Having these 3 mechanisms altogether can cause some issues. [2018-04-13T13:28:12.972] [15132188.batch] task/cgroup: /slurm/uid_296431/job_15132188/step_batch: alloc=81920MB mem.limit=81920MB memsw.limit=unlimited [2018-04-13T13:38:43.538] [15116233.batch] error: Step 15116233.4294967294 hit memory limit at least once during execution. This may or may not result in some failure. [2018-04-13T13:38:43.539] [15116233.batch] error: Job 15116233 hit memory limit at least once during execution. This may or may not result in some failure. 2d. There are some jobs cancelled due to time limit, check that the ones you see batch failures are not these ones: [2018-04-15T16:26:44.193] [15162432.batch] error: *** JOB 15162432 ON b060 CANCELLED AT 2018-04-15T16:26:44 DUE TO TIME LIMIT *** 3rd. I am not sure if this is related, but would be good if you can fix it: [2018-03-14T12:01:38.473] [14459833.0] error: gres/mic unable to set OFFLOAD_DEVICES, no device files configured 4th. This was probably caused by the already fixed problem: [2018-03-14T12:01:38.715] [14459833.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused [2018-03-14T12:01:38.716] [14459833.0] done with job If none of this makes sense to you, I would need the new&complete slurmctld logs, slurmd logs and scontrol show job of a failing job. Is it reproducible, or it does happen sporadically? The following lines were added to the slurm.conf file. # SchedMD suggested changes Apr18 MemLimitEnforce=no JobAcctGatherParams=NoOverMemoryKill I have also ‘resumed’ the nodes that were facing the ‘batch job complete failure’, we now wait and see if the error comes back. Created attachment 6723 [details]
logs - slurmctld and slurmd 30-APR-2018
logs - slurmctld and slurmd 30-APR-2018
"Batch job complete failure"
slurmd log from b027.
Still not resolved batch job complete f root 2018-04-29T01:23:26 b[027,038,043] batch job complete f root 2018-04-29T01:23:55 b[053,055,089] batch job complete f slurm 2018-04-30T00:12:45 b078 Logs added - slurmctld and slurmd retrieved on 30-APR-2018 Error: "Batch job complete failure" Slurmd log from the hostname b027. Created attachment 6724 [details]
b025 - slurmd duplicate job id
b025 - slurmd duplicate job id
[2018-04-28T01:07:46.075] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job15651535/slurm_script [2018-04-28T01:09:41.763] error: Job 15651535 already running, do not launch second copy 1. The spool dir of slurmd is set to a local filesystem? 2. Please, send me the latest slurm.conf, cgroup.conf and gres.conf 3. What do you have in job_submit.lua? 4. GresPlugins seems to be inconsistently configured. I.e. messages like: [2018-04-30T01:45:57.918] error: gres_plugin_node_config_unpack: no plugin configured to unpack data type ap-southeast-2 from node galaxy-bio 5. Is cm02, your backup controller, reachable? Up? Configured? 6. Is b[101-108] in your DNS or hosts.conf? I see messages like: error: _find_node_record(751): lookup failure for b101 7. Is it normal that your jobs last for just 3 seconds? 8. What happened to b025 from 04-27-2018@22:03 to 04-28-2018@00:02? Was it freezed, rebooted? [2018-04-27T22:03:28.050] [15648753.batch] done with job [2018-04-28T00:02:40.400] Message aggregation disabled After that I see some restarts of the slurmd daemon. 9. In b027 I see a restart with an error. What is this about? [2018-04-30T00:02:01.605] Message aggregation disabled [2018-04-30T00:02:01.605] CPU frequency setting not configured for this node [2018-04-30T00:02:01.605] error: GresPlugins changed from ap-southeast-2,gpu,memdir,mic,one to gpu,memdir,mic,one ignored [2018-04-30T00:02:01.605] error: Restart the slurmctld daemon to change GresPlugins 10. I cannot correlate the slurmd on b027 with slurmctld log, since it starts at 2018-04-30 and the event happened on 2018-04-29. 11. Regarding what I think is your failed job, I see: [2018-04-29T01:22:26.137] [15657370.batch] error: *** JOB 15657370 ON b027 CANCELLED AT 2018-04-29T01:22:26 DUE TO TIME LIMIT *** [2018-04-29T01:23:27.000] [15657370.batch] error: *** JOB 15657370 STEPD TERMINATED ON b027 AT 2018-04-29T01:23:26 DUE TO JOB NOT ENDING WITH SIGNALS *** [2018-04-29T01:23:27.000] [15657370.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 15 [2018-04-29T01:23:27.001] [15657370.batch] done with job This seems to be related to bug 3941 and may indeed be the cause of the nodes being drained. Please, clarify me the previous points, I will investigate more about 11.) > Please, clarify me the previous points, I will investigate more about 11.)
12. One more thing, would it be possible to get the system log (/var/log/messages) of b027 starting at 2018-04-28 and ending at 2018-04-30 ?
Created attachment 6730 [details]
Slurm.conf, cgroup.conf and gres
Slurm.conf, cgroup.conf and gres
Created attachment 6731 [details]
b027 messages
b027 messages
1. The spool dir of slurmd is set to a local filesystem? Yes. SlurmdSpoolDir=/cm/local/apps/slurm/var/spool 2. Please, send me the latest slurm.conf, cgroup.conf and gres.conf Attached. 3. What do you have in job_submit.lua? Where is this file (location)? 4. GresPlugins seems to be inconsistently configured. I.e. messages like: [2018-04-30T01:45:57.918] error: gres_plugin_node_config_unpack: no plugin configured to unpack data type ap-southeast-2 from node galaxy-bio Is that a problem? 5. Is cm02, your backup controller, reachable? Up? Configured? Yes. 6. Is b[101-108] in your DNS or hosts.conf? I see messages like: error: _find_node_record(751): lookup failure for b101 Taken away from Slurm for windows Deployment. Do we need to do something on the BCM to let slurm know about them? 7. Is it normal that your jobs last for just 3 seconds? ? 8. What happened to b025 from 04-27-2018@22:03 to 04-28-2018@00:02? Was it freezed, rebooted? [2018-04-27T22:03:28.050] [15648753.batch] done with job [2018-04-28T00:02:40.400] Message aggregation disabled After that I see some restarts of the slurmd daemon. It wasn't freezed or rebooted. 9. In b027 I see a restart with an error. What is this about? [2018-04-30T00:02:01.605] Message aggregation disabled [2018-04-30T00:02:01.605] CPU frequency setting not configured for this node [2018-04-30T00:02:01.605] error: GresPlugins changed from ap-southeast-2,gpu,memdir,mic,one to gpu,memdir,mic,one ignored [2018-04-30T00:02:01.605] error: Restart the slurmctld daemon to change GresPlugins 10. I cannot correlate the slurmd on b027 with slurmctld log, since it starts at 2018-04-30 and the event happened on 2018-04-29. 11. Regarding what I think is your failed job, I see: [2018-04-29T01:22:26.137] [15657370.batch] error: *** JOB 15657370 ON b027 CANCELLED AT 2018-04-29T01:22:26 DUE TO TIME LIMIT *** [2018-04-29T01:23:27.000] [15657370.batch] error: *** JOB 15657370 STEPD TERMINATED ON b027 AT 2018-04-29T01:23:26 DUE TO JOB NOT ENDING WITH SIGNALS *** [2018-04-29T01:23:27.000] [15657370.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 15 [2018-04-29T01:23:27.001] [15657370.batch] done with job This seems to be related to bug 3941 and may indeed be the cause of the nodes being drained. Please, clarify me the previous points, I will investigate more about 11.) -Yes please. 12. One more thing, would it be possible to get the system log (/var/log/messages) of b027 starting at 2018-04-28 and ending at 2018-04-30 ? Attached. (In reply to Ahmed Arefin from comment #17) > 3. What do you have in job_submit.lua? > Where is this file (location)? In your slurmctld log file, it is refrenced: [2018-04-30T00:02:33.490] job_submit.lua: uid=334466, name='sbatch_production_script', alloc_node='b033': set partition=h2gpu,h24gpu,gpu job_submit.lua should be in the same directory than slurm.conf file and does modifications to your jobs on submission time. See 'man slurm.conf' and grep by JobSubmitPlugins. > 4. GresPlugins seems to be inconsistently configured. I.e. messages like: > > [2018-04-30T01:45:57.918] error: gres_plugin_node_config_unpack: no plugin > configured to unpack data type ap-southeast-2 from node galaxy-bio > > Is that a problem? Well, yes, indeed it is... this is the same than 9.). This message indicates that galaxy-bio node lacks an information about a gres plugin called... "ap-southeast-2"...... From my knowledge, "ap-southeast-2" is an Amazon AWS availability zone name, not a GRES plugin.. what??? I guess somebody in your organization modified slurm.conf and did some CTRL+C CTRL+V and messed up something. if (i >= gres_context_cnt) { error("gres_plugin_node_state_unpack: no plugin " "configured to unpack data type %u from node %s", plugin_id, node_name); /* A likely sign that GresPlugins has changed. * Not a fatal error, skip over the data. */ continue; } The error messages confirm that: > [2018-04-30T00:02:01.605] error: GresPlugins changed from ap-southeast-2,gpu,memdir,mic,one to gpu,memdir,mic,one ignored > [2018-04-30T00:02:01.605] error: Restart the slurmctld daemon to change GresPlugins You have to do some deeper checks in your config and daemon status, I cannot help with such inconsistencies. More on gres...your gres.conf looks like: Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia1 Name=gpu File=/dev/nvidia2 Name=gpu File=/dev/nvidia3 Name=mic Count=0 Name=one Count=1 Name=memdir Count=64 Remove Name=mic Count=0, this makes no sense, and will remove errors in the log files. Change also your slurm.conf from 'GresTypes=gpu,memdir,mic,one' to 'GresTypes=gpu,memdir,one' > 5. Is cm02, your backup controller, reachable? Up? Configured? > Yes. Are you sure? Can you make all the network tests, ping, port scan, and whatever is necessary to ensure that this is *really* reachable? [2018-04-30T07:53:37.174] error: _shutdown_backup_controller:send/recv: Connection refused Please, demonstrate me that it is ok. I want also a 'ps aux|grep -i slurm', and 'netstat -anlp|grep -i slurm' on cm02. Also, the state save location must be the same between cm02 and cm01. > 6. Is b[101-108] in your DNS or hosts.conf? I see messages like: > > error: _find_node_record(751): lookup failure for b101 > > Taken away from Slurm for windows Deployment. Do we need to do something on > the BCM to let slurm know about them? All nodes defined in slurm.conf must be resolvable or have the address explicitly set. It seems you first removed the nodes from DNS and then from Slurm, generating these errors. This is not the correct order. Before deleting a node, the usual procedure is to drain the nodes, then remove them from slurm.conf, restart slurmctld daemons, issue a scontrol reconfig, and finally remove from DNS or whatever you want to do with these nodes. The slurmctld daemon has a multitude of bitmaps to track state of nodes and cores in the system. Removing nodes of a running system would require the slurmctld daemon to rebuild all of those bitmaps, which the developers feel would be safer to do by restarting the daemon. You also want to run a "scontrol reconfig" to make all nodes to re-read slurm.conf. > 7. Is it normal that your jobs last for just 3 seconds? > ? > I see jobs that just start and end in about 3 seconds. If you have very short jobs but hundreds of them, Slurm requires special tuning. https://slurm.schedmd.com/high_throughput.html I am expecting some explanation about what's your general cluster use case and if what I observed is expected or not. You just have to look at the b027 slurm logs to see what I mean. > > 12. One more thing, would it be possible to get the system log > (/var/log/messages) of b027 starting at 2018-04-28 and ending at 2018-04-30 ? > > Attached. Good, nothing strange. --------- Some advices: -Change your cgroup configuration. Remove this line in cgroup.conf, it is no longer needed in 17.11: CgroupReleaseAgentDir="/etc/slurm/cgroup" -A recommendation for your task/affinity setup: It is recommended to stack task/affinity,task/cgroup together when configuring TaskPlugin, and set‐ ting TaskAffinity=no and ConstrainCores=yes in cgroup.conf. This setup uses the task/affinity plugin for setting the affinity of the tasks (which is better and different than task/cgroup) and uses the task/cgroup plugin to fence tasks into the specified resources, thus combining the best of both pieces. slurm.conf: TaskPlugin=task/affinity,task/cgroup cgroup.conf: ConstrainCores=yes TaskAffinity=no Created attachment 6744 [details]
Job submit lua
Job submit lua
I have uploaded the Jobsubmit lua, feel free to have a look. We are going to apply the following suggested changes, will let you know the outcome. -Change your cgroup configuration. Remove this line in cgroup.conf, it is no longer needed in 17.11: CgroupReleaseAgentDir="/etc/slurm/cgroup" -A recommendation for your task/affinity setup: It is recommended to stack task/affinity,task/cgroup together when configuring TaskPlugin, and set‐ ting TaskAffinity=no and ConstrainCores=yes in cgroup.conf. This setup uses the task/affinity plugin for setting the affinity of the tasks (which is better and different than task/cgroup) and uses the task/cgroup plugin to fence tasks into the specified resources, thus combining the best of both pieces. slurm.conf: TaskPlugin=task/affinity,task/cgroup cgroup.conf: ConstrainCores=yes TaskAffinity=no Hi Ahmed, I just wanted to know if you are still experiencing the problem after the changes I proposed. There still may be a problem indeed with jobs not ending with signals but would like to know the situation about your specific case. Thanks Yes, we are still experiencing this issue. We applied the suggested changes, but we are waiting a clusterwide drain and reboot to propagate the changes which has been delayed until the next week due to a bug in the Bright cluster manager. More news soon. (In reply to Ahmed Arefin from comment #22) > Yes, we are still experiencing this issue. We applied the suggested changes, > but we are waiting a clusterwide drain and reboot to propagate the changes > which has been delayed until the next week due to a bug in the Bright > cluster manager. More news soon. Hi Ahmed, have you finally applied the changes and rebooted to propagate them? Please, keep me informed, thanks. Hi Ahmed, Any info on this matter? Hello, We think the issue has been resolved, please kindly wait a couple of days for us to further observe, then close this case. Thanks for your help. Note: We have also applied: cm01:~ # scontrol show config | grep UnkillableStepTimeout UnkillableStepTimeout = 180 sec * This gives slurmd 3 minutes to clean up after forcing a job to quit, rather than the default 60 seconds which can be pushing it on a busy file system (especially when large core files are included) (In reply to Ahmed Arefin from comment #25) > Hello, > > We think the issue has been resolved, please kindly wait a couple of days > for us to further observe, then close this case. Thanks for your help. > > Note: We have also applied: > cm01:~ # scontrol show config | grep UnkillableStepTimeout > UnkillableStepTimeout = 180 sec > > * This gives slurmd 3 minutes to clean up after forcing a job to quit, > rather than the default 60 seconds which can be pushing it on a busy file > system (especially when large core files are included) Good, I will wait until next week and if there's no more input here I will close the issue. Glad it is better now. Thanks, Felip Hi, I am closing this bug and assuming that after the config. cleanup the original errors, duplicate jobid and batch job complete failures, have disappeared. Please, if you further see a lot and continuous errors like: [2018-04-29T01:23:27.000] [15657370.batch] error: *** JOB 15657370 STEPD TERMINATED ON b027 AT 2018-04-29T01:23:26 DUE TO JOB NOT ENDING WITH SIGNALS *** [2018-04-29T01:23:27.000] [15657370.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 15 don't hesitate and re-open or attach yourself to the other bug 5262 which is dealing specifically with this error). Regards Felip |
Created attachment 6635 [details] logs - slurmctld and slurmd Hello Team, Following our recent Slurm upgrade to the version no 17.11.2, we are experiencing a “duplicate job id” issue on nodes, which also drains the machines. The duplicate job id issue has not been solved by turning off the ‘job preemption’ parameter in the slurm.conf file. Here is example log from the node b036 that was affected: [2018-04-15T21:37:58.624] [15165425.batch] task/cgroup: /slurm/uid_581585/job_15165425/step_batch: alloc=24576MB mem.limit=24576MB memsw.limit=unlimited [2018-04-15T21:38:11.996] error: Job 15165425 already running, do not launch second copy [2018-04-15T21:38:11.999] [15165425.batch] error: *** JOB 15165425 ON b036 CANCELLED AT 2018-04-15T21:38:11 DUE TO JOB REQUEUE *** [2018-04-15T21:38:13.087] [15165425.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15 are018@cm01:~> sacct -j 15165425 JobID JobName Partition User AllocCPUS NNodes Elapsed TotalCPU State MaxVMSize MaxRSS ReqMem NodeList ------------ ---------- ---------- --------- ---------- -------- ---------- ---------- ---------- ---------- ---------- ---------- --------------- 15165425 FMD6-LIII+ h2gpu,h24+ xxx000 7 1 00:00:00 00:00:00 PENDING 24Gn None assigned