Ticket 1402 - srun fails "forcing job termination". Log says "No cgroup.conf file"
Summary: srun fails "forcing job termination". Log says "No cgroup.conf file"
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 14.11.0
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-01-26 01:12 MST by Marios Hadjieleftheriou
Modified: 2015-02-06 07:15 MST (History)
2 users (show)

See Also:
Site: Lion Cave Capital
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 14.11.4
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (3.20 KB, text/plain)
2015-01-26 01:12 MST, Marios Hadjieleftheriou
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Marios Hadjieleftheriou 2015-01-26 01:12:34 MST
Created attachment 1585 [details]
slurm.conf

Occasionally, srun will fail with error "forcing job termination". slurmd log contains error "[980336.0] No cgroup.conf file (/opt/apps/etc/cgroup.conf)". We have jobacct_gather/linux enabled, but not cgroups. slurm.conf attached.
Comment 1 David Bigagli 2015-01-26 06:48:33 MST
The message forcing job termination is printed by srun when it receives the SIGINT. Did the user typed ctrl^C. 
The message in the slurmd seems unrelated to srun terminating the job and is printed when slurmd starts or get reconfigured. Could you please send us your node configuration.

David
Comment 2 David Bigagli 2015-01-30 08:24:06 MST
Do you have more information about this issue?

David
Comment 3 Marios Hadjieleftheriou 2015-01-30 11:32:43 MST
The user did not type Crtl-C. The script was running using nohup, and the server it run from has an uptime of 253 days.

I did not see anything useful in the slurm logs. I enabled debug logging and I am waiting for this to happen again.

The same user has seen this error 3 times, but no other user has complained about it. Also, it happens very rarely.

If I can get useful debug info from the logs, I will update.
Comment 4 Marios Hadjieleftheriou 2015-01-30 11:36:04 MST
I forgot to mention, that looking at the logs of the job itself post-mortem, there were no errors or anything unusual. The job simply terminates.
Comment 5 David Bigagli 2015-02-02 04:31:56 MST
Hi,
   I can reproduce this if I sent a signal INT to srun, but I have to do this
in a loop. The srun is designed such that if a single signal INT is received
it prints the status of its tasks, when it receives multiple INT in one seconds
it terminates the job. Is it possible the job received signal INT?

David
Comment 6 Marios Hadjieleftheriou 2015-02-03 14:51:48 MST
We run a script on the client that runs a bunch of sruns in the background. The script traps int, kill, etc. and scancels all sruns and prints an error message.

In this case, I don't see any indication that the sigint error handler was called by the script. There is also no other output by our job indicating that something went wrong. Moreover, the script was run using nohup. The user also claims that they did not call scancel on this or any other job.

We got another similar failure today (this time the error message was srun: Job allocation 1037720 has been revoked). The job never even run. It got cancelled while waiting on the queue for resources. I will go through the debug logs tomorrow and post some more details.
Comment 7 David Bigagli 2015-02-06 07:15:58 MST
The message about the cgroup.conf file was fixed in this commit a26eaa643d7f7
available in 14.11.4. For the other issue I would suggest if you can reopen this ticket when you have more data.

David