| Summary: | slurmctld on sequoia became unresponsive | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Don Lipari <lipari1> |
| Component: | Bluegene select plugin | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 1 - System not usable | ||
| Priority: | --- | ||
| Version: | 2.4.x | ||
| Hardware: | IBM BlueGene | ||
| OS: | Linux | ||
| Site: | LLNL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Don Lipari
2012-11-13 09:20:40 MST
So in this core I can see thread 15 is in the block_state_mutex at this point in the log (as mentioned in the previous comment)... [2012-11-13T00:54:22] Queue start of job 56271 in BG block RMP12No173755691 [2012-11-13T00:54:22] debug: block RMP12No173755691 is already ready. [2012-11-13T00:54:22] backfill: Started JobId=56271 on seq2210 [2012-11-13T00:54:22] debug: adding user crs to block RMP12No173755691 calling a function to add the user to the database, bt... #0 0x000000808f94bbbc in .semop () from /lib64/libc.so.6 #1 0x0000040009398838 in .sqloSSemP () from /opt/ibm/db2/V9.7/lib64/libdb2o.so #2 0x0000040008f09bb8 in ._Z12sqlccipcrecvP17SQLCC_COMHANDLE_TP12SQLCC_COND_T () from /opt/ibm/db2/V9.7/lib64/libdb2o.so #3 0x0000040008f0e0bc in .sqlccrecv () from /opt/ibm/db2/V9.7/lib64/libdb2o.so #4 0x000004000918e130 in ._Z12sqljcReceiveP10sqljCmnMgr () from /opt/ibm/db2/V9.7/lib64/libdb2o.so #5 0x000004000927a610 in ._Z18sqljrDrdaArExecuteP14db2UCinterfaceP9UCstpInfo () from /opt/ibm/db2/V9.7/lib64/libdb2o.so #6 0x00000400088ad874 in ._Z14CLI_sqlExecuteP17CLI_STATEMENTINFOP19CLI_ERRORHEADERINFO () from /opt/ibm/db2/V9.7/lib64/libdb2o.so #7 0x0000040008989e88 in ._Z11SQLExecute2P17CLI_STATEMENTINFOP19CLI_ERRORHEADERINFO () from /opt/ibm/db2/V9.7/lib64/libdb2o.so #8 0x0000040008985ef4 in .SQLExecute () from /opt/ibm/db2/V9.7/lib64/libdb2o.so #9 0x0000008096df97fc in .SQLExecute () from /usr/lib64/libodbc.so.2 #10 0x000004000153e0b8 in cxxdb::StatementHandle::execute (this=0x400b8091520, no_data_out=0x0) at cxxdb/StatementHandle.cc:296 #11 0x000004000153ffac in cxxdb::UpdateStatement::execute (this=0x400b802d030, affected_row_count_out=0x400cb9fcc18) at cxxdb/UpdateStatement.cc:55 #12 0x0000040001768ff8 in hlcs::security::db::grant (object=..., authority=..., user=...) at db/grant.cc:168 #13 0x0000040001745a50 in hlcs::security::grant (object=<value optimized out>, authority=<value optimized out>, user=<value optimized out>) at privileges.cc:49 #14 0x0000040000654954 in bgsched::Block::addUser (blockName="RMP12No173755691", user="crs") at Block.cc:1238 #15 0x00000400003eb778 in bridge_block_add_user (bg_record=0x40044214698, user_name=0x400681676d8 "crs") at bridge_linker.cc:994 #16 0x00000400003ec140 in bridge_block_sync_users (bg_record=0x40044214698) at bridge_linker.cc:1120 #17 0x00000400003d5298 in _start_agent (args=0x4004414e0b8) at bg_job_run.c:573 #18 _block_agent (args=0x4004414e0b8) at bg_job_run.c:606 #19 0x000000808fa4c2bc in .start_thread () from /lib64/libpthread.so.0 #20 0x000000808f94866c in .__clone () from /lib64/libc.so.6 From this it appears to be hung on some write to the database. Perhaps IBM might be able to shed some light on the subject. Was there any disk error during the time or full filesystem? The block_state_mutex is locked here until this returns. Which hoses up thread 4 which also has a job_write lock locked which in turns hoses up all the other threads looking for that lock. I looked through the other logs in /bgsys/logs/BGQ.sn but couldn't seem to find anything that appeared to be related to this. I copied slurmctld.log-20121113.gz to my home directory so if it was needed in the future we would have the log, but at the moment I don't think this is not a direct Slurm issue since it appears the lock up is out of our control. If what was happening in the IBM code in thread 15 would clean up then all would be just fine. This code path is called very often so this is some what surprising. (In reply to comment #1) > So in this core I can see thread 15 is in the block_state_mutex at this > point in the log (as mentioned in the previous comment)... > > [2012-11-13T00:54:22] Queue start of job 56271 in BG block RMP12No173755691 > [2012-11-13T00:54:22] debug: block RMP12No173755691 is already ready. > [2012-11-13T00:54:22] backfill: Started JobId=56271 on seq2210 > [2012-11-13T00:54:22] debug: adding user crs to block RMP12No173755691 > > calling a function to add the user to the database, bt... [...] > From this it appears to be hung on some write to the database. [...] Would it make sense to install a timer before the db call? So when the next time this happens, after the timeout period, the controller can log the error and keep going (and draining associated partitions if appropriate). Having all those threads build up followed by the hang is not a robust response to a flaky db2. This would be a huge undertaking and severely affect performance, and decrease stability. It would require spawning a thread every time a call to the api is initiated which isn't very desirable since calls happen to the api extremely often. Then when this happens what should the policy be? I am guessing it would be different for each call to the api. I am voting against this, it would probably be weeks to implement and test and I would probably still feel nervous about running this way. Obviously the fall out of not handling this isn't very attractive, but this is the only time I am aware of anything like this happening. Also if a timeout on the db happens I would expect the end result would be to exit anyway since this would probably mean future calls to the database would be wedged as well. So I don't think this would buy you much. It still isn't clear if this is in the api or the database, right now it appears it is in the database and not in the api, but it would be interesting for IBM to see what they can find on the matter. We (SchedMD & IBM) believe this issue will go away by Slurm being more ginger to certain IBM API's. We have put mutex's around the critical ones so they can at most be ran one at a time. If this shows up again please consult IBM. If there still appears to be a bug in Slurm reopen the ticket. |