1. 6

  2. 2

    This reminds me of something we figured out when dealing with some ropey multi-threaded C++ code that no-one understood and repeatedly deadlocked. Of course, as these things do, it fell onto the sysadmin team to handle as apparently it was our fault developer code locked up at 3am and again our fault we were the ones who got the pager alerts…but I digress. :)

    v0 used strace to figure out if all the threads were either idle or stuck in ‘D’ state alongside a internal application side watchdog that touch’d a file on the filesystem so we trivially see if the main event loop was still moving. If things had stalled, the shell script pulled out the 9’bore and we left runit to mop up.

    v1 popped out, IIRC, when we saw we could side step strace and find if the application was stuck in a futex syscall via /proc/.../syscall.