I just recently fixed an issue I wanted my Monit monitoring process to restart a daemon who was segfaulting and causing 100% CPU usage according to top and most other system tools. I had seen configuration examples where Monit could detect that and restart the process, so I figured that adding a configuration like that below would fix it easily enough:
check process foo with pidfile /var/run/foo.pid start program = "/etc/init.d/foo start" timeout 10 seconds stop program = "/etc/init.d/foo stop" if cpu usage > 90% for 8 cycles then restart
After letting that run for a bunch of cycles the process remained running, and monit didn’t do anything to acknowledge it even in log files. (FYI, a “cycle” is defined in the Monitrc config file in the “set daemon” line and defaults to 120 seconds).
After some research, I finally came upon this post on the Monit mailing list where somebody describes that the CPU usage that Monit bases its numbers off is a percentage of the CPU available for all processors. My machine had 4 processors, so what was seeing as 100% CPU usage in top, monit would only see that as 25%.
I quickly changed my Monit config to check for CPU Usage > 22% as ween in the following. That now works perfectly, even acknowledging in the log each of the 8 times that the CPU was over the limit before restarting it:
check process foo with pidfile /var/run/foo.pid start program = "/etc/init.d/foo start" timeout 10 seconds stop program = "/etc/init.d/foo stop" if cpu usage > 22% for 8 cycles then restart
…. Now I need to solve the real problem and see why the latest Mongo PHP pecl module is segfaulting….