Passenger processes stuck maxing CPU after hitting 100%

115 views Asked by At

The Setup:

* Ubuntu 18.04 LTS
* Apache 2.4.29
* Passenger 6.0.16
* Ruby 2.3.8
* Rails 4.2.x

I have both staging and prod servers with the same setup on AWS EC2; they are both running the same kernel/build. I upgraded the Ruby/Rails version of my app from Ruby 2.1.x -> 2.3.8, and Rails 4.0 -> 4.2, first on staging then on production.

On staging, everything was working fine; pages were loading quickly and without issue. On prod, pages would start by loading quickly but pretty soon would degrade. The user CPU would max out at 99%+ eventually causing the app to go down and be unresponsive. The only solution was to restart Apache, roughly every 30min.

After a LOT of digging and testing, top -c showed that Passenger RubyApp would hit 100% CPU and soon after would stay "locked" at max CPU for each process, even if no one was using the site. I've been trying to change different settings both in Apache and Passenger but nothing seems to work. Effectively, as soon as we get a few people hitting the site in a particular way, ANY of the spun Passenger processes that hit 100% end up staying fairly high and either don't shut off or don't exit and burn CPU, as if there were some IO issue.

Right now Passenger and Apache configs are exactly the same on staging/prod and are the defaults.

Screenshots of the example top in prod with a few users using it.

top on prod

And roughly same amount of people using on staging.

top on staging

Staging looks far more accurate in terms of a Rails app -- I'd expect to see higher memory use than CPU. AWS Support was also baffled, as prod is on an XL and staging is on a Micro instance, and the AWS kernel versions were the same. Here's AWS monitoring around CPU usage... prod was updated on the 20th, but not a lot of people used it over the weekend, and really became a problem on Monday during working hours.

enter image description here

Any ideas of why this is happening on one server vs the other?? It's no particular request that causes it; it's literally any (or 2-3 requests coming in tandem) that will cause the CPU to spike to 100 and get stuck.

TIA.

0

There are 0 answers