ELB generating 504 GATEWAY_TIMEOUTS w/ 2 EC2 instances - Packets not reaching Servers

1.8k views Asked by At

I have one ELB facing internet and connecting to 2 EC2 instances serving a web page. The infraestructure has been deployed using AWS Opsworks.

The configuration is as follows:

  1. Internet facing ELB:

    • Present in two availability zones
    • Cross-zone load balancing enabled
    • Idle timeout : 60 segs
    • Stickiness: LBCookieStickinessPolicy, expirationPeriod='0'
  2. EC2 instances serving a web page (demo1.virtualtoptraining.com)

    • Configuration is exactly the same on both
    • Running Apache 2 and PHP5 as module
  3. DNS Configuration:

demo1.virtualtoptraining.com --> CNAME to DNS Name of loadbalancer

When I run testing agains demo1.virtualtoptraining.com using JMETER, I see around 20% of requests getting errors. The error reported is 504 GATEWAY_TIMEOUT. The errors do not depend on load conditions as I can get this error navigating with a single user on a web browser. The error is 100% reproduceable.

Tried the following:

  • I have been playing with apache timers as I though this could be related to the web server not responding to the load balancer in time, but I got no improvements.

  • I have been using servers on the same availability zone. The error is still there.

After some investigation, I came to the conclusion that the requests where I get the 504 GATEWAY_TIMEOUT reach the load balancer, but they never get forwarded to the web servers as I can not see them in any of the web servers access logs.

I am showing below an example of the issue with all the logs. I got this example just navigating to the web page on my browser:

Example Sequence

  1. Logout from the web page (forwarded to front page)
  2. Get front page
  3. Get login page
  4. Post login credentials
  5. Follow redirection after credentials validation
  6. Get main page. ERROR --> this GET operation is not seen on any of the server access logs

I am listing now, the access logs as well as the ELB log, I am highlighting the messages with ***(message #)

SERVER 1 ACCESS LOG:

ubuntu@php-app1:/var/log/apache2$ tail -f moodle281-access.log
172.31.40.186 - - [09/Jun/2015:22:44:52 +0000] "POST /login/index.php HTTP/1.1" 303 1018 "-" "Apache-HttpClient/4.2.6 (java 1.5)"
172.31.40.186 - - [09/Jun/2015:22:44:53 +0000] "GET / HTTP/1.1" 200 39933 "-" "Apache-HttpClient/4.2.6 (java 1.5)"
172.31.40.186 - - [09/Jun/2015:22:44:54 +0000] "GET /course/view.php?id=2 HTTP/1.1" 200 55453 "-" "Apache-HttpClient/4.2.6 (java 1.5)"
172.31.40.186 - - [09/Jun/2015:22:44:55 +0000] "GET /course/view.php?id=2 HTTP/1.1" 200 55453 "-" "Apache-HttpClient/4.2.6 (java 1.5)"
172.31.40.186 - - [09/Jun/2015:22:44:55 +0000] "GET /mod/forum/discuss.php?d=1 HTTP/1.1" 200 47600 "-" "Apache-HttpClient/4.2.6 (java 1.5)"
172.31.40.186 - - [09/Jun/2015:22:44:56 +0000] "POST /mod/forum/post.php HTTP/1.1" 404 26456 "-" "Apache-HttpClient/4.2.6 (java 1.5)"
172.31.40.186 - - [09/Jun/2015:22:44:56 +0000] "GET /user/index.php?id=2 HTTP/1.1" 200 82175 "-" "Apache-HttpClient/4.2.6 (java 1.5)"

****(1) 172.31.19.167 - - [09/Jun/2015:22:48:52 +0000] "GET /login/logout.php?sesskey=fXjerOYZEo HTTP/1.1" 303 862 "http://demo1.virtualtoptraining.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5"
****(4) 172.31.19.167 - - [09/Jun/2015:22:48:55 +0000] "GET /login/index.php HTTP/1.1" 200 6088 "http://demo1.virtualtoptraining.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5"
****(5) 172.31.19.167 - - [09/Jun/2015:22:49:00 +0000] "GET /login/index.php?testsession=2 HTTP/1.1" 303 800 "http://demo1.virtualtoptraining.com/login/index.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5"

SERVER 2 ACCESS LOG:

....

****  (2) 172.31.19.167 - - [09/Jun/2015:22:48:52 +0000] "GET / HTTP/1.1" 200 7078 "http://demo1.virtualtoptraining.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5"
****  (3) 172.31.19.167 - - [09/Jun/2015:22:49:00 +0000] "POST /login/index.php HTTP/1.1" 303 1010 "http://demo1.virtualtoptraining.com/login/index.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5"

....

LOAD BALANCER ACCESS LOG:

015-06-09T22:48:52.128826Z itoptest2 151.182.141.178:52554 172.31.25.91:80 0.000092 0.204629 0.000058 303 303 0 434 "GET http://demo1.virtualtoptraining.com:80/login/logout.php?sesskey=fXjerOYZEo HTTP/1.1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5" - -
2015-06-09T22:48:52.382853Z itoptest2 151.182.141.178:52554 172.31.34.25:80 0.000082 0.456022 0.00007 200 200 0 6500 "GET http://demo1.virtualtoptraining.com:80/ HTTP/1.1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5" - -
2015-06-09T22:48:55.853509Z itoptest2 151.182.141.178:52554 172.31.25.91:80 0.000105 0.292976 0.000046 200 200 0 5617 "GET http://demo1.virtualtoptraining.com:80/login/index.php HTTP/1.1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5" - -
2015-06-09T22:49:00.382245Z itoptest2 151.182.141.178:52554 172.31.34.25:80 0.000105 0.545512 0.000068 303 303 36 463 "POST http://demo1.virtualtoptraining.com:80/login/index.php HTTP/1.1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5" - -
2015-06-09T22:49:00.980174Z itoptest2 151.182.141.178:52554 172.31.25.91:80 0.00007 0.219837 0.000044 303 303 0 434 "GET http://demo1.virtualtoptraining.com:80/login/index.php?testsession=2 HTTP/1.1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5" - -
2015-06-09T22:49:01.248300Z itoptest2 151.182.141.178:52554 - -1 -1 -1 504 0 0 0 "GET http://demo1.virtualtoptraining.com:80/ HTTP/1.1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5" - -

As you can see the request generating the 504 GATEWAY_TIMEOUT is never seen on any of the web servers access log.

Any idea on of what the issue could be?

3

There are 3 answers

0
BData On BEST ANSWER

We did further investigation in the topic:

  • Requests were arriving to the server (found by running a tcpdump) but not answered by Apache

  • Apache error logs were showing that for the "not answered requests" Apache was crashing with a core dump. The following error was seen on the log:

    AH00051: child pid 11419 exit signal Bus error (7)

  • Core dump analysis was showing the following:

    warning: Could not load shared library symbols for 4 libraries, e.g. /usr/lib/php5/20121212/sasl.so. Use the "info sharedlibrary" command to see the complete listing. Do you need "set solib-search-path" or "set sysroot"? [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/sbin/apache2 -k start'. Program terminated with signal SIGBUS, Bus error. 0 lex_scan (zendlval=) at Zend/zend_language_scanner.c:1091 1091 Zend/zend_language_scanner.c: No such file or directory. (gdb)

After analyzing all this info we concluded that the issue was related to a PHP error as there was a concurrent access to a file from two different instances of the application. This access was not properly coordinated therefore, causing a crash.

1
AA Ron On

Is it possible you are not giving the ELB enough of a ramp-up period during a load test scenario? Make sure you monitor the ELB Surge Queues on CloudWatch. Surge requests that are dropped return a 5xx error (More on CloudWatch ELB Metrics here).

Also, when testing against Amazon EC2 ELB, it is important to configure JMeter's stalecheck. This is because Amazon will hot-swap your ELB with a larger one instead of actively scaling the one you're using. This means the underlying addressing changes, so JMeter has to check for stale connections. It could be JMeter is sending requests to the old ELB, which then in turn has a gateway timeout because it can no longer send to your app server.

Check out the in-depth ELB testing guide from AWS: AWS Best Practices for Load Testing ELB

Also, this is where the stalecheck property is for JMeter.

../bin/hc.parameters:

# Default value since JMeter 2.11, 
# also uncomment hc.parameters.file=hc.parameters to enable this check:
#http.connection.stalecheck$Boolean=false

You also need to edit the jmeter.properties file so it actually uses the hc.parameters settings, as noted in the comment above.

1
Dmitri T On

Do you have DNS Cache Manager added and enabled? If not I would suggest to try it out, it should resolve your issue.