I have one ELB facing internet and connecting to 2 EC2 instances serving a web page. The infraestructure has been deployed using AWS Opsworks.
The configuration is as follows:
Internet facing ELB:
- Present in two availability zones
- Cross-zone load balancing enabled
- Idle timeout : 60 segs
- Stickiness: LBCookieStickinessPolicy, expirationPeriod='0'
EC2 instances serving a web page (demo1.virtualtoptraining.com)
- Configuration is exactly the same on both
- Running Apache 2 and PHP5 as module
DNS Configuration:
demo1.virtualtoptraining.com --> CNAME to DNS Name of loadbalancer
When I run testing agains demo1.virtualtoptraining.com using JMETER, I see around 20% of requests getting errors. The error reported is 504 GATEWAY_TIMEOUT. The errors do not depend on load conditions as I can get this error navigating with a single user on a web browser. The error is 100% reproduceable.
Tried the following:
I have been playing with apache timers as I though this could be related to the web server not responding to the load balancer in time, but I got no improvements.
I have been using servers on the same availability zone. The error is still there.
After some investigation, I came to the conclusion that the requests where I get the 504 GATEWAY_TIMEOUT reach the load balancer, but they never get forwarded to the web servers as I can not see them in any of the web servers access logs.
I am showing below an example of the issue with all the logs. I got this example just navigating to the web page on my browser:
Example Sequence
- Logout from the web page (forwarded to front page)
- Get front page
- Get login page
- Post login credentials
- Follow redirection after credentials validation
- Get main page. ERROR --> this GET operation is not seen on any of the server access logs
I am listing now, the access logs as well as the ELB log, I am highlighting the messages with ***(message #)
SERVER 1 ACCESS LOG:
ubuntu@php-app1:/var/log/apache2$ tail -f moodle281-access.log
172.31.40.186 - - [09/Jun/2015:22:44:52 +0000] "POST /login/index.php HTTP/1.1" 303 1018 "-" "Apache-HttpClient/4.2.6 (java 1.5)"
172.31.40.186 - - [09/Jun/2015:22:44:53 +0000] "GET / HTTP/1.1" 200 39933 "-" "Apache-HttpClient/4.2.6 (java 1.5)"
172.31.40.186 - - [09/Jun/2015:22:44:54 +0000] "GET /course/view.php?id=2 HTTP/1.1" 200 55453 "-" "Apache-HttpClient/4.2.6 (java 1.5)"
172.31.40.186 - - [09/Jun/2015:22:44:55 +0000] "GET /course/view.php?id=2 HTTP/1.1" 200 55453 "-" "Apache-HttpClient/4.2.6 (java 1.5)"
172.31.40.186 - - [09/Jun/2015:22:44:55 +0000] "GET /mod/forum/discuss.php?d=1 HTTP/1.1" 200 47600 "-" "Apache-HttpClient/4.2.6 (java 1.5)"
172.31.40.186 - - [09/Jun/2015:22:44:56 +0000] "POST /mod/forum/post.php HTTP/1.1" 404 26456 "-" "Apache-HttpClient/4.2.6 (java 1.5)"
172.31.40.186 - - [09/Jun/2015:22:44:56 +0000] "GET /user/index.php?id=2 HTTP/1.1" 200 82175 "-" "Apache-HttpClient/4.2.6 (java 1.5)"
****(1) 172.31.19.167 - - [09/Jun/2015:22:48:52 +0000] "GET /login/logout.php?sesskey=fXjerOYZEo HTTP/1.1" 303 862 "http://demo1.virtualtoptraining.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5"
****(4) 172.31.19.167 - - [09/Jun/2015:22:48:55 +0000] "GET /login/index.php HTTP/1.1" 200 6088 "http://demo1.virtualtoptraining.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5"
****(5) 172.31.19.167 - - [09/Jun/2015:22:49:00 +0000] "GET /login/index.php?testsession=2 HTTP/1.1" 303 800 "http://demo1.virtualtoptraining.com/login/index.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5"
SERVER 2 ACCESS LOG:
....
**** (2) 172.31.19.167 - - [09/Jun/2015:22:48:52 +0000] "GET / HTTP/1.1" 200 7078 "http://demo1.virtualtoptraining.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5"
**** (3) 172.31.19.167 - - [09/Jun/2015:22:49:00 +0000] "POST /login/index.php HTTP/1.1" 303 1010 "http://demo1.virtualtoptraining.com/login/index.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5"
....
LOAD BALANCER ACCESS LOG:
015-06-09T22:48:52.128826Z itoptest2 151.182.141.178:52554 172.31.25.91:80 0.000092 0.204629 0.000058 303 303 0 434 "GET http://demo1.virtualtoptraining.com:80/login/logout.php?sesskey=fXjerOYZEo HTTP/1.1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5" - -
2015-06-09T22:48:52.382853Z itoptest2 151.182.141.178:52554 172.31.34.25:80 0.000082 0.456022 0.00007 200 200 0 6500 "GET http://demo1.virtualtoptraining.com:80/ HTTP/1.1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5" - -
2015-06-09T22:48:55.853509Z itoptest2 151.182.141.178:52554 172.31.25.91:80 0.000105 0.292976 0.000046 200 200 0 5617 "GET http://demo1.virtualtoptraining.com:80/login/index.php HTTP/1.1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5" - -
2015-06-09T22:49:00.382245Z itoptest2 151.182.141.178:52554 172.31.34.25:80 0.000105 0.545512 0.000068 303 303 36 463 "POST http://demo1.virtualtoptraining.com:80/login/index.php HTTP/1.1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5" - -
2015-06-09T22:49:00.980174Z itoptest2 151.182.141.178:52554 172.31.25.91:80 0.00007 0.219837 0.000044 303 303 0 434 "GET http://demo1.virtualtoptraining.com:80/login/index.php?testsession=2 HTTP/1.1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5" - -
2015-06-09T22:49:01.248300Z itoptest2 151.182.141.178:52554 - -1 -1 -1 504 0 0 0 "GET http://demo1.virtualtoptraining.com:80/ HTTP/1.1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.5 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.5" - -
As you can see the request generating the 504 GATEWAY_TIMEOUT is never seen on any of the web servers access log.
Any idea on of what the issue could be?
We did further investigation in the topic:
Requests were arriving to the server (found by running a tcpdump) but not answered by Apache
Apache error logs were showing that for the "not answered requests" Apache was crashing with a core dump. The following error was seen on the log:
AH00051: child pid 11419 exit signal Bus error (7)
Core dump analysis was showing the following:
warning: Could not load shared library symbols for 4 libraries, e.g. /usr/lib/php5/20121212/sasl.so. Use the "info sharedlibrary" command to see the complete listing. Do you need "set solib-search-path" or "set sysroot"? [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/sbin/apache2 -k start'. Program terminated with signal SIGBUS, Bus error. 0 lex_scan (zendlval=) at Zend/zend_language_scanner.c:1091 1091 Zend/zend_language_scanner.c: No such file or directory. (gdb)
After analyzing all this info we concluded that the issue was related to a PHP error as there was a concurrent access to a file from two different instances of the application. This access was not properly coordinated therefore, causing a crash.