Just some contexts: In our old data pipeline system, we are running MySQL 5.6. or Aurora on Amazon rds. Bad thing about our old data pipeline is running a lot of heavy computations on the database servers because we are handcuffed by what was designed: treating transactional databases as data warehouse and our backend API directly “fishing” the databases heavily in our old system. We are currently patching this old data pipeline, while re-design the new data warehouse in SnowFlake.

In our old data pipeline system, the data pipeline calculation is a series of sequential MySQL queries. As our data grows bigger and bigger in the old data pipeline, what the problem now is the calculation might just hang forever at, for example, the step 3 MySQL query, while all metrics in Amazon CloudWatch/ grafana we are monitoring (CPU, database connections, freeable memory, network throughput, swap usages, read latency, available storage, write latency, etc. ) looks normal. The MySQL slow query log is not really helpful here because each of our query in the data pipeline is essentially slow anyway (can takes hours to run a query because the old data pipeline is running a lot of heavy computations on the database servers). The way we usually solve these problems is to “blindly” upgrade the MySQL/Aurora Amazon rds service and hoping it will solve the issue. I am wondering

(1) What are the recommended database metrics in MySQL 5.6. or Aurora on Amazon rds we should monitor real-time to help us identify why a query freezes forever? Like innodb_buffer_pool_size?

(2) Is there any existing tool and/or in-house approach where we can predict how many hardware resources we need before we can confidently execute a query and know it will succeed? Could someone share some 2 cents?

One thought: Since Amazon rds sometimes is a bit blackbox, one possible way is to host our own MySQL server on an Amazon EC2 instance in parallel to our Amazon MySQL 5.6/Aurora rds production server, so we can ssh into MySQL server and run a lot of command tools like mytop (https://www.tecmint.com/mysql-performance-monitoring/) to gather a lot more real time MySQL metrics which can help us triage the issue. Open to any 2 cents from gurus. Thank you!

1

There are 1 answers

5
Michael - sqlbot On BEST ANSWER

None of the tools mentioned at that link should need to run on the database server itself, and to the extent that this is true, there should be no difference in their behavior if they aren't. Run them on any Linux server, giving the appropriate --host and --user and --password arguments (in whatever form they may expect). Even mysqladmin works remotely. Most of the MySQL command line tools do (such as the mysql cli, mysqldump, mysqlbinlog, and even mysqlcheck).

There is no magic coupling that most administrative utilities can gain by running on the same server as MySQL Server itself -- this is a common misconception but, in fact, even when running on the same machine, they still have to make a connection to the server, just like any other client. They may connect to the unix socket locally rather than using TCP, but it's still an ordinary client connection, and provides no extra capabilities.

It is also possible to run an external replica of an RDS/MySQL or Aurora/MySQL server on your own EC2 instance (or in your own data center, even). But this isn't likely to tell you a whole lot that you can't learn from the RDS metrics, particularly in light of the above. (Note also, that even replica servers acquire their replication streams using an ordinary client connection back to the master server.)

Avoid the temptation to tweak server parameters. On RDS, most of the defaults are quite sane, and unless you know specifically and precisely why you want to adjust a parameter... don't do it.

The most likely explanation for slow queries... is poorly written queries and/or poorly designed indexes.

If you are not familiar with EXPLAIN SELECT, then you need to learn it, live it, an love it. SQL is declarative, not procedural. That is, SQL tells the server what you want -- not specifically how to obtain it internall. For example: SELECT ... FROM x JOIN y tells the server to match up the rows from table x and y ON a certain criteria, but does not tell the server whether to read from x then find the matching rows in y... or read from y and find the matching rows in x. The net result is the same either way -- it doesn't matter which table the server examines first, internally -- but if the query or the indexes don't allow the server to correctly deduce the optimum path to the results you've requested, it can spend countless hours churning through unnecessary effort.

Take for an extreme and overly-simplified example, a table with millions of rows and table with 1 row. It would make sense to read the small table first, so you know what 1 value you're trying to join in the large table. It would make no sense to read throuh each row in the large table, then go over and check the small table for a match for each of the millions of rows. The order in which you join tables can be different than the order in which the actual joining is done.

And that's where EXPLAIN comes in. This allows you to inspect the query plan -- the strategy the internal query optimizer has concluded will get it to the answer you need with the least amount of effort. This is the core of the magic of relational database systems -- finding the correct solution in the optimal time, based on what it knows about the data. EXPLAIN shows you the order in which the tables are being accessed, how they're being joined, which indexes are being used, and an estimate of the number of rows from each table are involved -- and these numbers multiply together to give you an estimate of the number of permutations involved in resolving your query. Two small tables, each with 50,000 rows, joined without a proper index, means an entirely unreasonable 2,500,000,000 unique combinations between the two tables that must be evaluated; every row must be compared to every other row. In short, if this turns out to be the kind of thing that you are (unknowingly) asking the server to do, then you are definitely doing something wrong. Inspecting your query plan should be second nature any time you write a complex query, to ensure that the server is using a sensible strategy to resolve it.

The output is cryptic, but secret decoder rings are available.

https://dev.mysql.com/doc/refman/5.7/en/explain.html#explain-execution-plan