In my decades-long (sigh!) experience in programming, one of the main problems is the gradual slowdown of systems caused by the growth of their databases over time.
Often this is due to inexperience, for the first years we are too concentrated on obtaining a working system and we ignore performance until we have to continuously work on the same project for enough years that we need to maintain it.
At that point we understand by touching it (or better, by bumping into it) the importance of database. When we first start to use queries, we just try to get the correct data. As long as we only work with test data or in the first stage of a production project this is just enough, but when records in tables begin to become ten of thousands or more, systems start to slowdown. It is only after first crashes of servers, or first emergency interventions on applications, that we finally start to think in advance to these problems.
Unfortunately the projects that gave us this expertise don’t have a database friendly behaviour and – still worst – they are already in production. In these cases we need to optimize them with a performance review. In this article I try to give some good cue about this topic.
This topic belongs to the code optimization serie.
Database and performance
Before start the main topic, let’s recall some useful notions related to good performance in databases. We assume as already known the basic definitions like relational database, table, column, primary key.
Indexes
A table is a list of records usually ordered by its primary key. An index is a separate data structure containing a small subset of the original columns, and ordered in a different way, to allow faster lookups.
For example, consider a table containing sales orders with fields such as date, total amount, and customer. The primary key is usually an integer ID for performance reasons. However, in real-world scenarios, we often need to retrieve orders by date, customer, status, or other fields. Without an index, the database would need to scan the entire table each time.
Indexes solve this problem by allowing the database engine to locate relevant rows efficiently, avoiding full table scans.
Note that JOIN and WHERE are more efficient if they use fields contained in one index. Vice versa aggregation function like DATE(), LOWER() etc. can prevent the usage of an index also if performed on indexed columns.
Precomputate frequent aggregations
Continuing with the previous sales example, a typical use case is to determine the total amount payed by each customer. Often this is a statistics showed in the initial dashboard, but dynamically compute it can be heavy, especially as the order table grows. This is why sometimes precomputes specific data can be a clever idea.
It is probably better to spend 100 ms at each order to update the total amount of a customer that minutes computing them each day.
Early filtering
Compare this query:
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'NL';
with this one:
SELECT *
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE c.country = 'NL';
They seem the same, but while the former first computes all the results of orders JOIN customers and then filters them, the second first filters dutch customers and then combines them with orders, working potentially with much less records.
Selecting only needed fields
Some tables have a lot of fields, orders and customers are usually among these. If we always select data with SELECT * “just in case in future we need other fields”, we are slowing the execution potentially of a big factor.
Also consider that the raw data is transformed in objects of the programming language that you are using (that means generic objects or dedicated data entities), and this transformation slow down the computation.
In my experience I saw cases where the DISTINCT computation of a menu of a website was done in php on the raw records given by a query. That query extracted 4000 records instead of the 80 needed, and the overhead of creating 4000 php objects and working on them gave both memory and timing problem. Just adding DISTINCT on the query solved the problem (one of the typical bugs where you need a couple of days to change 1 word and solve the issue).
The method
Since there are so many development frameworks, and we want to provide general ideas, we will not focus on specific tools, but will present a method for manually analyzing and optimizing the database. The method is linear:
- enable slow query logging for the database, to detect problematic queries
- put the log in a table where we can analyze the data
- analyze and optimize the single queries.
If after a round of optimization the system still presents problems, we can repeat it on all queries instead of only slow queries. Let’s see the method in action.
Slow query logging
To enable the MySQL slow query log, you can do it either temporarily at runtime or permanently via configuration.
Enable (execute these queries, no restart needed):
SET GLOBAL slow_query_log = ON;
SET GLOBAL slow_query_log_file = '/var/log/mysql/mysql-slow.log';
SET GLOBAL long_query_time = 2;
Disable:
SET GLOBAL slow_query_log = OFF;
This logs all queries that take longer than long_query_time seconds. The runtime method does not survive a MySQL restart.
To enable it permanently, add this to my.cnf (or my.ini on Windows) and restart MySQL:
[mysqld]
slow_query_log = 1
slow_query_log_file = mysql-slow.log
long_query_time = 2
Remove or set slow_query_log = 0 to disable it permanently.
Reading the slow query log
MySQL slow query log has a standard, well-defined format, although small variations exist depending on MySQL version and options.
The format is text-based and is officially described in the MySQL documentation.
Here a possible sample of the log:
# Time: <date>
# User@Host: user[user] @ host [ip] Id: <i>
# Query_time: <sec> Lock_time: <sec> Rows_sent: <count> Rows_examined: <count>
SET timestamp=<timestamp>;
SELECT ...
We can parse it with a script that tracks for each query its timing and other info. Here a script that I use to analyze my logs. The parsing part needs to be adjusted from time to time if log follows a different format (e.g. for other DBMSs like Postgres or MS SQL Server).
Analysis
Ok, now the interesting part. The above script give us a list of queries ordered by cumulative execution time. Some of them execute many times for few seconds, other execute few times but for a lot of seconds or minutes.
We need to analyse each of them and search for possible optimizable features. My script allows to export data in two useful format:
- queries aggregated per user and time, useful to see the database load during the day
- time and count of each query ordered by descreasing load.
The first mode is very useful to produce temporal charts that highlight the more critical hours of day and the involved users. Here an example of obtained data and possible visualization format.

This chart tells us to pay attention to processes running at midnight, they are probably blocking each other, moreover user1 and user3 during they executions at night use a lot of resources. If our log spans days a typical chart shows periodical peeks during the nightly import/export crons and in the more busy hours of the day.
Very long queries
From the list of top 50 aggregated queries, we’ll find queries with a low count but an high sum_query_time. These queries usually present one of these problems.
Missing indexes
Queries are not using indexes correctly: some ON or WHERE conditions reference columns that are either not indexed or are indexed separately, so the optimizer cannot use them efficiently together.
This can be verified by analyzing the execution plan (most database clients provide it, such as DBeaver or SQL Server Management Studio). In particular, you should look for the ALL access type on a table. That means the database engine is performing a full table scan instead of using an index for the join or filter condition.
Too many aggregations
Queries use too many aggregation functions or subqueries. This slows down query execution because aggregation operations require additional computation, and subqueries often generate derived tables that cannot be properly indexed. When these derived results are joined with other tables, performance can degrade significantly.
If a table with certain aggregations or computed values is frequently needed, it may be worth maintaining a precomputed (flattened) table containing the processed data. This table can then be joined directly instead of recalculating the same aggregations dynamically on every query execution.
Large joins or large tables
When performing queries with joins, the database may consider a lot of records if we don’t ensure to filter data as early as we can (see the previous dedicate paragraph).
Other times, the tables used in the queries are simply too large. Even if the query correctly uses the available indexes, performance can still be slow due to the volume of data, since the selected result set may be large enough to require significant processing time. Moreover large tables require additional time for insert, update, and delete operations because indexes must be maintained.
In such cases, it may be appropriate to reduce the active table size by archiving or partitioning older data into a dedicated historical table. This allows queries to operate on a smaller dataset, improving efficiency and overall performance.
Short but frequent queries
These queries usually suffer of the same problem of large queries, but we can detect it only by aggregating the total execution time of occurences of a certain query. In my script I approximate this by grouping by a prefix of the query, avoiding a split due to different parameters. My script lists them in the same Top 50 list of the long queries.
Moreover, if the code repeats a query very often with the same parameters, we can evaluate to cache it to completely avoid the query.
Size of tables
As already pointed out, another important hint in database optimization is the size of tables. If they have too many rows the DBMS needs a lot of extra work to update indexes when we insert/update/remove records in them. Also if they are too big (maybe because they contain json, or files), the memory usage is less efficient (less cache applicable).
So it is useful to find the bigger tables of our database and decide if we can delete older data or at least historicize part of data contained in them in other table. Here the query to get the size and number of record in our mysql database:
SELECT table_schema AS db_name, table_name,
ROUND((data_length + index_length) / 1024 / 1024, 2) AS size_mb,
ROUND(data_length / 1024 / 1024, 2) AS data_mb,
ROUND(index_length / 1024 / 1024, 2) AS index_mb,
table_rows
FROM information_schema.tables
WHERE table_schema = 'mydb'
ORDER BY (data_length + index_length) DESC;
Conclusion
In this post we focus on MySQL database, but the procedure is similar for other DBMSs, it is sufficent to change the way we get the slow query log and to parse its format.
I hope the article gives you some good idea to help in the optimization of your databases.
I put here again the link to the script to parse the MySQL slow query log: https://github.com/zagonico86/mysql-slow-query-analysis/tree/main