When long queries are fast - an SQL optimization

As the amount of data Enectiva.cz handles grows, we need to find better ways to manage it. After auditing our database, we discovered a pattern which allowed us to reduce the amount of data stored by tens of percent.

In order to take advantage of these potential savings, we had to change SQL queries used to access the data. This entailed a set of performance tests for various prototypical scenarios. One such scenario is: when was the consumption data last updated? In case of a meter the last update is almost1 the same as the latest reading, but it becomes harder for a building with several meters — then the answer: whenever is the latest update from any of its meters.

There is an obvious SQL query to answer this question:

SELECT MAX(time) AS time FROM consumptions WHERE meter_id IN (100, 200);

It returns the correct result but runs surprisingly slow on our test data. The times range from 80 ms to over a second for two meters, multiple seconds for five meters. That is way too much for such a simple question, lets EXPLAIN it.

Aggregate  (cost=524741.22..524741.23 rows=1 width=8)
  ->  Bitmap Heap Scan on consumptions  (cost=5950.92..524105.81
                                         rows=254166 width=8)
        Recheck Cond: (meter_id = ANY ('{100,200}'::integer[]))
        ->  Bitmap Index Scan on consumptions_meter_id_time  (
                                        rows=254166 width=0)
              Index Cond: (meter_id = ANY ('{100,200}'::integer[]))

Reading from back to front, we see that a bitmap index scan occurs using the index over (meter_id, time) columns. The expected cost is roughly 1 % of the whole query and the scan yields over 250K rows. That seems quite a lot for getting just one of them at the end! The result is then rechecked and consolidated in a bitmap heap scan which makes 99 % of the cost estimate. At the end, the maximum is picked reducing the result set to a single row (with a single column).

So what is going on here? Postgres uses the index to pick out pages which will contain some records matching the criteria (Bitmap Index Scan, it uses a bitmap search to do that), then reads all those pages and goes through all their records to actually select the ones satisfying the conditions (Bitmap Heap Scan). The selection of the maximum at the end is the cherry on top (Aggregate).

OK, now we understand what is going on during the query execution, but to improve it? We drew on our previous experiments and put together alternative queries with the same result.

One possible candidate (while we still hoped for a succinct and readable query) replaced the IN condition (which leads to an array comparison) by an explicit OR. In some cases this proved to be faster but as the previous query plan show, the bottleneck here is the final heap scan.

The winning approach tries to reduce the number of rows as fast as possible:

SELECT MAX(sub.ident) AS ident 
    SELECT MAX(ident) AS ident FROM consumptions WHERE meter_id = 100
    SELECT MAX(ident) AS ident FROM consumptions WHERE meter_id = 200
) AS sub;

The core idea is to split the query into sub-queries, one per meter, return the maximum of each and compare only those between each other.

Aggregate  (cost=7.57..7.58 rows=1 width=8)
  ->  Append  (cost=3.76..7.56 rows=2 width=8)
        ->  Result  (cost=3.76..3.77 rows=1 width=0)
              InitPlan 5 (returns $4)
                ->  Limit  (cost=0.57..3.76 rows=1 width=8)
                      ->  Index Only Scan Backward using
                            consumptions_meter_id_time on consumptions  (
                            Index Cond: ((meter_id = 100) AND
                                         (ident IS NOT NULL))
        ->  Result  (cost=3.76..3.77 rows=1 width=0)
              InitPlan 6 (returns $5)
                ->  Limit  (cost=0.57..3.76 rows=1 width=8)
                      ->  Index Only Scan Backward using
                            consumptions_meter_id_time on consumptions
                            consumptions_1  (
                            Index Cond: ((meter_id = 200) AND
                                         (ident IS NOT NULL))

The query plan demonstrates this process: sub-queries are shown on the same level and consist of an Index Only Scan (all the data to answer the query are present in the index, so no page access is necessary). Even though the upper bound of the cost in these scans is pretty high, the query is very fast; indexes are sorted, in this case consumptions are sorted by meter id and then time, Postgres therefore only needs to find the section of the index for any given meter and then the last row in that section. The word Backward suggests that the index is traversed in reverse.

Then the maxima for each meter are put together (Append) and compared (Aggregate) to find the overall maximum.

In practice, this query performs on the order of milliseconds (even for more meters) and beats the original query by two or three orders of magnitude. The downside is that the query is less readable, but the performance boost greatly outweighs few lines of commentary explaining the goal of the query.

  1. While a reading’s time is a moment specific down to a second, consumption is defined over a period of time — time gets discretized, in our case to 5 minute intervals. ↩︎

We're looking for developers to help us save energy

If you're interested in what we do and you would like to help us save energy, drop us a line at jobs@enectiva.cz.

comments powered by Disqus