Window functions in Postgres with Rails time

Back when Ruby on Rails was moving to use Postgres as its default database, there was a lot of comparison articles and videos. From one of them, I remember a simple query which surprisingly works in MySQL:

SELECT 
  category_id, 
  COUNT(*), 
  email 
FROM users 
GROUP BY category_id

OK, we want to get the number of users is each category identified by its ID and an e-mail. Which e-mail? I don’t remember if MySQL returns the first or last in each group but it does not complain and it returns something. Postgres will complain because it sticks to the SQL specification.

Following spec is great but sometimes it gets into way when you need to get things done. Let’s imagine you need the user counts per group with one representative e-mail. The classic solution is to be explicit and select MIN(email) or MAX(email).

Now comes a new requirement that you also need a representative username. You add MIN(username) and you’re done. Until someone notices that the username and e-mail don’t go together, they belong to different users!

Huh, it makes sense, because each column is independent, but here you need them to be dependent. Some head scratching and StackOverflow browsing later, you might come up with:

WITH counts AS (
  SELECT 
    category_id, 
    COUNT(*) AS cnt, 
    MIN(id) AS id
  FROM users 
  GROUP BY category_id
)

SELECT 
  counts.category_id,
  counts.cnt,
  users.email,
  users.username
FROM counts
JOIN users ON (counts.id = users.id)

Select one user ID and then join back on the table itself to fetch the extra columns. Yay, you made it.

Then someone comes up with a small tweak “having just one sample user is confusing, let’s show three and … so that users understand it’s just a sample, that’s easy, right?” This is the point to start ripping your hair out.

Lucky for your hairline, Postgres offers window functions since version 9.2 (2012). In short, they are a set of functions which perform a calculation across multiple rows (like aggregate functions) but don’t reduce them to one output row (unlike aggregate functions). With these functions, you can for example get the row number, rank, percentile, first, last, or n-th value within a group. If this doesn’t make much sense, go ahead and watch a great Egghead tutorial about window functions.

Understanding window functions, the last requirement becomes manageable:

WITH indexed AS (
 SELECT
   category_id,
   email,
   username,
   row_number() OVER (PARTITION BY category_id ORDER BY email)
 FROM users
),
counts AS (
  SELECT
    category_id,
    COUNT(*)
  FROM indexed
  GROUP BY category_id
)

SELECT 
  indexed.category_id,
  email,
  username,
  counts.count
FROM indexed
JOIN counts ON (counts.category_id = indexed.category_id)
WHERE row_number <= 3

The query is split into three steps. First, creates indexed relation by selecting the desired columns plus row number from users. row_number() is one of the window functions returning the sequential number of the current row (starting from 1) in the group specified by PARTITION BY. It will go 1, 2, 3, …, 1, 2, 3, …, resetting every time the group changes.

The second query simply calculates counts in each category.

Finally, the last SELECT puts these two relations together and filters the result to only three rows per category. The result might look like:

| category_id | email         | username | count |
|-------------|---------------|----------|-------|
| 100         | a@example.com | Anton    | 54    |
| 100         | l@example.com | Liza     | 54    |
| 100         | p@example.com | Pauline  | 54    |
| 200         | c@example.com | Charles  | 2     |
| 200         | m@example.com | Maggie   | 2     |
| 300         | d@example.com | Diana    | 42    |
| 300         | k@example.com | Kelsey   | 42    |
| 300         | l@example.com | Lauren   | 42    |

Category 2 has only two users, both are returned, others have more (admittedly with strange first letter of e-mail distribution). Count per category is returned alongside every user. In real life, I’d probably separate the counts query but it might be just my prejudice; test it on your data.

Window functions are wonderful and there are many situations when they can be used if you remember them. However, if you are working with a database backing a Rails application, you will probably run into one additional complication: time. In case you haven’t noticed yet, time is hard.

Last week, I had a task involving both window functions and time. The assignment was to export first (or last) reading of each month, day or hour. Ignoring the time problem for a moment, the query to fetch first reading of every day is simple:

WITH indexed AS (
  SELECT
    meter_id,
    state,
    time,
    row_number() OVER (PARTITION BY meter_id, date_trunc('day', time))
  FROM readings
  WHERE meter_id IN (...)
)

SELECT 
  meter_id,
  state,
  time
FROM indexed
WHERE row_number = 1
ORDER BY time

You create a query in Rails (probably using Arel), try in and it looks strange. All your first readings of the day are actually after two in the morning. What went wrong?

Rails store datetime columns in Postgres as timestamp without time zone. Before insertion, every time value is converted to UTC by Rails automatically. After every retrieval, every value is converted to your application’s time zone. When I store Time.zone.parse('2017-09-17 00:00:00') from Rails with CET/CEST as time zone, the literal you’ll see in the database will be 2017-09-16 22:00:00. When I fetch it again, it won’t be a problem because Rails will convert it back to CET/CEST as needed. If I use a time condition in a query via ActiveRecord API, it will be converted as well. Rails takes care of everything!

But when you tell Postgres to group readings by day, it will group them by the day it can derive from timestamp (without time zone) which is actually UTC day. Rails then behaves nicely and converts the UTC time to CET/CEST time.

So the problem is to get Postgres to actually group readings by CET/CEST day: interpret timestamp to be in UTC, convert it to the corresponding moment in CET/CEST, and group.

First piece of the solution is the AT TIME ZONE construct. Among other things, it allows to convert timestamp to timestamp with time zone. This means that if a reading has time 2017-09-17 02:00:00 and we retrieve it as time AT TIME ZONE 'UTC' the returned value will be 2017-09-17 02:00:00+00:00 - the literal will be taken and interpreted to be in the given timezone. So we added time zone information to the raw value. Putting this into the query:

-- ...
    row_number() OVER (PARTITION BY meter_id, date_trunc('day', time AT TIME ZONE 'UTC'))
-- ...

It will look nice when you look at it in psql or some admin tool, every day will have the first reading few minutes after midnight. But when you run it in Ruby (on Rails) and check the values, all the readings will be right after two in the morning. Well, of course, we cut the readings into UTC days but we wanted CE(S)T days! We can’t say AT TIME ZONE 'Europe/Prague' because that would interpret the zone-less times to be CEST already, group them as such and when fetched and converted, it the results would be off again1. It is possible to convert timestamp with time zone to time in another time zone but the process is surprisingly user unfriendly (the time needs to be decomposed to components and new time created). A working solution is to change the internal time zone of Postgres. It can be done at a session level by SET TIME ZONE 'Europe/Prague';. This means that the zone-less time will be interpreted as in UTC but it will be grouped by CE(S)T days. We don’t need to modify the returned value and let Rails do its thing as usual.

The complete query will therefore be:

SET TIME ZONE 'Europe/Prague';
WITH indexed AS (
  SELECT
    meter_id,
    state,
    time,
    row_number() OVER (PARTITION BY meter_id, date_trunc('day', time AT TIME ZONE 'UTC'))
  FROM readings
  WHERE meter_id IN (...)
)

SELECT 
  meter_id,
  state,
  time
FROM indexed
WHERE row_number = 1
ORDER BY time

There are two more caveats worth mentioning. Firstly, this query assumes that all readings returned by the query are in the same time zone. If that’s not the case, you’d need to do more conversions. The second point note is about time zone names. It is clearly described in Postgres documentation: UTC, CET, EEST and others are just aliases for offsets, if you need Postgres to be DST aware, use widely supported IANA time zone names like Europe/Prague or America/New_York.

The final takeaway is: time is hard, but it can be worked with (if you really, really need to).


  1. This assumes that your database runs with UTC as its default time, you can check this simply by SELECT NOW(). If your database uses another time zone, the result will be shifted accordingly. ↩︎

We're looking for developers to help us save energy

If you're interested in what we do and you would like to help us save energy, drop us a line at jobs@enectiva.cz.

comments powered by Disqus