Postgres column tetris

Enectiva deals with non-trivial amounts of data on daily basis. That’s not uncommon in any way, in the grand scheme of things it’s still nothing. However, it’s in the region where you need to think about the data and can’t just slap it into the database without performance suffering.

We are not Postgres experts but we take time and effort to learn about it so that we can leverage it efficient. Some say “storage is cheap, don’t waste your time”. They’re correct in a sense but throwing storage at the problem isn’t a magic bullet as noted by Michael Nygard in Release It!. If you take the time to learn about the database you can easily tweak configuration options, access patterns, or data structure and gain huge benefits. One of the changes with absolutely no effect on application1 is column tetris. Skip to tl;dr.

Column tetris

The idea is simple: every bit of data in Postgres has a data type and each data type occupies some chunk of disk space (it also limits the values that can be stored). The size of most types is fixed (for example smallint uses 2 B, integer 4 B, bigint 8 B, docs), but there are some with variable storage requirements as well(text, and bytea are the obvious examples). When the data is stored, the order of the columns matter. Considerate ordering can thus save space.

Each row of a table is stored in a sequence of bytes. It starts with a header containing information about the row and then comes the actual data. Values of columns with fixed size data type are laid out in themselves, for larger types only a pointer to the data itself somewhere else is stored (this allows the database to skip around easily while knowing how many bytes it can jump over regardless of the values themselves, it also means that retrieval a string requires another seek).

The rules

Three facts are important when playing column tetris:

  1. The row header is 23 B long2.
  2. Values have to be aligned; each data type has an alignment requirement, i.e. what multiple of bytes it can start at.
  3. When a NULL-able column exists in a table, NULL mask is created with 1 b per column (even the non-NULL-able ones).

Row header is pretty clear and you cannot do anything about it, 23 B it is.

Alignment

Alignment is trickier, but for the built-in types it’s simple: a value must fit within a word of length MAXALIGN = 8 B (on 64bit machines), it cannot be split between two3. To illustrate, let’s see some possible layouts of columns a (2 B), b (8 B), c (4 B), and d (8 B):

  +-------+-------+-------+-------
A a-b-------c---d-------
B a-b-------c---d-------__
C a-______b-------c---____d-------
D b-------d-------c---a-__

a, b, c, d - values of corresponding columns
+ - marks every eighth byte
_ - padding

Layout A shows the columns tightly packed without padding: a beginner might sum up the sizes of the columns and think that each row will occupy 22 B like this. Layout B shows what a beginner who read about alignment might think if she only understood that each row has to be aligned at the end (24 B). Layout C shows how the data will be actually laid out, occupying 32 B. Layout D is a result of playing column tetris, size is back on 24 B.

Columns b and d are 8 B long and because of the alignment requirements they have to start at positions 0, 8, 16, 24 etc. Layouts A and B would have column b start at position 2 violating this requirement (the value would cross the boundary of 8 B words).

NULL masks

Knowing about NULL mask doesn’t give you much leverage in column tetris but it can save you some time when trying to understand your database. When all columns of a table are required (NOT NULL) NULL mask is not used, row header has 23 B followed by 1 B of padding and then the data itself. If, however, at least one column can be NULL, a NULL mask gets attached to each record.

NULL mask is a bit string where each bit corresponds to one column (in order). 1 means the column has a value, 0 means it’s NULL (thus it doesn’t need to be stored!). If the table has up to eight columns, the NULL mask will fit within a byte and this byte can be stored right after the 23 B of header, basically for free. If the number of columns exceeds eight, the NULL mask grows accordingly. The complication is that the data can start only on multiples of MAXALIGN meaning that padding is inserted after the header. This nightmare scenario can therefore be concocted: a table with 9 smallint (2 B) columns4, one of which can be NULL:

+-------+-------+-------+-------+-------+-------+-------
H----------------------N-_______a-b-c-d-e-f-g-h-i-______

H - Row header
N - NULL mask

23 B header is followed by NULL mask occupying 9 bits. Mandatory data alignment means 7 bytes and 7 bits of padding before data, which itself is 9 * 2 B, leaving 6 B of padding at the end. The total length is 56 B with 15 B of padding (nearly 27 %). In this scenario it would be very tempting to get rid of the one NULL column (and save 16 B) but the column probably has a meaning which differs from 0.

In our situation one of the tables needed 9 columns but the four NULL-able columns were bigints meaning that NULL mask costs 8 B per row while four bigints with 0 (which in our case has the same meaning as NULL) would cost 32 B. These columns are NULL 75+ % of the time so the NULL mask is cheaper.

tl;dr

Playing column tetris is worth only on tables with millions of rows where few bytes per row result savings in order of gigabytes. As a rule of thumb: order the columns by storage size in descending order and remember that adding a column does not mean each row will grow just by its size.

This article is a compilation of Postgres documentation, many very detailed answers on DBA StackExchange by Erwin Brandstetter and a bit of our experience.


  1. This assumes that the application always lists SELECTed columns by name and does not rely on column order. ↩︎

  2. 23 B is normal size which can differ. Moreover, the header might contain OID as well. I’m not sure when one would need it and how to use it, but it would throw the calculations in this post off. ↩︎

  3. This is not a definition, but I find it helpful to think in these terms. Also, I have no idea what happens with 8 B long values on 32bit machines with MAXALIGN = 4 B. ↩︎

  4. The smallest data type is boolean with 1 B, but it’s hard to come up with sensible NULL-able boolean columns. ↩︎

We're looking for developers to help us save energy

If you're interested in what we do and you would like to help us save energy, drop us a line at jobs@enectiva.cz.

comments powered by Disqus