JonBlog
Thoughts on website ideas, PHP and other tech topics, plus going car-free
Relational database versioning strategies
Categories: Meshing, Projects
Update, June 2013: I should love to continue with Meshing, and I still regard the basic idea as having great potential. However I have been rather pressed for time to work upon it, so it is effectively on hold. Do still get in touch if you are interested in collaborating - your interest may well re-spark mine.

Introduction

One of the challenges in Meshing is to determine how to store previous data in a database. How many reads and writes need to be carried out is important, as this will have a direct impact upon performance. I’ll document some of my ideas here, mainly to solidify my understanding of it – and also because it may be of interest to others.

The first design decision is: one table or two? Let us consider a trivial table for Person, which has columns (id, name, email). To store versions within one table, we can take the primary key and add a version integer into it. We should also add a soft-delete timestamp (so records are only marked as deleted, rather than actually removed) and some sensible update metadata. Hence our table ends up thus:

person table
id integer (pk)
version integer (pk)
name varchar
email varchar
created_at timestamp
updated_at timestamp
deleted_at timestamp

The biggest problem with this is performance: if records are updated several times each, all copies are stored in the one table. Ergo, a table ostensibly with 100K rows would in fact contain 1M rows, if each current record had nine other versions. Unsurprisingly, joining to such a table can get quite expensive when compared to the performance of a unversioned table. It also adds a layer of complexity to all queries on the table (see here). The Propel ORM team has also noted several problems with the approach, and recommends an archive table instead (i.e. when a record is updated or deleted, its current state should first be copied to another table). Performance-wise, this means the “current” table will have the same performance as an unversioned table, and while the archive table will still be relatively slow, in most cases it is required much less frequently.

Having two tables gives rise to a new schema, which could look like the diagram below:

person_current table
id integer (pk)
name varchar
email varchar
person_version table
id integer (pk)
version integer (pk)
name varchar
email varchar
updated_at varchar
deleted_at varchar

There is no longer a created_at column, since this can be derived by looking up person_version.updated_at where person_version.version = 1. You’ll also notice the original table doesn’t have any metadata columns – this is an interesting topic that I’ll return to later. But first, I’ll describe a couple of strategies for storing versioned data, together with their advantages and disadvantages.

‘Version current’ strategy

Version Current is probably the most obvious approach to record versioning, if the current table is to have no metadata. For the initial insert, we simply insert a record into the current table, and duplicate it into the version table:

Stage Table contents
Insert person_current person_version
id name email
1 Jon jon@
example.com
id version name email updated_at deleted_at
1 1 Jon jon@
example.com
2011-11-26 15:52 NULL
Update
id name email
1 Jon jonny@
example.com
id version name email updated_at deleted_at
1 1 Jon jon@
example.com
2011-11-26 15:52 NULL
1 2 Jon jonny@
example.com
2011-11-26 15:55 NULL

This permits a previous version’s data to be stored in the same record as its metadata, but requires the first copy of each record to be redundantly duplicated. That’s fine for small records, but if we add in some BLOBs, the storage inefficiency may increase quite a bit.

‘Version previous’ strategy

This strategy requires the version table to store the previous state of the row. This means that if we insert a new current row, we also insert blank values into the version table. This approach allows us to capture metadata for our initial insert in the same way as updates and deletes, given that metadata is not being stored in the current table.

So, inserting a record and then updating the email address could look like this:

Stage Table contents
Insert person_current person_version
id name email
1 Jon jon@
example.com
id version name email updated_at deleted_at
1 1 NULL NULL 2011-11-26 15:52 NULL
Update
id name email
1 Jon jonny@
example.com
id version name email updated_at deleted_at
1 1 NULL NULL 2011-11-26 15:52 NULL
1 2 Jon jon@
example.com
2011-11-26 15:55 NULL

This doesn’t have the redundancy on initial inserts that Version Current suffers from, but introduces some interesting effects. Firstly, if we want to retrieve version 1, we need to read the metadata from person_version.version = 1 and the data from person_version.version = 2. Most previous version retrievals will therefore require a single select returning two rows, using a clause like WHERE id = i AND version IN (x, x+1). However, a minor inefficiency is introduced, since reading version 2 (with metadata) will require two selects: one on the version table, and then upon discovering that is the last versioned copy, a second one on the current table. Reading the current record (with metadata) also requires two selects.

It is worth noting also that columns defined as non-null in the original schema must be made nullable in the version schema, to permit the nulls in the first record. Whilst this may be invalid from an application perspective, in practice it is enforced by the not-nulls in the current table.

So, this strategy is more space-efficient than Version Current, but less time-efficient. Whilst here we may need to read from both tables in some cases, Version Current allows us to just read from the version table for versionable operations (leaving the current table exclusively for the use of the application layer).

Metadata approach

So, should version metadata be added to current records? I originally omitted them because I felt it might present problems to pre-existing applications that access tables directly, but if it can be made to work, doing this might garner small overall performance improvements. Let us consider the Version Previous strategy, with the addition of metadata to the current table. Our schema now looks like this:

person_current table
id integer (pk)
name varchar
email varchar
version integer (pk)
updated_at timestamp
person_version table
id integer (pk)
version integer (pk)
name varchar
email varchar
updated_at varchar
deleted_at varchar

Note that the current table doesn’t contain deleted_at. This is because during row deletion, it is hard-deleted from the current table in order to avoid the soft-delete issues referred to earlier. Instead, we add a new row to the version table, with a new version number and with the deleted_at timestamp populated.

Let us do the same insert and update as before. Since we now have metadata columns in the current table, it would be interesting to see if it works without a version for the initial insert:

Stage Table contents
Insert person_current person_version
id name email version updated_at
1 Jon jon@
example.com
1 2011-11-26 15:52
No rows
Update
id name email version updated_at
1 Jon jonny@
example.com
2 2011-11-26 15:55
id version name email updated_at deleted_at
1 1 Jon jon@
example.com
2011-11-26 15:52 NULL

With this design, retrieving the current row will obtain the related metadata without extra cost. But, getting a specific version may still need two selects – we would arguably check the version table first (since there will usually be more old versions than current ones) and if it’s not there, we’d check the current table.

However, doing an update requires up to four separate SQL statements. We would start by selecting a current record matching a primary key. If this exists, we can simply add one to the version number from that row, insert a new row in the version table, and update the data and metadata in the current row. However, if it doesn’t exist, we also need to check the version table, since it may have been hard-deleted from the current table.

Also, this approach doesn’t store a creation date for the record, so let us try a new schema that includes it. If it is added to both person_current and person_version, it will be available in every record at no extra operation cost, even though this incurs a little data redundancy:

person_current table
id integer (pk)
name varchar
email varchar
version integer
created_at timestamp
updated_at timestamp
person_version table
id integer (pk)
version integer (pk)
name varchar
email varchar
created_at varchar
updated_at varchar
deleted_at varcha

Furthermore, the update inefficiency can be fixed by creating a version record upon current insert again:

Stage Table contents
Insert person_current person_version
id name email version created_at updated_at
1 Jon jon@
example.com
1 2011-11-26 15:52 2011-11-26 15:52
id version name email created_at updated_at deleted_at
1 1 NULL NULL 2011-11-26 15:52 NULL NULL
Update
id name email version created_at updated_at
1 Jon jonny@
example.com
2 2011-11-26 15:52 2011-11-26 15:55
id version name email created_at updated_at deleted_at
1 1 Jon jon@
example.com
2011-11-26 15:52  NULL NULL
1 2 Jon jon@
example.com
2011-11-26 15:52 2011-11-26 15:52 NULL

In fact, it turns out that storing metadata in the current table obtains only limited advantages. True, it does mean that reads on the current table get some metadata at no extra cost. But in the above example, we still need the extra select to read some rows; that is to say, Version Current offers reasonable performance for versioned operations without the complexity of metadata in the current table.

Let us take as an example the proposal to replicate a WordPress site inside a peer-to-peer mesh, perhaps to create a ‘superblog’ for citizen journalists. If I was implementing the project, I’d rather not create metadata columns into WP’s database, even if it could be shown to work. Would plugins work nicely with it? Would it survive a WP upgrade process?

For this reason, my preference for this project is to omit metadata from current records. There’s less columns to worry about, and no metadata redundancy that could lead to internal inconsistencies. As it happens, the code is presently heading towards Version Previous, because I’d initially balked at the data redundancy in Version Current. But the latter offers a performance win for versioned ops, and disk storage is cheap, so I may change my mind. In fact, Meshing could offer a storage strategy choice in the same way as I’m doing for hashing. This would give us a simple mechanism to unit-test each type, and to compare performance with real-world testing.

Leave a Reply