I was talking to
Johan some time ago and he mentioned working on
Temporal Databases. At the time it sounded interesting, but I didn't give it much more thought. Lately I've started thinking more about it, and it's such a simple yet ingenious concept. Instead of deleting or modifying data you'll mark the old data as invalid and insert the new valid data. Of course, in a true temporal database, this is largely transparent to the application. It's useful in pretty much any context. Of course there are exceptions, where it's just unneeded overhead, but usually there's no reason to not keep historic data around. Especially these days when disk space is so cheap. (BTW, did you see the new
2 TB 3.5" disk from Western Digital?)
Many applications (though far from all or even most) keep logs over when things change, so if there is some problem you can backtrack through the logs and if you are lucky see what changed, and if you are even more lucky you might even be able to see what or who caused the change, and why. The data most likely isn't machine readable any more though, and it takes a lot of manual labour to figure out the history of any given object.
Temporal Database solve this at the database layer, and can be implemented in an existing SQL database using a couple of views, triggers, rules and PLSQL functions. What you need is a couple of extra columns. A
start column that marks the first moment in time when the record became valid, and an
end column that marks when the record became invalid.
The
start column is always set and defaults to
current_time. The
end column can be
null though, as would be the case when you first insert a new record in the database. When you do an UPDATE a trigger catches your query and changes it into two queries: The existing row only gets one row changed, the
end column, which is set to
current_time - then a new row is inserted with all the data from the existing row, after your changes have been applied, and
start set to
current_time and
end set to
null. The procedure for a DELETE is a bit more simple. A rule is created that traps the DELETE statement and replaces it with an UPDATE that sets
end to
current_time.
Another thing you'll need to keep this logic completely contained in the database is a view for the currently valid data in the table, which simply selects all the rows where
end is
null. To make things even more transparent you could probably add some rules and triggers to this view which acts on the real table, and then only use this view for everything in the application, though I have yet to try this out.
The above setup is pretty sweet since everything happens automatically. It does have a couple of drawbacks though.
- You have to manually set up all the views, triggers, rules and functions for each and every table.
- When you make changes to your tables, you will need to make the corresponding changes in the views etc.
- If you ever need to purge historic data things get complicated.
The last point is something I personally like. I wouldn't want it to be easy to permanently delete data from a database set up this way. It's however a very important point here in Sweden when dealing with information about living persons since "
Personuppgiftslagen" ("The Personal Data Act") applies. One paragraph of the law states that personal information may not be stored longer than necessary. This is obviously an issue if the database system intercepts any DELETE statements and blocks them. One way of working around this would be to disable the delete rule while removing the old data, which could probably be implemented as a PLSQL function. However, there are probably other parts of the law that applies too, e.g. keeping historic records of a persons addresses as they move around may be storing too much information. It'd be interesting to have a lawyer look at that.
While the advantages in keeping all the temporal logic inside the database are numerous I'm considering the alternative of implementing it all in the application layer. Since I started looking at temporal databases I've experimented with both versions, and I'm torn between them, and I think that will always be the case.
I'm currently working on a new member database for
ULUG which will employ the temporal database concepts, and I've decided to go with the application layer approach, and I'll be extending the concept a bit. I've added two columns for
editor and
logmessage. The
editor column will contain the user id of the person who performed the last record update, and the
logmessage will allow the editor to comment on his changes for future reference. It can also be used to specify things automatically, such as "User changed their password". I've also added a column named
revision. With every record update it will be incremented by one. This will make it easier to refer to a specific update of an object, for example "User ID 1, revision 5" instead of "User ID 1, as it looked at 2007-04-06 15:45:32". I also decided to use this in combination with the record ID as the primary key, instead of a more complex key using the
start and
end timestamp columns.
Every page in the system displaying data that come from a temporal source will automatically display a small clock that when clicked allows the user to pick a date and time and show the object as it looked at that point in time. When an old revision is being displayed, the background will change and a warning text will appear at the top of the page to make it clear without a doubt that this data is no longer valid. There should also be some kind of way to show a log of all the revisions of an object, though I'm not completely done with the thought process on the specifics of this feature.
My goal is to implement the temporal aspects of this as a special class that holds all the temporal logic. Any other objects, e.g. Member, WikiPage, etc, should be able to inherit from this object and automatically get the benefits of temporal data without having to manually add the extra code as would be the case when it is implemented in the database layer. The only thing needed are the extra columns in their respective table.
If I can get that to work nicely I'm sure I'll end up using the application layer approach much more than the database layer approach. The drawback of course is that an application error could potentially erase data, but that's a problem with regular databases as well and I've managed to avoid any such bugs in my code so far. And if the worst case would happen, that's where my point-in-time-recovery backup system enters the picture!