I’m currently working with the Ordnance Survey Code-Point Open data, which contains the 1.7M UK postcodes and their geo-coordinates. To import it into MySQL, I convert the Easting and Northing units into latitude and longitude,
TRUNCATE the database table, and
INSERT them in.
This is rather slow, so I’m trying to speed it up by batching rows together (MySQL permits multi-row insertions) and by wrapping several database calls in a transaction. Nevertheless, it still takes 40 minutes to complete, and the conversion calculations put a great deal of load onto the processor (admittedly I put in a few microseconds of delay deliberately, so as not to swamp the import machine entirely). Since this data is republished quarterly, it would be good to have a quicker way to do it. What would be really good is a differencing import: we determine only what has changed, and add or delete those differences.
I’m thinking something along these lines:
- Unzip the supplied zip file into a folder named using the current date
- Do a normal (slow) import here
- Some months later, unzip the new zip file into a folder using the new date
- Run a Unix
sorton each of the old and new CSV files (we may be able to skip this if they are already sorted)
- For each sorted file, do a Unix
diffbetween the old and new files
- For each deleted line, remove the postcode
- For each added line, add the postcode
This should bring the import process down to twenty seconds or so. I expect others will find this very useful, so if I write this, I will — of course! — open source it.
Updated 25 April
The above would work if the current postcode table is an exact reflection of the previous import. However, for cases where it may have deviated (perhaps by manual editing) I have another algorithm in mind:
- Run a hash summing algorithm over the data to be imported, broken down into buckets based on the first X characters. A very rough version would be to use a bucket for each first letter and create a sum hash for each block (in practise I’d use the first three or four letters/digits)
- Run the same over the live data
- Find hashes that do not match, and inside a transaction delete the live data and replace it with the new import data)
Some experimentation would be required to see what the optimal number of buckets would be. The first three or four characters before the space (the “outcode”) would be a good start: this is around 3,000 buckets. This approach is slow in a read-only phase, and reasonably fast in a write phase.