Since the beginning of the year, I’ve on-and-off been writing a paper for a software idea under the above title. I thought it might be useful to bounce some ideas around in a public forum, and to have a lightening conductor from the search engines, in case anyone is working on anything similar. At 31 pages presently – and still plenty of highlighted placeholders – it might be a while yet before the paper is in a reasonably presentable form. So, here’s a taster.
My proposal is an open-source package that permits individuals to create and maintain a cheap, distributed, de-centralised, crowd-sourced#, scalable, relational database (they’re not buzzwords; they all mean something, honest!). Whether several technical hurdles can be overcome, whether it would be useful in practice, and whether it will generate sufficient interest to be viable are not all answered at this point.
Here’s how it might work. An admin sets up the software on their internet-connected server, and fires up the web-based user interface. They design a database with the usual features – tables, columns, simple data types and constraints – and then start entering data. Data could be entered via a generic data editor or the owner can develop their own software to talk to the database. When the dataset grows and becomes interesting to an unrelated group, they in turn can set up the software on their own server, and create an instance of the same database by specifying its URL. Data in the first system will then start replicating to the second, and thereafter any data entered into one will be replicated in the other, all on a safe, versioned basis. As the dataset scales, so in theory does the distributed infrastructure, and hence also the number of contributors. Each “node” contributes to the dataset in varying amounts – from nothing at all (i.e. just replicating) to highly active – and their interconnectedness forms a “mesh” that has no central point and no single point of failure. Underlying each installation would be a popular database like MySQL, with a DBAL employed to avoid platform lock-in.
There are several reasons why I think this would be useful. Firstly, keeping large datasets up-to-date with data is expensive, and so organisations that aren’t well funded are unable to undertake projects that would be beneficial to them. Note that setting up an online database in itself is not prohibitively expensive – MySQL or PostgreSQL are free, and will hold millions of records even on a cheap VPS. But populating large tables is expensive, whichever way one chooses to do it; either one does it in-house, which is onerous in terms of paid labour, or one crowd-sources the data input within a single server infrastructure, which requires costly amounts of compute capacity to sustain the usual contributor/lurker ratio.
Secondly, when crowd-sourcing data input for small organisations, contributors want to know whether the combined efforts will reach the critical mass required to make the data useful. For example, if Wikipedia were to try to run with a small group of volunteer writers, it would be wasting its time; it only works because is has 86,000 or so editors. So, there is a certain “I will if you will” reticence, which is fair enough: no-one can be expected to undertake a lengthy on-going task if the effort is likely ultimately to be fruitless. In a similar way, no group interested in the maintenance of a particular dataset will contribute if they are uncertain of its long-term availability. If it is being hosted by an unconnected party, it may disappear without warning, again making any prior contributions a wasted effort.
Thirdly, there are often a range of competing groups who would benefit from a particular dataset, but the problem is that they all want to be its primary custodian. It could be supposed that this is human nature at work; everyone wants to be in charge, or the “go-to person” for a particular thing. It is much less common to find groups that are willing to be contributors to the cachet of their competitors, which is sometimes even true in the not-for-profit sector. The resulting fragmented and duplicated effort is less likely to near completeness than if all efforts were to be pooled and the results shared.
My proposal, I believe, solves all of these problems. Each installation is cheap to run; in theory, at least, any dataset worth maintaining will be separately hosted by groups who benefit from the data, and supporters who wish to help. This makes it ultimately scalable; if any one installation becomes overloaded (i.e. too many contributors or lurkers) its critical mass should encourage more installations elsewhere, and visitors who form the over-capacity will move there. Since every installation will usually* mirror the whole dataset, its long term availability is guaranteed even if some hosts drop out. Each node will determine which inserts and updates are required by the nodes connected directly to it, and will propagate them as required; every record is versioned, so we can examine its update history; and each node will have a web-based interface, so that peers can easily be added and removed.
(* For very large datasets, some nodes may choose to replicate just a subset of the data. Peer nodes would be automatically notified that they also need to peer with some full nodes if they want a full copy of the dataset.)
I think supporting cheap, shared hosts is key to this project, even if it will run much better and/or enable more features on a dedicated/VPS host. This has given exposure to open source software in the past, such as WordPress, and may serve to drive initial interest, system adoption and development involvement.
Some potential use-cases now follow:
- The example I use in my paper is a job advert database, say for “Professional Jobs in the UK”. This is of course a large and frequently changing dataset that is difficult to keep complete, even for large corporate owners. Presently such datasets are held incompletely and separately by competing recruitment agencies and third-party aggregator sites. The fragmentation that comes from the large number of silos results, unsurprisingly, in high levels of duplicated and stale data. Additionally, with no single reference repository, there are limited opportunities for third parties to improve current search tools (in particular, most of the current marketplace has a strong disincentive to offer a “no agencies” filter, for obvious reasons). Specialist nodes could just mirror, say, “IT Jobs” or “Jobs in the West Midlands”, if they wish.
- The system might also lend itself well to constantly-updated scientific data, to be edited by scientists and shared freely and publicly between centres of research. This may provide better statistical results than current efforts (due to having larger datasets), or might assist underfunded areas of research at reduced costs. In particular, medical case-study data could be shared on this system, as long as it is anonymised.
- Structured data that is subject to censorship would be a third suitable usage category for this software. Information from political or corporate whistleblowers, for example, could be added into any node and replicated quickly across a large number of redundant sites with minimal technical effort. This would make an ongoing dataset available to journalists and dissidents whilst also making it resistant to suppression, much as WikiLeaks initially did with their (unstructured) cable data.
- Buying research papers presently is highly expensive, which creates a barrier to academic learning and public policy verification. Additionally, the closed nature of the most prestigious journals may be responsible for slowing efforts to synchronise the plethora of formatting requirements, which at present is an additional burden for authors. If my proposal set out here were to be employed, each paper can be connected to plenty of structured metadata, and the schema design could grow organically as more metadata for each paper is demanded. The most interesting item of metadata for a paper is arguably its citations, which in my system would just be a many:many relation from table “paper” to itself, via a joining table “paper_citation”. Each node might belong to a university or private research centre, with records inserted at the node whose name is on the paper. I would expect all nodes to mirror the full dataset, since the cost of maintaining the necessary amount of disk storage is (in my guesstimate) likely to be cheaper than all the necessary journal subscriptions. Finally, the open nature of the system should produce some very novel approaches to searching for papers, which currently needs to be carried out over several closed systems.
- It would be very interesting to see how marketplaces could be built using this technique. I believe B2B (business-to-business) systems already do this, but the dynamics are significantly changed by the low cost of the infrastructure, and by the fact that individuals can get involved in an existing trusted system with very low barriers to entry. This might assist with the problem of non-clearing markets i.e. the fact that not every seller necessarily finds a buyer immediately (or, in some cases, ever). In particular, this approach might be of use in slow-moving markets, such as selling or renting property.
- Databases of public transport routes. I like this idea a lot, as public transport (certainly in the UK) is often run by competing private companies, each of whom have a vested interest in not publicising the services of their competitors.
Replicating static datasets (such as the ones offered by AWS) would certainly work with this software, but given that row versioning is not required, rsync or native replication systems would usually be better suited to the task.
It is tempting to suggest that the online auction marketplace is ripe for this approach. However, sites like eBay need two things: instantaneous synchronisation between its servers, and secrecy between buyers – and this project specifically has neither.
The paper and/or a prototype will be presented here, though since it is a pet project, I am not sure when! In the interim, comments on desirability, feasibility etc. are very welcome indeed.
30 Aug: I came across an interesting article today about how academic papers are distributed; this becomes my forth potential use-case.
31 Aug: added note about supporting shared hosts to encourage take-up.
2 Sep: added new notes on using this idea to create online notices for slow-moving markets, and for public transportation routes. Also commented on how the system would not be well-suited to running auctions, given its design.
12 Sep: # an email respondent suggests that crowd-sourcing is unavoidably a buzz-word, perhaps indicating that it might be helpful to define it. What I meant here was that members of a decentralised group all contribute in a way that is relevant to themselves; in so doing, they give that particular project a critical mass that it might not otherwise receive. Contributors take part because of some perceived pay-off, such as access to the full dataset for their own purposes, or in order to support the aims of the project.