Social Reconstruction of Public Transportation Information

The UK‘s local public transport data is effectively a closed dataset. The situation in the US seems similar: In spite of the benefits only a handful of agencies have released raw data freely (such as BART and TriMet on the west coast of America).

That hasn’t stopped “screen-scraping” of data or simply typing in paper timetables (from Urban Mapping to many listed here). Unfortunately, the legal basis for scraping is complex, which creates significant risks for anyone building a business. For example, earlier this year, airline Ryanair requested the removal of all their data from Skyscanner, a flight price comparison site that gathers data by scraping airlines’ websites. How many airlines would need to object to their data being scraped before a “price comparison” service becomes unusable?

User-generated mapping content is evolving, often to circumvent restrictive distribution of national mapping. Services include OpenStreetMap and the recently announced Google Map Maker.

Micro-blogging, primarily through Twitter, has started to show the potential of individual travellers to report information about their journeys: Ron Whitman‘s Commuter Feed is a good example. Tom Morris has also experimented with London Twitter feeds.

This article outlines why the “social web”/tech-entrepreneur sector may wish to stop trying to use official sources of data, and instead apply the technology it understands best: People.

The Big Picture

I will use the example of UK local bus data to summarise the strategic issues for data providers. I can only presume the issues are similar elsewhere (comments welcome).

Explaining exactly who the data providers are is one of the many problems of trying to extract and use the data. I would provide more detail, but the topic is somewhat sensitive. The most critical point in the chain that constructs and distributes the data are local authorities – sub-regional public bodies, typically those responsible for large cities, conurbations or counties. They process the data, but are not under any statutory requirement to do so (no national government legislation requires it).

There are a number of issues for the existing data providers:

  1. Mindset of centralised control: Most operators, public authorities, and other agencies, still have a mindset of centralised control of information, delivered to users via the method the agency believes is appropriate. This is heavily driven by the belief that only the agency can be accountable or impartial, and that incorrect information supplied by an uncontrolled third party is likely to damage the image of local transport service and generally reflect badly on the agency.
  2. Mindset of local: Most agencies are locally focused, locally orientated. It seems logical for them to commission a fully-functioning website or piece of information delivery software that is specific to their city, because their target market is local. There’s a lack of global perspective: An agency will typically commission a system that is specific to their city, even when 95% of the features would work for any city, and 90% are already in existing global products.
  3. Not appreciating trends in delivery channels: There is still an attitude of “we’ll provide a website”, without a comprehension that the number of channels for delivery of information is exploding far faster than any one agency can hope to construct bespoke user interfaces to cater for. Mobile devices, integration into social software. There would probably a market for a “WiFi-enabled” alarm clock that would ring later if your morning train had been delayed: We simply can’t define the limits for how this information might be used.
  4. Not appreciating trends in cost: Even large, well-funded agencies are starting to fall behind the technology. The cost of systems (many millions of dollars invested year on year in some cases) is starting to hurt. Logically the global system should win out, because one city is very much like another: There is considerable scope for sharing systems costs.

What It Means

Long term we are heading for global providers of information, that pool data from local sources. That will be forced by the cost of technology. This can be seen in technology costs driving things like agglomeration in the groceries sector (such as Walmart) over the last 30 years. Also in the move from customised mainframe computing, to shared operating systems and platforms (such as Windows). This will be worse, because the number of systems will be simultaneously exploding alongside the complexity of those systems.

As these issues become progressively better understood, data will become more centralised. Even in agencies where (in my opinion) uniqueness and absolute control are culturally in-breed, such as London Transport/TfL, cost will eventually win the argument.

However, centralised data handling does not automatically make the data open. Quite the opposite.

Contracted Provision

Currently, effective control of data is with local government. Many individuals within local government will naturally attempt to block any change that might leverage power away from them and their organisation. “Job protection” is an over-simplification, but helps explain the underlying position. But by contracting data handling and presentation to a third-party contractor, local government would gain the technological “economies of scale” (assuming the contractor won many contracts from different authorities) and notionally maintain control.

Use of third-party contractors is already common within the local government sector, particularly for Information Technology.

An example can be seen in Edinburgh City Council’s Traffic Map. In spite of how it appears, the information isn’t powered directly by Edinburgh City Council or Google. Instead it is part of Mott MacDonald‘s Common Data Management Facility, providing services under contract to many different local authorities.

In the UK public transport arena, Trapeze is a good example of the gradual agglomeration of data handling within a few large businesses, where historically many small software providers could be found.

The example above provides key driver information, and is somewhat useful, but is it the best outcome? I suspect not. Contracts tend to be priced highly, because local government clients are high risk: Their political control means that they can change their strategic direction and requirements unexpectedly. At best, customer feedback loops through local authorities are slow and politicised. At worst the design of the system will reflect the arbitrary views of a self-proclaimed expert (such as myself). Even if you think it is perfect, there is no scope for choice or creativity. Choice is good and need not be expensive.

Social Provision

Instead of using official data, why not let users reconstruct it? User-generated content is cheaper to create than information from professionally staffed sources: Since very many contributors do so little work, no individual expects payment. User-generated content can be just as accurate too, although this is not automatic: For example, a strong community will subject everything to peer review, weeding out poor information and contributors.

This is not an entirely theoretical position. There is a largely untapped human resource, just waiting to help.

The transport enthusiasts (transit fans, “spotters”) already collate and produce some extremely high quality information about certain technical aspects of operations and services. For example, sites such as LondonBusRoutes.net contain detail on the bus route timing and vehicle allocation (type and number of buses), which transpires to be difficult to extract from official sources. While it may be argued that these sites simply repackage official information, their very existence is a testament to the strength of underlying community.

Casual observation of people delayed on trains or in traffic suggests they derive some comfort from picking up their mobile (cell) phone and telling someone about it. Something they can do, in a scenario they otherwise have no control over. Their desire to communicate the same information to drivers or users 10 miles behind them (who might be able to re-plan their route, should they know) is untested. But the potential is intriguing.

Nobody has entirely worked out how to use these people; yet.

Battle Lines

If the social web/tech-entrepreneur sector chooses to fight the “status quo” head on, it does so against large multi-national IT providers who support clients with historically entrenched positions. Not a contest that favours the underdog.

If the tech’ “upstarts” can find a way to use this human resource effectively, they will ultimately provide a more cost-effective solution than the traditional “government IT” sector can offer. Integrate that user-generated information into the wider consumer internet, and the machinery of government simply won’t be able to justify its historic position of pouring millions into systems it controls. The “social web”/tech-entrepreneur sector wins.

The upstarts do not need perfect source data, if the implementation of results is considered to be better by users. The early Xephos vs TransportDirect comparisons provide some evidence. The success or failure of the social web/tech-entrepreneur sector is ultimately dependant on whether they can provide better information than official sources, using the resources and skills they have available to them.

Disclaimer: The contents of this article reflect my own personal analysis of the situation. This does not directly reflect advice to, or views of, government or anyone else involved in the handling and provision of public transportation data.

9 comments on "Social Reconstruction of Public Transportation Information"

  1. On July 9th, 2008 at 1:06 am Joe Hughes wrote:

    Thanks for another thought-provoking piece on transit data–it’s good to see people writing about these issues! I posted a link on the Transit Developers group.

  2. On July 9th, 2008 at 5:17 am Jehiah wrote:

    Googles new Map Maker is a good example, another would be allowing users to refine information, like how Google allows you to update an address that isn’t quite displaying in the proper spot when you do a map search.

    Part of your argument is exactly what is happening today. Developers are working on solutions to make transit information more accessible. Weather the data comes from gtfs feeds, scraped info, or user contributed, it all has an added value when it leads towards openness and more solutions.

    As you mentioned it would be nice to have means to let people in a train behind mine know that they might be delayed, but for now I’ll settle for putting my endeavors towards helping them know when the train will be there if it’s on time.

  3. On July 9th, 2008 at 8:43 am ian wrote:

    thoughtful piece (and the more of your site i read the more comments i have!).

    The intersection of spatial data and public policy is quite different in the US v. UK v. ROE (rest of Europe). We know about the basics: TIGER v. OS, or rather, public good v. national (monopolistic) resource. The legal basis for scraping a transit agency isn’t “complex” in that there really is no basis whatsoever–it is an area that lacks a regulatory framework, hence courts are where much will be decided (in the US) until/if legislators act. [Furthermore, scraping isn't the most reliable/accurate means of capturing this kind of data, raising questions about its underlying accuracy, and that's the whole point...]. Public agencies (financed via taxpayers, indirectly through bonds or other arrangements) generally fall under the purview of public record laws, so there is no basis to withhold information unless it poses a threat to public safety, etc…

    The Rynnair example isn’t accurate as it’s a question of distribution/channel control for a *for profit* company. Fare data is competitive, bus timetables for a transit agency are not. I’d also caution about generalizing based on UK bus operators–I believe your point about data being centralized is that the _collection_ of data will be centralized? Certainly it will be (more) broadly distributed, per your points.

    I think Trapeze saw the Google Transit writing on the wall and decided it had better get in front of the issue or they’d face a slow leakage of customers.

    Also, another point to underscore–this post focuses around _data_ but (for the most part) Google Transit is an application–very few of the participating agencies make their data available (Google has declined to do this, but I’ll let Joe speak for his company ;)).

    Many good points and I look forward to reading your archives.

  4. On July 9th, 2008 at 12:53 pm Tim Howgego wrote:

    Interesting comments.

    Keep in mind that about 9/10 (depending on how the market is defined) of UK bus operations outside of Greater London and Northern Ireland are purely commercial, not under contract, and as such “for profit”. London could be operated entirely commercially, but remains regulated. Rail-based modes are fundamentally more expensive, regulated and franchised. You might ask why the public sector is so heavily involved in providing information about a primarily commercial sector? Impartiality is typically cited, but still amounts to a pre-emptive government intervention that isn’t needed in most local markets. Operators could take the lead, but consider that their first steps (the original Great Britain National Bus Timetable) ceased to evolve partly because the operating groups were encouraged to fund Traveline.

    Having said that, fare data certainly is more sensitive than timetable data. While there is a trend towards simple zoning and flat fares, the spatial representation of fares information raises uncomfortable equality questions politically in many areas, and particularly between areas. Historically there has been a lot of market differentiation by geography, which conflicts directly with the common political perception of public transport networks as a unified undifferentiated product. As you may have detected by now, there is a lot of underlying tension between private and public objectives.

    I suspect there are some cultural differences between US and UK/EU regarding poorly defined areas of the law. We tend panic and try and find or rearrange the spirit of nearest bit of legislation. As an outsider, I’d characterised the US approach as building a business that is big enough to sustain lawyers and lobbyists when the time comes. Perhaps that’s a mis-interpretation?

    I don’t particularly want to focus on Trapeze, because they are not the only business out there. Their entry into the UK market about 8 years ago involved buying up many of the small specialist software companies, so I don’t believe we are now seeing a direct competitive response to Google. Without going into too much detail, there will be some instances in the UK where entirely legitimately (through fair market competition), they provide the scheduling software to operators and hold the contract for processing and delivering electronic information to users. But between those two processes are several public bodies, multiple levels of information filtering and reconstruction, and a lot of paper. You do the maths.

  5. On July 9th, 2008 at 7:54 pm Aaron Antrim wrote:

    I enjoyed reading your thoughtful article. For transit agencies that need help understanding the advantages of opening up their data, I find it is useful to refer them to this excellent presentation to come out of TriMet in Portland, Oregon, USA.

    Leveraging Resources for Customer Information by Exposing Transit Data

  6. On July 10th, 2008 at 1:53 am Joachim Pfeiffer wrote:

    Tim, great writeup. You wrote: “Instead of using official data, why not let users reconstruct it?”. Let’s see if we can push this along.

    I’m going to venture to say that regular (e.g. semi-annual) as well as “ad hoc” network and schedule changes will challenge the sanity of anybody relying on reconstructed transit data after the second, perhaps third, major schedule change. Weeks pass by quickly as existing data needs to be re-validated after the fact, changes gathered and integrated, based on observations, manual comparison with published timetables, or other methods outside established relationships with the originator of the data (i.e. transit agency here in the US). In such a scenario, data leveraged through web services and other means now become a leveraged liability, adding pressure to develop accurate updates in a timely fashion.

    Question: Can a practice of using reconstructed transit data, leveraged in web apps, replace a relationship with the originator of the data that allows access to this data in a timely fashion, i.e. ahead of schedule changes? I don’t see it.

    It’s not like there weren’t models for such relationships: The framework for public GTFS feeds or TriMet’s web service (follow link posted by Aaron) are a great model to allow third parties access to transit data in a controlled way, i.e. ideally ahead of schedule changes with assured quality through data validation using FeedValidator.

  7. On January 8th, 2010 at 3:47 pm Travelaide wrote:

    This is news to me, I have been developing a large UK travel site called http://www.travelaide.co.uk and we have spent hundreds of hours searching for api, xml feeds to sites such as expedia and active hotels. Im sure the larger companies have access to generic api calls to directly pull the info from the travel comanies databases, so why would they need to scape the sites?

  8. On January 15th, 2010 at 8:07 am Tim Howgego wrote:

    Travelaide:

    1. Transport operating companies often don’t want other people using their data. Or at the very least want control over re-use. While concerns are legitimate, it’s deeply wrapped up in the sector’s culture – lack of a perceived “need for marketing”, mixed control of systems between public and private sector, emphasis on cost minimization.
    2. Operating businesses are built around skills in mechanical engineering or people management, with traditionally minimal use of computing technology. “WTF is an API?” It’s not something that’s likely to feature in any decision making/specification process. It’s part of different world.
  9. On January 28th, 2010 at 7:20 pm Ian McCaig’s History of Lastminute.com - Tim Howgego wrote:

    [...] never be profitable, and will be the preserve of hobbyists and naive startups. Or maybe it will all morph into one huge contract and never become truly public data. The outcome is [...]