Maintaining Privacy and Security in Open Data | TfNSW Open Data Hub and Developer Portal

alejandro.felman's Blog

Making data as open and transparent as possible is one of Transport for NSW's main goals and part of the overall data policy set out by the government as part of the Future Transport strategy. However, making data open doesn't come without risk. The Transport for NSW Open Data Team must ensure that all the data that is published doesn't put people's lives at risk, doesn't compromise any of the city's security and doesn't violate any privacy laws. Read on below to find out more about what risks are involved when releasing new data and some of the measures we take to protect the public.

Risks When Releasing Open Data

Most of the data we release is usually information about a service, mode or vehicle, such as timetabling information or real-time vehicle positining. This kind of data is fairly standardised and has a low security or privacy risk. All data will still go through a review process before being published on the Open Data Hub, including data we have identified to be low risk. However, there are some datasets that have a higher risk of having security or privacy issues and require a more thorough review process and extra safety measures before they are released to the public on our website. Datasets such as peak loads, bus occupancy and Opal tap-on/tap-off data can be much more sensitive. Some of the issues or risks that can arise if we were to release this kind of data in its raw form includes:

Identification of individuals
Tracking of individuals through the transport network
Breach of privacy laws
Privacy attacks
Reconstructing past trips of specific individuals
General safety of city infrastructure, vehicles, property and TfNSW personnel

Why Can't I get More Granular, Detailed Data?

We get many requests to provide more detailed or granular data, usually for research projects, but there is a reason as to why some datasets are provided at a certain level of detail. The reason is simply to avoid the issues and risks outlined above. For example, peak load train figures are provided as aggregated figures grouped by train line while Opal tap-on/tap-off data was run through an algorithm and grouped by train stations in regular intervals. If you require more detailed data there is no harm in asking but there's usually a reason as to why some datasets have a certain level of granularity as explained above. The best way to request or inquire about existing data is through our Open Data Forum.

Safety Measures When Publishing Data

Many of our datasets need to be put through various processes to ensure no harm comes from publishing the data. The processes are more thorough if the data is more sensitive. Some of the processes that our data goes through includes applying a differential algorithm, which protects against privacy attacks and any chance of re-identification. We also anonymize data that contains personal information, effectively removing it from the dataset and preventing the ability to identify individuals or re-construct end to end trips. Other datasets that are not as sensitive are simply provided as aggregate figures that are useful enough without the risk of having privacy or security issues. Last but not least, any personal information is not published as open data.

Use Case: Opal Data

An initial set of Opal data was released on the Open Data Hub in March 2017. Opal data was among the most requested from our users and it took us longer than expected to release due to sensitivity, privacy and security risks. The following Opal datasets have been released, you can find them in our data catalogue:

Opal Trips - Bus / Train / Light Rail / Ferry
Peak Train Load Estimates
Fare Compliance Survey Results Data
Opal Tap On and Tap Off
Opal Tap On and Tap Off Release 2

Opal data is easily the most sensitive data we have. Opal raw data contains personal information as well as trip data, which cannot be posted publicly as it would breach privacy laws. Opal data went through a very thorough review as well as various data processes to make the data suitable for publishing. Algorithms were developed and used on the raw data to produce a privacy-protected dataset that guarantees no information that can identify an individual was released in the dataset. This process also anonymized the data in a way that no individual end to end trips can be identified. For more information on the Open Opal Data make sure to read the Opal FAQ document.

As you can see, we take the upmost care with the data we release and put the safety of NSW citizens first. We also try to do our best to release as much data as possible on the Open Data Hub and stay true to the open data philosophy. We encourage you to request data via our forum, Twitter or email.

- The TfNSW Open Data Team