While discussing data extraction methods for KItinerary earlier I briefly mentioned barcodes as one source of information. It’s a subject that deserves a few more details though, as it’s generally good to know what information you are sharing when your ticket barcode gets scanned.
Barcodes on booking confirmations or tickets serve multiple purposes:
- Carrying some form of token used for validation. This can be a simple number or actual cryptographic signatures. That token typically does not contain any direct information about you or your booking, but it can act as a key for online lookup of such information, in which case it is even relevant to protect just that token from a privacy point of view.
- Information about you or your booking. Often this is a machine-readable version of what’s also printed in human readable form on a ticket, such as your name, booking number or details about what you booked. From a privacy point of view even more problematic are cases where the barcode contains additional information not visible on the human readable part.
For data extraction we of course benefit from a machine readable format that doesn’t require fragile text parsing in PDF or HTML files. Additionally, barcodes tend to use systematic identifiers instead of ambiguous and/or localized human readable names, for example for airports or stations. The most well-known such identifier is probably the 3 letter IATA airport code. Such identifiers allow us to easily retrieve additional information about that location from sources like Wikidata.
KDE Itinerary’s nighlty Flatpak builds therefore recently got the ZXing-C++ dependency added to make full use of that, and we are working on getting that into the nightly Android builds too. If you are deploying or packaging KDE Itinerary or the KMail integration plug-in by other means you probably want to make sure to have ZXing-C++ available too.
While we are mainly interested in itinerary related information, we also come in touch with whatever else is in the barcodes. Besides general privacy insights this also has the very practical impact on how to sanitize our test data. While it’s fairly straightforward to replace your credit card number in a simple ASCII-based code, doing this in partially understood binary codes with cryptographic security features is next to impossible.
There’s a number of different aspects of the barcodes that are relevant for understanding what is (or can be) encoded in them:
- The size of the encoded data. That’s a very good indicator if there is only a ticket token or also additional booking information. One-dimensional codes can only store short alpha-numeric payloads, which is usually a strong indicator of a token-only code. Two-dimensional codes like QR or Aztec on the other hand can store up to a few hundred bytes.
- ASCII or binary payloads. Many of the barcode codecs are optimized for alpha-numeric content rather than arbitrary binary data, so this doesn’t necessarily say anything about the amount of data in there. Textual content is however much easier to analyze, any barcode scanning app can show you the content. Many of those scanners however choke on e.g. 0 bytes, so even capturing the full binary payload isn’t straightforward.
- Standardized or proprietary content. In some areas barcode content is standardized to achieve inter-operator compatibility, airline boarding passes being the extreme with a sinle international standard. Unfortunately, there are few other standards, let alone some with even remotely such a wide coverage. So in many cases we encounter vendor-specific codes with little or no public documentation. Those however are often a bit simpler in their structure, while standards tend to be modular and offer support for extensions. Standardization also doesn’t necessarily imply the specification is publicly available, but it makes it at least more likely that it’s findable somewhere on the Internet ;-)
As mentioned above there is only one relevant barcode type for flights, “IATA Bar Coded Boarding Passes (BCBP)”. It’s a fairly old standard, containing a modular ASCII payload for one to four legs. The set of mandatory fields is very small:
- Passenger name (as 6bit ASCII and truncated to 20 characters).
- Booking reference.
- Start and destination IATA airport codes.
- Flight number.
- Day of flight. This is the number of days since January 1st of the year of the trip. The year however is not encoded at all.
- Seat number and class.
- Passenger sequence number (part of the unique identification of a passenger).
Privacy-wise, this is already enough to be problematic, as was shown at 33C3 in 2016. For KItinerary’s data extraction this is almost all the useful information in here, particularly annoying is the lack of a full date, requiring us to guess the year from context.
However, there’s plenty of optional fields that are populated based on the airline and the travel destination. A few noteworthy examples are:
- Frequent flyer number (which sometimes doubles as a credit card).
- Baggage tag numbers.
- A “document id”, which has been seen containing the passport number for flights to the UK for example.
- A variable length vendor-specific field. This is often seen to be used by Lufthansa-associated airlines, with unknown content.
- Fields specific to US security requirements (and only used for flights in or to the US).
- A cryptographic signature of the content, to be specified by “local authorities”. This so far has also only been observed for US destinations.
It would be interesting to explore if a “privacy mode” for boarding passes in KDE Itinerary would work in practice, that is only presenting the mandatory fields of the boarding pass and see how far you get with that at the airport. It’s unlikely to work for security-related fields or with signatures as used in the US, but fields primarily of commercial interest are probably avoidable in other parts of the world.
For train tickets the situation is a lot more diverse. The closest thing to an international standard is UIC 918.3, which is the big 50x50mm Aztec code found on European international tickets, as well as on domestic tickets in at least Austria, Denmark, Germany and Switzerland. UIC 918.3 however only defines a container format with a minimal header, cryptographic signatures and a zlib compressed payload.
To get an idea on the variety of payloads we find on train ticket, here’s an overview of what KItinerary supports so far, order roughly by complexity and usefulness of the content:
- Koleje Małopolskie (a local polish provider): a simple JSON structure containing almost all relevant trip information, even exact times in UNIX format. Very useful and very easy to extract. Contains the passenger name, but at least nothing beyond what’s on the paper ticket. Uses human readable station names rather than station identifiers though.
- SNCF (French national railway): a simple fixed-length ASCII format encoding one or two legs of a trip. Easy to extract and useful too, privacy-wise this contains the passenger birth date beyond what’s on the paper ticket. There’s still 4 bytes in there with unknown meaning.
- Trenitalia (Italian national railway): A 67 byte binary blob encoding one leg of a trip. It seems very optimized for size, with numeric values having no alignment at all, so it needs to be looked at as a bit array rather than a byte array. Being entirely undocumented, we had to decode this ourselves. This is ongoing work, the current state can be found in the wiki, about half of the content can be attributed to a meaning or is always 0. The data we got out of this so far is quite useful, but it’s still incomplete (date/time values for example are suspected to be in there, but haven’t been decoded yet successfully). With parts of the content stlll being unknown it’s to early to asses this for privacy concerns.
- RCT2 (the standard UIC 918.3 payload for European international tickets, and also used by DSB, ÖBB and SBB): There’s at least decent documentation about this. Unfortunately it’s of very limited use for data extraction. RCT2 is essentially an ASCII art representation of the upper part on the corresponding paper ticket, being designed for display to a human reader rather than for machine reading. The limited space in there conflicts with the realities of multi-lingual tickets, leading to a rather flexible interpretations of the standard. Relevant information for us like the exact train a ticket is valid for is not part of the specified fields in many cases, but encoded in an operator-specific format in a free text description field. Therefore KItinerary is only using this as fallback if no other information is available. Being designed as an exact representation of the paper ticket, it has not been seen containing any additional information.
- Deutsche Bahn (vendor-specific payload for UIC918.3): That’s another modular hybrid binary/textual structure, wrapped inside the UIC 918.3 container, relatively complicated to decode and unfortunately containing very little useful information for KItinerary. Many fields are related to tariff details, but there’s also the passenger name and in older versions also full or partial numbers of the (credit) card used for payment and/or identification. This has meanwhile been fixed though. Tickets with an option for local public transport at the destination contain additional operator-specifc payloads, it’s unknown whether those contain useful/sensitive information.
There’s also a few operators we know use barcodes with trip-related content, but that we don’t support yet due to not having enough information or sample data to properly decode their barcodes:
- VIA Rail (Canadian railway): ASCII payload, structurally probably comparable to SNCF, so this might be fairly easy to support given a sufficient amount of samples.
- VR (Finish national railway): A 108 byte binary code with entirely unknown content so far. It looks more complex than the Trenitalia one, with the larger size and more parts of the code changing between even adjacent tickets, but not entirely random which suggests there is no encryption, compression or other sophisticated encoding.
Other transport operators like SNCB (Belgian national railway) or Flixbus are also using barcodes, but those seem merely to contain ticket tokens. The same is true for all event ticket samples we have so far.
One of the easiest way to help with decoding such barcodes is looking for prior work or documents on that subject in your local language. For Deutsche Bahn I found numerous useful sources online, all in German though. For SNCF some material exists as well, but it required French language skills to find that.
While obviously conflicting with striving for privacy, another very helpful way to help is donating test samples, especially for not fully understood yet barcodes. Decoding an entirely undocumented binary code requires enough samples so you can look at meaningful differences between partially differing tickets, and enough samples to verify you theories on the semantics of certain bits with sufficient certainty. We are not talking about machine-learning scale amounts here though, for the current understanding of the Trenitalia codes it took about 30 barcodes from about a dozen different bookings.
And of course if you like solving binary puzzles, there are some nice challenges here too ;-)