In the previous post on writing custom data extractors for the KItinerary framework, I mentioned we are augmenting extracted data with knowledge from Wikidata. This post will cover this aspect in more detail.
Static knowledge refers to information that with near certainty don’t change for the duration of your trip, or during a release cycle of our software. That’s things like name, location and timezone of an airport, or the country it belongs to, as opposed to dynamic knowledge like departure gates or platforms, delays, etc.
There’s two main use-cases for static knowledge here:
Detecting and translating human-readable identifiers of for example countries or airports into something more suited for machine use. As a human I know ‘Charles de Gaulle’ refers to the main airport of Paris, for the software the IATA airport code ‘CDG’ is much more useful. Similar, we want ISO 3166-1 codes for countries rather than possibly localized human-readable names, and so on.
Augmenting or completing partial information. Usually the booking confirmation data we extract doesn’t contain geo coordinates (useful for navigation to/from the airport/station), the timezone (essential for correctly converting arrival/departure times) or the country (needed for our power plug compatibility check), so we need to obtain that from elsewhere.
Currently we are basing this on the following free and open data sources:
Wikidata for everything else, such as airport and train station names, locations and identifiers, as well as information about countries such as driving direction or used power plugs.
One aspect that might be a bit counter-intuitive at first is that we don’t query this information online, but bake it into the program code. This has a number of advantages:
- Privacy. Your email client would otherwise produce a predictable online access when opening an email containing a travel booking. Even with transport encryption this activity would still be observable on e.g. an open WiFi network, not ideal.
- Speed. The data is encoded in the shared read-only data section of the KItinerary library, in a way that is directly usable without any parsing or additional memory allocations.
- Offline support. That’s something quite handy when traveling, as you might be forced into flight mode or are outside of data roaming coverage.
Obviously there are some downsides to this approach as well, but they are comparatively small:
- Outdated data. You’d need a software update to roll out data changes. This is however why we limit this to “static” data, ie. information that should be stable within one release cycle (3-4 month) with a very high certainty.
- Size. Obviously this data needs space. It’s however surprisingly little, about 600kB for the localized country name to ISO 3166-1 mapping in KContacts, and about 400kB in KItinerary for the airport and train station databases.
There’s an easy way to help in this area too, by improving the Wikidata content. Our code extracting the relevant knowledge from Wikidata is warning about some issues such as missing or conflicting data. These issues are usually easy to research and fix, so that might be a nice entry point into Wikidata editing (it definitely was for me).
Since the data quality is actually very good, the below is the complete list of remaining issues at this point, for the few ten thousand objects we are looking at.
The following airports have two geo coordinates assigned to them that differ by more than 20 kilometers. Few airports are that large, more commonly that’s a simple typo. A quick look on a map is often enough to determine which is the right coordinate.
- Mar de Cortés International Airport (PPE)
- Abbs Airport (EAB)
- Albuq Airport (BUK)
- Chiquimula Airport (CIQ)
- El Naranjo Airport (ENJ)
- Leribe Airport (LRB)
- Laucala Airport (LUC)
- Ascencion De Guarayos Airport (ASC)
- Horizontina Airport (HRZ)
- Nuevo Casas Grandes Airport (NCG)
- Tunta Airport (KBN)
- Nakhon Ratchasima Airport (NAK)
- Lancang Airport (JMJ)
- Dourados Airport (DOU)
- Presidencia Roque Sáenz Peña Airport (PRQ)
- San Ignacio de Moxos Airport (SNM)
- Rukumkot Airport (RUK)
- Songyuan Chaganhu Airport (YSQ)
- Santana do Livramento Airport (LVB)
- Dolisie Airport (DIS)
- Sikasso Airport (KSS)
- Auguste George Airport (NGD)
- Ángel Albino Corzo International Airport (TGZ)
- Gedaref Airport (GSU)
- Gobernador Edgardo Castello Airport (VDM)
- Goodland Municipal Airport (GLD)
- Aripuanã Airport (AIR)
- Kakamega Airport (GGM)
- Tazadit Airport (OUZ)
Similarly, the following airports have no geo coordinate specified:
The list of train stations missing geo coordinates is a bit longer and therefore in a separate file.
A bit more elaborate are the following train stations with IBNR conflicts (an ideally unique station identifier). In the easiest case it’s just two duplicate objects that can be merged, in other cases this might require some understanding on which part of a larger station is actually addressed by the identifier.
- IBNR 8089107: Q464094 vs Q22284853
- IBNR 8012583: Q50884707 vs Q46954645
- IBNR 8001978: Q54878964 vs Q2657895
- IBNR 8011167 and 8089100: Q567079 vs Q22284875
- IBNR 8089045: Q664036 vs Q22291719