It’s been a year since I last wrote about KItinerary custom extractor development, and things have changed a bit since then. While I have been mentioned smaller improvements such as new API available for custom extractors in the regular summary posts, there’s a recent change that deserves a bit more detail.
The extractor engine of KItinerary distinguishes between two types of data extraction: generic and custom. Generic extractors are vendor neutral and typically consuming some kind of standardized data format. We only have a few of those, they tend to be relatively complex and written in C++. On the other hand we have the custom extractors, consisting of vendor-specific JS scripts that can consume one specific document only. As we have many of those, the aim is to keep them as simple and easy to write/maintain as possible.
How and when KItinerary called these extractors had organically grown over time and varied between different document types (PDF, HTML, PkPass, etc). That got the job done and performed well, but had a few limitations:
- Results of the generic HTML extractor were considered final and could not be amended by a custom extractor, something necessary for example to fix Flixbus barcodes.
- Custom extractors for PDF documents could not reuse work of a previously run generic extractor (e.g. for IATA boarding passes or UIC 918.3 train tickets), but had to do the entire work again, including the search for and decoding of the corresponding barcode. That’s particularly annoying if all you wanted to do is obtain one missing bit of information not encoded in a generic way, such as the boarding time.
- Custom extractors for PkPass didn’t have that limitation, at the expense of a lot of hand-written result merging code instead of using existing infrastructure for this, causing unnecessary maintenance cost.
Within the last weeks this all got unified to follow the same process for all input types, which not only makes this easier to maintain but also makes features previously only available for some input types available everywhere. Independent of the input type, any applicable generic extractor is run first, and collects the follow information:
- A JSON-LD structure of reservation information that could be extracted. If no custom extractor is applicable later, this will also be the overall result of the extraction.
- Barcode content, independent of the barcode content being unknown or supported by a generic extractor. This is particularly useful for custom extractors as they no longer have to search a document again for a barcode, but have the content available right away to work with. It also allows custom extractors for all input types to trigger based on patterns in the barcode content.
- Information about where this data was found in a multi-page document. This information was previously not propagated at all, so custom extractors had to handle the complexities of multi-page documents themselves.
With the generic extraction complete, we now check if there’s custom extractors that can deal with the input data (e.g. based on context information like email addresses), or with the intermediate results produced by the generic extractors (e.g. by matching values in a given path in the JSON-LD structures). In the latter case the extractor script now has access to the results from the generic extraction, and it is executed once per generic extraction result. This means it’s now very straightforward to have custom extractors that amend generic extraction results, and it is in many case no longer required to manually deal with multi-page documents.
For custom extractor authors the main changes are in the Context object exposed to the script, that’s where you find the intermediate results of the generic extraction.
With this a simple script to extract the boarding time for a PDF boarding pass could look like this:
If you compare this with how it had to be done previously, it’s less than half the code.
The other relevant change for custom extractor authors is the ability to trigger their scripts on arbitrary elements in the generic extraction results, as shown by the following example from the Flixbus extractors:
With this it’s now also possible to use the custom extractor infrastructure to prototype and experiment with new approaches for generic extraction, by simply defining a broad enough pattern there, e.g. matching all PDF boarding passes.
All good? Not quite. While the extraction results matched those of the previous implementation exactly after all these changes, it took 25-30% longer to run the tests. This was caused by the loss of the ability to shortcut generic extraction if a custom extractor could be found by other means. Custom extractors can apply specific knowledge to speed up the expensive barcode search in PDF documents, e.g. by knowing where to look for them rather than doing an exhaustive search. However, with a few more shortcuts in the PDF image loading we managed to get this back to its old speed without sacrificing the new flexibility.
- Discard color images during loading (commit).
- Convert color images to grayscale during loading (commit).
- Deduplicate barcodes (commit).
This work requires access to sufficient amounts of sample documents of course, a big thanks to everyone who has donated data so far! We always need more though, of particular interest would be Renfe tickets with at least one international station, and Trenitalia tickets with a coach number bigger than eight. Anything else very likely helps too of course :)
Feedback about observations of working or failing data extraction for providers not yet listed in the wiki
is also very valuable. And if you want to improve or add new custom extractors, please feel free to get in touch
via the KDE PIM mailing list or the
#kontact channel on Matrix or Freenode.