KItinerary Custom Extractor Improvements

    It’s been a year since I last wrote about KItinerary custom extractor development, and things have changed a bit since then. While I have been mentioned smaller improvements
such as new API available for custom extractors in the regular summary posts, there’s a recent change that
deserves a bit more detail.

Background

The extractor engine of KItinerary distinguishes between two types of data extraction: generic and custom. Generic extractors
are vendor neutral and typically consuming some kind of standardized data format. We only have a few of those, they tend to be relatively
complex and written in C++. On the other hand we have the custom extractors, consisting of vendor-specific JS scripts that can
consume one specific document only. As we have many of those, the aim is to keep them as simple and easy to write/maintain as possible.

How and when KItinerary called these extractors had organically grown over time and varied between different
document types (PDF, HTML, PkPass, etc). That got the job done and performed well, but had a few limitations:

  Results of the generic HTML extractor were considered final and could not be amended by a custom extractor, something necessary
for example to fix Flixbus barcodes.
  Custom extractors for PDF documents could not reuse work of a previously run generic extractor (e.g. for IATA boarding passes or
UIC 918.3 train tickets), but had to do the entire work again, including the search for and decoding of the corresponding barcode.
That’s particularly annoying if all you wanted to do is obtain one missing bit of information not encoded in a generic way, such as the
boarding time.
  Custom extractors for PkPass didn’t have that limitation, at the expense of a lot of hand-written result merging code instead
of using existing infrastructure for this, causing unnecessary maintenance cost.

New Processing

Within the last weeks this all got unified to follow the same process for all input types, which not only makes this easier to
maintain but also makes features previously only available for some input types available everywhere. Independent of the input type,
any applicable generic extractor is run first, and collects the follow information:

  A JSON-LD structure of reservation information that could be extracted. If no custom extractor is applicable later, this will
also be the overall result of the extraction.
  Barcode content, independent of the barcode content being unknown or supported by a generic extractor.
This is particularly useful for custom extractors as they no longer have to search a document again for a barcode, but have the content
available right away to work with. It also allows custom extractors for all input types to trigger based on patterns in the barcode content.
  Information about where this data was found in a multi-page document. This information was previously not propagated at all, so
custom extractors had to handle the complexities of multi-page documents themselves.

With the generic extraction complete, we now check if there’s custom extractors that can deal with the input data (e.g. based on
context information like email addresses), or with the intermediate results produced by the generic extractors (e.g. by matching
values in a given path in the JSON-LD structures). In the latter case the extractor script now has access to the results from
the generic extraction, and it is executed once per generic extraction result. This means it’s now very straightforward to have
custom extractors that amend generic extraction results, and it is in many case no longer required to manually deal with multi-page
documents.

New API

For custom extractor authors the main changes are in the Context
object exposed to the script, that’s where you find the intermediate results of the generic extraction.

With this a simple script to extract the boarding time for a PDF boarding pass could look like this:

function main(pdf) {
    var res = Context.data[0];
    var page = pdf.pages[Context.pdfPageNumber];
    var time = page.text.match(/Boarding\s+(\d{2}:\d{2})/);
    res.reservationFor.boardingTime = JsonLd.toDateTime(time[1], "hh:mm", "en")
    return res;
}

If you compare this with how it had to be done previously,
it’s less than half the code.

The other relevant change for custom extractor authors is the ability to trigger their scripts on arbitrary
elements in the generic extraction results, as shown by the following example from the Flixbus extractors:

{
    "type": "JsonLd",
    "property": "reservationFor.busCompany.name",
    "match": "FlixMobility"
}

With this it’s now also possible to use the custom extractor infrastructure to prototype and experiment with new approaches
for generic extraction, by simply defining a broad enough pattern there, e.g. matching all PDF boarding passes.

Performance

All good? Not quite. While the extraction results matched those of the previous implementation exactly after all these
changes, it took 25-30% longer to run the tests. This was caused by the loss of the ability to shortcut generic extraction if a custom
extractor could be found by other means. Custom extractors can apply specific knowledge to speed up the expensive barcode search in PDF
documents, e.g. by knowing where to look for them rather than doing an exhaustive search. However, with a few more shortcuts in the PDF
image loading we managed to get this back to its old speed without sacrificing the new flexibility.

Relevant changes:

  Discard color images during loading (commit).
  Convert color images to grayscale during loading (commit).
  Deduplicate barcodes (commit).

Contribute

This work requires access to sufficient amounts of sample documents of course, a big thanks to everyone who has donated data so far! We always
need more though, of particular interest would be Renfe tickets with at least one international station, and Trenitalia tickets with a coach
number bigger than eight. Anything else very likely helps too of course :)

Feedback about observations of working or failing data extraction for providers not yet listed in the wiki
is also very valuable. And if you want to improve or add new custom extractors, please feel free to get in touch
via the KDE PIM mailing list or the #kontact channel on Matrix or Freenode.