Following the look at how KDE Itinerary does data extraction, this post will cover custom data extractors in a bit more detail. Custom extractors are needed where we are unable to obtain the information we are interested in from structured annotations, or add information to incomplete structured data (such as boarding pass barcodes).
Data extraction is usually performed on incoming emails, and roughly follows these steps:
Find all suitable extractors for a given document. Typically this is done based on the context, not the content. Practically all existing extractors do this by wildcard matching on the
Fromheader of the email. This makes this step very fast, and allows us to avoid expensive processing of the content if not necessary.
The resulting JSON-LD structures are then normalized, augmented by information from Wikidata, and merged with data from e.g. structured annotations or barcodes. Elements still invalid at the end of this processing are discarded. This means that custom extractors have some flexibility when it comes to the completeness of their results, there is no need to extract information that can be deduced by other means, see the post-processing docs for details.
To create a custom extractor you first need to define on which documents it is applicable. This is done in the extractor meta data, a simple JSON file specifying where the code for your extractor can be found, and when to trigger it.
This is fairly straightforward, defining to run the script
aohostels.js on plain text parts of emails
received from that vendor.
The KItinerary::Extractor documentation explains what else can be specified here.
The interesting part is if course in the extractor script itself. The entry point is the function mentioned
in the meta data (
main is the default). It’s argument depends on the type, for plain text that’s just a string,
for HTML, PDF or Apple Pass files you get an object representing that type. This function is expected to return
a JSON object (or an array of JSON objects) containing the extracted data in JSON-LD encoding of the schema.org ontology.
The following (slightly simplified) excerpt from the extractor for the booking confirmations of the Akademy accommodation shows the common patterns of a simple text-based extractor:
The full source code is here. Next to that you’ll also find more elaborate examples covering multi-leg trips, support for multiple localized variants, other document types, multi-column layouts, or documents that don’t specify the year in any mentioned date.
The KItinerary::ExtractorEngine documentation describes the API exposed to extractor scripts, both for working with more complex input types and for performing common operations like locale-specific date/time parsing or barcode decoding.
To make custom extractor development easier, we have an interactive data inspection and test tool called KItinerary Workbench. With the KItinerary Workbench you can inspect input data as seen by the extractor script, that is look at the extracted text and images of a PDF file or the DOM tree of an HTML document, or decode barcodes in various formats.
Most importantly though, you can re-run your just edited extractor scripts without recompiling or restarting anything. See the documentation on how to set that up.
This post should contain enough pointers to get started with custom extractor development, more details can be found
in the KItinerary documentation.
If you have questions, join us on the KDE PIM mailing list or in the
#kontact channel on Freenode.