KItinerary Command Line Extractor
The KItinerary data extraction engine recently got a command line interface, which can be pointed at any file KItinerary can consume (HTML, PDF, plain text, iCal, Apple Wallet passes, etc) and which then outputs JSON-LD according to the schema.org data model with the information that could be found in there. Adding this has been motivated by two separate goals: Increasing extractor robustness, and easing integration into 3rd party applications.
Robustness
Regarding robustness we have the problem that the problem that the extractor when used in an email application is essentially exposed directly to remotely provided (hostile) content. When running in-process accidentally or intentionally corrupt documents might trigger hangs or crashes of the email application. That’s especially a problem for some of the more complex document types we deal with, such as PDF.
An effective way to mitigate this is moving the dangerous operations to a separate process. This way the host application is not only isolated from possible crashes, it also allows us to sandbox the extractor using tools like Bubblewrap or seccomp, to reduce the impact of possible security issues in the extractor or the underlying parsing libraries.
For users of the KItinerary API, such as the KMail plug-in this is straightforward to use:
KItinerary::ExtractorEngine engine;
engine.setUseSeparateProcess(true);
engine.setData(...);
auto result = engine.extract();
Besides enabling the out-of-process mode, it’s important to feed raw data rather than pre-parsed documents in, otherwise the dangerous part of the process happens in the host application after all.
Integration
Another motivation for having external process for the data extraction is that it provides an easy way to integrate with applications that cannot or do not want to link against KItinerary and its dependencies, or for whom linking would bring in additional complications at this point. The browser integration work is such an example benefiting from this, there we have to deal with unaligned release cycles and two very different technology stacks.
Using the out-of-process extractor of course is not for free. The entire test suite currently needs about 8.5 seconds for extracting almost 600 samples, with out-of-process mode this roughly doubles. Per document that is about 15 and 30 miliseconds respectively, which isn’t all that bad. That is the average over all test samples though, PDFs tend to be more expensive to process than plain text files for example, while the overhead of spawning a new process is largely constant.
Prototyping
But even if the final integration might involve directly linking against the KItinerary stack, the command line interface can be useful for the initial evaluation and for building prototypes. To support this there’s a nightly Flatpak build available in KDE’s Flatpak repo. This is obviously not meant for production deployments, maintained distribution packages based on official releases are better for that. It however allows very fast turnaround times to receive the latest improvements in the extractor engine while still having the convenience of pre-built packages and co-installability.
Trying this is fairly straightforward:
$ flatpak remote-add --if-not-exists flathub https://flathub.org/repo/flathub.flatpakrepo
$ flatpak remote-add --if-not-exists kdeapps --from https://distribute.kde.org/kdeapps.flatpakrepo
$ flatpak install kdeapps org.kde.kitinerary-extractor
$ cat trainticket.pdf | flatpak run org.kde.kitinerary-extractor
[
{
"@context": "http://schema.org",
"@type": "TrainReservation",
...
}
]
Contribute
As always, sample data donations are of invaluable help with this work! Of particular interest at the moment would be multi-leg Renfe tickets, as well as those with at least one international destination :)
For contributing in other ways than donating test data please see our Phabricator workboard
for what’s on the todo list, for coordinating work and for collecting ideas. For questions and suggestions, please feel free
to join us on the KDE PIM mailing list or in the #kontact
channel on Freenode or Matrix.