KItinerary Command Line Extractor

    The KItinerary data extraction engine recently got a command line interface,
which can be pointed at any file KItinerary can consume (HTML, PDF, plain text, iCal, Apple Wallet passes, etc) and which then outputs
JSON-LD according to the schema.org
data model with the information that could be found in there. Adding this has been motivated by two separate goals: Increasing extractor robustness,
and easing integration into 3rd party applications.

Robustness

Regarding robustness we have the problem that the problem that the extractor when used in an email application is essentially exposed
directly to remotely provided (hostile) content. When running in-process accidentally or intentionally corrupt documents might
trigger hangs or crashes of the email application. That’s especially a problem for some of the more complex document types we deal with,
such as PDF.

An effective way to mitigate this is moving the dangerous operations to a separate process. This way the host application is not only
isolated from possible crashes, it also allows us to sandbox the extractor using tools like Bubblewrap
or seccomp, to reduce the impact of possible security issues in the extractor or the underlying parsing libraries.

For users of the KItinerary API, such as the KMail plug-in this is straightforward to use:

KItinerary::ExtractorEngine engine;
engine.setUseSeparateProcess(true);
engine.setData(...);
auto result = engine.extract();

Besides enabling the out-of-process mode, it’s important to feed raw data rather than pre-parsed documents in, otherwise the dangerous part
of the process happens in the host application after all.

Integration

Another motivation for having external process for the data extraction is that it provides an easy way to integrate with applications that
cannot or do not want to link against KItinerary and its dependencies, or for whom linking would bring in additional complications at this point.
The browser integration work is such an example benefiting from this, there we have to deal with unaligned release cycles and two very different
technology stacks.

Using the out-of-process extractor of course is not for free. The entire test suite currently needs about 8.5 seconds for extracting almost 600 samples,
with out-of-process mode this roughly doubles. Per document that is about 15 and 30 miliseconds respectively, which isn’t all that bad. That is
the average over all test samples though, PDFs tend to be more expensive to process than plain text files for example, while the overhead of spawning
a new process is largely constant.

Prototyping

But even if the final integration might involve directly linking against the KItinerary stack, the command line interface can be useful for the initial
evaluation and for building prototypes. To support this there’s a nightly Flatpak build available in
KDE’s Flatpak repo.
This is obviously not meant for production deployments, maintained distribution packages based on official releases are better for that. It however
allows very fast turnaround times to receive the latest improvements in the extractor engine while still having the convenience of pre-built packages
and co-installability.

Trying this is fairly straightforward:

$ flatpak remote-add --if-not-exists flathub https://flathub.org/repo/flathub.flatpakrepo
$ flatpak remote-add --if-not-exists kdeapps --from https://distribute.kde.org/kdeapps.flatpakrepo
$ flatpak install kdeapps org.kde.kitinerary-extractor

$ cat trainticket.pdf | flatpak run org.kde.kitinerary-extractor
[
    {
        "@context": "http://schema.org",
        "@type": "TrainReservation",
        ...
    }
]

Contribute

As always, sample data donations are of invaluable help with this work! Of particular interest at the moment would be multi-leg Renfe tickets,
as well as those with at least one international destination :)

For contributing in other ways than donating test data please see our Phabricator workboard
for what’s on the todo list, for coordinating work and for collecting ideas. For questions and suggestions, please feel free
to join us on the KDE PIM mailing list or in the #kontact channel on Freenode or Matrix.