I have previously written about why we are interested in barcodes for the KItinerary extractor. This time it’s more about the how, specifically how we find and decode vector graphic barcodes in PDF files, something KItinerary wasn’t able to do until very recently.
While PDF is a vector graphics format, most barcodes we encounter in there are actually stored as images. Technically this might not be the cleanest or most efficient way, but it makes KItinerary’s life very easy: We just iterate over all images found in the PDF, and feed them into the barcode decoder.
There are also providers that use vector graphics to represent barcodes in their PDF documents, for example Iberia, easyJet, Ryanair and Aer Lingus, enough to make this a relevant problem for KItinerary. The basic idea would be to render the relevant area of the document into an image and feed that into the barcode decoder. The rendering part is straightforward since Poppler has API for that, but how do we know where to look for a vector graphics barcode?
Answering that required a bit of digging into the PDF files, to understand how the barcodes are actually represented. Lacking a “GammaRay for PDF”, Inkscape turned out to be of great help. Importing PDF files there gives you both a graphical and a “textual” (via the generated SVG) representation of the PDF content. This showed three different variants:
- A single complex filled path for the entire barcode.
- A set of small filled paths (typically quads), for each line or dot of the barcode.
- A set of interrupted line strokes with a sufficiently wide pen, so draw the barcode as “scanlines”.
Case (1) is the most easy one, path fill operations with a solid black brush and hundreds or more path elements within a bounding box of just a few centimeters are very rare for anything else, even more so when filtering out paths with curve elements.
The other two cases are much harder to detect without properly grouping all the involved drawing operations though. Here again Inkscape helped, as in all cases the barcodes were represented as an SVG group there, and Inkscape’s PDF import code contained the necessary hints on how to replicate that grouping in KItinerary.
So in the end we iterate over groups of path fill and line stroke operations found in the document, check them for being plausible barcodes by looking at brush or pen properties, path complexity, output size, etc, and then render them to a raster image. The last two steps are expensive, so it’s important we discard as many false positives before we get there.
As a result all remaining PDF documents with previously undetected barcodes in my sample collection now work, with minimal extra runtime cost.
Poppler’s Private API
While I’m quite happy with the result, it unfortunately comes at a cost, in form of a much stronger dependency on Poppler’s private API. KItinerary is already using Poppler’s private API for iterating over the images in a document, which makes distributors understandably very unhappy. For this dependency we had a plan on how to address it by adding the necessary features to Poppler’s public API (at the cost of processing the same document twice, once for text and once for images).
The new code however heavily relies on access to the low-level stream of drawing operations, which is a much much larger API surface to expose from Poppler than just iterating over image assets. Seeing that Inkscape has the same problem, maybe that is actually necessary though?
This work heavily relies on access to a large variety of sample documents, to make sure we support all relevant cases. So if you encounter an airline boarding pass PDF file that isn’t detected as such with the current master branch or the upcoming 19.08 release, I’d be very interested in that test case :)