For KDE Itinerary it’s crucial we know the correct timezone for each element in the timeline, precisely enough to also handle complex situations like daylight saving time changes during an international flight. How can we reliably determine the timezone though, e.g. given a geographic coordinate, offline and on a resource-constraint mobile device?
There’s more than 400 timezones defined in the IANA timezone database,
which is also what’s behind
QTimeZone. The geographic shapes associated with them
are available here, extracted
from OSM. This is about 100 MB of high precision vector data.
That’s a bit excessive for a desktop application, and out of the question for mobile or embedded use. Worse, doing hit detection on complex polygons of that size is also quite expensive at runtime.
As IANA timezones largely follow country or regional borders, determining those based on a geographic coordinate is a very similar problem. To illustrate just how complex this can get, see the below map of the village of Baarle, which is stuck in a quantum superposition of being in Belgium and the Netherlands at the same time.
So we want a more space-efficient encoding for this data, optimized for fast lookup, while trading in some of the spatial precision. What are acceptable trade-offs here depends largely on the application, but I’d guess few need sub-meter precision, or anything below the usual consumer GPS accuracy for that matter. For KDE Itinerary even two or three orders of magnitudes less are sufficient.
You could of course query an online service for this, but that’s coming at a high privacy cost, considering this requires sharing fairly high-precision location data, so we want something that works offline.
One possible way to implement this was developed by a former colleague of mine, and is actually surprisingly simple and effective. The idea is to store the timezone map as a color image, with a different color for each timezone. For lookup you just map the coordinate to the corresponding pixel, and translate the color back to the timezone with a simple lookup table.
The key to make this efficient is in the selection of the image encoding format. Ideal is a format that allows independent access to individual scanlines, and that uses a run-length encoding for each scanline, such as TGA. With that you can access an individual pixel with constant memory cost, independent of the image size.
The use of an image format has the advantage that precision/cost trade-offs are pretty obvious, it’s very easy to create using the above mentioned timezone shapefiles and QGIS, and debugging can be done visually with an image viewer.
This approach has been in use for the offline preparation of KDE Itinerary’s extractor engine knowledge base so far. Not so much for it’s runtime efficiency though (as we are using a gigantic 27942 x 13968 map), but for its ease of use.
The efficiency of this comes from the run-length encoding of scanlines, which is very good at leveraging one-dimensional spatial proximity of the encoded features, ie. a typical scanline only contains few continuous regions, independent of the resolution. It however doesn’t use the same property in the second dimension at all. Image formats that exploit this like e.g. PNG achieve an even better compression, but at the cost of constant memory decoding.
After the good results with using Z-order curves for the work on public transport line metadata recently, I tried to apply this to the timezone lookup problem as well. Z-order curves provide a one-dimensional representation of a multi-dimensional space while preserving spatial proximity. Applying a run-length encoding to this is similarly efficient as with the image approach, but applies to both dimensions.
This is easier to imagine as an efficient representation of a quadtree, where a tile is further subdivided if it covers multiple timezones, up to a certain depth limit. The depth limit defines the precision you can achieve.
As generating such a data structure requires computing intersections with the timezone geometry, this is best done within a tool made for such things, such as QGIS’ Python scripting. The generation script can be found here.
Compared to the image approach this brings in a bit of extra complexity, despite the actual z-order curve part being fairly straightforward:
Generation takes a lot of time. For the parameters picked for KDE Itinerary right now it’s 1-2 hours on 8 cores, a parallel implementation is therefore pretty much mandatory. The image approach hardly ever needs more than a few seconds for comparison.
There’s various parameters you can tweak (which is actually good), but unlike with the image approach their exact impact can be hard to predict. That is particularly annoying with experimental evaluation needing hours of computing time.
To be fair, parts of the generation cost comes from the somewhat more advanced conflict handling stage on quadtree entries at the maximum depth still covering multiple timezones.
If you consider that timezones are mostly hour-aligned (with a few notable exceptions),
many of them have to be practically equivalent when looking at a reasonably small time window,
say the next year, and in many use-cases having an equivalent timezone is actually good enough.
That is, for showing the right time in 2020 it doesn’t matter if you are using
In the end this is worth it though I think. With about 300kB of static data we can correctly resolve the timezone from a given coordinate on 99.2% of the covered area, and for 99.6% we at least get a correct equivalent timezone. The maximum error distance is set to about 300m (varies with the latitude). The covered area isn’t the entire earth though, as we cut off the polar regions below -65°S and above 80°N. This gives us about 20% more “space” on the z-order curve to increase precision in areas more relevant for the vast majority of users.
Precision turned out good enough to entirely replace the timezone information we previously carried for all airports and train stations explicitly, which allows us to recoup about 74kB of disk space. This wasn’t entirely expected, as there are some tricky locations very close to a border in there (e.g. Geneva (GVA)).
One aspect motivating this work was its applicability to other discrete features distributed with a strong spatial proximity,
such as countries or regions. However, halfway through this I realized that an IANA timezones actually implies a country. There is
only one exception to this, since the 2019a update northern Vietnam is assigned the
Asia/Bangkok timezone otherwise only used in Thailand.
That would be fixable though by assigning the northern Vietnam area a separate internal timezone identifier, and resolving this
differently depending on whether we are looking for country or timezone information.
After adding a slightly different strategy for handling ambiguous areas, we got a geo coordinate to country lookup almost for free as well, without even needing an extra index.
It’s worth keeping in mind though that all of this is of course an approximation, in places like Baarle where the country changes several times within our error margin this will not get you far.
As this might be useful for other applications as well, is there interest in having such location-related API in KDE Frameworks? Thinking about lookup, conversion and localization of ISO 3166-1/2 codes, IANA timezones, etc.