Automatically finding and fixing insecure HTTP links

End of last month I attended the KDE privacy goal sprint in Leipzig. Together with Sandro I continued to look into tooling for identifying and fixing insecure HTTP links, an issue I have written about earlier already. The result of this can be found in D19996.

Identifying insecure links

The first tool we built is httpcheck, a scanner for http: URLs in whatever files you point it to. This is optimized for high speed and therefore doesn’t do any online validation etc.

Obviously something like this can never be perfect, so this has a few features to deal with the common problems we encounter:

There is a global exclusion list for known URIs as e.g. used in XML namespaces, where the http: is part of the identifier and not resolved as a network address (see also the last post on this issue).
There is an exclusion list for services known to not support transport encryption (yes, those still exist in 2019), as well as for URLs that would just produce an unmanageable amount of warning noise for now (that’s mainly the gnu.org addresses commonly found in license headers).
Like other code checkers, this supports inline and per module overrides to suppress warnings. It is for example quite important to not touch code that deals with adjusting http: to https: URLs and therefore might validly contain parts of what seems to be an insecure URL.

A tool like this is mainly useful to prevent new issues from being introduced, and there’s two ideas on how to deploy this:

As a unit test injected by ECM into all projects (as it’s currently done for the appstream test, and that’s also why the code for this is in ECM).
As a commit hook, similar to the license checks run at commit time.

Before rolling this out we need to fix the current code base first though, to not drown in warnings and test failures.

Automatically fixing insecure links

And that brings us to the second tool, httpupdate, which is is meant to automate the migration to https: URLs. This will consume the same overrides and exclusion list as httpcheck, so it wont touch anything explicitly marked as intentionally using http:. It also doesn’t simply replace http: by https: but it first validates that the corresponding service actually supports secure connections.

A side-effect of this is that it also identifies dead links or no longer existing services, and therefore helps to maintain e.g. our documentation.

Of course this is also imperfect, the result always needs manual review, but it nevertheless massively speeds up the process compared to doing all this manually.

Impact

How much does this help with the overall privacy for our users though? How often do you click on links in the documentation, CMake output, or let alone in license headers? And even then, doesn’t HSTS enabled browsers and properly configured web servers redirect to secure connections anyway? In most cases this is probably true, and the practical impact is limited.

However during the test runs of the tools at the sprint we found two possible data leaks this way (one when using an URL shortening service, one for a pastebin service), among hundreds of probably less impactfull insecure links. So I think this is worth it even if it just helps us to spot a potential high impact issue among the many harmless ones.

Contribute

As mentioned above, before it makes sense to roll out the continuous checks for this we need to fix the current state. That means going through all repositories and see what these tools find, fix things and improve the tools and their exclusion lists on the way. So there’s plenty of opportunity to help :)