[tz] [PROPOSED] Prefer https: URLs and bring some URLs up to date
gilmoreorless at gmail.com
Mon Sep 25 21:59:10 UTC 2017
> On 24 Sep 2017, at 11:23, Paul Eggert <eggert at cs.ucla.edu> wrote:
> At this point I'd guess most of the http: URLs are broken; however, fixing them all is beyond the scope of this patch.
How important is it to the tz project that the URLs stay valid? (No facetiousness intended, it’s a serious question.) I’ve been assuming it’s important to maintain the references given the data files’ use as historical document, but it’s always best to double-check assumptions.
The reason I ask is that a while ago I wrote a prototype script that would try to verify and update all the reference URLs in the data files. It was a quick Sunday afternoon project that didn’t get much further than checking HTTP status codes, but could easily be revived. I didn’t do much with it at the time because:
a) I was meant to be doing something else (isn’t that always the case?)
b) I wasn’t entirely sure it was a valuable endeavour to begin with (i.e. the question I posed at the start)
The basic process was planned to be along these lines:
1. Find all reference URLs in data source comments.
2. For each link, do a HEAD request to get the HTTP headers of the URL:
* If the URL still works, move on.
* If the status code is a 301 permanent redirect, update the reference to the new URL.
* If the domain no longer resolves, or the URL returns a 404 not found, make a request to the Web Archive API to find the latest cached version of that page, and update the reference URL accordingly.
It obviously wouldn’t catch all the changes that happen to links on the web (like Paul’s https upgrade patch), but it could find a good chunk of them.
If this kind of work would be valuable I can restart the project and send through some patches (when I next find the spare time, of course).
More information about the tz