[tz] Comments

Sat Apr 22 08:28:01 UTC 2023

Guy Harris via tz <tz at iana.org> wrote on Fri, 21 Apr 2023 at 17:29:33 EDT in <59840028-62AD-4567-9D90-DAD3CBEAB025 at sonic.net>:
> On Apr 21, 2023, at 1:05 PM, Michael Douglass via tz <tz at iana.org> wrote:
> > On 4/21/23 15:13, Arthur David Olson via tz wrote:
> > > Fun fact: the time zone database is 80% comments by volume.
> > What's that by weight?
> If stored on paper, probably a much smaller percentage, as most of
...

I would say that rather, the proper metaphorical analogue of percent-by-mass as opposed to percent-by-volume is not actual mass, but rather effectiveness or efficacy. (Although maybe that is more like molarity or density than it is straight up %-by-mass? If so it's a mere arithmetic conversion.)

And so, one might argue, comments are almost exactly equivalent to code here, and everything captured in code is also captured in comments, so tz is 100% "mass" of comments. Or 0% "mass" of comments, because they are fully duplicative of the coded data, just in another form. Or, perhaps, 50% since that's the mean of 0% and 100%.

However, that assumption isn't right. There are plenty of places where the text of the comments explain the sourcing but don't actually enumerate the specifics of the transitions, which are clear enough from the coded data in subsequent tabular form. So measuring how many bytes of coded data contain information not within the comments would require a line-by-line inquiry and semantic evaluation.

And at some point, there is the question of the value of history. Is an entertaining story of a "derisive offer to erect a sundial" in Detroit "mere dicta," or is it valuable content that is fully itself a part of the database, or somewhere in between? How should it be measured?

Still another question might be the ratio of "tzdata" to "tzcode," which is, at least, easy to calculate.

When we're done answering all those questions, someone can produce some nifty graphs of how this has changed over time. In addition to pure growth, it may be, perhaps, that discussion about time zone abbreviations and perhaps their local invention in the database might have been considered meaty content in prior years and now is viewed differently. So if the semantic standards change over time, that task might become even tougher.

I suppose there's probably some prior art here in the discipline of philosophy of science (or philosphy of history), but I don't know it. One could also imagine tackling this problem with machine learning.

--
jhawk at alum.mit.edu
John Hawkinson