[tz] [PROPOSED PATCH 2/2] Use lz format for new tarball

Sat Aug 27 19:43:02 UTC 2016

On Fri, Aug 26, 2016, at 19:10, Antonio Diaz Diaz wrote:
> Paul Eggert wrote:
> >> please use a compression format that can be handled easily by
> >> Windows users as well. For instance, choose a format from the list
> >> that 7Zip can handle: http://www.7-zip.org/
> >
> > Thanks for mentioning the problem. xz format is on 7-Zip's list; it's
> > a tiny bit larger than lzip format for our data (0.3% larger for the
> > draft tzdb tarball) but I suppose portability trumps this minor
> > advantage.
> 
> Please, do not use xz for a new distribution format. The xz format is 
> defective. See for example http://www.nongnu.org/lzip/xz_inadequate.html 

Seems like a lot of fear, uncertainty, and doubt.

" Xz was designed as a fragmented format. Xz implementations may choose
what subset of the format they support. For example the xz-embedded
decompressor does not support the optional CRC64 check, and is therefore
unable to verify the integrity of the files produced by default by
xz-utils. Xz files must be produced specially for the xz-embedded
decompressor. " - is this last sentence even true? does xz-embedded fail
to open the files, or merely doesn't run the integrity check? Someone
could write an lzip extractor that ignores the CRC, would this be an
indictment of your format?

"It has room for 2^63 filters, which can then be combined to make an
even larger number of algorithms. Xz reserves less than 0.8% of filter
IDs for custom filters, but even this small range provides about 8
million custom filter IDs for each human inhabitant on earth. There is
not the slightest justification for such egregious level of
extensibility. " - this seems like a criticism of data type choice? I'm
not sure what the point is.

"The 'file' utility does not provide any help:" "Xz-utils can report the
minimum version of xz-utils required to decompress a given file, but it
must examine the file contents to find it out," - how does 'file' work
if not by examining the file content?

"Not only data at a random position are interpreted as the CRC. Whatever
data that follow the bogus CRC will be interpreted as the beginning of
the following field, preventing the successful decoding of any remaining
data in the stream. "

What are the odds that the bytes found there will coincidentally match
the CRC of the short data? And won't a corrupted length field always
prevent the successful decoding of any remaining data, regardless of how
the CRC is stored relative to it?

----

Anyway, why even use a compressed format? Is the data large enough for
it to matter?