Welcome to mirror list, hosted at ThFree Co, Russian Federation.

cygwin.com/git/newlib-cygwin.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
Diffstat (limited to 'winsup/bz2lib/manual_4.html')
-rw-r--r--winsup/bz2lib/manual_4.html528
1 files changed, 528 insertions, 0 deletions
diff --git a/winsup/bz2lib/manual_4.html b/winsup/bz2lib/manual_4.html
new file mode 100644
index 000000000..9ab7fb24f
--- /dev/null
+++ b/winsup/bz2lib/manual_4.html
@@ -0,0 +1,528 @@
+<HTML>
+<HEAD>
+<!-- This HTML file has been created by texi2html 1.54
+ from manual.texi on 23 March 2000 -->
+
+<TITLE>bzip2 and libbzip2 - Miscellanea</TITLE>
+<link href="manual_3.html" rel=Previous>
+<link href="manual_toc.html" rel=ToC>
+
+</HEAD>
+<BODY>
+<p>Go to the <A HREF="manual_1.html">first</A>, <A HREF="manual_3.html">previous</A>, next, last section, <A HREF="manual_toc.html">table of contents</A>.
+<P><HR><P>
+
+
+<H1><A NAME="SEC43" HREF="manual_toc.html#TOC43">Miscellanea</A></H1>
+
+<P>
+These are just some random thoughts of mine. Your mileage may
+vary.
+
+</P>
+
+
+<H2><A NAME="SEC44" HREF="manual_toc.html#TOC44">Limitations of the compressed file format</A></H2>
+<P>
+<CODE>bzip2-1.0</CODE>, <CODE>0.9.5</CODE> and <CODE>0.9.0</CODE>
+use exactly the same file format as the previous
+version, <CODE>bzip2-0.1</CODE>. This decision was made in the interests of
+stability. Creating yet another incompatible compressed file format
+would create further confusion and disruption for users.
+
+</P>
+<P>
+Nevertheless, this is not a painless decision. Development
+work since the release of <CODE>bzip2-0.1</CODE> in August 1997
+has shown complexities in the file format which slow down
+decompression and, in retrospect, are unnecessary. These are:
+
+<UL>
+<LI>The run-length encoder, which is the first of the
+
+ compression transformations, is entirely irrelevant.
+ The original purpose was to protect the sorting algorithm
+ from the very worst case input: a string of repeated
+ symbols. But algorithm steps Q6a and Q6b in the original
+ Burrows-Wheeler technical report (SRC-124) show how
+ repeats can be handled without difficulty in block
+ sorting.
+<LI>The randomisation mechanism doesn't really need to be
+
+ there. Udi Manber and Gene Myers published a suffix
+ array construction algorithm a few years back, which
+ can be employed to sort any block, no matter how
+ repetitive, in O(N log N) time. Subsequent work by
+ Kunihiko Sadakane has produced a derivative O(N (log N)^2)
+ algorithm which usually outperforms the Manber-Myers
+ algorithm.
+
+ I could have changed to Sadakane's algorithm, but I find
+ it to be slower than <CODE>bzip2</CODE>'s existing algorithm for
+ most inputs, and the randomisation mechanism protects
+ adequately against bad cases. I didn't think it was
+ a good tradeoff to make. Partly this is due to the fact
+ that I was not flooded with email complaints about
+ <CODE>bzip2-0.1</CODE>'s performance on repetitive data, so
+ perhaps it isn't a problem for real inputs.
+
+ Probably the best long-term solution,
+ and the one I have incorporated into 0.9.5 and above,
+ is to use the existing sorting
+ algorithm initially, and fall back to a O(N (log N)^2)
+ algorithm if the standard algorithm gets into difficulties.
+<LI>The compressed file format was never designed to be
+
+ handled by a library, and I have had to jump though
+ some hoops to produce an efficient implementation of
+ decompression. It's a bit hairy. Try passing
+ <CODE>decompress.c</CODE> through the C preprocessor
+ and you'll see what I mean. Much of this complexity
+ could have been avoided if the compressed size of
+ each block of data was recorded in the data stream.
+<LI>An Adler-32 checksum, rather than a CRC32 checksum,
+
+ would be faster to compute.
+</UL>
+
+<P>
+It would be fair to say that the <CODE>bzip2</CODE> format was frozen
+before I properly and fully understood the performance
+consequences of doing so.
+
+</P>
+<P>
+Improvements which I was able to incorporate into
+0.9.0, despite using the same file format, are:
+
+<UL>
+<LI>Single array implementation of the inverse BWT. This
+
+ significantly speeds up decompression, presumably
+ because it reduces the number of cache misses.
+<LI>Faster inverse MTF transform for large MTF values. The
+
+ new implementation is based on the notion of sliding blocks
+ of values.
+<LI><CODE>bzip2-0.9.0</CODE> now reads and writes files with <CODE>fread</CODE>
+
+ and <CODE>fwrite</CODE>; version 0.1 used <CODE>putc</CODE> and <CODE>getc</CODE>.
+ Duh! Well, you live and learn.
+
+</UL>
+
+<P>
+Further ahead, it would be nice
+to be able to do random access into files. This will
+require some careful design of compressed file formats.
+
+</P>
+
+
+
+<H2><A NAME="SEC45" HREF="manual_toc.html#TOC45">Portability issues</A></H2>
+<P>
+After some consideration, I have decided not to use
+GNU <CODE>autoconf</CODE> to configure 0.9.5 or 1.0.
+
+</P>
+<P>
+<CODE>autoconf</CODE>, admirable and wonderful though it is,
+mainly assists with portability problems between Unix-like
+platforms. But <CODE>bzip2</CODE> doesn't have much in the way
+of portability problems on Unix; most of the difficulties appear
+when porting to the Mac, or to Microsoft's operating systems.
+<CODE>autoconf</CODE> doesn't help in those cases, and brings in a
+whole load of new complexity.
+
+</P>
+<P>
+Most people should be able to compile the library and program
+under Unix straight out-of-the-box, so to speak, especially
+if you have a version of GNU C available.
+
+</P>
+<P>
+There are a couple of <CODE>__inline__</CODE> directives in the code. GNU C
+(<CODE>gcc</CODE>) should be able to handle them. If you're not using
+GNU C, your C compiler shouldn't see them at all.
+If your compiler does, for some reason, see them and doesn't
+like them, just <CODE>#define</CODE> <CODE>__inline__</CODE> to be <CODE>/* */</CODE>. One
+easy way to do this is to compile with the flag <CODE>-D__inline__=</CODE>,
+which should be understood by most Unix compilers.
+
+</P>
+<P>
+If you still have difficulties, try compiling with the macro
+<CODE>BZ_STRICT_ANSI</CODE> defined. This should enable you to build the
+library in a strictly ANSI compliant environment. Building the program
+itself like this is dangerous and not supported, since you remove
+<CODE>bzip2</CODE>'s checks against compressing directories, symbolic links,
+devices, and other not-really-a-file entities. This could cause
+filesystem corruption!
+
+</P>
+<P>
+One other thing: if you create a <CODE>bzip2</CODE> binary for public
+distribution, please try and link it statically (<CODE>gcc -s</CODE>). This
+avoids all sorts of library-version issues that others may encounter
+later on.
+
+</P>
+<P>
+If you build <CODE>bzip2</CODE> on Win32, you must set <CODE>BZ_UNIX</CODE> to 0 and
+<CODE>BZ_LCCWIN32</CODE> to 1, in the file <CODE>bzip2.c</CODE>, before compiling.
+Otherwise the resulting binary won't work correctly.
+
+</P>
+
+
+
+<H2><A NAME="SEC46" HREF="manual_toc.html#TOC46">Reporting bugs</A></H2>
+<P>
+I tried pretty hard to make sure <CODE>bzip2</CODE> is
+bug free, both by design and by testing. Hopefully
+you'll never need to read this section for real.
+
+</P>
+<P>
+Nevertheless, if <CODE>bzip2</CODE> dies with a segmentation
+fault, a bus error or an internal assertion failure, it
+will ask you to email me a bug report. Experience with
+version 0.1 shows that almost all these problems can
+be traced to either compiler bugs or hardware problems.
+
+<UL>
+<LI>
+
+Recompile the program with no optimisation, and see if it
+works. And/or try a different compiler.
+I heard all sorts of stories about various flavours
+of GNU C (and other compilers) generating bad code for
+<CODE>bzip2</CODE>, and I've run across two such examples myself.
+
+2.7.X versions of GNU C are known to generate bad code from
+time to time, at high optimisation levels.
+If you get problems, try using the flags
+<CODE>-O2</CODE> <CODE>-fomit-frame-pointer</CODE> <CODE>-fno-strength-reduce</CODE>.
+You should specifically <EM>not</EM> use <CODE>-funroll-loops</CODE>.
+
+You may notice that the Makefile runs six tests as part of
+the build process. If the program passes all of these, it's
+a pretty good (but not 100%) indication that the compiler has
+done its job correctly.
+<LI>
+
+If <CODE>bzip2</CODE> crashes randomly, and the crashes are not
+repeatable, you may have a flaky memory subsystem. <CODE>bzip2</CODE>
+really hammers your memory hierarchy, and if it's a bit marginal,
+you may get these problems. Ditto if your disk or I/O subsystem
+is slowly failing. Yup, this really does happen.
+
+Try using a different machine of the same type, and see if
+you can repeat the problem.
+<LI>This isn't really a bug, but ... If <CODE>bzip2</CODE> tells
+
+you your file is corrupted on decompression, and you
+obtained the file via FTP, there is a possibility that you
+forgot to tell FTP to do a binary mode transfer. That absolutely
+will cause the file to be non-decompressible. You'll have to transfer
+it again.
+</UL>
+
+<P>
+If you've incorporated <CODE>libbzip2</CODE> into your own program
+and are getting problems, please, please, please, check that the
+parameters you are passing in calls to the library, are
+correct, and in accordance with what the documentation says
+is allowable. I have tried to make the library robust against
+such problems, but I'm sure I haven't succeeded.
+
+</P>
+<P>
+Finally, if the above comments don't help, you'll have to send
+me a bug report. Now, it's just amazing how many people will
+send me a bug report saying something like
+
+<PRE>
+ bzip2 crashed with segmentation fault on my machine
+</PRE>
+
+<P>
+and absolutely nothing else. Needless to say, a such a report
+is <EM>totally, utterly, completely and comprehensively 100% useless;
+a waste of your time, my time, and net bandwidth</EM>.
+With no details at all, there's no way I can possibly begin
+to figure out what the problem is.
+
+</P>
+<P>
+The rules of the game are: facts, facts, facts. Don't omit
+them because "oh, they won't be relevant". At the bare
+minimum:
+
+<PRE>
+ Machine type. Operating system version.
+ Exact version of <CODE>bzip2</CODE> (do <CODE>bzip2 -V</CODE>).
+ Exact version of the compiler used.
+ Flags passed to the compiler.
+</PRE>
+
+<P>
+However, the most important single thing that will help me is
+the file that you were trying to compress or decompress at the
+time the problem happened. Without that, my ability to do anything
+more than speculate about the cause, is limited.
+
+</P>
+<P>
+Please remember that I connect to the Internet with a modem, so
+you should contact me before mailing me huge files.
+
+</P>
+
+
+
+<H2><A NAME="SEC47" HREF="manual_toc.html#TOC47">Did you get the right package?</A></H2>
+
+<P>
+<CODE>bzip2</CODE> is a resource hog. It soaks up large amounts of CPU cycles
+and memory. Also, it gives very large latencies. In the worst case, you
+can feed many megabytes of uncompressed data into the library before
+getting any compressed output, so this probably rules out applications
+requiring interactive behaviour.
+
+</P>
+<P>
+These aren't faults of my implementation, I hope, but more
+an intrinsic property of the Burrows-Wheeler transform (unfortunately).
+Maybe this isn't what you want.
+
+</P>
+<P>
+If you want a compressor and/or library which is faster, uses less
+memory but gets pretty good compression, and has minimal latency,
+consider Jean-loup
+Gailly's and Mark Adler's work, <CODE>zlib-1.1.2</CODE> and
+<CODE>gzip-1.2.4</CODE>. Look for them at
+
+</P>
+<P>
+<CODE>http://www.cdrom.com/pub/infozip/zlib</CODE> and
+<CODE>http://www.gzip.org</CODE> respectively.
+
+</P>
+<P>
+For something faster and lighter still, you might try Markus F X J
+Oberhumer's <CODE>LZO</CODE> real-time compression/decompression library, at
+<BR> <CODE>http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html</CODE>.
+
+</P>
+<P>
+If you want to use the <CODE>bzip2</CODE> algorithms to compress small blocks
+of data, 64k bytes or smaller, for example on an on-the-fly disk
+compressor, you'd be well advised not to use this library. Instead,
+I've made a special library tuned for that kind of use. It's part of
+<CODE>e2compr-0.40</CODE>, an on-the-fly disk compressor for the Linux
+<CODE>ext2</CODE> filesystem. Look at
+<CODE>http://www.netspace.net.au/~reiter/e2compr</CODE>.
+
+</P>
+
+
+
+<H2><A NAME="SEC48" HREF="manual_toc.html#TOC48">Testing</A></H2>
+
+<P>
+A record of the tests I've done.
+
+</P>
+<P>
+First, some data sets:
+
+<UL>
+<LI>B: a directory containing 6001 files, one for every length in the
+
+ range 0 to 6000 bytes. The files contain random lowercase
+ letters. 18.7 megabytes.
+<LI>H: my home directory tree. Documents, source code, mail files,
+
+ compressed data. H contains B, and also a directory of
+ files designed as boundary cases for the sorting; mostly very
+ repetitive, nasty files. 565 megabytes.
+<LI>A: directory tree holding various applications built from source:
+
+ <CODE>egcs</CODE>, <CODE>gcc-2.8.1</CODE>, KDE, GTK, Octave, etc.
+ 2200 megabytes.
+</UL>
+
+<P>
+The tests conducted are as follows. Each test means compressing
+(a copy of) each file in the data set, decompressing it and
+comparing it against the original.
+
+</P>
+<P>
+First, a bunch of tests with block sizes and internal buffer
+sizes set very small,
+to detect any problems with the
+blocking and buffering mechanisms.
+This required modifying the source code so as to try to
+break it.
+
+<OL>
+<LI>Data set H, with
+
+ buffer size of 1 byte, and block size of 23 bytes.
+<LI>Data set B, buffer sizes 1 byte, block size 1 byte.
+
+<LI>As (2) but small-mode decompression.
+
+<LI>As (2) with block size 2 bytes.
+
+<LI>As (2) with block size 3 bytes.
+
+<LI>As (2) with block size 4 bytes.
+
+<LI>As (2) with block size 5 bytes.
+
+<LI>As (2) with block size 6 bytes and small-mode decompression.
+
+<LI>H with buffer size of 1 byte, but normal block
+
+ size (up to 900000 bytes).
+</OL>
+
+<P>
+Then some tests with unmodified source code.
+
+<OL>
+<LI>H, all settings normal.
+
+<LI>As (1), with small-mode decompress.
+
+<LI>H, compress with flag <CODE>-1</CODE>.
+
+<LI>H, compress with flag <CODE>-s</CODE>, decompress with flag <CODE>-s</CODE>.
+
+<LI>Forwards compatibility: H, <CODE>bzip2-0.1pl2</CODE> compressing,
+
+ <CODE>bzip2-0.9.5</CODE> decompressing, all settings normal.
+<LI>Backwards compatibility: H, <CODE>bzip2-0.9.5</CODE> compressing,
+
+ <CODE>bzip2-0.1pl2</CODE> decompressing, all settings normal.
+<LI>Bigger tests: A, all settings normal.
+
+<LI>As (7), using the fallback (Sadakane-like) sorting algorithm.
+
+<LI>As (8), compress with flag <CODE>-1</CODE>, decompress with flag
+
+ <CODE>-s</CODE>.
+<LI>H, using the fallback sorting algorithm.
+
+<LI>Forwards compatibility: A, <CODE>bzip2-0.1pl2</CODE> compressing,
+
+ <CODE>bzip2-0.9.5</CODE> decompressing, all settings normal.
+<LI>Backwards compatibility: A, <CODE>bzip2-0.9.5</CODE> compressing,
+
+ <CODE>bzip2-0.1pl2</CODE> decompressing, all settings normal.
+<LI>Misc test: about 400 megabytes of <CODE>.tar</CODE> files with
+
+ <CODE>bzip2</CODE> compiled with Checker (a memory access error
+ detector, like Purify).
+<LI>Misc tests to make sure it builds and runs ok on non-Linux/x86
+
+ platforms.
+</OL>
+
+<P>
+These tests were conducted on a 225 MHz IDT WinChip machine, running
+Linux 2.0.36. They represent nearly a week of continuous computation.
+All tests completed successfully.
+
+</P>
+
+
+
+<H2><A NAME="SEC49" HREF="manual_toc.html#TOC49">Further reading</A></H2>
+<P>
+<CODE>bzip2</CODE> is not research work, in the sense that it doesn't present
+any new ideas. Rather, it's an engineering exercise based on existing
+ideas.
+
+</P>
+<P>
+Four documents describe essentially all the ideas behind <CODE>bzip2</CODE>:
+
+<PRE>
+Michael Burrows and D. J. Wheeler:
+ "A block-sorting lossless data compression algorithm"
+ 10th May 1994.
+ Digital SRC Research Report 124.
+ ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
+ If you have trouble finding it, try searching at the
+ New Zealand Digital Library, http://www.nzdl.org.
+
+Daniel S. Hirschberg and Debra A. LeLewer
+ "Efficient Decoding of Prefix Codes"
+ Communications of the ACM, April 1990, Vol 33, Number 4.
+ You might be able to get an electronic copy of this
+ from the ACM Digital Library.
+
+David J. Wheeler
+ Program bred3.c and accompanying document bred3.ps.
+ This contains the idea behind the multi-table Huffman
+ coding scheme.
+ ftp://ftp.cl.cam.ac.uk/users/djw3/
+
+Jon L. Bentley and Robert Sedgewick
+ "Fast Algorithms for Sorting and Searching Strings"
+ Available from Sedgewick's web page,
+ www.cs.princeton.edu/~rs
+</PRE>
+
+<P>
+The following paper gives valuable additional insights into the
+algorithm, but is not immediately the basis of any code
+used in bzip2.
+
+<PRE>
+Peter Fenwick:
+ Block Sorting Text Compression
+ Proceedings of the 19th Australasian Computer Science Conference,
+ Melbourne, Australia. Jan 31 - Feb 2, 1996.
+ ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps
+</PRE>
+
+<P>
+Kunihiko Sadakane's sorting algorithm, mentioned above,
+is available from:
+
+<PRE>
+http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz
+</PRE>
+
+<P>
+The Manber-Myers suffix array construction
+algorithm is described in a paper
+available from:
+
+<PRE>
+http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps
+</PRE>
+
+<P>
+Finally, the following paper documents some recent investigations
+I made into the performance of sorting algorithms:
+
+<PRE>
+Julian Seward:
+ On the Performance of BWT Sorting Algorithms
+ Proceedings of the IEEE Data Compression Conference 2000
+ Snowbird, Utah. 28-30 March 2000.
+</PRE>
+
+<P><HR><P>
+<p>Go to the <A HREF="manual_1.html">first</A>, <A HREF="manual_3.html">previous</A>, next, last section, <A HREF="manual_toc.html">table of contents</A>.
+</BODY>
+</HTML>