diff options
Diffstat (limited to 'winsup/bz2lib/manual_4.html')
-rw-r--r-- | winsup/bz2lib/manual_4.html | 528 |
1 files changed, 0 insertions, 528 deletions
diff --git a/winsup/bz2lib/manual_4.html b/winsup/bz2lib/manual_4.html deleted file mode 100644 index 9ab7fb24f..000000000 --- a/winsup/bz2lib/manual_4.html +++ /dev/null @@ -1,528 +0,0 @@ -<HTML> -<HEAD> -<!-- This HTML file has been created by texi2html 1.54 - from manual.texi on 23 March 2000 --> - -<TITLE>bzip2 and libbzip2 - Miscellanea</TITLE> -<link href="manual_3.html" rel=Previous> -<link href="manual_toc.html" rel=ToC> - -</HEAD> -<BODY> -<p>Go to the <A HREF="manual_1.html">first</A>, <A HREF="manual_3.html">previous</A>, next, last section, <A HREF="manual_toc.html">table of contents</A>. -<P><HR><P> - - -<H1><A NAME="SEC43" HREF="manual_toc.html#TOC43">Miscellanea</A></H1> - -<P> -These are just some random thoughts of mine. Your mileage may -vary. - -</P> - - -<H2><A NAME="SEC44" HREF="manual_toc.html#TOC44">Limitations of the compressed file format</A></H2> -<P> -<CODE>bzip2-1.0</CODE>, <CODE>0.9.5</CODE> and <CODE>0.9.0</CODE> -use exactly the same file format as the previous -version, <CODE>bzip2-0.1</CODE>. This decision was made in the interests of -stability. Creating yet another incompatible compressed file format -would create further confusion and disruption for users. - -</P> -<P> -Nevertheless, this is not a painless decision. Development -work since the release of <CODE>bzip2-0.1</CODE> in August 1997 -has shown complexities in the file format which slow down -decompression and, in retrospect, are unnecessary. These are: - -<UL> -<LI>The run-length encoder, which is the first of the - - compression transformations, is entirely irrelevant. - The original purpose was to protect the sorting algorithm - from the very worst case input: a string of repeated - symbols. But algorithm steps Q6a and Q6b in the original - Burrows-Wheeler technical report (SRC-124) show how - repeats can be handled without difficulty in block - sorting. -<LI>The randomisation mechanism doesn't really need to be - - there. Udi Manber and Gene Myers published a suffix - array construction algorithm a few years back, which - can be employed to sort any block, no matter how - repetitive, in O(N log N) time. Subsequent work by - Kunihiko Sadakane has produced a derivative O(N (log N)^2) - algorithm which usually outperforms the Manber-Myers - algorithm. - - I could have changed to Sadakane's algorithm, but I find - it to be slower than <CODE>bzip2</CODE>'s existing algorithm for - most inputs, and the randomisation mechanism protects - adequately against bad cases. I didn't think it was - a good tradeoff to make. Partly this is due to the fact - that I was not flooded with email complaints about - <CODE>bzip2-0.1</CODE>'s performance on repetitive data, so - perhaps it isn't a problem for real inputs. - - Probably the best long-term solution, - and the one I have incorporated into 0.9.5 and above, - is to use the existing sorting - algorithm initially, and fall back to a O(N (log N)^2) - algorithm if the standard algorithm gets into difficulties. -<LI>The compressed file format was never designed to be - - handled by a library, and I have had to jump though - some hoops to produce an efficient implementation of - decompression. It's a bit hairy. Try passing - <CODE>decompress.c</CODE> through the C preprocessor - and you'll see what I mean. Much of this complexity - could have been avoided if the compressed size of - each block of data was recorded in the data stream. -<LI>An Adler-32 checksum, rather than a CRC32 checksum, - - would be faster to compute. -</UL> - -<P> -It would be fair to say that the <CODE>bzip2</CODE> format was frozen -before I properly and fully understood the performance -consequences of doing so. - -</P> -<P> -Improvements which I was able to incorporate into -0.9.0, despite using the same file format, are: - -<UL> -<LI>Single array implementation of the inverse BWT. This - - significantly speeds up decompression, presumably - because it reduces the number of cache misses. -<LI>Faster inverse MTF transform for large MTF values. The - - new implementation is based on the notion of sliding blocks - of values. -<LI><CODE>bzip2-0.9.0</CODE> now reads and writes files with <CODE>fread</CODE> - - and <CODE>fwrite</CODE>; version 0.1 used <CODE>putc</CODE> and <CODE>getc</CODE>. - Duh! Well, you live and learn. - -</UL> - -<P> -Further ahead, it would be nice -to be able to do random access into files. This will -require some careful design of compressed file formats. - -</P> - - - -<H2><A NAME="SEC45" HREF="manual_toc.html#TOC45">Portability issues</A></H2> -<P> -After some consideration, I have decided not to use -GNU <CODE>autoconf</CODE> to configure 0.9.5 or 1.0. - -</P> -<P> -<CODE>autoconf</CODE>, admirable and wonderful though it is, -mainly assists with portability problems between Unix-like -platforms. But <CODE>bzip2</CODE> doesn't have much in the way -of portability problems on Unix; most of the difficulties appear -when porting to the Mac, or to Microsoft's operating systems. -<CODE>autoconf</CODE> doesn't help in those cases, and brings in a -whole load of new complexity. - -</P> -<P> -Most people should be able to compile the library and program -under Unix straight out-of-the-box, so to speak, especially -if you have a version of GNU C available. - -</P> -<P> -There are a couple of <CODE>__inline__</CODE> directives in the code. GNU C -(<CODE>gcc</CODE>) should be able to handle them. If you're not using -GNU C, your C compiler shouldn't see them at all. -If your compiler does, for some reason, see them and doesn't -like them, just <CODE>#define</CODE> <CODE>__inline__</CODE> to be <CODE>/* */</CODE>. One -easy way to do this is to compile with the flag <CODE>-D__inline__=</CODE>, -which should be understood by most Unix compilers. - -</P> -<P> -If you still have difficulties, try compiling with the macro -<CODE>BZ_STRICT_ANSI</CODE> defined. This should enable you to build the -library in a strictly ANSI compliant environment. Building the program -itself like this is dangerous and not supported, since you remove -<CODE>bzip2</CODE>'s checks against compressing directories, symbolic links, -devices, and other not-really-a-file entities. This could cause -filesystem corruption! - -</P> -<P> -One other thing: if you create a <CODE>bzip2</CODE> binary for public -distribution, please try and link it statically (<CODE>gcc -s</CODE>). This -avoids all sorts of library-version issues that others may encounter -later on. - -</P> -<P> -If you build <CODE>bzip2</CODE> on Win32, you must set <CODE>BZ_UNIX</CODE> to 0 and -<CODE>BZ_LCCWIN32</CODE> to 1, in the file <CODE>bzip2.c</CODE>, before compiling. -Otherwise the resulting binary won't work correctly. - -</P> - - - -<H2><A NAME="SEC46" HREF="manual_toc.html#TOC46">Reporting bugs</A></H2> -<P> -I tried pretty hard to make sure <CODE>bzip2</CODE> is -bug free, both by design and by testing. Hopefully -you'll never need to read this section for real. - -</P> -<P> -Nevertheless, if <CODE>bzip2</CODE> dies with a segmentation -fault, a bus error or an internal assertion failure, it -will ask you to email me a bug report. Experience with -version 0.1 shows that almost all these problems can -be traced to either compiler bugs or hardware problems. - -<UL> -<LI> - -Recompile the program with no optimisation, and see if it -works. And/or try a different compiler. -I heard all sorts of stories about various flavours -of GNU C (and other compilers) generating bad code for -<CODE>bzip2</CODE>, and I've run across two such examples myself. - -2.7.X versions of GNU C are known to generate bad code from -time to time, at high optimisation levels. -If you get problems, try using the flags -<CODE>-O2</CODE> <CODE>-fomit-frame-pointer</CODE> <CODE>-fno-strength-reduce</CODE>. -You should specifically <EM>not</EM> use <CODE>-funroll-loops</CODE>. - -You may notice that the Makefile runs six tests as part of -the build process. If the program passes all of these, it's -a pretty good (but not 100%) indication that the compiler has -done its job correctly. -<LI> - -If <CODE>bzip2</CODE> crashes randomly, and the crashes are not -repeatable, you may have a flaky memory subsystem. <CODE>bzip2</CODE> -really hammers your memory hierarchy, and if it's a bit marginal, -you may get these problems. Ditto if your disk or I/O subsystem -is slowly failing. Yup, this really does happen. - -Try using a different machine of the same type, and see if -you can repeat the problem. -<LI>This isn't really a bug, but ... If <CODE>bzip2</CODE> tells - -you your file is corrupted on decompression, and you -obtained the file via FTP, there is a possibility that you -forgot to tell FTP to do a binary mode transfer. That absolutely -will cause the file to be non-decompressible. You'll have to transfer -it again. -</UL> - -<P> -If you've incorporated <CODE>libbzip2</CODE> into your own program -and are getting problems, please, please, please, check that the -parameters you are passing in calls to the library, are -correct, and in accordance with what the documentation says -is allowable. I have tried to make the library robust against -such problems, but I'm sure I haven't succeeded. - -</P> -<P> -Finally, if the above comments don't help, you'll have to send -me a bug report. Now, it's just amazing how many people will -send me a bug report saying something like - -<PRE> - bzip2 crashed with segmentation fault on my machine -</PRE> - -<P> -and absolutely nothing else. Needless to say, a such a report -is <EM>totally, utterly, completely and comprehensively 100% useless; -a waste of your time, my time, and net bandwidth</EM>. -With no details at all, there's no way I can possibly begin -to figure out what the problem is. - -</P> -<P> -The rules of the game are: facts, facts, facts. Don't omit -them because "oh, they won't be relevant". At the bare -minimum: - -<PRE> - Machine type. Operating system version. - Exact version of <CODE>bzip2</CODE> (do <CODE>bzip2 -V</CODE>). - Exact version of the compiler used. - Flags passed to the compiler. -</PRE> - -<P> -However, the most important single thing that will help me is -the file that you were trying to compress or decompress at the -time the problem happened. Without that, my ability to do anything -more than speculate about the cause, is limited. - -</P> -<P> -Please remember that I connect to the Internet with a modem, so -you should contact me before mailing me huge files. - -</P> - - - -<H2><A NAME="SEC47" HREF="manual_toc.html#TOC47">Did you get the right package?</A></H2> - -<P> -<CODE>bzip2</CODE> is a resource hog. It soaks up large amounts of CPU cycles -and memory. Also, it gives very large latencies. In the worst case, you -can feed many megabytes of uncompressed data into the library before -getting any compressed output, so this probably rules out applications -requiring interactive behaviour. - -</P> -<P> -These aren't faults of my implementation, I hope, but more -an intrinsic property of the Burrows-Wheeler transform (unfortunately). -Maybe this isn't what you want. - -</P> -<P> -If you want a compressor and/or library which is faster, uses less -memory but gets pretty good compression, and has minimal latency, -consider Jean-loup -Gailly's and Mark Adler's work, <CODE>zlib-1.1.2</CODE> and -<CODE>gzip-1.2.4</CODE>. Look for them at - -</P> -<P> -<CODE>http://www.cdrom.com/pub/infozip/zlib</CODE> and -<CODE>http://www.gzip.org</CODE> respectively. - -</P> -<P> -For something faster and lighter still, you might try Markus F X J -Oberhumer's <CODE>LZO</CODE> real-time compression/decompression library, at -<BR> <CODE>http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html</CODE>. - -</P> -<P> -If you want to use the <CODE>bzip2</CODE> algorithms to compress small blocks -of data, 64k bytes or smaller, for example on an on-the-fly disk -compressor, you'd be well advised not to use this library. Instead, -I've made a special library tuned for that kind of use. It's part of -<CODE>e2compr-0.40</CODE>, an on-the-fly disk compressor for the Linux -<CODE>ext2</CODE> filesystem. Look at -<CODE>http://www.netspace.net.au/~reiter/e2compr</CODE>. - -</P> - - - -<H2><A NAME="SEC48" HREF="manual_toc.html#TOC48">Testing</A></H2> - -<P> -A record of the tests I've done. - -</P> -<P> -First, some data sets: - -<UL> -<LI>B: a directory containing 6001 files, one for every length in the - - range 0 to 6000 bytes. The files contain random lowercase - letters. 18.7 megabytes. -<LI>H: my home directory tree. Documents, source code, mail files, - - compressed data. H contains B, and also a directory of - files designed as boundary cases for the sorting; mostly very - repetitive, nasty files. 565 megabytes. -<LI>A: directory tree holding various applications built from source: - - <CODE>egcs</CODE>, <CODE>gcc-2.8.1</CODE>, KDE, GTK, Octave, etc. - 2200 megabytes. -</UL> - -<P> -The tests conducted are as follows. Each test means compressing -(a copy of) each file in the data set, decompressing it and -comparing it against the original. - -</P> -<P> -First, a bunch of tests with block sizes and internal buffer -sizes set very small, -to detect any problems with the -blocking and buffering mechanisms. -This required modifying the source code so as to try to -break it. - -<OL> -<LI>Data set H, with - - buffer size of 1 byte, and block size of 23 bytes. -<LI>Data set B, buffer sizes 1 byte, block size 1 byte. - -<LI>As (2) but small-mode decompression. - -<LI>As (2) with block size 2 bytes. - -<LI>As (2) with block size 3 bytes. - -<LI>As (2) with block size 4 bytes. - -<LI>As (2) with block size 5 bytes. - -<LI>As (2) with block size 6 bytes and small-mode decompression. - -<LI>H with buffer size of 1 byte, but normal block - - size (up to 900000 bytes). -</OL> - -<P> -Then some tests with unmodified source code. - -<OL> -<LI>H, all settings normal. - -<LI>As (1), with small-mode decompress. - -<LI>H, compress with flag <CODE>-1</CODE>. - -<LI>H, compress with flag <CODE>-s</CODE>, decompress with flag <CODE>-s</CODE>. - -<LI>Forwards compatibility: H, <CODE>bzip2-0.1pl2</CODE> compressing, - - <CODE>bzip2-0.9.5</CODE> decompressing, all settings normal. -<LI>Backwards compatibility: H, <CODE>bzip2-0.9.5</CODE> compressing, - - <CODE>bzip2-0.1pl2</CODE> decompressing, all settings normal. -<LI>Bigger tests: A, all settings normal. - -<LI>As (7), using the fallback (Sadakane-like) sorting algorithm. - -<LI>As (8), compress with flag <CODE>-1</CODE>, decompress with flag - - <CODE>-s</CODE>. -<LI>H, using the fallback sorting algorithm. - -<LI>Forwards compatibility: A, <CODE>bzip2-0.1pl2</CODE> compressing, - - <CODE>bzip2-0.9.5</CODE> decompressing, all settings normal. -<LI>Backwards compatibility: A, <CODE>bzip2-0.9.5</CODE> compressing, - - <CODE>bzip2-0.1pl2</CODE> decompressing, all settings normal. -<LI>Misc test: about 400 megabytes of <CODE>.tar</CODE> files with - - <CODE>bzip2</CODE> compiled with Checker (a memory access error - detector, like Purify). -<LI>Misc tests to make sure it builds and runs ok on non-Linux/x86 - - platforms. -</OL> - -<P> -These tests were conducted on a 225 MHz IDT WinChip machine, running -Linux 2.0.36. They represent nearly a week of continuous computation. -All tests completed successfully. - -</P> - - - -<H2><A NAME="SEC49" HREF="manual_toc.html#TOC49">Further reading</A></H2> -<P> -<CODE>bzip2</CODE> is not research work, in the sense that it doesn't present -any new ideas. Rather, it's an engineering exercise based on existing -ideas. - -</P> -<P> -Four documents describe essentially all the ideas behind <CODE>bzip2</CODE>: - -<PRE> -Michael Burrows and D. J. Wheeler: - "A block-sorting lossless data compression algorithm" - 10th May 1994. - Digital SRC Research Report 124. - ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz - If you have trouble finding it, try searching at the - New Zealand Digital Library, http://www.nzdl.org. - -Daniel S. Hirschberg and Debra A. LeLewer - "Efficient Decoding of Prefix Codes" - Communications of the ACM, April 1990, Vol 33, Number 4. - You might be able to get an electronic copy of this - from the ACM Digital Library. - -David J. Wheeler - Program bred3.c and accompanying document bred3.ps. - This contains the idea behind the multi-table Huffman - coding scheme. - ftp://ftp.cl.cam.ac.uk/users/djw3/ - -Jon L. Bentley and Robert Sedgewick - "Fast Algorithms for Sorting and Searching Strings" - Available from Sedgewick's web page, - www.cs.princeton.edu/~rs -</PRE> - -<P> -The following paper gives valuable additional insights into the -algorithm, but is not immediately the basis of any code -used in bzip2. - -<PRE> -Peter Fenwick: - Block Sorting Text Compression - Proceedings of the 19th Australasian Computer Science Conference, - Melbourne, Australia. Jan 31 - Feb 2, 1996. - ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps -</PRE> - -<P> -Kunihiko Sadakane's sorting algorithm, mentioned above, -is available from: - -<PRE> -http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz -</PRE> - -<P> -The Manber-Myers suffix array construction -algorithm is described in a paper -available from: - -<PRE> -http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps -</PRE> - -<P> -Finally, the following paper documents some recent investigations -I made into the performance of sorting algorithms: - -<PRE> -Julian Seward: - On the Performance of BWT Sorting Algorithms - Proceedings of the IEEE Data Compression Conference 2000 - Snowbird, Utah. 28-30 March 2000. -</PRE> - -<P><HR><P> -<p>Go to the <A HREF="manual_1.html">first</A>, <A HREF="manual_3.html">previous</A>, next, last section, <A HREF="manual_toc.html">table of contents</A>. -</BODY> -</HTML> |