0.1

nemequ · Aug 6, 1997 · d2eaa71 · d2eaa71
commit d2eaa71
Show file tree

Hide file tree

Showing 23 changed files with 6,550 additions and 0 deletions.
diff --git a/ALGORITHMS b/ALGORITHMS
@@ -0,0 +1,47 @@
+
+Bzip2 is not research work, in the sense that it doesn't present any
+new ideas.  Rather, it's an engineering exercise based on existing
+ideas.
+
+Four documents describe essentially all the ideas behind bzip2:
+
+   Michael Burrows and D. J. Wheeler:
+     "A block-sorting lossless data compression algorithm"
+      10th May 1994. 
+      Digital SRC Research Report 124.
+      ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
+
+   Daniel S. Hirschberg and Debra A. LeLewer
+     "Efficient Decoding of Prefix Codes"
+      Communications of the ACM, April 1990, Vol 33, Number 4.
+      You might be able to get an electronic copy of this
+         from the ACM Digital Library.
+
+   David J. Wheeler
+      Program bred3.c and accompanying document bred3.ps.
+      This contains the idea behind the multi-table Huffman
+      coding scheme.
+      ftp://ftp.cl.cam.ac.uk/pub/user/djw3/
+
+   Jon L. Bentley and Robert Sedgewick
+     "Fast Algorithms for Sorting and Searching Strings"
+      Available from Sedgewick's web page,
+      www.cs.princeton.edu/~rs
+
+The following paper gives valuable additional insights into the
+algorithm, but is not immediately the basis of any code
+used in bzip2.
+
+   Peter Fenwick:
+      Block Sorting Text Compression
+      Proceedings of the 19th Australasian Computer Science Conference,
+        Melbourne, Australia.  Jan 31 - Feb 2, 1996.
+      ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps
+
+All three are well written, and make fascinating reading.  If you want
+to modify bzip2 in any non-trivial way, I strongly suggest you obtain,
+read and understand these papers.
+
+I am much indebted to the various authors for their help, support and
+advice.
+
diff --git a/LICENSE b/LICENSE
diff --git a/Makefile b/Makefile
@@ -0,0 +1,30 @@
+
+CC = gcc
+SH = /bin/sh
+
+CFLAGS = -O3 -fomit-frame-pointer -funroll-loops -Wall -Winline -W
+
+
+
+all:
+	cat words0
+	$(CC) $(CFLAGS) -o bzip2 bzip2.c
+	$(CC) $(CFLAGS) -o bzip2recover bzip2recover.c
+	rm -f bunzip2
+	ln -s ./bzip2 ./bunzip2
+	cat words1
+	./bzip2 -1 < sample1.ref > sample1.rb2
+	./bzip2 -2 < sample2.ref > sample2.rb2
+	./bunzip2 < sample1.bz2 > sample1.tst
+	./bunzip2 < sample2.bz2 > sample2.tst
+	cat words2
+	cmp sample1.bz2 sample1.rb2 
+	cmp sample2.bz2 sample2.rb2
+	cmp sample1.tst sample1.ref
+	cmp sample2.tst sample2.ref
+	cat words3
+
+
+clean:
+	rm -f bzip2 bunzip2 bzip2recover sample*.tst sample*.rb2
+
diff --git a/README b/README
@@ -0,0 +1,243 @@
+
+GREETINGS!
+
+   This is the README for bzip2, my block-sorting file compressor,
+   version 0.1.  
+
+   bzip2 is distributed under the GNU General Public License version 2;
+   for details, see the file LICENSE.  Pointers to the algorithms used
+   are in ALGORITHMS.  Instructions for use are in bzip2.1.preformatted.
+
+   Please read this file carefully.
+
+
+
+HOW TO BUILD
+
+   -- for UNIX:
+
+        Type `make'.     (tough, huh? :-)
+
+        This creates binaries "bzip2", and "bunzip2",
+        which is a symbolic link to "bzip2".
+
+        It also runs four compress-decompress tests to make sure
+        things are working properly.  If all goes well, you should be up &
+        running.  Please be sure to read the output from `make'
+        just to be sure that the tests went ok.
+
+        To install bzip2 properly:
+
+           -- Copy the binary "bzip2" to a publically visible place,
+              possibly /usr/bin, /usr/common/bin or /usr/local/bin.
+
+           -- In that directory, make "bunzip2" be a symbolic link
+              to "bzip2".
+
+           -- Copy the manual page, bzip2.1, to the relevant place.
+              Probably the right place is /usr/man/man1/.
+
+   -- for Windows 95 and NT: 
+
+        For a start, do you *really* want to recompile bzip2?  
+        The standard distribution includes a pre-compiled version
+        for Windows 95 and NT, `bzip2.exe'.
+
+        This executable was created with Jacob Navia's excellent
+        port to Win32 of Chris Fraser & David Hanson's excellent
+        ANSI C compiler, "lcc".  You can get to it at the pages
+        of the CS department of Princeton University, 
+        www.cs.princeton.edu.  
+        I have not tried to compile this version of bzip2 with
+        a commercial C compiler such as MS Visual C, as I don't
+        have one available.
+
+        Note that lcc is designed primarily to be portable and
+        fast.  Code quality is a secondary aim, so bzip2.exe
+        runs perhaps 40% slower than it could if compiled with
+        a good optimising compiler.
+
+        I compiled a previous version of bzip (0.21) with Borland
+        C 5.0, which worked fine, and with MS VC++ 2.0, which
+        didn't.  Here is an comment from the README for bzip-0.21.
+
+           MS VC++ 2.0's optimising compiler has a bug which, at 
+           maximum optimisation, gives an executable which produces 
+           garbage compressed files.  Proceed with caution. 
+           I do not know whether or not this happens with later 
+           versions of VC++.
+
+           Edit the defines starting at line 86 of bzip.c to 
+           select your platform/compiler combination, and then compile.
+           Then check that the resulting executable (assumed to be 
+           called bzip.exe) works correctly, using the SELFTEST.BAT file.  
+           Bearing in mind the previous paragraph, the self-test is
+           important.
+
+        Note that the defines which bzip-0.21 had, to support 
+        compilation with VC 2.0 and BC 5.0, are gone.  Windows
+        is not my preferred operating system, and I am, for the
+        moment, content with the modestly fast executable created
+        by lcc-win32.
+
+   A manual page is supplied, unformatted (bzip2.1),
+   preformatted (bzip2.1.preformatted), and preformatted
+   and sanitised for MS-DOS (bzip2.txt).
+
+
+
+COMPILATION NOTES
+
+   bzip2 should work on any 32 or 64-bit machine.  It is known to work
+   [meaning: it has compiled and passed self-tests] on the 
+   following platform-os combinations:
+
+      Intel i386/i486        running Linux 2.0.21
+      Sun Sparcs (various)   running SunOS 4.1.4 and Solaris 2.5
+      Intel i386/i486        running Windows 95 and NT
+      DEC Alpha              running Digital Unix 4.0
+
+   Following the release of bzip-0.21, many people mailed me
+   from around the world to say they had made it work on all sorts
+   of weird and wonderful machines.  Chances are, if you have
+   a reasonable ANSI C compiler and a 32-bit machine, you can
+   get it to work.
+
+   The #defines starting at around line 82 of bzip2.c supply some
+   degree of platform-independance.  If you configure bzip2 for some
+   new far-out platform which is not covered by the existing definitions,
+   please send me the relevant definitions.
+
+   I recommend GNU C for compilation.  The code is standard ANSI C,
+   except for the Unix-specific file handling, so any ANSI C compiler
+   should work.  Note however that the many routines marked INLINE
+   should be inlined by your compiler, else performance will be very
+   poor.  Asking your compiler to unroll loops gives some
+   small improvement too; for gcc, the relevant flag is
+   -funroll-loops.
+
+   On a 386/486 machines, I'd recommend giving gcc the
+   -fomit-frame-pointer flag; this liberates another register for
+   allocation, which measurably improves performance.
+
+   I used the abovementioned lcc compiler to develop bzip2.
+   I would highly recommend this compiler for day-to-day development;
+   it is fast, reliable, lightweight, has an excellent profiler,
+   and is generally excellent.  And it's fun to retarget, if you're
+   into that kind of thing.
+
+   If you compile bzip2 on a new platform or with a new compiler,
+   please be sure to run the four compress-decompress tests, either
+   using the Makefile, or with the test.bat (MSDOS) or test.cmd (OS/2)
+   files.  Some compilers have been seen to introduce subtle bugs
+   when optimising, so this check is important.  Ideally you should
+   then go on to test bzip2 on a file several megabytes or even
+   tens of megabytes long, just to be 110% sure.  ``Professional
+   programmers are paranoid programmers.'' (anon).
+
+
+
+VALIDATION
+
+   Correct operation, in the sense that a compressed file can always be
+   decompressed to reproduce the original, is obviously of paramount
+   importance.  To validate bzip2, I used a modified version of 
+   Mark Nelson's churn program.  Churn is an automated test driver
+   which recursively traverses a directory structure, using bzip2 to
+   compress and then decompress each file it encounters, and checking
+   that the decompressed data is the same as the original.  As test 
+   material, I used several runs over several filesystems of differing
+   sizes.
+
+   One set of tests was done on my base Linux filesystem,
+   410 megabytes in 23,000 files.  There were several runs over
+   this filesystem, in various configurations designed to break bzip2.
+   That filesystem also contained some specially constructed test
+   files designed to exercise boundary cases in the code.
+   This included files of zero length, various long, highly repetitive 
+   files, and some files which generate blocks with all values the same.
+
+   The other set of tests was done just with the "normal" configuration,
+   but on a much larger quantity of data.
+
+      Tests are:
+
+         Linux FS, 410M, 23000 files
+
+         As above, with --repetitive-fast
+
+         As above, with -1
+
+         Low level disk image of a disk containing
+            Windows NT4.0; 420M in a single huge file
+
+         Linux distribution, incl Slackware, 
+            all GNU sources.   1900M in 2300 files.
+
+         Approx ~100M compiler sources and related
+            programming tools, running under Purify.
+
+         About 500M of data in 120 files of around
+            4 M each.  This is raw data from a 
+            biomagnetometer (SQUID-based thing).
+
+      Overall, total volume of test data is about
+         3300 megabytes in 25000 files.
+
+   The distribution does four tests after building bzip.  These tests
+   include test decompressions of pre-supplied compressed files, so
+   they not only test that bzip works correctly on the machine it was
+   built on, but can also decompress files compressed on a different
+   machine.  This guards against unforseen interoperability problems.
+
+
+Please read and be aware of the following:
+
+WARNING:
+
+   This program (attempts to) compress data by performing several
+   non-trivial transformations on it.  Unless you are 100% familiar
+   with *all* the algorithms contained herein, and with the
+   consequences of modifying them, you should NOT meddle with the
+   compression or decompression machinery.  Incorrect changes can and
+   very likely *will* lead to disastrous loss of data.
+
+
+DISCLAIMER:
+
+   I TAKE NO RESPONSIBILITY FOR ANY LOSS OF DATA ARISING FROM THE
+   USE OF THIS PROGRAM, HOWSOEVER CAUSED.
+
+   Every compression of a file implies an assumption that the
+   compressed file can be decompressed to reproduce the original.
+   Great efforts in design, coding and testing have been made to
+   ensure that this program works correctly.  However, the complexity
+   of the algorithms, and, in particular, the presence of various
+   special cases in the code which occur with very low but non-zero
+   probability make it impossible to rule out the possibility of bugs
+   remaining in the program.  DO NOT COMPRESS ANY DATA WITH THIS
+   PROGRAM UNLESS YOU ARE PREPARED TO ACCEPT THE POSSIBILITY, HOWEVER
+   SMALL, THAT THE DATA WILL NOT BE RECOVERABLE.
+
+   That is not to say this program is inherently unreliable.  Indeed,
+   I very much hope the opposite is true.  bzip2 has been carefully
+   constructed and extensively tested.
+
+End of nasty legalities.
+
+
+I hope you find bzip2 useful.  Feel free to contact me at
+   [email protected]
+if you have any suggestions or queries.  Many people mailed me with
+comments, suggestions and patches after the releases of 0.15 and 0.21, 
+and the changes in bzip2 are largely a result of this feedback.
+I thank you for your comments.
+
+Julian Seward
+
+Manchester, UK
+18 July 1996 (version 0.15)
+25 August 1996 (version 0.21)
+
+Guildford, Surrey, UK
+7 August 1997 (bzip2, version 0.0)
diff --git a/README.DOS b/README.DOS
@@ -0,0 +1,20 @@
+
+Windows 95 & Windows NT users:
+
+1.  There's a pre-built executable, bzip2.exe, which
+    should work.  You don't need to compile anything.
+    You can run the `test.bat' batch file to check
+    the executable is working ok, if you want.
+
+2.  The control-C signal catcher seems pretty dodgy
+    under Windows, at least for the executable supplied.
+    When it catches a control-C, bzip2 tries to delete
+    its output file, so you don't get left with a half-
+    baked file.  But this sometimes seems to fail
+    under Windows.  Caveat Emptor!  I think I am doing
+    something not-quite-right in the signal catching.
+    Windows-&-C gurus got any suggestions?
+
+    Control-C handling all seems to work fine under Unix.
+
+7 Aug 97