Skip to content
This repository has been archived by the owner on Jun 6, 2019. It is now read-only.

Commit

Permalink
0.1
Browse files Browse the repository at this point in the history
  • Loading branch information
Julian Seward committed Aug 6, 1997
0 parents commit d2eaa71
Show file tree
Hide file tree
Showing 23 changed files with 6,550 additions and 0 deletions.
47 changes: 47 additions & 0 deletions ALGORITHMS
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@

Bzip2 is not research work, in the sense that it doesn't present any
new ideas. Rather, it's an engineering exercise based on existing
ideas.

Four documents describe essentially all the ideas behind bzip2:

Michael Burrows and D. J. Wheeler:
"A block-sorting lossless data compression algorithm"
10th May 1994.
Digital SRC Research Report 124.
ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz

Daniel S. Hirschberg and Debra A. LeLewer
"Efficient Decoding of Prefix Codes"
Communications of the ACM, April 1990, Vol 33, Number 4.
You might be able to get an electronic copy of this
from the ACM Digital Library.

David J. Wheeler
Program bred3.c and accompanying document bred3.ps.
This contains the idea behind the multi-table Huffman
coding scheme.
ftp://ftp.cl.cam.ac.uk/pub/user/djw3/

Jon L. Bentley and Robert Sedgewick
"Fast Algorithms for Sorting and Searching Strings"
Available from Sedgewick's web page,
www.cs.princeton.edu/~rs

The following paper gives valuable additional insights into the
algorithm, but is not immediately the basis of any code
used in bzip2.

Peter Fenwick:
Block Sorting Text Compression
Proceedings of the 19th Australasian Computer Science Conference,
Melbourne, Australia. Jan 31 - Feb 2, 1996.
ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps

All three are well written, and make fascinating reading. If you want
to modify bzip2 in any non-trivial way, I strongly suggest you obtain,
read and understand these papers.

I am much indebted to the various authors for their help, support and
advice.

339 changes: 339 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

30 changes: 30 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@

CC = gcc
SH = /bin/sh

CFLAGS = -O3 -fomit-frame-pointer -funroll-loops -Wall -Winline -W



all:
cat words0
$(CC) $(CFLAGS) -o bzip2 bzip2.c
$(CC) $(CFLAGS) -o bzip2recover bzip2recover.c
rm -f bunzip2
ln -s ./bzip2 ./bunzip2
cat words1
./bzip2 -1 < sample1.ref > sample1.rb2
./bzip2 -2 < sample2.ref > sample2.rb2
./bunzip2 < sample1.bz2 > sample1.tst
./bunzip2 < sample2.bz2 > sample2.tst
cat words2
cmp sample1.bz2 sample1.rb2
cmp sample2.bz2 sample2.rb2
cmp sample1.tst sample1.ref
cmp sample2.tst sample2.ref
cat words3


clean:
rm -f bzip2 bunzip2 bzip2recover sample*.tst sample*.rb2

243 changes: 243 additions & 0 deletions README
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@

GREETINGS!

This is the README for bzip2, my block-sorting file compressor,
version 0.1.

bzip2 is distributed under the GNU General Public License version 2;
for details, see the file LICENSE. Pointers to the algorithms used
are in ALGORITHMS. Instructions for use are in bzip2.1.preformatted.

Please read this file carefully.



HOW TO BUILD

-- for UNIX:

Type `make'. (tough, huh? :-)

This creates binaries "bzip2", and "bunzip2",
which is a symbolic link to "bzip2".

It also runs four compress-decompress tests to make sure
things are working properly. If all goes well, you should be up &
running. Please be sure to read the output from `make'
just to be sure that the tests went ok.

To install bzip2 properly:

-- Copy the binary "bzip2" to a publically visible place,
possibly /usr/bin, /usr/common/bin or /usr/local/bin.

-- In that directory, make "bunzip2" be a symbolic link
to "bzip2".

-- Copy the manual page, bzip2.1, to the relevant place.
Probably the right place is /usr/man/man1/.

-- for Windows 95 and NT:

For a start, do you *really* want to recompile bzip2?
The standard distribution includes a pre-compiled version
for Windows 95 and NT, `bzip2.exe'.

This executable was created with Jacob Navia's excellent
port to Win32 of Chris Fraser & David Hanson's excellent
ANSI C compiler, "lcc". You can get to it at the pages
of the CS department of Princeton University,
www.cs.princeton.edu.
I have not tried to compile this version of bzip2 with
a commercial C compiler such as MS Visual C, as I don't
have one available.

Note that lcc is designed primarily to be portable and
fast. Code quality is a secondary aim, so bzip2.exe
runs perhaps 40% slower than it could if compiled with
a good optimising compiler.

I compiled a previous version of bzip (0.21) with Borland
C 5.0, which worked fine, and with MS VC++ 2.0, which
didn't. Here is an comment from the README for bzip-0.21.

MS VC++ 2.0's optimising compiler has a bug which, at
maximum optimisation, gives an executable which produces
garbage compressed files. Proceed with caution.
I do not know whether or not this happens with later
versions of VC++.

Edit the defines starting at line 86 of bzip.c to
select your platform/compiler combination, and then compile.
Then check that the resulting executable (assumed to be
called bzip.exe) works correctly, using the SELFTEST.BAT file.
Bearing in mind the previous paragraph, the self-test is
important.

Note that the defines which bzip-0.21 had, to support
compilation with VC 2.0 and BC 5.0, are gone. Windows
is not my preferred operating system, and I am, for the
moment, content with the modestly fast executable created
by lcc-win32.

A manual page is supplied, unformatted (bzip2.1),
preformatted (bzip2.1.preformatted), and preformatted
and sanitised for MS-DOS (bzip2.txt).



COMPILATION NOTES

bzip2 should work on any 32 or 64-bit machine. It is known to work
[meaning: it has compiled and passed self-tests] on the
following platform-os combinations:

Intel i386/i486 running Linux 2.0.21
Sun Sparcs (various) running SunOS 4.1.4 and Solaris 2.5
Intel i386/i486 running Windows 95 and NT
DEC Alpha running Digital Unix 4.0

Following the release of bzip-0.21, many people mailed me
from around the world to say they had made it work on all sorts
of weird and wonderful machines. Chances are, if you have
a reasonable ANSI C compiler and a 32-bit machine, you can
get it to work.

The #defines starting at around line 82 of bzip2.c supply some
degree of platform-independance. If you configure bzip2 for some
new far-out platform which is not covered by the existing definitions,
please send me the relevant definitions.

I recommend GNU C for compilation. The code is standard ANSI C,
except for the Unix-specific file handling, so any ANSI C compiler
should work. Note however that the many routines marked INLINE
should be inlined by your compiler, else performance will be very
poor. Asking your compiler to unroll loops gives some
small improvement too; for gcc, the relevant flag is
-funroll-loops.

On a 386/486 machines, I'd recommend giving gcc the
-fomit-frame-pointer flag; this liberates another register for
allocation, which measurably improves performance.

I used the abovementioned lcc compiler to develop bzip2.
I would highly recommend this compiler for day-to-day development;
it is fast, reliable, lightweight, has an excellent profiler,
and is generally excellent. And it's fun to retarget, if you're
into that kind of thing.

If you compile bzip2 on a new platform or with a new compiler,
please be sure to run the four compress-decompress tests, either
using the Makefile, or with the test.bat (MSDOS) or test.cmd (OS/2)
files. Some compilers have been seen to introduce subtle bugs
when optimising, so this check is important. Ideally you should
then go on to test bzip2 on a file several megabytes or even
tens of megabytes long, just to be 110% sure. ``Professional
programmers are paranoid programmers.'' (anon).



VALIDATION

Correct operation, in the sense that a compressed file can always be
decompressed to reproduce the original, is obviously of paramount
importance. To validate bzip2, I used a modified version of
Mark Nelson's churn program. Churn is an automated test driver
which recursively traverses a directory structure, using bzip2 to
compress and then decompress each file it encounters, and checking
that the decompressed data is the same as the original. As test
material, I used several runs over several filesystems of differing
sizes.

One set of tests was done on my base Linux filesystem,
410 megabytes in 23,000 files. There were several runs over
this filesystem, in various configurations designed to break bzip2.
That filesystem also contained some specially constructed test
files designed to exercise boundary cases in the code.
This included files of zero length, various long, highly repetitive
files, and some files which generate blocks with all values the same.

The other set of tests was done just with the "normal" configuration,
but on a much larger quantity of data.

Tests are:

Linux FS, 410M, 23000 files

As above, with --repetitive-fast

As above, with -1

Low level disk image of a disk containing
Windows NT4.0; 420M in a single huge file

Linux distribution, incl Slackware,
all GNU sources. 1900M in 2300 files.

Approx ~100M compiler sources and related
programming tools, running under Purify.

About 500M of data in 120 files of around
4 M each. This is raw data from a
biomagnetometer (SQUID-based thing).

Overall, total volume of test data is about
3300 megabytes in 25000 files.

The distribution does four tests after building bzip. These tests
include test decompressions of pre-supplied compressed files, so
they not only test that bzip works correctly on the machine it was
built on, but can also decompress files compressed on a different
machine. This guards against unforseen interoperability problems.


Please read and be aware of the following:

WARNING:

This program (attempts to) compress data by performing several
non-trivial transformations on it. Unless you are 100% familiar
with *all* the algorithms contained herein, and with the
consequences of modifying them, you should NOT meddle with the
compression or decompression machinery. Incorrect changes can and
very likely *will* lead to disastrous loss of data.


DISCLAIMER:

I TAKE NO RESPONSIBILITY FOR ANY LOSS OF DATA ARISING FROM THE
USE OF THIS PROGRAM, HOWSOEVER CAUSED.

Every compression of a file implies an assumption that the
compressed file can be decompressed to reproduce the original.
Great efforts in design, coding and testing have been made to
ensure that this program works correctly. However, the complexity
of the algorithms, and, in particular, the presence of various
special cases in the code which occur with very low but non-zero
probability make it impossible to rule out the possibility of bugs
remaining in the program. DO NOT COMPRESS ANY DATA WITH THIS
PROGRAM UNLESS YOU ARE PREPARED TO ACCEPT THE POSSIBILITY, HOWEVER
SMALL, THAT THE DATA WILL NOT BE RECOVERABLE.

That is not to say this program is inherently unreliable. Indeed,
I very much hope the opposite is true. bzip2 has been carefully
constructed and extensively tested.

End of nasty legalities.


I hope you find bzip2 useful. Feel free to contact me at
[email protected]
if you have any suggestions or queries. Many people mailed me with
comments, suggestions and patches after the releases of 0.15 and 0.21,
and the changes in bzip2 are largely a result of this feedback.
I thank you for your comments.

Julian Seward

Manchester, UK
18 July 1996 (version 0.15)
25 August 1996 (version 0.21)

Guildford, Surrey, UK
7 August 1997 (bzip2, version 0.0)
20 changes: 20 additions & 0 deletions README.DOS
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@

Windows 95 & Windows NT users:

1. There's a pre-built executable, bzip2.exe, which
should work. You don't need to compile anything.
You can run the `test.bat' batch file to check
the executable is working ok, if you want.

2. The control-C signal catcher seems pretty dodgy
under Windows, at least for the executable supplied.
When it catches a control-C, bzip2 tries to delete
its output file, so you don't get left with a half-
baked file. But this sometimes seems to fail
under Windows. Caveat Emptor! I think I am doing
something not-quite-right in the signal catching.
Windows-&-C gurus got any suggestions?

Control-C handling all seems to work fine under Unix.

7 Aug 97
Loading

0 comments on commit d2eaa71

Please sign in to comment.