diff --git a/ALGORITHMS b/ALGORITHMS new file mode 100644 index 0000000..7c7d2ca --- /dev/null +++ b/ALGORITHMS @@ -0,0 +1,47 @@ + +Bzip2 is not research work, in the sense that it doesn't present any +new ideas. Rather, it's an engineering exercise based on existing +ideas. + +Four documents describe essentially all the ideas behind bzip2: + + Michael Burrows and D. J. Wheeler: + "A block-sorting lossless data compression algorithm" + 10th May 1994. + Digital SRC Research Report 124. + ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz + + Daniel S. Hirschberg and Debra A. LeLewer + "Efficient Decoding of Prefix Codes" + Communications of the ACM, April 1990, Vol 33, Number 4. + You might be able to get an electronic copy of this + from the ACM Digital Library. + + David J. Wheeler + Program bred3.c and accompanying document bred3.ps. + This contains the idea behind the multi-table Huffman + coding scheme. + ftp://ftp.cl.cam.ac.uk/pub/user/djw3/ + + Jon L. Bentley and Robert Sedgewick + "Fast Algorithms for Sorting and Searching Strings" + Available from Sedgewick's web page, + www.cs.princeton.edu/~rs + +The following paper gives valuable additional insights into the +algorithm, but is not immediately the basis of any code +used in bzip2. + + Peter Fenwick: + Block Sorting Text Compression + Proceedings of the 19th Australasian Computer Science Conference, + Melbourne, Australia. Jan 31 - Feb 2, 1996. + ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps + +All three are well written, and make fascinating reading. If you want +to modify bzip2 in any non-trivial way, I strongly suggest you obtain, +read and understand these papers. + +I am much indebted to the various authors for their help, support and +advice. + diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..a43ea21 --- /dev/null +++ b/LICENSE @@ -0,0 +1,339 @@ + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc. + 675 Mass Ave, Cambridge, MA 02139, USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Library General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. + + END OF TERMS AND CONDITIONS + + Appendix: How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) 19yy + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + +Also add information on how to contact you by electronic and paper mail. + +If the program is interactive, make it output a short notice like this +when it starts in an interactive mode: + + Gnomovision version 69, Copyright (C) 19yy name of author + Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, the commands you use may +be called something other than `show w' and `show c'; they could even be +mouse-clicks or menu items--whatever suits your program. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the program, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the program + `Gnomovision' (which makes passes at compilers) written by James Hacker. + + , 1 April 1989 + Ty Coon, President of Vice + +This General Public License does not permit incorporating your program into +proprietary programs. If your program is a subroutine library, you may +consider it more useful to permit linking proprietary applications with the +library. If this is what you want to do, use the GNU Library General +Public License instead of this License. diff --git a/Makefile b/Makefile new file mode 100644 index 0000000..d124743 --- /dev/null +++ b/Makefile @@ -0,0 +1,30 @@ + +CC = gcc +SH = /bin/sh + +CFLAGS = -O3 -fomit-frame-pointer -funroll-loops -Wall -Winline -W + + + +all: + cat words0 + $(CC) $(CFLAGS) -o bzip2 bzip2.c + $(CC) $(CFLAGS) -o bzip2recover bzip2recover.c + rm -f bunzip2 + ln -s ./bzip2 ./bunzip2 + cat words1 + ./bzip2 -1 < sample1.ref > sample1.rb2 + ./bzip2 -2 < sample2.ref > sample2.rb2 + ./bunzip2 < sample1.bz2 > sample1.tst + ./bunzip2 < sample2.bz2 > sample2.tst + cat words2 + cmp sample1.bz2 sample1.rb2 + cmp sample2.bz2 sample2.rb2 + cmp sample1.tst sample1.ref + cmp sample2.tst sample2.ref + cat words3 + + +clean: + rm -f bzip2 bunzip2 bzip2recover sample*.tst sample*.rb2 + diff --git a/README b/README new file mode 100644 index 0000000..d77830f --- /dev/null +++ b/README @@ -0,0 +1,243 @@ + +GREETINGS! + + This is the README for bzip2, my block-sorting file compressor, + version 0.1. + + bzip2 is distributed under the GNU General Public License version 2; + for details, see the file LICENSE. Pointers to the algorithms used + are in ALGORITHMS. Instructions for use are in bzip2.1.preformatted. + + Please read this file carefully. + + + +HOW TO BUILD + + -- for UNIX: + + Type `make'. (tough, huh? :-) + + This creates binaries "bzip2", and "bunzip2", + which is a symbolic link to "bzip2". + + It also runs four compress-decompress tests to make sure + things are working properly. If all goes well, you should be up & + running. Please be sure to read the output from `make' + just to be sure that the tests went ok. + + To install bzip2 properly: + + -- Copy the binary "bzip2" to a publically visible place, + possibly /usr/bin, /usr/common/bin or /usr/local/bin. + + -- In that directory, make "bunzip2" be a symbolic link + to "bzip2". + + -- Copy the manual page, bzip2.1, to the relevant place. + Probably the right place is /usr/man/man1/. + + -- for Windows 95 and NT: + + For a start, do you *really* want to recompile bzip2? + The standard distribution includes a pre-compiled version + for Windows 95 and NT, `bzip2.exe'. + + This executable was created with Jacob Navia's excellent + port to Win32 of Chris Fraser & David Hanson's excellent + ANSI C compiler, "lcc". You can get to it at the pages + of the CS department of Princeton University, + www.cs.princeton.edu. + I have not tried to compile this version of bzip2 with + a commercial C compiler such as MS Visual C, as I don't + have one available. + + Note that lcc is designed primarily to be portable and + fast. Code quality is a secondary aim, so bzip2.exe + runs perhaps 40% slower than it could if compiled with + a good optimising compiler. + + I compiled a previous version of bzip (0.21) with Borland + C 5.0, which worked fine, and with MS VC++ 2.0, which + didn't. Here is an comment from the README for bzip-0.21. + + MS VC++ 2.0's optimising compiler has a bug which, at + maximum optimisation, gives an executable which produces + garbage compressed files. Proceed with caution. + I do not know whether or not this happens with later + versions of VC++. + + Edit the defines starting at line 86 of bzip.c to + select your platform/compiler combination, and then compile. + Then check that the resulting executable (assumed to be + called bzip.exe) works correctly, using the SELFTEST.BAT file. + Bearing in mind the previous paragraph, the self-test is + important. + + Note that the defines which bzip-0.21 had, to support + compilation with VC 2.0 and BC 5.0, are gone. Windows + is not my preferred operating system, and I am, for the + moment, content with the modestly fast executable created + by lcc-win32. + + A manual page is supplied, unformatted (bzip2.1), + preformatted (bzip2.1.preformatted), and preformatted + and sanitised for MS-DOS (bzip2.txt). + + + +COMPILATION NOTES + + bzip2 should work on any 32 or 64-bit machine. It is known to work + [meaning: it has compiled and passed self-tests] on the + following platform-os combinations: + + Intel i386/i486 running Linux 2.0.21 + Sun Sparcs (various) running SunOS 4.1.4 and Solaris 2.5 + Intel i386/i486 running Windows 95 and NT + DEC Alpha running Digital Unix 4.0 + + Following the release of bzip-0.21, many people mailed me + from around the world to say they had made it work on all sorts + of weird and wonderful machines. Chances are, if you have + a reasonable ANSI C compiler and a 32-bit machine, you can + get it to work. + + The #defines starting at around line 82 of bzip2.c supply some + degree of platform-independance. If you configure bzip2 for some + new far-out platform which is not covered by the existing definitions, + please send me the relevant definitions. + + I recommend GNU C for compilation. The code is standard ANSI C, + except for the Unix-specific file handling, so any ANSI C compiler + should work. Note however that the many routines marked INLINE + should be inlined by your compiler, else performance will be very + poor. Asking your compiler to unroll loops gives some + small improvement too; for gcc, the relevant flag is + -funroll-loops. + + On a 386/486 machines, I'd recommend giving gcc the + -fomit-frame-pointer flag; this liberates another register for + allocation, which measurably improves performance. + + I used the abovementioned lcc compiler to develop bzip2. + I would highly recommend this compiler for day-to-day development; + it is fast, reliable, lightweight, has an excellent profiler, + and is generally excellent. And it's fun to retarget, if you're + into that kind of thing. + + If you compile bzip2 on a new platform or with a new compiler, + please be sure to run the four compress-decompress tests, either + using the Makefile, or with the test.bat (MSDOS) or test.cmd (OS/2) + files. Some compilers have been seen to introduce subtle bugs + when optimising, so this check is important. Ideally you should + then go on to test bzip2 on a file several megabytes or even + tens of megabytes long, just to be 110% sure. ``Professional + programmers are paranoid programmers.'' (anon). + + + +VALIDATION + + Correct operation, in the sense that a compressed file can always be + decompressed to reproduce the original, is obviously of paramount + importance. To validate bzip2, I used a modified version of + Mark Nelson's churn program. Churn is an automated test driver + which recursively traverses a directory structure, using bzip2 to + compress and then decompress each file it encounters, and checking + that the decompressed data is the same as the original. As test + material, I used several runs over several filesystems of differing + sizes. + + One set of tests was done on my base Linux filesystem, + 410 megabytes in 23,000 files. There were several runs over + this filesystem, in various configurations designed to break bzip2. + That filesystem also contained some specially constructed test + files designed to exercise boundary cases in the code. + This included files of zero length, various long, highly repetitive + files, and some files which generate blocks with all values the same. + + The other set of tests was done just with the "normal" configuration, + but on a much larger quantity of data. + + Tests are: + + Linux FS, 410M, 23000 files + + As above, with --repetitive-fast + + As above, with -1 + + Low level disk image of a disk containing + Windows NT4.0; 420M in a single huge file + + Linux distribution, incl Slackware, + all GNU sources. 1900M in 2300 files. + + Approx ~100M compiler sources and related + programming tools, running under Purify. + + About 500M of data in 120 files of around + 4 M each. This is raw data from a + biomagnetometer (SQUID-based thing). + + Overall, total volume of test data is about + 3300 megabytes in 25000 files. + + The distribution does four tests after building bzip. These tests + include test decompressions of pre-supplied compressed files, so + they not only test that bzip works correctly on the machine it was + built on, but can also decompress files compressed on a different + machine. This guards against unforseen interoperability problems. + + +Please read and be aware of the following: + +WARNING: + + This program (attempts to) compress data by performing several + non-trivial transformations on it. Unless you are 100% familiar + with *all* the algorithms contained herein, and with the + consequences of modifying them, you should NOT meddle with the + compression or decompression machinery. Incorrect changes can and + very likely *will* lead to disastrous loss of data. + + +DISCLAIMER: + + I TAKE NO RESPONSIBILITY FOR ANY LOSS OF DATA ARISING FROM THE + USE OF THIS PROGRAM, HOWSOEVER CAUSED. + + Every compression of a file implies an assumption that the + compressed file can be decompressed to reproduce the original. + Great efforts in design, coding and testing have been made to + ensure that this program works correctly. However, the complexity + of the algorithms, and, in particular, the presence of various + special cases in the code which occur with very low but non-zero + probability make it impossible to rule out the possibility of bugs + remaining in the program. DO NOT COMPRESS ANY DATA WITH THIS + PROGRAM UNLESS YOU ARE PREPARED TO ACCEPT THE POSSIBILITY, HOWEVER + SMALL, THAT THE DATA WILL NOT BE RECOVERABLE. + + That is not to say this program is inherently unreliable. Indeed, + I very much hope the opposite is true. bzip2 has been carefully + constructed and extensively tested. + +End of nasty legalities. + + +I hope you find bzip2 useful. Feel free to contact me at + jseward@acm.org +if you have any suggestions or queries. Many people mailed me with +comments, suggestions and patches after the releases of 0.15 and 0.21, +and the changes in bzip2 are largely a result of this feedback. +I thank you for your comments. + +Julian Seward + +Manchester, UK +18 July 1996 (version 0.15) +25 August 1996 (version 0.21) + +Guildford, Surrey, UK +7 August 1997 (bzip2, version 0.0) \ No newline at end of file diff --git a/README.DOS b/README.DOS new file mode 100644 index 0000000..d522b81 --- /dev/null +++ b/README.DOS @@ -0,0 +1,20 @@ + +Windows 95 & Windows NT users: + +1. There's a pre-built executable, bzip2.exe, which + should work. You don't need to compile anything. + You can run the `test.bat' batch file to check + the executable is working ok, if you want. + +2. The control-C signal catcher seems pretty dodgy + under Windows, at least for the executable supplied. + When it catches a control-C, bzip2 tries to delete + its output file, so you don't get left with a half- + baked file. But this sometimes seems to fail + under Windows. Caveat Emptor! I think I am doing + something not-quite-right in the signal catching. + Windows-&-C gurus got any suggestions? + + Control-C handling all seems to work fine under Unix. + +7 Aug 97 diff --git a/bzip2.1 b/bzip2.1 new file mode 100644 index 0000000..9094c7c --- /dev/null +++ b/bzip2.1 @@ -0,0 +1,441 @@ +.PU +.TH bzip2 1 +.SH NAME +bzip2, bunzip2 \- a block-sorting file compressor, v0.1 +.br +bzip2recover \- recovers data from damaged bzip2 files + +.SH SYNOPSIS +.ll +8 +.B bzip2 +.RB [ " \-cdfkstvVL123456789 " ] +[ +.I "filenames \&..." +] +.ll -8 +.br +.B bunzip2 +.RB [ " \-kvsVL " ] +[ +.I "filenames \&..." +] +.br +.B bzip2recover +.I "filename" + +.SH DESCRIPTION +.I Bzip2 +compresses files using the Burrows-Wheeler block-sorting +text compression algorithm, and Huffman coding. +Compression is generally considerably +better than that +achieved by more conventional LZ77/LZ78-based compressors, +and approaches the performance of the PPM family of statistical +compressors. + +The command-line options are deliberately very similar to +those of +.I GNU Gzip, +but they are not identical. + +.I Bzip2 +expects a list of file names to accompany the command-line flags. +Each file is replaced by a compressed version of itself, +with the name "original_name.bz2". +Each compressed file has the same modification date and permissions +as the corresponding original, so that these properties can be +correctly restored at decompression time. File name handling is +naive in the sense that there is no mechanism for preserving +original file names, permissions and dates in filesystems +which lack these concepts, or have serious file name length +restrictions, such as MS-DOS. + +.I Bzip2 +and +.I bunzip2 +will not overwrite existing files; if you want this to happen, +you should delete them first. + +If no file names are specified, +.I bzip2 +compresses from standard input to standard output. +In this case, +.I bzip2 +will decline to write compressed output to a terminal, as +this would be entirely incomprehensible and therefore pointless. + +.I Bunzip2 +(or +.I bzip2 \-d +) decompresses and restores all specified files whose names +end in ".bz2". +Files without this suffix are ignored. +Again, supplying no filenames +causes decompression from standard input to standard output. + +You can also compress or decompress files to +the standard output by giving the \-c flag. +You can decompress multiple files like this, but you may +only compress a single file this way, since it would otherwise +be difficult to separate out the compressed representations of +the original files. + +Compression is always performed, even if the compressed file is +slightly larger than the original. Files of less than about +one hundred bytes tend to get larger, since the compression +mechanism has a constant overhead in the region of 50 bytes. +Random data (including the output of most file compressors) +is coded at about 8.05 bits per byte, giving an expansion of +around 0.5%. + +As a self-check for your protection, +.I bzip2 +uses 32-bit CRCs to make sure that the decompressed +version of a file is identical to the original. +This guards against corruption of the compressed data, +and against undetected bugs in +.I bzip2 +(hopefully very unlikely). +The chances of data corruption going undetected is +microscopic, about one chance in four billion +for each file processed. Be aware, though, that the check +occurs upon decompression, so it can only tell you that +that something is wrong. It can't help you recover the +original uncompressed data. +You can use +.I bzip2recover +to try to recover data from damaged files. + +Return values: +0 for a normal exit, +1 for environmental +problems (file not found, invalid flags, I/O errors, &c), +2 to indicate a corrupt compressed file, +3 for an internal consistency error (eg, bug) which caused +.I bzip2 +to panic. + +.SH MEMORY MANAGEMENT +.I Bzip2 +compresses large files in blocks. The block size affects both the +compression ratio achieved, and the amount of memory needed both for +compression and decompression. The flags \-1 through \-9 +specify the block size to be 100,000 bytes through 900,000 bytes +(the default) respectively. At decompression-time, the block size used for +compression is read from the header of the compressed file, and +.I bunzip2 +then allocates itself just enough memory to decompress the file. +Since block sizes are stored in compressed files, it follows that the flags +\-1 to \-9 +are irrelevant to and so ignored during decompression. +Compression and decompression requirements, in bytes, can be estimated as: + + Compression: 400k + ( 7 x block size ) + + Decompression: 100k + ( 5 x block size ), or +.br + 100k + ( 2.5 x block size ) + +Larger block sizes give rapidly diminishing marginal returns; most +of the +compression comes from the first two or three hundred k of block size, +a fact worth bearing in mind when using +.I bzip2 +on small machines. It is also important to appreciate that the +decompression memory requirement is set at compression-time by the +choice of block size. + +For files compressed with the default 900k block size, +.I bunzip2 +will require about 4600 kbytes to decompress. +To support decompression of any file on a 4 megabyte machine, +.I bunzip2 +has an option to decompress using approximately half this +amount of memory, about 2300 kbytes. Decompression speed is +also halved, so you should use this option only where necessary. +The relevant flag is \-s. + +In general, try and use the largest block size +memory constraints allow, since that maximises the compression +achieved. Compression and decompression +speed are virtually unaffected by block size. + +Another significant point applies to files which fit in a single +block -- that means most files you'd encounter using a large +block size. The amount of real memory touched is proportional +to the size of the file, since the file is smaller than a block. +For example, compressing a file 20,000 bytes long with the flag +\-9 +will cause the compressor to allocate around +6700k of memory, but only touch 400k + 20000 * 7 = 540 +kbytes of it. Similarly, the decompressor will allocate 4600k but +only touch 100k + 20000 * 5 = 200 kbytes. + +Here is a table which summarises the maximum memory usage for +different block sizes. Also recorded is the total compressed +size for 14 files of the Calgary Text Compression Corpus +totalling 3,141,622 bytes. This column gives some feel for how +compression varies with block size. These figures tend to understate +the advantage of larger block sizes for larger files, since the +Corpus is dominated by smaller files. + + Compress Decompress Decompress Corpus + Flag usage usage -s usage Size + + -1 1100k 600k 350k 914704 + -2 1800k 1100k 600k 877703 + -3 2500k 1600k 850k 860338 + -4 3200k 2100k 1100k 846899 + -5 3900k 2600k 1350k 845160 + -6 4600k 3100k 1600k 838626 + -7 5400k 3600k 1850k 834096 + -8 6000k 4100k 2100k 828642 + -9 6700k 4600k 2350k 828642 + +.SH OPTIONS +.TP +.B \-c --stdout +Compress or decompress to standard output. \-c will decompress +multiple files to stdout, but will only compress a single file to +stdout. +.TP +.B \-d --decompress +Force decompression. +.I Bzip2 +and +.I bunzip2 +are really the same program, and the decision about whether to +compress or decompress is done on the basis of which name is +used. This flag overrides that mechanism, and forces +.I bzip2 +to decompress. +.TP +.B \-f --compress +The complement to \-d: forces compression, regardless of the invokation +name. +.TP +.B \-t --test +Check integrity of the specified file(s), but don't decompress them. +This really performs a trial decompression and throws away the result, +using the low-memory decompression algorithm (see \-s). +.TP +.B \-k --keep +Keep (don't delete) input files during compression or decompression. +.TP +.B \-s --small +Reduce memory usage, both for compression and decompression. +Files are decompressed using a modified algorithm which only +requires 2.5 bytes per block byte. This means any file can be +decompressed in 2300k of memory, albeit somewhat more slowly than +usual. + +During compression, -s selects a block size of 200k, which limits +memory use to around the same figure, at the expense of your +compression ratio. In short, if your machine is low on memory +(8 megabytes or less), use -s for everything. See +MEMORY MANAGEMENT above. + +.TP +.B \-v --verbose +Verbose mode -- show the compression ratio for each file processed. +Further \-v's increase the verbosity level, spewing out lots of +information which is primarily of interest for diagnostic purposes. +.TP +.B \-L --license +Display the software version, license terms and conditions. +.TP +.B \-V --version +Same as \-L. +.TP +.B \-1 to \-9 +Set the block size to 100 k, 200 k .. 900 k when +compressing. Has no effect when decompressing. +See MEMORY MANAGEMENT above. +.TP +.B \--repetitive-fast +.I bzip2 +injects some small pseudo-random variations +into very repetitive blocks to limit +worst-case performance during compression. +If sorting runs into difficulties, the block +is randomised, and sorting is restarted. +Very roughly, +.I bzip2 +persists for three times as long as a well-behaved input +would take before resorting to randomisation. +This flag makes it give up much sooner. + +.TP +.B \--repetitive-best +Opposite of \--repetitive-fast; try a lot harder before +resorting to randomisation. + +.SH RECOVERING DATA FROM DAMAGED FILES +.I bzip2 +compresses files in blocks, usually 900kbytes long. +Each block is handled independently. If a media or +transmission error causes a multi-block .bz2 +file to become damaged, +it may be possible to recover data from the undamaged blocks +in the file. + +The compressed representation of each block is delimited by +a 48-bit pattern, which makes it possible to find the block +boundaries with reasonable certainty. Each block also carries +its own 32-bit CRC, so damaged blocks can be +distinguished from undamaged ones. + +.I bzip2recover +is a simple program whose purpose is to search for +blocks in .bz2 files, and write each block out into +its own .bz2 file. You can then use +.I bzip2 -t +to test the integrity of the resulting files, +and decompress those which are undamaged. + +.I bzip2recover +takes a single argument, the name of the damaged file, +and writes a number of files "rec0001file.bz2", "rec0002file.bz2", +etc, containing the extracted blocks. The output filenames +are designed so that the use of wildcards in subsequent processing +-- for example, "bzip2 -dc rec*file.bz2 > recovered_data" -- +lists the files in the "right" order. + +.I bzip2recover +should be of most use dealing with large .bz2 files, as +these will contain many blocks. It is clearly futile to +use it on damaged single-block files, since a damaged +block cannot be recovered. If you wish to minimise +any potential data loss through media or transmission +errors, you might consider compressing with a smaller +block size. + +.SH PERFORMANCE NOTES +The sorting phase of compression gathers together similar strings +in the file. Because of this, files containing very long +runs of repeated symbols, like "aabaabaabaab ..." (repeated +several hundred times) may compress extraordinarily slowly. +You can use the +\-vvvvv +option to monitor progress in great detail, if you want. +Decompression speed is unaffected. + +Such pathological cases +seem rare in practice, appearing mostly in artificially-constructed +test files, and in low-level disk images. It may be inadvisable to +use +.I bzip2 +to compress the latter. +If you do get a file which causes severe slowness in compression, +try making the block size as small as possible, with flag \-1. + +Incompressible or virtually-incompressible data may decompress +rather more slowly than one would hope. This is due to +a naive implementation of the move-to-front coder. + +.I bzip2 +usually allocates several megabytes of memory to operate in, +and then charges all over it in a fairly random fashion. This +means that performance, both for compressing and decompressing, +is largely determined by the speed +at which your machine can service cache misses. +Because of this, small changes +to the code to reduce the miss rate have been observed to give +disproportionately large performance improvements. +I imagine +.I bzip2 +will perform best on machines with very large caches. + +Test mode (\-t) uses the low-memory decompression algorithm +(\-s). This means test mode does not run as fast as it could; +it could run as fast as the normal decompression machinery. +This could easily be fixed at the cost of some code bloat. + +.SH CAVEATS +I/O error messages are not as helpful as they could be. +.I Bzip2 +tries hard to detect I/O errors and exit cleanly, but the +details of what the problem is sometimes seem rather misleading. + +This manual page pertains to version 0.1 of +.I bzip2. +It may well happen that some future version will +use a different compressed file format. If you try to +decompress, using 0.1, a .bz2 file created with some +future version which uses a different compressed file format, +0.1 will complain that your file "is not a bzip2 file". +If that happens, you should obtain a more recent version +of +.I bzip2 +and use that to decompress the file. + +Wildcard expansion for Windows 95 and NT +is flaky. + +.I bzip2recover +uses 32-bit integers to represent bit positions in +compressed files, so it cannot handle compressed files +more than 512 megabytes long. This could easily be fixed. + +.I bzip2recover +sometimes reports a very small, incomplete final block. +This is spurious and can be safely ignored. + +.SH RELATIONSHIP TO bzip-0.21 +This program is a descendant of the +.I bzip +program, version 0.21, which I released in August 1996. +The primary difference of +.I bzip2 +is its avoidance of the possibly patented algorithms +which were used in 0.21. +.I bzip2 +also brings various useful refinements (\-s, \-t), +uses less memory, decompresses significantly faster, and +has support for recovering data from damaged files. + +Because +.I bzip2 +uses Huffman coding to construct the compressed bitstream, +rather than the arithmetic coding used in 0.21, +the compressed representations generated by the two programs +are incompatible, and they will not interoperate. The change +in suffix from .bz to .bz2 reflects this. It would have been +helpful to at least allow +.I bzip2 +to decompress files created by 0.21, but this would +defeat the primary aim of having a patent-free compressor. + +Huffman coding necessarily involves some coding inefficiency +compared to arithmetic coding. This means that +.I bzip2 +compresses about 1% worse than 0.21, an unfortunate but +unavoidable fact-of-life. On the other hand, decompression +is approximately 50% faster for the same reason, and the +change in file format gave an opportunity to add data-recovery +features. So it is not all bad. + +.SH AUTHOR +Julian Seward, jseward@acm.org. + +The ideas embodied in +.I bzip +and +.I bzip2 +are due to (at least) the following people: +Michael Burrows and David Wheeler (for the block sorting +transformation), David Wheeler (again, for the Huffman coder), +Peter Fenwick (for the structured coding model in 0.21, +and many refinements), +and +Alistair Moffat, Radford Neal and Ian Witten (for the arithmetic +coder in 0.21). I am much indebted for their help, support and advice. +See the file ALGORITHMS in the source distribution for pointers to +sources of documentation. +Christian von Roques encouraged me to look for faster +sorting algorithms, so as to speed up compression. +Bela Lubkin encouraged me to improve the worst-case +compression performance. +Many people sent patches, helped with portability problems, +lent machines, gave advice and were generally helpful. + diff --git a/bzip2.1.preformatted b/bzip2.1.preformatted new file mode 100644 index 0000000..947dc97 --- /dev/null +++ b/bzip2.1.preformatted @@ -0,0 +1,462 @@ + + + +bzip2(1) bzip2(1) + + +NNAAMMEE + bzip2, bunzip2 - a block-sorting file compressor, v0.1 + bzip2recover - recovers data from damaged bzip2 files + + +SSYYNNOOPPSSIISS + bbzziipp22 [ --ccddffkkssttvvVVLL112233445566778899 ] [ _f_i_l_e_n_a_m_e_s _._._. ] + bbuunnzziipp22 [ --kkvvssVVLL ] [ _f_i_l_e_n_a_m_e_s _._._. ] + bbzziipp22rreeccoovveerr _f_i_l_e_n_a_m_e + + +DDEESSCCRRIIPPTTIIOONN + _B_z_i_p_2 compresses files using the Burrows-Wheeler block- + sorting text compression algorithm, and Huffman coding. + Compression is generally considerably better than that + achieved by more conventional LZ77/LZ78-based compressors, + and approaches the performance of the PPM family of sta- + tistical compressors. + + The command-line options are deliberately very similar to + those of _G_N_U _G_z_i_p_, but they are not identical. + + _B_z_i_p_2 expects a list of file names to accompany the com- + mand-line flags. Each file is replaced by a compressed + version of itself, with the name "original_name.bz2". + Each compressed file has the same modification date and + permissions as the corresponding original, so that these + properties can be correctly restored at decompression + time. File name handling is naive in the sense that there + is no mechanism for preserving original file names, per- + missions and dates in filesystems which lack these con- + cepts, or have serious file name length restrictions, such + as MS-DOS. + + _B_z_i_p_2 and _b_u_n_z_i_p_2 will not overwrite existing files; if + you want this to happen, you should delete them first. + + If no file names are specified, _b_z_i_p_2 compresses from + standard input to standard output. In this case, _b_z_i_p_2 + will decline to write compressed output to a terminal, as + this would be entirely incomprehensible and therefore + pointless. + + _B_u_n_z_i_p_2 (or _b_z_i_p_2 _-_d ) decompresses and restores all spec- + ified files whose names end in ".bz2". Files without this + suffix are ignored. Again, supplying no filenames causes + decompression from standard input to standard output. + + You can also compress or decompress files to the standard + output by giving the -c flag. You can decompress multiple + files like this, but you may only compress a single file + this way, since it would otherwise be difficult to sepa- + rate out the compressed representations of the original + files. + + + + 1 + + + + + +bzip2(1) bzip2(1) + + + Compression is always performed, even if the compressed + file is slightly larger than the original. Files of less + than about one hundred bytes tend to get larger, since the + compression mechanism has a constant overhead in the + region of 50 bytes. Random data (including the output of + most file compressors) is coded at about 8.05 bits per + byte, giving an expansion of around 0.5%. + + As a self-check for your protection, _b_z_i_p_2 uses 32-bit + CRCs to make sure that the decompressed version of a file + is identical to the original. This guards against corrup- + tion of the compressed data, and against undetected bugs + in _b_z_i_p_2 (hopefully very unlikely). The chances of data + corruption going undetected is microscopic, about one + chance in four billion for each file processed. Be aware, + though, that the check occurs upon decompression, so it + can only tell you that that something is wrong. It can't + help you recover the original uncompressed data. You can + use _b_z_i_p_2_r_e_c_o_v_e_r to try to recover data from damaged + files. + + Return values: 0 for a normal exit, 1 for environmental + problems (file not found, invalid flags, I/O errors, &c), + 2 to indicate a corrupt compressed file, 3 for an internal + consistency error (eg, bug) which caused _b_z_i_p_2 to panic. + + +MMEEMMOORRYY MMAANNAAGGEEMMEENNTT + _B_z_i_p_2 compresses large files in blocks. The block size + affects both the compression ratio achieved, and the + amount of memory needed both for compression and decom- + pression. The flags -1 through -9 specify the block size + to be 100,000 bytes through 900,000 bytes (the default) + respectively. At decompression-time, the block size used + for compression is read from the header of the compressed + file, and _b_u_n_z_i_p_2 then allocates itself just enough memory + to decompress the file. Since block sizes are stored in + compressed files, it follows that the flags -1 to -9 are + irrelevant to and so ignored during decompression. Com- + pression and decompression requirements, in bytes, can be + estimated as: + + Compression: 400k + ( 7 x block size ) + + Decompression: 100k + ( 5 x block size ), or + 100k + ( 2.5 x block size ) + + Larger block sizes give rapidly diminishing marginal + returns; most of the compression comes from the first two + or three hundred k of block size, a fact worth bearing in + mind when using _b_z_i_p_2 on small machines. It is also + important to appreciate that the decompression memory + requirement is set at compression-time by the choice of + block size. + + + + 2 + + + + + +bzip2(1) bzip2(1) + + + For files compressed with the default 900k block size, + _b_u_n_z_i_p_2 will require about 4600 kbytes to decompress. To + support decompression of any file on a 4 megabyte machine, + _b_u_n_z_i_p_2 has an option to decompress using approximately + half this amount of memory, about 2300 kbytes. Decompres- + sion speed is also halved, so you should use this option + only where necessary. The relevant flag is -s. + + In general, try and use the largest block size memory con- + straints allow, since that maximises the compression + achieved. Compression and decompression speed are virtu- + ally unaffected by block size. + + Another significant point applies to files which fit in a + single block -- that means most files you'd encounter + using a large block size. The amount of real memory + touched is proportional to the size of the file, since the + file is smaller than a block. For example, compressing a + file 20,000 bytes long with the flag -9 will cause the + compressor to allocate around 6700k of memory, but only + touch 400k + 20000 * 7 = 540 kbytes of it. Similarly, the + decompressor will allocate 4600k but only touch 100k + + 20000 * 5 = 200 kbytes. + + Here is a table which summarises the maximum memory usage + for different block sizes. Also recorded is the total + compressed size for 14 files of the Calgary Text Compres- + sion Corpus totalling 3,141,622 bytes. This column gives + some feel for how compression varies with block size. + These figures tend to understate the advantage of larger + block sizes for larger files, since the Corpus is domi- + nated by smaller files. + + Compress Decompress Decompress Corpus + Flag usage usage -s usage Size + + -1 1100k 600k 350k 914704 + -2 1800k 1100k 600k 877703 + -3 2500k 1600k 850k 860338 + -4 3200k 2100k 1100k 846899 + -5 3900k 2600k 1350k 845160 + -6 4600k 3100k 1600k 838626 + -7 5400k 3600k 1850k 834096 + -8 6000k 4100k 2100k 828642 + -9 6700k 4600k 2350k 828642 + + +OOPPTTIIOONNSS + --cc ----ssttddoouutt + Compress or decompress to standard output. -c will + decompress multiple files to stdout, but will only + compress a single file to stdout. + + + + + + 3 + + + + + +bzip2(1) bzip2(1) + + + --dd ----ddeeccoommpprreessss + Force decompression. _B_z_i_p_2 and _b_u_n_z_i_p_2 are really + the same program, and the decision about whether to + compress or decompress is done on the basis of + which name is used. This flag overrides that mech- + anism, and forces _b_z_i_p_2 to decompress. + + --ff ----ccoommpprreessss + The complement to -d: forces compression, regard- + less of the invokation name. + + --tt ----tteesstt + Check integrity of the specified file(s), but don't + decompress them. This really performs a trial + decompression and throws away the result, using the + low-memory decompression algorithm (see -s). + + --kk ----kkeeeepp + Keep (don't delete) input files during compression + or decompression. + + --ss ----ssmmaallll + Reduce memory usage, both for compression and + decompression. Files are decompressed using a mod- + ified algorithm which only requires 2.5 bytes per + block byte. This means any file can be decom- + pressed in 2300k of memory, albeit somewhat more + slowly than usual. + + During compression, -s selects a block size of + 200k, which limits memory use to around the same + figure, at the expense of your compression ratio. + In short, if your machine is low on memory (8 + megabytes or less), use -s for everything. See + MEMORY MANAGEMENT above. + + + --vv ----vveerrbboossee + Verbose mode -- show the compression ratio for each + file processed. Further -v's increase the ver- + bosity level, spewing out lots of information which + is primarily of interest for diagnostic purposes. + + --LL ----lliicceennssee + Display the software version, license terms and + conditions. + + --VV ----vveerrssiioonn + Same as -L. + + --11 ttoo --99 + Set the block size to 100 k, 200 k .. 900 k when + compressing. Has no effect when decompressing. + See MEMORY MANAGEMENT above. + + + + 4 + + + + + +bzip2(1) bzip2(1) + + + ----rreeppeettiittiivvee--ffaasstt + _b_z_i_p_2 injects some small pseudo-random variations + into very repetitive blocks to limit worst-case + performance during compression. If sorting runs + into difficulties, the block is randomised, and + sorting is restarted. Very roughly, _b_z_i_p_2 persists + for three times as long as a well-behaved input + would take before resorting to randomisation. This + flag makes it give up much sooner. + + + ----rreeppeettiittiivvee--bbeesstt + Opposite of --repetitive-fast; try a lot harder + before resorting to randomisation. + + +RREECCOOVVEERRIINNGG DDAATTAA FFRROOMM DDAAMMAAGGEEDD FFIILLEESS + _b_z_i_p_2 compresses files in blocks, usually 900kbytes long. + Each block is handled independently. If a media or trans- + mission error causes a multi-block .bz2 file to become + damaged, it may be possible to recover data from the + undamaged blocks in the file. + + The compressed representation of each block is delimited + by a 48-bit pattern, which makes it possible to find the + block boundaries with reasonable certainty. Each block + also carries its own 32-bit CRC, so damaged blocks can be + distinguished from undamaged ones. + + _b_z_i_p_2_r_e_c_o_v_e_r is a simple program whose purpose is to + search for blocks in .bz2 files, and write each block out + into its own .bz2 file. You can then use _b_z_i_p_2 _-_t to test + the integrity of the resulting files, and decompress those + which are undamaged. + + _b_z_i_p_2_r_e_c_o_v_e_r takes a single argument, the name of the dam- + aged file, and writes a number of files "rec0001file.bz2", + "rec0002file.bz2", etc, containing the extracted blocks. + The output filenames are designed so that the use of wild- + cards in subsequent processing -- for example, "bzip2 -dc + rec*file.bz2 > recovered_data" -- lists the files in the + "right" order. + + _b_z_i_p_2_r_e_c_o_v_e_r should be of most use dealing with large .bz2 + files, as these will contain many blocks. It is clearly + futile to use it on damaged single-block files, since a + damaged block cannot be recovered. If you wish to min- + imise any potential data loss through media or transmis- + sion errors, you might consider compressing with a smaller + block size. + + +PPEERRFFOORRMMAANNCCEE NNOOTTEESS + The sorting phase of compression gathers together similar + + + + 5 + + + + + +bzip2(1) bzip2(1) + + + strings in the file. Because of this, files containing + very long runs of repeated symbols, like "aabaabaabaab + ..." (repeated several hundred times) may compress + extraordinarily slowly. You can use the -vvvvv option to + monitor progress in great detail, if you want. Decompres- + sion speed is unaffected. + + Such pathological cases seem rare in practice, appearing + mostly in artificially-constructed test files, and in low- + level disk images. It may be inadvisable to use _b_z_i_p_2 to + compress the latter. If you do get a file which causes + severe slowness in compression, try making the block size + as small as possible, with flag -1. + + Incompressible or virtually-incompressible data may decom- + press rather more slowly than one would hope. This is due + to a naive implementation of the move-to-front coder. + + _b_z_i_p_2 usually allocates several megabytes of memory to + operate in, and then charges all over it in a fairly ran- + dom fashion. This means that performance, both for com- + pressing and decompressing, is largely determined by the + speed at which your machine can service cache misses. + Because of this, small changes to the code to reduce the + miss rate have been observed to give disproportionately + large performance improvements. I imagine _b_z_i_p_2 will per- + form best on machines with very large caches. + + Test mode (-t) uses the low-memory decompression algorithm + (-s). This means test mode does not run as fast as it + could; it could run as fast as the normal decompression + machinery. This could easily be fixed at the cost of some + code bloat. + + +CCAAVVEEAATTSS + I/O error messages are not as helpful as they could be. + _B_z_i_p_2 tries hard to detect I/O errors and exit cleanly, + but the details of what the problem is sometimes seem + rather misleading. + + This manual page pertains to version 0.1 of _b_z_i_p_2_. It may + well happen that some future version will use a different + compressed file format. If you try to decompress, using + 0.1, a .bz2 file created with some future version which + uses a different compressed file format, 0.1 will complain + that your file "is not a bzip2 file". If that happens, + you should obtain a more recent version of _b_z_i_p_2 and use + that to decompress the file. + + Wildcard expansion for Windows 95 and NT is flaky. + + _b_z_i_p_2_r_e_c_o_v_e_r uses 32-bit integers to represent bit posi- + tions in compressed files, so it cannot handle compressed + + + + 6 + + + + + +bzip2(1) bzip2(1) + + + files more than 512 megabytes long. This could easily be + fixed. + + _b_z_i_p_2_r_e_c_o_v_e_r sometimes reports a very small, incomplete + final block. This is spurious and can be safely ignored. + + +RREELLAATTIIOONNSSHHIIPP TTOO bbzziipp--00..2211 + This program is a descendant of the _b_z_i_p program, version + 0.21, which I released in August 1996. The primary dif- + ference of _b_z_i_p_2 is its avoidance of the possibly patented + algorithms which were used in 0.21. _b_z_i_p_2 also brings + various useful refinements (-s, -t), uses less memory, + decompresses significantly faster, and has support for + recovering data from damaged files. + + Because _b_z_i_p_2 uses Huffman coding to construct the com- + pressed bitstream, rather than the arithmetic coding used + in 0.21, the compressed representations generated by the + two programs are incompatible, and they will not interop- + erate. The change in suffix from .bz to .bz2 reflects + this. It would have been helpful to at least allow _b_z_i_p_2 + to decompress files created by 0.21, but this would defeat + the primary aim of having a patent-free compressor. + + Huffman coding necessarily involves some coding ineffi- + ciency compared to arithmetic coding. This means that + _b_z_i_p_2 compresses about 1% worse than 0.21, an unfortunate + but unavoidable fact-of-life. On the other hand, decom- + pression is approximately 50% faster for the same reason, + and the change in file format gave an opportunity to add + data-recovery features. So it is not all bad. + + +AAUUTTHHOORR + Julian Seward, jseward@acm.org. + + The ideas embodied in _b_z_i_p and _b_z_i_p_2 are due to (at least) + the following people: Michael Burrows and David Wheeler + (for the block sorting transformation), David Wheeler + (again, for the Huffman coder), Peter Fenwick (for the + structured coding model in 0.21, and many refinements), + and Alistair Moffat, Radford Neal and Ian Witten (for the + arithmetic coder in 0.21). I am much indebted for their + help, support and advice. See the file ALGORITHMS in the + source distribution for pointers to sources of documenta- + tion. Christian von Roques encouraged me to look for + faster sorting algorithms, so as to speed up compression. + Bela Lubkin encouraged me to improve the worst-case com- + pression performance. Many people sent patches, helped + with portability problems, lent machines, gave advice and + were generally helpful. + + + + + + 7 + + diff --git a/bzip2.c b/bzip2.c new file mode 100644 index 0000000..0fb45fb --- /dev/null +++ b/bzip2.c @@ -0,0 +1,4036 @@ + +/*-----------------------------------------------------------*/ +/*--- A block-sorting, lossless compressor bzip2.c ---*/ +/*-----------------------------------------------------------*/ + +/*-- + This program is bzip2, a lossless, block-sorting data compressor, + version 0.1pl0, dated 17-Aug-1997. + + Copyright (C) 1996, 1997 by Julian Seward. + Guildford, Surrey, UK + email: jseward@acm.org + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + + The GNU General Public License is contained in the file LICENSE. + + This program is based on (at least) the work of: + Mike Burrows + David Wheeler + Peter Fenwick + Alistair Moffat + Radford Neal + Ian H. Witten + Robert Sedgewick + Jon L. Bentley + + For more information on these sources, see the file ALGORITHMS. +--*/ + +/*----------------------------------------------------*/ +/*--- IMPORTANT ---*/ +/*----------------------------------------------------*/ + +/*-- + WARNING: + This program (attempts to) compress data by performing several + non-trivial transformations on it. Unless you are 100% familiar + with *all* the algorithms contained herein, and with the + consequences of modifying them, you should NOT meddle with the + compression or decompression machinery. Incorrect changes can + and very likely *will* lead to disasterous loss of data. + + DISCLAIMER: + I TAKE NO RESPONSIBILITY FOR ANY LOSS OF DATA ARISING FROM THE + USE OF THIS PROGRAM, HOWSOEVER CAUSED. + + Every compression of a file implies an assumption that the + compressed file can be decompressed to reproduce the original. + Great efforts in design, coding and testing have been made to + ensure that this program works correctly. However, the + complexity of the algorithms, and, in particular, the presence + of various special cases in the code which occur with very low + but non-zero probability make it impossible to rule out the + possibility of bugs remaining in the program. DO NOT COMPRESS + ANY DATA WITH THIS PROGRAM UNLESS YOU ARE PREPARED TO ACCEPT THE + POSSIBILITY, HOWEVER SMALL, THAT THE DATA WILL NOT BE RECOVERABLE. + + That is not to say this program is inherently unreliable. + Indeed, I very much hope the opposite is true. bzip2 has been + carefully constructed and extensively tested. +--*/ + + + +/*----------------------------------------------------*/ +/*--- and now for something much more pleasant :-) ---*/ +/*----------------------------------------------------*/ + +/*---------------------------------------------*/ +/*-- + Place a 1 beside your platform, and 0 elsewhere. +--*/ + +/*-- + Generic 32-bit Unix. + Also works on 64-bit Unix boxes. +--*/ +#define BZ_UNIX 1 + +/*-- + Win32, as seen by Jacob Navia's excellent + port of (Chris Fraser & David Hanson)'s excellent + lcc compiler. +--*/ +#define BZ_LCCWIN32 0 + + + +/*---------------------------------------------*/ +/*-- + Some stuff for all platforms. +--*/ + +#include +#include +#if DEBUG + #include +#endif +#include +#include +#include +#include + +#define ERROR_IF_EOF(i) { if ((i) == EOF) ioError(); } +#define ERROR_IF_NOT_ZERO(i) { if ((i) != 0) ioError(); } +#define ERROR_IF_MINUS_ONE(i) { if ((i) == (-1)) ioError(); } + + +/*---------------------------------------------*/ +/*-- + Platform-specific stuff. +--*/ + +#if BZ_UNIX + #include + #include + #include + #include + #include + + #define Int32 int + #define UInt32 unsigned int + #define Char char + #define UChar unsigned char + #define Int16 short + #define UInt16 unsigned short + + #define PATH_SEP '/' + #define MY_LSTAT lstat + #define MY_S_IFREG S_ISREG + #define MY_STAT stat + + #define APPEND_FILESPEC(root, name) \ + root=snocString((root), (name)) + + #define SET_BINARY_MODE(fd) /**/ + + /*-- + You should try very hard to persuade your C compiler + to inline the bits marked INLINE. Otherwise bzip2 will + run rather slowly. gcc version 2.x is recommended. + --*/ + #ifdef __GNUC__ + #define INLINE inline + #define NORETURN __attribute__ ((noreturn)) + #else + #define INLINE /**/ + #define NORETURN /**/ + #endif +#endif + + + +#if BZ_LCCWIN32 + #include + #include + #include + + #define Int32 int + #define UInt32 unsigned int + #define Int16 short + #define UInt16 unsigned short + #define Char char + #define UChar unsigned char + + #define INLINE /**/ + #define NORETURN /**/ + #define PATH_SEP '\\' + #define MY_LSTAT _stat + #define MY_STAT _stat + #define MY_S_IFREG(x) ((x) & _S_IFREG) + + #if 0 + /*-- lcc-win32 seems to expand wildcards itself --*/ + #define APPEND_FILESPEC(root, spec) \ + do { \ + if ((spec)[0] == '-') { \ + root = snocString((root), (spec)); \ + } else { \ + struct _finddata_t c_file; \ + long hFile; \ + hFile = _findfirst((spec), &c_file); \ + if ( hFile == -1L ) { \ + root = snocString ((root), (spec)); \ + } else { \ + int anInt = 0; \ + while ( anInt == 0 ) { \ + root = snocString((root), \ + &c_file.name[0]); \ + anInt = _findnext(hFile, &c_file); \ + } \ + } \ + } \ + } while ( 0 ) + #else + #define APPEND_FILESPEC(root, name) \ + root = snocString ((root), (name)) + #endif + + #define SET_BINARY_MODE(fd) \ + do { \ + int retVal = setmode ( fileno ( fd ), \ + O_BINARY ); \ + ERROR_IF_MINUS_ONE ( retVal ); \ + } while ( 0 ) + +#endif + + +/*---------------------------------------------*/ +/*-- + Some more stuff for all platforms :-) +--*/ + +#define Bool unsigned char +#define True 1 +#define False 0 + +/*-- + IntNative is your platform's `native' int size. + Only here to avoid probs with 64-bit platforms. +--*/ +#define IntNative int + + +/*-- + change to 1, or compile with -DDEBUG=1 to debug +--*/ +#ifndef DEBUG +#define DEBUG 0 +#endif + + +/*---------------------------------------------------*/ +/*--- ---*/ +/*---------------------------------------------------*/ + +/*-- + Implementation notes, July 1997 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + + Memory allocation + ~~~~~~~~~~~~~~~~~ + All large data structures are allocated on the C heap, + for better or for worse. That includes the various + arrays of pointers, striped words, bytes, frequency + tables and buffers for compression and decompression. + + bzip2 can operate at various block-sizes, ranging from + 100k to 900k in 100k steps, and it allocates only as + much as it needs to. When compressing, we know from the + command-line options what the block-size is going to be, + so all allocation can be done at start-up; if that + succeeds, there can be no further allocation problems. + + Decompression is more complicated. Each compressed file + contains, in its header, a byte indicating the block + size used for compression. This means bzip2 potentially + needs to reallocate memory for each file it deals with, + which in turn opens the possibility for a memory allocation + failure part way through a run of files, by encountering + a file requiring a much larger block size than all the + ones preceding it. + + The policy is to simply give up if a memory allocation + failure occurs. During decompression, it would be + possible to move on to subsequent files in the hope that + some might ask for a smaller block size, but the + complications for doing this seem more trouble than they + are worth. + + + Compressed file formats + ~~~~~~~~~~~~~~~~~~~~~~~ + [This is now entirely different from both 0.21, and from + any previous Huffman-coded variant of bzip. + See the associated file bzip2.txt for details.] + + + Error conditions + ~~~~~~~~~~~~~~~~ + Dealing with error conditions is the least satisfactory + aspect of bzip2. The policy is to try and leave the + filesystem in a consistent state, then quit, even if it + means not processing some of the files mentioned in the + command line. `A consistent state' means that a file + exists either in its compressed or uncompressed form, + but not both. This boils down to the rule `delete the + output file if an error condition occurs, leaving the + input intact'. Input files are only deleted when we can + be pretty sure the output file has been written and + closed successfully. + + Errors are a dog because there's so many things to + deal with. The following can happen mid-file, and + require cleaning up. + + internal `panics' -- indicating a bug + corrupted or inconsistent compressed file + can't allocate enough memory to decompress this file + I/O error reading/writing/opening/closing + signal catches -- Control-C, SIGTERM, SIGHUP. + + Other conditions, primarily pertaining to file names, + can be checked in-between files, which makes dealing + with them easier. +--*/ + + + +/*---------------------------------------------------*/ +/*--- Misc (file handling) data decls ---*/ +/*---------------------------------------------------*/ + +UInt32 bytesIn, bytesOut; +Int32 verbosity; +Bool keepInputFiles, smallMode, testFailsExist; +UInt32 globalCrc; +Int32 numFileNames, numFilesProcessed; + + +/*-- source modes; F==file, I==stdin, O==stdout --*/ +#define SM_I2O 1 +#define SM_F2O 2 +#define SM_F2F 3 + +/*-- operation modes --*/ +#define OM_Z 1 +#define OM_UNZ 2 +#define OM_TEST 3 + +Int32 opMode; +Int32 srcMode; + + +Int32 longestFileName; +Char inName[1024]; +Char outName[1024]; +Char *progName; +Char progNameReally[1024]; +FILE *outputHandleJustInCase; + +void panic ( Char* ) NORETURN; +void ioError ( void ) NORETURN; +void compressOutOfMemory ( Int32, Int32 ) NORETURN; +void uncompressOutOfMemory ( Int32, Int32 ) NORETURN; +void blockOverrun ( void ) NORETURN; +void badBlockHeader ( void ) NORETURN; +void badBGLengths ( void ) NORETURN; +void crcError ( UInt32, UInt32 ) NORETURN; +void bitStreamEOF ( void ) NORETURN; +void cleanUpAndFail ( Int32 ) NORETURN; +void compressedStreamEOF ( void ) NORETURN; + +void* myMalloc ( Int32 ); + + + +/*---------------------------------------------------*/ +/*--- Data decls for the front end ---*/ +/*---------------------------------------------------*/ + +/*-- + The overshoot bytes allow us to avoid most of + the cost of pointer renormalisation during + comparison of rotations in sorting. + The figure of 20 is derived as follows: + qSort3 allows an overshoot of up to 10. + It then calls simpleSort, which calls + fullGtU, also with max overshoot 10. + fullGtU does up to 10 comparisons without + renormalising, giving 10+10 == 20. +--*/ +#define NUM_OVERSHOOT_BYTES 20 + +/*-- + These are the main data structures for + the Burrows-Wheeler transform. +--*/ + +/*-- + Pointers to compression and decompression + structures. Set by + allocateCompressStructures and + setDecompressStructureSizes + + The structures are always set to be suitable + for a block of size 100000 * blockSize100k. +--*/ +UChar *block; /*-- compress --*/ +UInt16 *quadrant; /*-- compress --*/ +Int32 *zptr; /*-- compress --*/ +UInt16 *szptr; /*-- overlays zptr ---*/ +Int32 *ftab; /*-- compress --*/ + +UInt16 *ll16; /*-- small decompress --*/ +UChar *ll4; /*-- small decompress --*/ + +Int32 *tt; /*-- fast decompress --*/ +UChar *ll8; /*-- fast decompress --*/ + + +/*-- + freq table collected to save a pass over the data + during decompression. +--*/ +Int32 unzftab[256]; + + +/*-- + index of the last char in the block, so + the block size == last + 1. +--*/ +Int32 last; + + +/*-- + index in zptr[] of original string after sorting. +--*/ +Int32 origPtr; + + +/*-- + always: in the range 0 .. 9. + The current block size is 100000 * this number. +--*/ +Int32 blockSize100k; + + +/*-- + Used when sorting. If too many long comparisons + happen, we stop sorting, randomise the block + slightly, and try again. +--*/ + +Int32 workFactor; +Int32 workDone; +Int32 workLimit; +Bool blockRandomised; +Bool firstAttempt; +Int32 nBlocksRandomised; + + + +/*---------------------------------------------------*/ +/*--- Data decls for the back end ---*/ +/*---------------------------------------------------*/ + +#define MAX_ALPHA_SIZE 258 +#define MAX_CODE_LEN 23 + +#define RUNA 0 +#define RUNB 1 + +#define N_GROUPS 6 +#define G_SIZE 50 +#define N_ITERS 4 + +#define MAX_SELECTORS (2 + (900000 / G_SIZE)) + +Bool inUse[256]; +Int32 nInUse; + +UChar seqToUnseq[256]; +UChar unseqToSeq[256]; + +UChar selector [MAX_SELECTORS]; +UChar selectorMtf[MAX_SELECTORS]; + +Int32 nMTF; + +Int32 mtfFreq[MAX_ALPHA_SIZE]; + +UChar len [N_GROUPS][MAX_ALPHA_SIZE]; + +/*-- decompress only --*/ +Int32 limit [N_GROUPS][MAX_ALPHA_SIZE]; +Int32 base [N_GROUPS][MAX_ALPHA_SIZE]; +Int32 perm [N_GROUPS][MAX_ALPHA_SIZE]; +Int32 minLens[N_GROUPS]; + +/*-- compress only --*/ +Int32 code [N_GROUPS][MAX_ALPHA_SIZE]; +Int32 rfreq[N_GROUPS][MAX_ALPHA_SIZE]; + + +/*---------------------------------------------------*/ +/*--- 32-bit CRC grunge ---*/ +/*---------------------------------------------------*/ + +/*-- + I think this is an implementation of the AUTODIN-II, + Ethernet & FDDI 32-bit CRC standard. Vaguely derived + from code by Rob Warnock, in Section 51 of the + comp.compression FAQ. +--*/ + +UInt32 crc32Table[256] = { + + /*-- Ugly, innit? --*/ + + 0x00000000L, 0x04c11db7L, 0x09823b6eL, 0x0d4326d9L, + 0x130476dcL, 0x17c56b6bL, 0x1a864db2L, 0x1e475005L, + 0x2608edb8L, 0x22c9f00fL, 0x2f8ad6d6L, 0x2b4bcb61L, + 0x350c9b64L, 0x31cd86d3L, 0x3c8ea00aL, 0x384fbdbdL, + 0x4c11db70L, 0x48d0c6c7L, 0x4593e01eL, 0x4152fda9L, + 0x5f15adacL, 0x5bd4b01bL, 0x569796c2L, 0x52568b75L, + 0x6a1936c8L, 0x6ed82b7fL, 0x639b0da6L, 0x675a1011L, + 0x791d4014L, 0x7ddc5da3L, 0x709f7b7aL, 0x745e66cdL, + 0x9823b6e0L, 0x9ce2ab57L, 0x91a18d8eL, 0x95609039L, + 0x8b27c03cL, 0x8fe6dd8bL, 0x82a5fb52L, 0x8664e6e5L, + 0xbe2b5b58L, 0xbaea46efL, 0xb7a96036L, 0xb3687d81L, + 0xad2f2d84L, 0xa9ee3033L, 0xa4ad16eaL, 0xa06c0b5dL, + 0xd4326d90L, 0xd0f37027L, 0xddb056feL, 0xd9714b49L, + 0xc7361b4cL, 0xc3f706fbL, 0xceb42022L, 0xca753d95L, + 0xf23a8028L, 0xf6fb9d9fL, 0xfbb8bb46L, 0xff79a6f1L, + 0xe13ef6f4L, 0xe5ffeb43L, 0xe8bccd9aL, 0xec7dd02dL, + 0x34867077L, 0x30476dc0L, 0x3d044b19L, 0x39c556aeL, + 0x278206abL, 0x23431b1cL, 0x2e003dc5L, 0x2ac12072L, + 0x128e9dcfL, 0x164f8078L, 0x1b0ca6a1L, 0x1fcdbb16L, + 0x018aeb13L, 0x054bf6a4L, 0x0808d07dL, 0x0cc9cdcaL, + 0x7897ab07L, 0x7c56b6b0L, 0x71159069L, 0x75d48ddeL, + 0x6b93dddbL, 0x6f52c06cL, 0x6211e6b5L, 0x66d0fb02L, + 0x5e9f46bfL, 0x5a5e5b08L, 0x571d7dd1L, 0x53dc6066L, + 0x4d9b3063L, 0x495a2dd4L, 0x44190b0dL, 0x40d816baL, + 0xaca5c697L, 0xa864db20L, 0xa527fdf9L, 0xa1e6e04eL, + 0xbfa1b04bL, 0xbb60adfcL, 0xb6238b25L, 0xb2e29692L, + 0x8aad2b2fL, 0x8e6c3698L, 0x832f1041L, 0x87ee0df6L, + 0x99a95df3L, 0x9d684044L, 0x902b669dL, 0x94ea7b2aL, + 0xe0b41de7L, 0xe4750050L, 0xe9362689L, 0xedf73b3eL, + 0xf3b06b3bL, 0xf771768cL, 0xfa325055L, 0xfef34de2L, + 0xc6bcf05fL, 0xc27dede8L, 0xcf3ecb31L, 0xcbffd686L, + 0xd5b88683L, 0xd1799b34L, 0xdc3abdedL, 0xd8fba05aL, + 0x690ce0eeL, 0x6dcdfd59L, 0x608edb80L, 0x644fc637L, + 0x7a089632L, 0x7ec98b85L, 0x738aad5cL, 0x774bb0ebL, + 0x4f040d56L, 0x4bc510e1L, 0x46863638L, 0x42472b8fL, + 0x5c007b8aL, 0x58c1663dL, 0x558240e4L, 0x51435d53L, + 0x251d3b9eL, 0x21dc2629L, 0x2c9f00f0L, 0x285e1d47L, + 0x36194d42L, 0x32d850f5L, 0x3f9b762cL, 0x3b5a6b9bL, + 0x0315d626L, 0x07d4cb91L, 0x0a97ed48L, 0x0e56f0ffL, + 0x1011a0faL, 0x14d0bd4dL, 0x19939b94L, 0x1d528623L, + 0xf12f560eL, 0xf5ee4bb9L, 0xf8ad6d60L, 0xfc6c70d7L, + 0xe22b20d2L, 0xe6ea3d65L, 0xeba91bbcL, 0xef68060bL, + 0xd727bbb6L, 0xd3e6a601L, 0xdea580d8L, 0xda649d6fL, + 0xc423cd6aL, 0xc0e2d0ddL, 0xcda1f604L, 0xc960ebb3L, + 0xbd3e8d7eL, 0xb9ff90c9L, 0xb4bcb610L, 0xb07daba7L, + 0xae3afba2L, 0xaafbe615L, 0xa7b8c0ccL, 0xa379dd7bL, + 0x9b3660c6L, 0x9ff77d71L, 0x92b45ba8L, 0x9675461fL, + 0x8832161aL, 0x8cf30badL, 0x81b02d74L, 0x857130c3L, + 0x5d8a9099L, 0x594b8d2eL, 0x5408abf7L, 0x50c9b640L, + 0x4e8ee645L, 0x4a4ffbf2L, 0x470cdd2bL, 0x43cdc09cL, + 0x7b827d21L, 0x7f436096L, 0x7200464fL, 0x76c15bf8L, + 0x68860bfdL, 0x6c47164aL, 0x61043093L, 0x65c52d24L, + 0x119b4be9L, 0x155a565eL, 0x18197087L, 0x1cd86d30L, + 0x029f3d35L, 0x065e2082L, 0x0b1d065bL, 0x0fdc1becL, + 0x3793a651L, 0x3352bbe6L, 0x3e119d3fL, 0x3ad08088L, + 0x2497d08dL, 0x2056cd3aL, 0x2d15ebe3L, 0x29d4f654L, + 0xc5a92679L, 0xc1683bceL, 0xcc2b1d17L, 0xc8ea00a0L, + 0xd6ad50a5L, 0xd26c4d12L, 0xdf2f6bcbL, 0xdbee767cL, + 0xe3a1cbc1L, 0xe760d676L, 0xea23f0afL, 0xeee2ed18L, + 0xf0a5bd1dL, 0xf464a0aaL, 0xf9278673L, 0xfde69bc4L, + 0x89b8fd09L, 0x8d79e0beL, 0x803ac667L, 0x84fbdbd0L, + 0x9abc8bd5L, 0x9e7d9662L, 0x933eb0bbL, 0x97ffad0cL, + 0xafb010b1L, 0xab710d06L, 0xa6322bdfL, 0xa2f33668L, + 0xbcb4666dL, 0xb8757bdaL, 0xb5365d03L, 0xb1f740b4L +}; + + +/*---------------------------------------------*/ +void initialiseCRC ( void ) +{ + globalCrc = 0xffffffffL; +} + + +/*---------------------------------------------*/ +UInt32 getFinalCRC ( void ) +{ + return ~globalCrc; +} + + +/*---------------------------------------------*/ +UInt32 getGlobalCRC ( void ) +{ + return globalCrc; +} + + +/*---------------------------------------------*/ +void setGlobalCRC ( UInt32 newCrc ) +{ + globalCrc = newCrc; +} + + +/*---------------------------------------------*/ +#define UPDATE_CRC(crcVar,cha) \ +{ \ + crcVar = (crcVar << 8) ^ \ + crc32Table[(crcVar >> 24) ^ \ + ((UChar)cha)]; \ +} + + +/*---------------------------------------------------*/ +/*--- Bit stream I/O ---*/ +/*---------------------------------------------------*/ + + +UInt32 bsBuff; +Int32 bsLive; +FILE* bsStream; +Bool bsWriting; + + +/*---------------------------------------------*/ +void bsSetStream ( FILE* f, Bool wr ) +{ + if (bsStream != NULL) panic ( "bsSetStream" ); + bsStream = f; + bsLive = 0; + bsBuff = 0; + bytesOut = 0; + bytesIn = 0; + bsWriting = wr; +} + + +/*---------------------------------------------*/ +void bsFinishedWithStream ( void ) +{ + if (bsWriting) + while (bsLive > 0) { + fputc ( (UChar)(bsBuff >> 24), bsStream ); + bsBuff <<= 8; + bsLive -= 8; + bytesOut++; + } + bsStream = NULL; +} + + +/*---------------------------------------------*/ +#define bsNEEDR(nz) \ +{ \ + while (bsLive < nz) { \ + Int32 zzi = fgetc ( bsStream ); \ + if (zzi == EOF) compressedStreamEOF(); \ + bsBuff = (bsBuff << 8) | (zzi & 0xffL); \ + bsLive += 8; \ + } \ +} + + +/*---------------------------------------------*/ +#define bsNEEDW(nz) \ +{ \ + while (bsLive >= 8) { \ + fputc ( (UChar)(bsBuff >> 24), \ + bsStream ); \ + bsBuff <<= 8; \ + bsLive -= 8; \ + bytesOut++; \ + } \ +} + + +/*---------------------------------------------*/ +#define bsR1(vz) \ +{ \ + bsNEEDR(1); \ + vz = (bsBuff >> (bsLive-1)) & 1; \ + bsLive--; \ +} + + +/*---------------------------------------------*/ +INLINE UInt32 bsR ( Int32 n ) +{ + UInt32 v; + bsNEEDR ( n ); + v = (bsBuff >> (bsLive-n)) & ((1 << n)-1); + bsLive -= n; + return v; +} + + +/*---------------------------------------------*/ +INLINE void bsW ( Int32 n, UInt32 v ) +{ + bsNEEDW ( n ); + bsBuff |= (v << (32 - bsLive - n)); + bsLive += n; +} + + +/*---------------------------------------------*/ +UChar bsGetUChar ( void ) +{ + return (UChar)bsR(8); +} + + +/*---------------------------------------------*/ +void bsPutUChar ( UChar c ) +{ + bsW(8, (UInt32)c ); +} + + +/*---------------------------------------------*/ +Int32 bsGetUInt32 ( void ) +{ + UInt32 u; + u = 0; + u = (u << 8) | bsR(8); + u = (u << 8) | bsR(8); + u = (u << 8) | bsR(8); + u = (u << 8) | bsR(8); + return u; +} + + +/*---------------------------------------------*/ +UInt32 bsGetIntVS ( UInt32 numBits ) +{ + return (UInt32)bsR(numBits); +} + + +/*---------------------------------------------*/ +UInt32 bsGetInt32 ( void ) +{ + return (Int32)bsGetUInt32(); +} + + +/*---------------------------------------------*/ +void bsPutUInt32 ( UInt32 u ) +{ + bsW ( 8, (u >> 24) & 0xffL ); + bsW ( 8, (u >> 16) & 0xffL ); + bsW ( 8, (u >> 8) & 0xffL ); + bsW ( 8, u & 0xffL ); +} + + +/*---------------------------------------------*/ +void bsPutInt32 ( Int32 c ) +{ + bsPutUInt32 ( (UInt32)c ); +} + + +/*---------------------------------------------*/ +void bsPutIntVS ( Int32 numBits, UInt32 c ) +{ + bsW ( numBits, c ); +} + + +/*---------------------------------------------------*/ +/*--- Huffman coding low-level stuff ---*/ +/*---------------------------------------------------*/ + +#define WEIGHTOF(zz0) ((zz0) & 0xffffff00) +#define DEPTHOF(zz1) ((zz1) & 0x000000ff) +#define MYMAX(zz2,zz3) ((zz2) > (zz3) ? (zz2) : (zz3)) + +#define ADDWEIGHTS(zw1,zw2) \ + (WEIGHTOF(zw1)+WEIGHTOF(zw2)) | \ + (1 + MYMAX(DEPTHOF(zw1),DEPTHOF(zw2))) + +#define UPHEAP(z) \ +{ \ + Int32 zz, tmp; \ + zz = z; tmp = heap[zz]; \ + while (weight[tmp] < weight[heap[zz >> 1]]) { \ + heap[zz] = heap[zz >> 1]; \ + zz >>= 1; \ + } \ + heap[zz] = tmp; \ +} + +#define DOWNHEAP(z) \ +{ \ + Int32 zz, yy, tmp; \ + zz = z; tmp = heap[zz]; \ + while (True) { \ + yy = zz << 1; \ + if (yy > nHeap) break; \ + if (yy < nHeap && \ + weight[heap[yy+1]] < weight[heap[yy]]) \ + yy++; \ + if (weight[tmp] < weight[heap[yy]]) break; \ + heap[zz] = heap[yy]; \ + zz = yy; \ + } \ + heap[zz] = tmp; \ +} + + +/*---------------------------------------------*/ +void hbMakeCodeLengths ( UChar *len, + Int32 *freq, + Int32 alphaSize, + Int32 maxLen ) +{ + /*-- + Nodes and heap entries run from 1. Entry 0 + for both the heap and nodes is a sentinel. + --*/ + Int32 nNodes, nHeap, n1, n2, i, j, k; + Bool tooLong; + + Int32 heap [ MAX_ALPHA_SIZE + 2 ]; + Int32 weight [ MAX_ALPHA_SIZE * 2 ]; + Int32 parent [ MAX_ALPHA_SIZE * 2 ]; + + for (i = 0; i < alphaSize; i++) + weight[i+1] = (freq[i] == 0 ? 1 : freq[i]) << 8; + + while (True) { + + nNodes = alphaSize; + nHeap = 0; + + heap[0] = 0; + weight[0] = 0; + parent[0] = -2; + + for (i = 1; i <= alphaSize; i++) { + parent[i] = -1; + nHeap++; + heap[nHeap] = i; + UPHEAP(nHeap); + } + if (!(nHeap < (MAX_ALPHA_SIZE+2))) + panic ( "hbMakeCodeLengths(1)" ); + + while (nHeap > 1) { + n1 = heap[1]; heap[1] = heap[nHeap]; nHeap--; DOWNHEAP(1); + n2 = heap[1]; heap[1] = heap[nHeap]; nHeap--; DOWNHEAP(1); + nNodes++; + parent[n1] = parent[n2] = nNodes; + weight[nNodes] = ADDWEIGHTS(weight[n1], weight[n2]); + parent[nNodes] = -1; + nHeap++; + heap[nHeap] = nNodes; + UPHEAP(nHeap); + } + if (!(nNodes < (MAX_ALPHA_SIZE * 2))) + panic ( "hbMakeCodeLengths(2)" ); + + tooLong = False; + for (i = 1; i <= alphaSize; i++) { + j = 0; + k = i; + while (parent[k] >= 0) { k = parent[k]; j++; } + len[i-1] = j; + if (j > maxLen) tooLong = True; + } + + if (! tooLong) break; + + for (i = 1; i < alphaSize; i++) { + j = weight[i] >> 8; + j = 1 + (j / 2); + weight[i] = j << 8; + } + } +} + + +/*---------------------------------------------*/ +void hbAssignCodes ( Int32 *code, + UChar *length, + Int32 minLen, + Int32 maxLen, + Int32 alphaSize ) +{ + Int32 n, vec, i; + + vec = 0; + for (n = minLen; n <= maxLen; n++) { + for (i = 0; i < alphaSize; i++) + if (length[i] == n) { code[i] = vec; vec++; }; + vec <<= 1; + } +} + + +/*---------------------------------------------*/ +void hbCreateDecodeTables ( Int32 *limit, + Int32 *base, + Int32 *perm, + UChar *length, + Int32 minLen, + Int32 maxLen, + Int32 alphaSize ) +{ + Int32 pp, i, j, vec; + + pp = 0; + for (i = minLen; i <= maxLen; i++) + for (j = 0; j < alphaSize; j++) + if (length[j] == i) { perm[pp] = j; pp++; }; + + for (i = 0; i < MAX_CODE_LEN; i++) base[i] = 0; + for (i = 0; i < alphaSize; i++) base[length[i]+1]++; + + for (i = 1; i < MAX_CODE_LEN; i++) base[i] += base[i-1]; + + for (i = 0; i < MAX_CODE_LEN; i++) limit[i] = 0; + vec = 0; + + for (i = minLen; i <= maxLen; i++) { + vec += (base[i+1] - base[i]); + limit[i] = vec-1; + vec <<= 1; + } + for (i = minLen + 1; i <= maxLen; i++) + base[i] = ((limit[i-1] + 1) << 1) - base[i]; +} + + + +/*---------------------------------------------------*/ +/*--- Undoing the reversible transformation ---*/ +/*---------------------------------------------------*/ + +/*---------------------------------------------*/ +#define SET_LL4(i,n) \ + { if (((i) & 0x1) == 0) \ + ll4[(i) >> 1] = (ll4[(i) >> 1] & 0xf0) | (n); else \ + ll4[(i) >> 1] = (ll4[(i) >> 1] & 0x0f) | ((n) << 4); \ + } + +#define GET_LL4(i) \ + (((UInt32)(ll4[(i) >> 1])) >> (((i) << 2) & 0x4) & 0xF) + +#define SET_LL(i,n) \ + { ll16[i] = (UInt16)(n & 0x0000ffff); \ + SET_LL4(i, n >> 16); \ + } + +#define GET_LL(i) \ + (((UInt32)ll16[i]) | (GET_LL4(i) << 16)) + + +/*---------------------------------------------*/ +/*-- + Manage memory for compression/decompression. + When compressing, a single block size applies to + all files processed, and that's set when the + program starts. But when decompressing, each file + processed could have been compressed with a + different block size, so we may have to free + and reallocate on a per-file basis. + + A call with argument of zero means + `free up everything.' And a value of zero for + blockSize100k means no memory is currently allocated. +--*/ + + +/*---------------------------------------------*/ +void allocateCompressStructures ( void ) +{ + Int32 n = 100000 * blockSize100k; + block = malloc ( (n + 1 + NUM_OVERSHOOT_BYTES) * sizeof(UChar) ); + quadrant = malloc ( (n + NUM_OVERSHOOT_BYTES) * sizeof(Int16) ); + zptr = malloc ( n * sizeof(Int32) ); + ftab = malloc ( 65537 * sizeof(Int32) ); + + if (block == NULL || quadrant == NULL || + zptr == NULL || ftab == NULL) { + Int32 totalDraw + = (n + 1 + NUM_OVERSHOOT_BYTES) * sizeof(UChar) + + (n + NUM_OVERSHOOT_BYTES) * sizeof(Int16) + + n * sizeof(Int32) + + 65537 * sizeof(Int32); + + compressOutOfMemory ( totalDraw, n ); + } + + /*-- + Since we want valid indexes for block of + -1 to n + NUM_OVERSHOOT_BYTES - 1 + inclusive. + --*/ + block++; + + /*-- + The back end needs a place to store the MTF values + whilst it calculates the coding tables. We could + put them in the zptr array. However, these values + will fit in a short, so we overlay szptr at the + start of zptr, in the hope of reducing the number + of cache misses induced by the multiple traversals + of the MTF values when calculating coding tables. + Seems to improve compression speed by about 1%. + --*/ + szptr = (UInt16*)zptr; +} + + +/*---------------------------------------------*/ +void setDecompressStructureSizes ( Int32 newSize100k ) +{ + if (! (0 <= newSize100k && newSize100k <= 9 && + 0 <= blockSize100k && blockSize100k <= 9)) + panic ( "setDecompressStructureSizes" ); + + if (newSize100k == blockSize100k) return; + + blockSize100k = newSize100k; + + if (ll16 != NULL) free ( ll16 ); + if (ll4 != NULL) free ( ll4 ); + if (ll8 != NULL) free ( ll8 ); + if (tt != NULL) free ( tt ); + + if (newSize100k == 0) return; + + if (smallMode) { + + Int32 n = 100000 * newSize100k; + ll16 = malloc ( n * sizeof(UInt16) ); + ll4 = malloc ( ((n+1) >> 1) * sizeof(UChar) ); + + if (ll4 == NULL || ll16 == NULL) { + Int32 totalDraw + = n * sizeof(Int16) + ((n+1) >> 1) * sizeof(UChar); + uncompressOutOfMemory ( totalDraw, n ); + } + + } else { + + Int32 n = 100000 * newSize100k; + ll8 = malloc ( n * sizeof(UChar) ); + tt = malloc ( n * sizeof(Int32) ); + + if (ll8 == NULL || tt == NULL) { + Int32 totalDraw + = n * sizeof(UChar) + n * sizeof(UInt32); + uncompressOutOfMemory ( totalDraw, n ); + } + + } +} + + + +/*---------------------------------------------------*/ +/*--- The new back end ---*/ +/*---------------------------------------------------*/ + +/*---------------------------------------------*/ +void makeMaps ( void ) +{ + Int32 i; + nInUse = 0; + for (i = 0; i < 256; i++) + if (inUse[i]) { + seqToUnseq[nInUse] = i; + unseqToSeq[i] = nInUse; + nInUse++; + } +} + + +/*---------------------------------------------*/ +void generateMTFValues ( void ) +{ + UChar yy[256]; + Int32 i, j; + UChar tmp; + UChar tmp2; + Int32 zPend; + Int32 wr; + Int32 EOB; + + makeMaps(); + EOB = nInUse+1; + + for (i = 0; i <= EOB; i++) mtfFreq[i] = 0; + + wr = 0; + zPend = 0; + for (i = 0; i < nInUse; i++) yy[i] = (UChar) i; + + + for (i = 0; i <= last; i++) { + UChar ll_i; + + #if DEBUG + assert (wr <= i); + #endif + + ll_i = unseqToSeq[block[zptr[i] - 1]]; + #if DEBUG + assert (ll_i < nInUse); + #endif + + j = 0; + tmp = yy[j]; + while ( ll_i != tmp ) { + j++; + tmp2 = tmp; + tmp = yy[j]; + yy[j] = tmp2; + }; + yy[0] = tmp; + + if (j == 0) { + zPend++; + } else { + if (zPend > 0) { + zPend--; + while (True) { + switch (zPend % 2) { + case 0: szptr[wr] = RUNA; wr++; mtfFreq[RUNA]++; break; + case 1: szptr[wr] = RUNB; wr++; mtfFreq[RUNB]++; break; + }; + if (zPend < 2) break; + zPend = (zPend - 2) / 2; + }; + zPend = 0; + } + szptr[wr] = j+1; wr++; mtfFreq[j+1]++; + } + } + + if (zPend > 0) { + zPend--; + while (True) { + switch (zPend % 2) { + case 0: szptr[wr] = RUNA; wr++; mtfFreq[RUNA]++; break; + case 1: szptr[wr] = RUNB; wr++; mtfFreq[RUNB]++; break; + }; + if (zPend < 2) break; + zPend = (zPend - 2) / 2; + }; + } + + szptr[wr] = EOB; wr++; mtfFreq[EOB]++; + + nMTF = wr; +} + + +/*---------------------------------------------*/ +#define LESSER_ICOST 0 +#define GREATER_ICOST 15 + +void sendMTFValues ( void ) +{ + Int32 v, t, i, j, gs, ge, totc, bt, bc, iter; + Int32 nSelectors, alphaSize, minLen, maxLen, selCtr; + Int32 nGroups, nBytes; + + /*-- + UChar len [N_GROUPS][MAX_ALPHA_SIZE]; + is a global since the decoder also needs it. + + Int32 code[N_GROUPS][MAX_ALPHA_SIZE]; + Int32 rfreq[N_GROUPS][MAX_ALPHA_SIZE]; + are also globals only used in this proc. + Made global to keep stack frame size small. + --*/ + + + UInt16 cost[N_GROUPS]; + Int32 fave[N_GROUPS]; + + if (verbosity >= 3) + fprintf ( stderr, + " %d in block, %d after MTF & 1-2 coding, %d+2 syms in use\n", + last+1, nMTF, nInUse ); + + alphaSize = nInUse+2; + for (t = 0; t < N_GROUPS; t++) + for (v = 0; v < alphaSize; v++) + len[t][v] = GREATER_ICOST; + + /*--- Decide how many coding tables to use ---*/ + if (nMTF <= 0) panic ( "sendMTFValues(0)" ); + if (nMTF < 200) nGroups = 2; else + if (nMTF < 800) nGroups = 4; else + nGroups = 6; + + /*--- Generate an initial set of coding tables ---*/ + { + Int32 nPart, remF, tFreq, aFreq; + + nPart = nGroups; + remF = nMTF; + gs = 0; + while (nPart > 0) { + tFreq = remF / nPart; + ge = gs-1; + aFreq = 0; + while (aFreq < tFreq && ge < alphaSize-1) { + ge++; + aFreq += mtfFreq[ge]; + } + + if (ge > gs + && nPart != nGroups && nPart != 1 + && ((nGroups-nPart) % 2 == 1)) { + aFreq -= mtfFreq[ge]; + ge--; + } + + if (verbosity >= 3) + fprintf ( stderr, + " initial group %d, [%d .. %d], has %d syms (%4.1f%%)\n", + nPart, gs, ge, aFreq, + (100.0 * (float)aFreq) / (float)nMTF ); + + for (v = 0; v < alphaSize; v++) + if (v >= gs && v <= ge) + len[nPart-1][v] = LESSER_ICOST; else + len[nPart-1][v] = GREATER_ICOST; + + nPart--; + gs = ge+1; + remF -= aFreq; + } + } + + /*--- + Iterate up to N_ITERS times to improve the tables. + ---*/ + for (iter = 0; iter < N_ITERS; iter++) { + + for (t = 0; t < nGroups; t++) fave[t] = 0; + + for (t = 0; t < nGroups; t++) + for (v = 0; v < alphaSize; v++) + rfreq[t][v] = 0; + + nSelectors = 0; + totc = 0; + gs = 0; + while (True) { + + /*--- Set group start & end marks. --*/ + if (gs >= nMTF) break; + ge = gs + G_SIZE - 1; + if (ge >= nMTF) ge = nMTF-1; + + /*-- + Calculate the cost of this group as coded + by each of the coding tables. + --*/ + for (t = 0; t < nGroups; t++) cost[t] = 0; + + if (nGroups == 6) { + register UInt16 cost0, cost1, cost2, cost3, cost4, cost5; + cost0 = cost1 = cost2 = cost3 = cost4 = cost5 = 0; + for (i = gs; i <= ge; i++) { + UInt16 icv = szptr[i]; + cost0 += len[0][icv]; + cost1 += len[1][icv]; + cost2 += len[2][icv]; + cost3 += len[3][icv]; + cost4 += len[4][icv]; + cost5 += len[5][icv]; + } + cost[0] = cost0; cost[1] = cost1; cost[2] = cost2; + cost[3] = cost3; cost[4] = cost4; cost[5] = cost5; + } else { + for (i = gs; i <= ge; i++) { + UInt16 icv = szptr[i]; + for (t = 0; t < nGroups; t++) cost[t] += len[t][icv]; + } + } + + /*-- + Find the coding table which is best for this group, + and record its identity in the selector table. + --*/ + bc = 999999999; bt = -1; + for (t = 0; t < nGroups; t++) + if (cost[t] < bc) { bc = cost[t]; bt = t; }; + totc += bc; + fave[bt]++; + selector[nSelectors] = bt; + nSelectors++; + + /*-- + Increment the symbol frequencies for the selected table. + --*/ + for (i = gs; i <= ge; i++) + rfreq[bt][ szptr[i] ]++; + + gs = ge+1; + } + if (verbosity >= 3) { + fprintf ( stderr, + " pass %d: size is %d, grp uses are ", + iter+1, totc/8 ); + for (t = 0; t < nGroups; t++) + fprintf ( stderr, "%d ", fave[t] ); + fprintf ( stderr, "\n" ); + } + + /*-- + Recompute the tables based on the accumulated frequencies. + --*/ + for (t = 0; t < nGroups; t++) + hbMakeCodeLengths ( &len[t][0], &rfreq[t][0], alphaSize, 20 ); + } + + + if (!(nGroups < 8)) panic ( "sendMTFValues(1)" ); + if (!(nSelectors < 32768 && + nSelectors <= (2 + (900000 / G_SIZE)))) + panic ( "sendMTFValues(2)" ); + + + /*--- Compute MTF values for the selectors. ---*/ + { + UChar pos[N_GROUPS], ll_i, tmp2, tmp; + for (i = 0; i < nGroups; i++) pos[i] = i; + for (i = 0; i < nSelectors; i++) { + ll_i = selector[i]; + j = 0; + tmp = pos[j]; + while ( ll_i != tmp ) { + j++; + tmp2 = tmp; + tmp = pos[j]; + pos[j] = tmp2; + }; + pos[0] = tmp; + selectorMtf[i] = j; + } + }; + + /*--- Assign actual codes for the tables. --*/ + for (t = 0; t < nGroups; t++) { + minLen = 32; + maxLen = 0; + for (i = 0; i < alphaSize; i++) { + if (len[t][i] > maxLen) maxLen = len[t][i]; + if (len[t][i] < minLen) minLen = len[t][i]; + } + if (maxLen > 20) panic ( "sendMTFValues(3)" ); + if (minLen < 1) panic ( "sendMTFValues(4)" ); + hbAssignCodes ( &code[t][0], &len[t][0], + minLen, maxLen, alphaSize ); + } + + /*--- Transmit the mapping table. ---*/ + { + Bool inUse16[16]; + for (i = 0; i < 16; i++) { + inUse16[i] = False; + for (j = 0; j < 16; j++) + if (inUse[i * 16 + j]) inUse16[i] = True; + } + + nBytes = bytesOut; + for (i = 0; i < 16; i++) + if (inUse16[i]) bsW(1,1); else bsW(1,0); + + for (i = 0; i < 16; i++) + if (inUse16[i]) + for (j = 0; j < 16; j++) + if (inUse[i * 16 + j]) bsW(1,1); else bsW(1,0); + + if (verbosity >= 3) + fprintf ( stderr, " bytes: mapping %d, ", bytesOut-nBytes ); + } + + /*--- Now the selectors. ---*/ + nBytes = bytesOut; + bsW ( 3, nGroups ); + bsW ( 15, nSelectors ); + for (i = 0; i < nSelectors; i++) { + for (j = 0; j < selectorMtf[i]; j++) bsW(1,1); + bsW(1,0); + } + if (verbosity >= 3) + fprintf ( stderr, "selectors %d, ", bytesOut-nBytes ); + + /*--- Now the coding tables. ---*/ + nBytes = bytesOut; + + for (t = 0; t < nGroups; t++) { + Int32 curr = len[t][0]; + bsW ( 5, curr ); + for (i = 0; i < alphaSize; i++) { + while (curr < len[t][i]) { bsW(2,2); curr++; /* 10 */ }; + while (curr > len[t][i]) { bsW(2,3); curr--; /* 11 */ }; + bsW ( 1, 0 ); + } + } + + if (verbosity >= 3) + fprintf ( stderr, "code lengths %d, ", bytesOut-nBytes ); + + /*--- And finally, the block data proper ---*/ + nBytes = bytesOut; + selCtr = 0; + gs = 0; + while (True) { + if (gs >= nMTF) break; + ge = gs + G_SIZE - 1; + if (ge >= nMTF) ge = nMTF-1; + for (i = gs; i <= ge; i++) { + #if DEBUG + assert (selector[selCtr] < nGroups); + #endif + bsW ( len [selector[selCtr]] [szptr[i]], + code [selector[selCtr]] [szptr[i]] ); + } + + gs = ge+1; + selCtr++; + } + if (!(selCtr == nSelectors)) panic ( "sendMTFValues(5)" ); + + if (verbosity >= 3) + fprintf ( stderr, "codes %d\n", bytesOut-nBytes ); +} + + +/*---------------------------------------------*/ +void moveToFrontCodeAndSend ( void ) +{ + bsPutIntVS ( 24, origPtr ); + generateMTFValues(); + sendMTFValues(); +} + + +/*---------------------------------------------*/ +void recvDecodingTables ( void ) +{ + Int32 i, j, t, nGroups, nSelectors, alphaSize; + Int32 minLen, maxLen; + Bool inUse16[16]; + + /*--- Receive the mapping table ---*/ + for (i = 0; i < 16; i++) + if (bsR(1) == 1) + inUse16[i] = True; else + inUse16[i] = False; + + for (i = 0; i < 256; i++) inUse[i] = False; + + for (i = 0; i < 16; i++) + if (inUse16[i]) + for (j = 0; j < 16; j++) + if (bsR(1) == 1) inUse[i * 16 + j] = True; + + makeMaps(); + alphaSize = nInUse+2; + + /*--- Now the selectors ---*/ + nGroups = bsR ( 3 ); + nSelectors = bsR ( 15 ); + for (i = 0; i < nSelectors; i++) { + j = 0; + while (bsR(1) == 1) j++; + selectorMtf[i] = j; + } + + /*--- Undo the MTF values for the selectors. ---*/ + { + UChar pos[N_GROUPS], tmp, v; + for (v = 0; v < nGroups; v++) pos[v] = v; + + for (i = 0; i < nSelectors; i++) { + v = selectorMtf[i]; + tmp = pos[v]; + while (v > 0) { pos[v] = pos[v-1]; v--; } + pos[0] = tmp; + selector[i] = tmp; + } + } + + /*--- Now the coding tables ---*/ + for (t = 0; t < nGroups; t++) { + Int32 curr = bsR ( 5 ); + for (i = 0; i < alphaSize; i++) { + while (bsR(1) == 1) { + if (bsR(1) == 0) curr++; else curr--; + } + len[t][i] = curr; + } + } + + /*--- Create the Huffman decoding tables ---*/ + for (t = 0; t < nGroups; t++) { + minLen = 32; + maxLen = 0; + for (i = 0; i < alphaSize; i++) { + if (len[t][i] > maxLen) maxLen = len[t][i]; + if (len[t][i] < minLen) minLen = len[t][i]; + } + hbCreateDecodeTables ( + &limit[t][0], &base[t][0], &perm[t][0], &len[t][0], + minLen, maxLen, alphaSize + ); + minLens[t] = minLen; + } +} + + +/*---------------------------------------------*/ +#define GET_MTF_VAL(lval) \ +{ \ + Int32 zt, zn, zvec, zj; \ + if (groupPos == 0) { \ + groupNo++; \ + groupPos = G_SIZE; \ + } \ + groupPos--; \ + zt = selector[groupNo]; \ + zn = minLens[zt]; \ + zvec = bsR ( zn ); \ + while (zvec > limit[zt][zn]) { \ + zn++; bsR1(zj); \ + zvec = (zvec << 1) | zj; \ + }; \ + lval = perm[zt][zvec - base[zt][zn]]; \ +} + + +/*---------------------------------------------*/ +void getAndMoveToFrontDecode ( void ) +{ + UChar yy[256]; + Int32 i, j, nextSym, limitLast; + Int32 EOB, groupNo, groupPos; + + limitLast = 100000 * blockSize100k; + origPtr = bsGetIntVS ( 24 ); + + recvDecodingTables(); + EOB = nInUse+1; + groupNo = -1; + groupPos = 0; + + /*-- + Setting up the unzftab entries here is not strictly + necessary, but it does save having to do it later + in a separate pass, and so saves a block's worth of + cache misses. + --*/ + for (i = 0; i <= 255; i++) unzftab[i] = 0; + + for (i = 0; i <= 255; i++) yy[i] = (UChar) i; + + last = -1; + + GET_MTF_VAL(nextSym); + + while (True) { + + if (nextSym == EOB) break; + + if (nextSym == RUNA || nextSym == RUNB) { + UChar ch; + Int32 s = -1; + Int32 N = 1; + do { + if (nextSym == RUNA) s = s + (0+1) * N; else + if (nextSym == RUNB) s = s + (1+1) * N; + N = N * 2; + GET_MTF_VAL(nextSym); + } + while (nextSym == RUNA || nextSym == RUNB); + + s++; + ch = seqToUnseq[yy[0]]; + unzftab[ch] += s; + + if (smallMode) + while (s > 0) { + last++; + ll16[last] = ch; + s--; + } + else + while (s > 0) { + last++; + ll8[last] = ch; + s--; + }; + + if (last >= limitLast) blockOverrun(); + continue; + + } else { + + UChar tmp; + last++; if (last >= limitLast) blockOverrun(); + + tmp = yy[nextSym-1]; + unzftab[seqToUnseq[tmp]]++; + if (smallMode) + ll16[last] = seqToUnseq[tmp]; else + ll8[last] = seqToUnseq[tmp]; + + /*-- + This loop is hammered during decompression, + hence the unrolling. + + for (j = nextSym-1; j > 0; j--) yy[j] = yy[j-1]; + --*/ + + j = nextSym-1; + for (; j > 3; j -= 4) { + yy[j] = yy[j-1]; + yy[j-1] = yy[j-2]; + yy[j-2] = yy[j-3]; + yy[j-3] = yy[j-4]; + } + for (; j > 0; j--) yy[j] = yy[j-1]; + + yy[0] = tmp; + GET_MTF_VAL(nextSym); + continue; + } + } +} + + +/*---------------------------------------------------*/ +/*--- Block-sorting machinery ---*/ +/*---------------------------------------------------*/ + +/*---------------------------------------------*/ +/*-- + Compare two strings in block. We assume (see + discussion above) that i1 and i2 have a max + offset of 10 on entry, and that the first + bytes of both block and quadrant have been + copied into the "overshoot area", ie + into the subscript range + [last+1 .. last+NUM_OVERSHOOT_BYTES]. +--*/ +INLINE Bool fullGtU ( Int32 i1, Int32 i2 ) +{ + Int32 k; + UChar c1, c2; + UInt16 s1, s2; + + #if DEBUG + /*-- + shellsort shouldn't ask to compare + something with itself. + --*/ + assert (i1 != i2); + #endif + + c1 = block[i1]; + c2 = block[i2]; + if (c1 != c2) return (c1 > c2); + i1++; i2++; + + c1 = block[i1]; + c2 = block[i2]; + if (c1 != c2) return (c1 > c2); + i1++; i2++; + + c1 = block[i1]; + c2 = block[i2]; + if (c1 != c2) return (c1 > c2); + i1++; i2++; + + c1 = block[i1]; + c2 = block[i2]; + if (c1 != c2) return (c1 > c2); + i1++; i2++; + + c1 = block[i1]; + c2 = block[i2]; + if (c1 != c2) return (c1 > c2); + i1++; i2++; + + c1 = block[i1]; + c2 = block[i2]; + if (c1 != c2) return (c1 > c2); + i1++; i2++; + + k = last + 1; + + do { + + c1 = block[i1]; + c2 = block[i2]; + if (c1 != c2) return (c1 > c2); + s1 = quadrant[i1]; + s2 = quadrant[i2]; + if (s1 != s2) return (s1 > s2); + i1++; i2++; + + c1 = block[i1]; + c2 = block[i2]; + if (c1 != c2) return (c1 > c2); + s1 = quadrant[i1]; + s2 = quadrant[i2]; + if (s1 != s2) return (s1 > s2); + i1++; i2++; + + c1 = block[i1]; + c2 = block[i2]; + if (c1 != c2) return (c1 > c2); + s1 = quadrant[i1]; + s2 = quadrant[i2]; + if (s1 != s2) return (s1 > s2); + i1++; i2++; + + c1 = block[i1]; + c2 = block[i2]; + if (c1 != c2) return (c1 > c2); + s1 = quadrant[i1]; + s2 = quadrant[i2]; + if (s1 != s2) return (s1 > s2); + i1++; i2++; + + if (i1 > last) { i1 -= last; i1--; }; + if (i2 > last) { i2 -= last; i2--; }; + + k -= 4; + workDone++; + } + while (k >= 0); + + return False; +} + +/*---------------------------------------------*/ +/*-- + Knuth's increments seem to work better + than Incerpi-Sedgewick here. Possibly + because the number of elems to sort is + usually small, typically <= 20. +--*/ +Int32 incs[14] = { 1, 4, 13, 40, 121, 364, 1093, 3280, + 9841, 29524, 88573, 265720, + 797161, 2391484 }; + +void simpleSort ( Int32 lo, Int32 hi, Int32 d ) +{ + Int32 i, j, h, bigN, hp; + Int32 v; + + bigN = hi - lo + 1; + if (bigN < 2) return; + + hp = 0; + while (incs[hp] < bigN) hp++; + hp--; + + for (; hp >= 0; hp--) { + h = incs[hp]; + if (verbosity >= 5) + fprintf ( stderr, " shell increment %d\n", h ); + + i = lo + h; + while (True) { + + /*-- copy 1 --*/ + if (i > hi) break; + v = zptr[i]; + j = i; + while ( fullGtU ( zptr[j-h]+d, v+d ) ) { + zptr[j] = zptr[j-h]; + j = j - h; + if (j <= (lo + h - 1)) break; + } + zptr[j] = v; + i++; + + /*-- copy 2 --*/ + if (i > hi) break; + v = zptr[i]; + j = i; + while ( fullGtU ( zptr[j-h]+d, v+d ) ) { + zptr[j] = zptr[j-h]; + j = j - h; + if (j <= (lo + h - 1)) break; + } + zptr[j] = v; + i++; + + /*-- copy 3 --*/ + if (i > hi) break; + v = zptr[i]; + j = i; + while ( fullGtU ( zptr[j-h]+d, v+d ) ) { + zptr[j] = zptr[j-h]; + j = j - h; + if (j <= (lo + h - 1)) break; + } + zptr[j] = v; + i++; + + if (workDone > workLimit && firstAttempt) return; + } + } +} + + +/*---------------------------------------------*/ +/*-- + The following is an implementation of + an elegant 3-way quicksort for strings, + described in a paper "Fast Algorithms for + Sorting and Searching Strings", by Robert + Sedgewick and Jon L. Bentley. +--*/ + +#define swap(lv1, lv2) \ + { Int32 tmp = lv1; lv1 = lv2; lv2 = tmp; } + +INLINE void vswap ( Int32 p1, Int32 p2, Int32 n ) +{ + while (n > 0) { + swap(zptr[p1], zptr[p2]); + p1++; p2++; n--; + } +} + +INLINE UChar med3 ( UChar a, UChar b, UChar c ) +{ + UChar t; + if (a > b) { t = a; a = b; b = t; }; + if (b > c) { t = b; b = c; c = t; }; + if (a > b) b = a; + return b; +} + + +#define min(a,b) ((a) < (b)) ? (a) : (b) + +typedef + struct { Int32 ll; Int32 hh; Int32 dd; } + StackElem; + +#define push(lz,hz,dz) { stack[sp].ll = lz; \ + stack[sp].hh = hz; \ + stack[sp].dd = dz; \ + sp++; } + +#define pop(lz,hz,dz) { sp--; \ + lz = stack[sp].ll; \ + hz = stack[sp].hh; \ + dz = stack[sp].dd; } + +#define SMALL_THRESH 20 +#define DEPTH_THRESH 10 + +/*-- + If you are ever unlucky/improbable enough + to get a stack overflow whilst sorting, + increase the following constant and try + again. In practice I have never seen the + stack go above 27 elems, so the following + limit seems very generous. +--*/ +#define QSORT_STACK_SIZE 1000 + + +void qSort3 ( Int32 loSt, Int32 hiSt, Int32 dSt ) +{ + Int32 unLo, unHi, ltLo, gtHi, med, n, m; + Int32 sp, lo, hi, d; + StackElem stack[QSORT_STACK_SIZE]; + + sp = 0; + push ( loSt, hiSt, dSt ); + + while (sp > 0) { + + if (sp >= QSORT_STACK_SIZE) panic ( "stack overflow in qSort3" ); + + pop ( lo, hi, d ); + + if (hi - lo < SMALL_THRESH || d > DEPTH_THRESH) { + simpleSort ( lo, hi, d ); + if (workDone > workLimit && firstAttempt) return; + continue; + } + + med = med3 ( block[zptr[ lo ]+d], + block[zptr[ hi ]+d], + block[zptr[ (lo+hi)>>1 ]+d] ); + + unLo = ltLo = lo; + unHi = gtHi = hi; + + while (True) { + while (True) { + if (unLo > unHi) break; + n = ((Int32)block[zptr[unLo]+d]) - med; + if (n == 0) { swap(zptr[unLo], zptr[ltLo]); ltLo++; unLo++; continue; }; + if (n > 0) break; + unLo++; + } + while (True) { + if (unLo > unHi) break; + n = ((Int32)block[zptr[unHi]+d]) - med; + if (n == 0) { swap(zptr[unHi], zptr[gtHi]); gtHi--; unHi--; continue; }; + if (n < 0) break; + unHi--; + } + if (unLo > unHi) break; + swap(zptr[unLo], zptr[unHi]); unLo++; unHi--; + } + #if DEBUG + assert (unHi == unLo-1); + #endif + + if (gtHi < ltLo) { + push(lo, hi, d+1 ); + continue; + } + + n = min(ltLo-lo, unLo-ltLo); vswap(lo, unLo-n, n); + m = min(hi-gtHi, gtHi-unHi); vswap(unLo, hi-m+1, m); + + n = lo + unLo - ltLo - 1; + m = hi - (gtHi - unHi) + 1; + + push ( lo, n, d ); + push ( n+1, m-1, d+1 ); + push ( m, hi, d ); + } +} + + +/*---------------------------------------------*/ + +#define BIGFREQ(b) (ftab[((b)+1) << 8] - ftab[(b) << 8]) + +#define SETMASK (1 << 21) +#define CLEARMASK (~(SETMASK)) + +void sortIt ( void ) +{ + Int32 i, j, ss, sb; + Int32 runningOrder[256]; + Int32 copy[256]; + Bool bigDone[256]; + UChar c1, c2; + Int32 numQSorted; + + /*-- + In the various block-sized structures, live data runs + from 0 to last+NUM_OVERSHOOT_BYTES inclusive. First, + set up the overshoot area for block. + --*/ + + if (verbosity >= 4) fprintf ( stderr, " sort initialise ...\n" ); + for (i = 0; i < NUM_OVERSHOOT_BYTES; i++) + block[last+i+1] = block[i % (last+1)]; + for (i = 0; i <= last+NUM_OVERSHOOT_BYTES; i++) + quadrant[i] = 0; + + block[-1] = block[last]; + + if (last < 4000) { + + /*-- + Use simpleSort(), since the full sorting mechanism + has quite a large constant overhead. + --*/ + if (verbosity >= 4) fprintf ( stderr, " simpleSort ...\n" ); + for (i = 0; i <= last; i++) zptr[i] = i; + firstAttempt = False; + workDone = workLimit = 0; + simpleSort ( 0, last, 0 ); + if (verbosity >= 4) fprintf ( stderr, " simpleSort done.\n" ); + + } else { + + numQSorted = 0; + for (i = 0; i <= 255; i++) bigDone[i] = False; + + if (verbosity >= 4) fprintf ( stderr, " bucket sorting ...\n" ); + + for (i = 0; i <= 65536; i++) ftab[i] = 0; + + c1 = block[-1]; + for (i = 0; i <= last; i++) { + c2 = block[i]; + ftab[(c1 << 8) + c2]++; + c1 = c2; + } + + for (i = 1; i <= 65536; i++) ftab[i] += ftab[i-1]; + + c1 = block[0]; + for (i = 0; i < last; i++) { + c2 = block[i+1]; + j = (c1 << 8) + c2; + c1 = c2; + ftab[j]--; + zptr[ftab[j]] = i; + } + j = (block[last] << 8) + block[0]; + ftab[j]--; + zptr[ftab[j]] = last; + + /*-- + Now ftab contains the first loc of every small bucket. + Calculate the running order, from smallest to largest + big bucket. + --*/ + + for (i = 0; i <= 255; i++) runningOrder[i] = i; + + { + Int32 vv; + Int32 h = 1; + do h = 3 * h + 1; while (h <= 256); + do { + h = h / 3; + for (i = h; i <= 255; i++) { + vv = runningOrder[i]; + j = i; + while ( BIGFREQ(runningOrder[j-h]) > BIGFREQ(vv) ) { + runningOrder[j] = runningOrder[j-h]; + j = j - h; + if (j <= (h - 1)) goto zero; + } + zero: + runningOrder[j] = vv; + } + } while (h != 1); + } + + /*-- + The main sorting loop. + --*/ + + for (i = 0; i <= 255; i++) { + + /*-- + Process big buckets, starting with the least full. + --*/ + ss = runningOrder[i]; + + /*-- + Complete the big bucket [ss] by quicksorting + any unsorted small buckets [ss, j]. Hopefully + previous pointer-scanning phases have already + completed many of the small buckets [ss, j], so + we don't have to sort them at all. + --*/ + for (j = 0; j <= 255; j++) { + sb = (ss << 8) + j; + if ( ! (ftab[sb] & SETMASK) ) { + Int32 lo = ftab[sb] & CLEARMASK; + Int32 hi = (ftab[sb+1] & CLEARMASK) - 1; + if (hi > lo) { + if (verbosity >= 4) + fprintf ( stderr, + " qsort [0x%x, 0x%x] done %d this %d\n", + ss, j, numQSorted, hi - lo + 1 ); + qSort3 ( lo, hi, 2 ); + numQSorted += ( hi - lo + 1 ); + if (workDone > workLimit && firstAttempt) return; + } + ftab[sb] |= SETMASK; + } + } + + /*-- + The ss big bucket is now done. Record this fact, + and update the quadrant descriptors. Remember to + update quadrants in the overshoot area too, if + necessary. The "if (i < 255)" test merely skips + this updating for the last bucket processed, since + updating for the last bucket is pointless. + --*/ + bigDone[ss] = True; + + if (i < 255) { + Int32 bbStart = ftab[ss << 8] & CLEARMASK; + Int32 bbSize = (ftab[(ss+1) << 8] & CLEARMASK) - bbStart; + Int32 shifts = 0; + + while ((bbSize >> shifts) > 65534) shifts++; + + for (j = 0; j < bbSize; j++) { + Int32 a2update = zptr[bbStart + j]; + UInt16 qVal = (UInt16)(j >> shifts); + quadrant[a2update] = qVal; + if (a2update < NUM_OVERSHOOT_BYTES) + quadrant[a2update + last + 1] = qVal; + } + + if (! ( ((bbSize-1) >> shifts) <= 65535 )) panic ( "sortIt" ); + } + + /*-- + Now scan this big bucket so as to synthesise the + sorted order for small buckets [t, ss] for all t != ss. + --*/ + for (j = 0; j <= 255; j++) + copy[j] = ftab[(j << 8) + ss] & CLEARMASK; + + for (j = ftab[ss << 8] & CLEARMASK; + j < (ftab[(ss+1) << 8] & CLEARMASK); + j++) { + c1 = block[zptr[j]-1]; + if ( ! bigDone[c1] ) { + zptr[copy[c1]] = zptr[j] == 0 ? last : zptr[j] - 1; + copy[c1] ++; + } + } + + for (j = 0; j <= 255; j++) ftab[(j << 8) + ss] |= SETMASK; + } + if (verbosity >= 4) + fprintf ( stderr, " %d pointers, %d sorted, %d scanned\n", + last+1, numQSorted, (last+1) - numQSorted ); + } +} + + +/*---------------------------------------------------*/ +/*--- Stuff for randomising repetitive blocks ---*/ +/*---------------------------------------------------*/ + +/*---------------------------------------------*/ +Int32 rNums[512] = { + 619, 720, 127, 481, 931, 816, 813, 233, 566, 247, + 985, 724, 205, 454, 863, 491, 741, 242, 949, 214, + 733, 859, 335, 708, 621, 574, 73, 654, 730, 472, + 419, 436, 278, 496, 867, 210, 399, 680, 480, 51, + 878, 465, 811, 169, 869, 675, 611, 697, 867, 561, + 862, 687, 507, 283, 482, 129, 807, 591, 733, 623, + 150, 238, 59, 379, 684, 877, 625, 169, 643, 105, + 170, 607, 520, 932, 727, 476, 693, 425, 174, 647, + 73, 122, 335, 530, 442, 853, 695, 249, 445, 515, + 909, 545, 703, 919, 874, 474, 882, 500, 594, 612, + 641, 801, 220, 162, 819, 984, 589, 513, 495, 799, + 161, 604, 958, 533, 221, 400, 386, 867, 600, 782, + 382, 596, 414, 171, 516, 375, 682, 485, 911, 276, + 98, 553, 163, 354, 666, 933, 424, 341, 533, 870, + 227, 730, 475, 186, 263, 647, 537, 686, 600, 224, + 469, 68, 770, 919, 190, 373, 294, 822, 808, 206, + 184, 943, 795, 384, 383, 461, 404, 758, 839, 887, + 715, 67, 618, 276, 204, 918, 873, 777, 604, 560, + 951, 160, 578, 722, 79, 804, 96, 409, 713, 940, + 652, 934, 970, 447, 318, 353, 859, 672, 112, 785, + 645, 863, 803, 350, 139, 93, 354, 99, 820, 908, + 609, 772, 154, 274, 580, 184, 79, 626, 630, 742, + 653, 282, 762, 623, 680, 81, 927, 626, 789, 125, + 411, 521, 938, 300, 821, 78, 343, 175, 128, 250, + 170, 774, 972, 275, 999, 639, 495, 78, 352, 126, + 857, 956, 358, 619, 580, 124, 737, 594, 701, 612, + 669, 112, 134, 694, 363, 992, 809, 743, 168, 974, + 944, 375, 748, 52, 600, 747, 642, 182, 862, 81, + 344, 805, 988, 739, 511, 655, 814, 334, 249, 515, + 897, 955, 664, 981, 649, 113, 974, 459, 893, 228, + 433, 837, 553, 268, 926, 240, 102, 654, 459, 51, + 686, 754, 806, 760, 493, 403, 415, 394, 687, 700, + 946, 670, 656, 610, 738, 392, 760, 799, 887, 653, + 978, 321, 576, 617, 626, 502, 894, 679, 243, 440, + 680, 879, 194, 572, 640, 724, 926, 56, 204, 700, + 707, 151, 457, 449, 797, 195, 791, 558, 945, 679, + 297, 59, 87, 824, 713, 663, 412, 693, 342, 606, + 134, 108, 571, 364, 631, 212, 174, 643, 304, 329, + 343, 97, 430, 751, 497, 314, 983, 374, 822, 928, + 140, 206, 73, 263, 980, 736, 876, 478, 430, 305, + 170, 514, 364, 692, 829, 82, 855, 953, 676, 246, + 369, 970, 294, 750, 807, 827, 150, 790, 288, 923, + 804, 378, 215, 828, 592, 281, 565, 555, 710, 82, + 896, 831, 547, 261, 524, 462, 293, 465, 502, 56, + 661, 821, 976, 991, 658, 869, 905, 758, 745, 193, + 768, 550, 608, 933, 378, 286, 215, 979, 792, 961, + 61, 688, 793, 644, 986, 403, 106, 366, 905, 644, + 372, 567, 466, 434, 645, 210, 389, 550, 919, 135, + 780, 773, 635, 389, 707, 100, 626, 958, 165, 504, + 920, 176, 193, 713, 857, 265, 203, 50, 668, 108, + 645, 990, 626, 197, 510, 357, 358, 850, 858, 364, + 936, 638 +}; + + +#define RAND_DECLS \ + Int32 rNToGo = 0; \ + Int32 rTPos = 0; \ + +#define RAND_MASK ((rNToGo == 1) ? 1 : 0) + +#define RAND_UPD_MASK \ + if (rNToGo == 0) { \ + rNToGo = rNums[rTPos]; \ + rTPos++; if (rTPos == 512) rTPos = 0; \ + } \ + rNToGo--; + + + +/*---------------------------------------------------*/ +/*--- The Reversible Transformation (tm) ---*/ +/*---------------------------------------------------*/ + +/*---------------------------------------------*/ +void randomiseBlock ( void ) +{ + Int32 i; + RAND_DECLS; + for (i = 0; i < 256; i++) inUse[i] = False; + + for (i = 0; i <= last; i++) { + RAND_UPD_MASK; + block[i] ^= RAND_MASK; + inUse[block[i]] = True; + } +} + + +/*---------------------------------------------*/ +void doReversibleTransformation ( void ) +{ + Int32 i; + + if (verbosity >= 2) fprintf ( stderr, "\n" ); + + workLimit = workFactor * last; + workDone = 0; + blockRandomised = False; + firstAttempt = True; + + sortIt (); + + if (verbosity >= 3) + fprintf ( stderr, " %d work, %d block, ratio %5.2f\n", + workDone, last, (float)workDone / (float)(last) ); + + if (workDone > workLimit && firstAttempt) { + if (verbosity >= 2) + fprintf ( stderr, " sorting aborted; randomising block\n" ); + randomiseBlock (); + workLimit = workDone = 0; + blockRandomised = True; + firstAttempt = False; + sortIt(); + if (verbosity >= 3) + fprintf ( stderr, " %d work, %d block, ratio %f\n", + workDone, last, (float)workDone / (float)(last) ); + } + + origPtr = -1; + for (i = 0; i <= last; i++) + if (zptr[i] == 0) + { origPtr = i; break; }; + + if (origPtr == -1) panic ( "doReversibleTransformation" ); +} + + +/*---------------------------------------------*/ + +INLINE Int32 indexIntoF ( Int32 indx, Int32 *cftab ) +{ + Int32 nb, na, mid; + nb = 0; + na = 256; + do { + mid = (nb + na) >> 1; + if (indx >= cftab[mid]) nb = mid; else na = mid; + } + while (na - nb != 1); + return nb; +} + + +#define GET_SMALL(cccc) \ + \ + cccc = indexIntoF ( tPos, cftab ); \ + tPos = GET_LL(tPos); + + +void undoReversibleTransformation_small ( FILE* dst ) +{ + Int32 cftab[257], cftabAlso[257]; + Int32 i, j, tmp, tPos; + UChar ch; + + /*-- + We assume here that the global array unzftab will + already be holding the frequency counts for + ll8[0 .. last]. + --*/ + + /*-- Set up cftab to facilitate generation of indexIntoF --*/ + cftab[0] = 0; + for (i = 1; i <= 256; i++) cftab[i] = unzftab[i-1]; + for (i = 1; i <= 256; i++) cftab[i] += cftab[i-1]; + + /*-- Make a copy of it, used in generation of T --*/ + for (i = 0; i <= 256; i++) cftabAlso[i] = cftab[i]; + + /*-- compute the T vector --*/ + for (i = 0; i <= last; i++) { + ch = (UChar)ll16[i]; + SET_LL(i, cftabAlso[ch]); + cftabAlso[ch]++; + } + + /*-- + Compute T^(-1) by pointer reversal on T. This is rather + subtle, in that, if the original block was two or more + (in general, N) concatenated copies of the same thing, + the T vector will consist of N cycles, each of length + blocksize / N, and decoding will involve traversing one + of these cycles N times. Which particular cycle doesn't + matter -- they are all equivalent. The tricky part is to + make sure that the pointer reversal creates a correct + reversed cycle for us to traverse. So, the code below + simply reverses whatever cycle origPtr happens to fall into, + without regard to the cycle length. That gives one reversed + cycle, which for normal blocks, is the entire block-size long. + For repeated blocks, it will be interspersed with the other + N-1 non-reversed cycles. Providing that the F-subscripting + phase which follows starts at origPtr, all then works ok. + --*/ + i = origPtr; + j = GET_LL(i); + do { + tmp = GET_LL(j); + SET_LL(j, i); + i = j; + j = tmp; + } + while (i != origPtr); + + /*-- + We recreate the original by subscripting F through T^(-1). + The run-length-decoder below requires characters incrementally, + so tPos is set to a starting value, and is updated by + the GET_SMALL macro. + --*/ + tPos = origPtr; + + /*-------------------------------------------------*/ + /*-- + This is pretty much a verbatim copy of the + run-length decoder present in the distribution + bzip-0.21; it has to be here to avoid creating + block[] as an intermediary structure. As in 0.21, + this code derives from some sent to me by + Christian von Roques. + + It allows dst==NULL, so as to support the test (-t) + option without slowing down the fast decompression + code. + --*/ + { + IntNative retVal; + Int32 i2, count, chPrev, ch2; + UInt32 localCrc; + + count = 0; + i2 = 0; + ch2 = 256; /*-- not a char and not EOF --*/ + localCrc = getGlobalCRC(); + + { + RAND_DECLS; + while ( i2 <= last ) { + chPrev = ch2; + GET_SMALL(ch2); + if (blockRandomised) { + RAND_UPD_MASK; + ch2 ^= (UInt32)RAND_MASK; + } + i2++; + + if (dst) + retVal = putc ( ch2, dst ); + + UPDATE_CRC ( localCrc, (UChar)ch2 ); + + if (ch2 != chPrev) { + count = 1; + } else { + count++; + if (count >= 4) { + Int32 j2; + UChar z; + GET_SMALL(z); + if (blockRandomised) { + RAND_UPD_MASK; + z ^= RAND_MASK; + } + for (j2 = 0; j2 < (Int32)z; j2++) { + if (dst) retVal = putc (ch2, dst); + UPDATE_CRC ( localCrc, (UChar)ch2 ); + } + i2++; + count = 0; + } + } + } + } + + setGlobalCRC ( localCrc ); + } + /*-- end of the in-line run-length-decoder. --*/ +} +#undef GET_SMALL + + +/*---------------------------------------------*/ + +#define GET_FAST(cccc) \ + \ + cccc = ll8[tPos]; \ + tPos = tt[tPos]; + + +void undoReversibleTransformation_fast ( FILE* dst ) +{ + Int32 cftab[257]; + Int32 i, tPos; + UChar ch; + + /*-- + We assume here that the global array unzftab will + already be holding the frequency counts for + ll8[0 .. last]. + --*/ + + /*-- Set up cftab to facilitate generation of T^(-1) --*/ + cftab[0] = 0; + for (i = 1; i <= 256; i++) cftab[i] = unzftab[i-1]; + for (i = 1; i <= 256; i++) cftab[i] += cftab[i-1]; + + /*-- compute the T^(-1) vector --*/ + for (i = 0; i <= last; i++) { + ch = (UChar)ll8[i]; + tt[cftab[ch]] = i; + cftab[ch]++; + } + + /*-- + We recreate the original by subscripting L through T^(-1). + The run-length-decoder below requires characters incrementally, + so tPos is set to a starting value, and is updated by + the GET_FAST macro. + --*/ + tPos = tt[origPtr]; + + /*-------------------------------------------------*/ + /*-- + This is pretty much a verbatim copy of the + run-length decoder present in the distribution + bzip-0.21; it has to be here to avoid creating + block[] as an intermediary structure. As in 0.21, + this code derives from some sent to me by + Christian von Roques. + --*/ + { + IntNative retVal; + Int32 i2, count, chPrev, ch2; + UInt32 localCrc; + + count = 0; + i2 = 0; + ch2 = 256; /*-- not a char and not EOF --*/ + localCrc = getGlobalCRC(); + + if (blockRandomised) { + RAND_DECLS; + while ( i2 <= last ) { + chPrev = ch2; + GET_FAST(ch2); + RAND_UPD_MASK; + ch2 ^= (UInt32)RAND_MASK; + i2++; + + retVal = putc ( ch2, dst ); + UPDATE_CRC ( localCrc, (UChar)ch2 ); + + if (ch2 != chPrev) { + count = 1; + } else { + count++; + if (count >= 4) { + Int32 j2; + UChar z; + GET_FAST(z); + RAND_UPD_MASK; + z ^= RAND_MASK; + for (j2 = 0; j2 < (Int32)z; j2++) { + retVal = putc (ch2, dst); + UPDATE_CRC ( localCrc, (UChar)ch2 ); + } + i2++; + count = 0; + } + } + } + + } else { + + while ( i2 <= last ) { + chPrev = ch2; + GET_FAST(ch2); + i2++; + + retVal = putc ( ch2, dst ); + UPDATE_CRC ( localCrc, (UChar)ch2 ); + + if (ch2 != chPrev) { + count = 1; + } else { + count++; + if (count >= 4) { + Int32 j2; + UChar z; + GET_FAST(z); + for (j2 = 0; j2 < (Int32)z; j2++) { + retVal = putc (ch2, dst); + UPDATE_CRC ( localCrc, (UChar)ch2 ); + } + i2++; + count = 0; + } + } + } + + } /*-- if (blockRandomised) --*/ + + setGlobalCRC ( localCrc ); + } + /*-- end of the in-line run-length-decoder. --*/ +} +#undef GET_FAST + + +/*---------------------------------------------------*/ +/*--- The block loader and RLEr ---*/ +/*---------------------------------------------------*/ + +/*---------------------------------------------*/ +/* Top 16: run length, 1 to 255. +* Lower 16: the char, or MY_EOF for EOF. +*/ + +#define MY_EOF 257 + +INLINE Int32 getRLEpair ( FILE* src ) +{ + Int32 runLength; + IntNative ch, chLatest; + + ch = getc ( src ); + + /*--- Because I have no idea what kind of a value EOF is. ---*/ + if (ch == EOF) { + ERROR_IF_NOT_ZERO ( errno ); + return (1 << 16) | MY_EOF; + } + + runLength = 0; + do { + chLatest = getc ( src ); + runLength++; + bytesIn++; + } + while (ch == chLatest && runLength < 255); + + if ( chLatest != EOF ) { + if ( ungetc ( chLatest, src ) == EOF ) + panic ( "getRLEpair: ungetc failed" ); + } else { + ERROR_IF_NOT_ZERO ( errno ); + } + + /*--- Conditional is just a speedup hack. ---*/ + if (runLength == 1) { + UPDATE_CRC ( globalCrc, (UChar)ch ); + return (1 << 16) | ch; + } else { + Int32 i; + for (i = 1; i <= runLength; i++) + UPDATE_CRC ( globalCrc, (UChar)ch ); + return (runLength << 16) | ch; + } +} + + +/*---------------------------------------------*/ +void loadAndRLEsource ( FILE* src ) +{ + Int32 ch, allowableBlockSize, i; + + last = -1; + ch = 0; + + for (i = 0; i < 256; i++) inUse[i] = False; + + /*--- 20 is just a paranoia constant ---*/ + allowableBlockSize = 100000 * blockSize100k - 20; + + while (last < allowableBlockSize && ch != MY_EOF) { + Int32 rlePair, runLen; + rlePair = getRLEpair ( src ); + ch = rlePair & 0xFFFF; + runLen = (UInt32)rlePair >> 16; + + #if DEBUG + assert (runLen >= 1 && runLen <= 255); + #endif + + if (ch != MY_EOF) { + inUse[ch] = True; + switch (runLen) { + case 1: + last++; block[last] = (UChar)ch; break; + case 2: + last++; block[last] = (UChar)ch; + last++; block[last] = (UChar)ch; break; + case 3: + last++; block[last] = (UChar)ch; + last++; block[last] = (UChar)ch; + last++; block[last] = (UChar)ch; break; + default: + inUse[runLen-4] = True; + last++; block[last] = (UChar)ch; + last++; block[last] = (UChar)ch; + last++; block[last] = (UChar)ch; + last++; block[last] = (UChar)ch; + last++; block[last] = (UChar)(runLen-4); break; + } + } + } +} + + +/*---------------------------------------------------*/ +/*--- Processing of complete files and streams ---*/ +/*---------------------------------------------------*/ + +/*---------------------------------------------*/ +void compressStream ( FILE *stream, FILE *zStream ) +{ + IntNative retVal; + UInt32 blockCRC, combinedCRC; + Int32 blockNo; + + blockNo = 0; + bytesIn = 0; + bytesOut = 0; + nBlocksRandomised = 0; + + SET_BINARY_MODE(stream); + SET_BINARY_MODE(zStream); + + ERROR_IF_NOT_ZERO ( ferror(stream) ); + ERROR_IF_NOT_ZERO ( ferror(zStream) ); + + bsSetStream ( zStream, True ); + + /*--- Write `magic' bytes B and Z, + then h indicating file-format == huffmanised, + followed by a digit indicating blockSize100k. + ---*/ + bsPutUChar ( 'B' ); + bsPutUChar ( 'Z' ); + bsPutUChar ( 'h' ); + bsPutUChar ( '0' + blockSize100k ); + + combinedCRC = 0; + + if (verbosity >= 2) fprintf ( stderr, "\n" ); + + while (True) { + + blockNo++; + initialiseCRC (); + loadAndRLEsource ( stream ); + ERROR_IF_NOT_ZERO ( ferror(stream) ); + if (last == -1) break; + + blockCRC = getFinalCRC (); + combinedCRC = (combinedCRC << 1) | (combinedCRC >> 31); + combinedCRC ^= blockCRC; + + if (verbosity >= 2) + fprintf ( stderr, " block %d: crc = 0x%8x, combined CRC = 0x%8x, size = %d", + blockNo, blockCRC, combinedCRC, last+1 ); + + /*-- sort the block and establish posn of original string --*/ + doReversibleTransformation (); + + /*-- + A 6-byte block header, the value chosen arbitrarily + as 0x314159265359 :-). A 32 bit value does not really + give a strong enough guarantee that the value will not + appear by chance in the compressed datastream. Worst-case + probability of this event, for a 900k block, is about + 2.0e-3 for 32 bits, 1.0e-5 for 40 bits and 4.0e-8 for 48 bits. + For a compressed file of size 100Gb -- about 100000 blocks -- + only a 48-bit marker will do. NB: normal compression/ + decompression do *not* rely on these statistical properties. + They are only important when trying to recover blocks from + damaged files. + --*/ + bsPutUChar ( 0x31 ); bsPutUChar ( 0x41 ); + bsPutUChar ( 0x59 ); bsPutUChar ( 0x26 ); + bsPutUChar ( 0x53 ); bsPutUChar ( 0x59 ); + + /*-- Now the block's CRC, so it is in a known place. --*/ + bsPutUInt32 ( blockCRC ); + + /*-- Now a single bit indicating randomisation. --*/ + if (blockRandomised) { + bsW(1,1); nBlocksRandomised++; + } else + bsW(1,0); + + /*-- Finally, block's contents proper. --*/ + moveToFrontCodeAndSend (); + + ERROR_IF_NOT_ZERO ( ferror(zStream) ); + } + + if (verbosity >= 2 && nBlocksRandomised > 0) + fprintf ( stderr, " %d block%s needed randomisation\n", + nBlocksRandomised, + nBlocksRandomised == 1 ? "" : "s" ); + + /*-- + Now another magic 48-bit number, 0x177245385090, to + indicate the end of the last block. (sqrt(pi), if + you want to know. I did want to use e, but it contains + too much repetition -- 27 18 28 18 28 46 -- for me + to feel statistically comfortable. Call me paranoid.) + --*/ + + bsPutUChar ( 0x17 ); bsPutUChar ( 0x72 ); + bsPutUChar ( 0x45 ); bsPutUChar ( 0x38 ); + bsPutUChar ( 0x50 ); bsPutUChar ( 0x90 ); + + bsPutUInt32 ( combinedCRC ); + if (verbosity >= 2) + fprintf ( stderr, " final combined CRC = 0x%x\n ", combinedCRC ); + + /*-- Close the files in an utterly paranoid way. --*/ + bsFinishedWithStream (); + + ERROR_IF_NOT_ZERO ( ferror(zStream) ); + retVal = fflush ( zStream ); + ERROR_IF_EOF ( retVal ); + retVal = fclose ( zStream ); + ERROR_IF_EOF ( retVal ); + + ERROR_IF_NOT_ZERO ( ferror(stream) ); + retVal = fclose ( stream ); + ERROR_IF_EOF ( retVal ); + + if (bytesIn == 0) bytesIn = 1; + if (bytesOut == 0) bytesOut = 1; + + if (verbosity >= 1) + fprintf ( stderr, "%6.3f:1, %6.3f bits/byte, " + "%5.2f%% saved, %d in, %d out.\n", + (float)bytesIn / (float)bytesOut, + (8.0 * (float)bytesOut) / (float)bytesIn, + 100.0 * (1.0 - (float)bytesOut / (float)bytesIn), + bytesIn, + bytesOut + ); +} + + +/*---------------------------------------------*/ +Bool uncompressStream ( FILE *zStream, FILE *stream ) +{ + UChar magic1, magic2, magic3, magic4; + UChar magic5, magic6; + UInt32 storedBlockCRC, storedCombinedCRC; + UInt32 computedBlockCRC, computedCombinedCRC; + Int32 currBlockNo; + IntNative retVal; + + SET_BINARY_MODE(stream); + SET_BINARY_MODE(zStream); + + ERROR_IF_NOT_ZERO ( ferror(stream) ); + ERROR_IF_NOT_ZERO ( ferror(zStream) ); + + bsSetStream ( zStream, False ); + + /*-- + A bad magic number is `recoverable from'; + return with False so the caller skips the file. + --*/ + magic1 = bsGetUChar (); + magic2 = bsGetUChar (); + magic3 = bsGetUChar (); + magic4 = bsGetUChar (); + if (magic1 != 'B' || + magic2 != 'Z' || + magic3 != 'h' || + magic4 < '1' || + magic4 > '9') { + bsFinishedWithStream(); + retVal = fclose ( stream ); + ERROR_IF_EOF ( retVal ); + return False; + } + + setDecompressStructureSizes ( magic4 - '0' ); + computedCombinedCRC = 0; + + if (verbosity >= 2) fprintf ( stderr, "\n " ); + currBlockNo = 0; + + while (True) { + magic1 = bsGetUChar (); + magic2 = bsGetUChar (); + magic3 = bsGetUChar (); + magic4 = bsGetUChar (); + magic5 = bsGetUChar (); + magic6 = bsGetUChar (); + if (magic1 == 0x17 && magic2 == 0x72 && + magic3 == 0x45 && magic4 == 0x38 && + magic5 == 0x50 && magic6 == 0x90) break; + + if (magic1 != 0x31 || magic2 != 0x41 || + magic3 != 0x59 || magic4 != 0x26 || + magic5 != 0x53 || magic6 != 0x59) badBlockHeader(); + + storedBlockCRC = bsGetUInt32 (); + + if (bsR(1) == 1) + blockRandomised = True; else + blockRandomised = False; + + currBlockNo++; + if (verbosity >= 2) + fprintf ( stderr, "[%d: huff+mtf ", currBlockNo ); + getAndMoveToFrontDecode (); + ERROR_IF_NOT_ZERO ( ferror(zStream) ); + + initialiseCRC(); + if (verbosity >= 2) fprintf ( stderr, "rt+rld" ); + if (smallMode) + undoReversibleTransformation_small ( stream ); + else + undoReversibleTransformation_fast ( stream ); + + ERROR_IF_NOT_ZERO ( ferror(stream) ); + + computedBlockCRC = getFinalCRC(); + if (verbosity >= 3) + fprintf ( stderr, " {0x%x, 0x%x}", storedBlockCRC, computedBlockCRC ); + if (verbosity >= 2) fprintf ( stderr, "] " ); + + /*-- A bad CRC is considered a fatal error. --*/ + if (storedBlockCRC != computedBlockCRC) + crcError ( storedBlockCRC, computedBlockCRC ); + + computedCombinedCRC = (computedCombinedCRC << 1) | (computedCombinedCRC >> 31); + computedCombinedCRC ^= computedBlockCRC; + }; + + if (verbosity >= 2) fprintf ( stderr, "\n " ); + + storedCombinedCRC = bsGetUInt32 (); + if (verbosity >= 2) + fprintf ( stderr, + "combined CRCs: stored = 0x%x, computed = 0x%x\n ", + storedCombinedCRC, computedCombinedCRC ); + if (storedCombinedCRC != computedCombinedCRC) + crcError ( storedCombinedCRC, computedCombinedCRC ); + + + bsFinishedWithStream (); + ERROR_IF_NOT_ZERO ( ferror(zStream) ); + retVal = fclose ( zStream ); + ERROR_IF_EOF ( retVal ); + + ERROR_IF_NOT_ZERO ( ferror(stream) ); + retVal = fflush ( stream ); + ERROR_IF_NOT_ZERO ( retVal ); + if (stream != stdout) { + retVal = fclose ( stream ); + ERROR_IF_EOF ( retVal ); + } + return True; +} + + +/*---------------------------------------------*/ +Bool testStream ( FILE *zStream ) +{ + UChar magic1, magic2, magic3, magic4; + UChar magic5, magic6; + UInt32 storedBlockCRC, storedCombinedCRC; + UInt32 computedBlockCRC, computedCombinedCRC; + Int32 currBlockNo; + IntNative retVal; + + SET_BINARY_MODE(zStream); + ERROR_IF_NOT_ZERO ( ferror(zStream) ); + + bsSetStream ( zStream, False ); + + magic1 = bsGetUChar (); + magic2 = bsGetUChar (); + magic3 = bsGetUChar (); + magic4 = bsGetUChar (); + if (magic1 != 'B' || + magic2 != 'Z' || + magic3 != 'h' || + magic4 < '1' || + magic4 > '9') { + bsFinishedWithStream(); + fclose ( zStream ); + fprintf ( stderr, "\n%s: bad magic number (ie, not created by bzip2)\n", + inName ); + return False; + } + + smallMode = True; + setDecompressStructureSizes ( magic4 - '0' ); + computedCombinedCRC = 0; + + if (verbosity >= 2) fprintf ( stderr, "\n" ); + currBlockNo = 0; + + while (True) { + magic1 = bsGetUChar (); + magic2 = bsGetUChar (); + magic3 = bsGetUChar (); + magic4 = bsGetUChar (); + magic5 = bsGetUChar (); + magic6 = bsGetUChar (); + if (magic1 == 0x17 && magic2 == 0x72 && + magic3 == 0x45 && magic4 == 0x38 && + magic5 == 0x50 && magic6 == 0x90) break; + + currBlockNo++; + if (magic1 != 0x31 || magic2 != 0x41 || + magic3 != 0x59 || magic4 != 0x26 || + magic5 != 0x53 || magic6 != 0x59) { + bsFinishedWithStream(); + fclose ( zStream ); + fprintf ( stderr, + "\n%s, block %d: bad header (not == 0x314159265359)\n", + inName, currBlockNo ); + return False; + } + storedBlockCRC = bsGetUInt32 (); + + if (bsR(1) == 1) + blockRandomised = True; else + blockRandomised = False; + + if (verbosity >= 2) + fprintf ( stderr, " block [%d: huff+mtf ", currBlockNo ); + getAndMoveToFrontDecode (); + ERROR_IF_NOT_ZERO ( ferror(zStream) ); + + initialiseCRC(); + if (verbosity >= 2) fprintf ( stderr, "rt+rld" ); + undoReversibleTransformation_small ( NULL ); + + computedBlockCRC = getFinalCRC(); + if (verbosity >= 3) + fprintf ( stderr, " {0x%x, 0x%x}", storedBlockCRC, computedBlockCRC ); + if (verbosity >= 2) fprintf ( stderr, "] " ); + + if (storedBlockCRC != computedBlockCRC) { + bsFinishedWithStream(); + fclose ( zStream ); + fprintf ( stderr, "\n%s, block %d: computed CRC does not match stored one\n", + inName, currBlockNo ); + return False; + } + + if (verbosity >= 2) fprintf ( stderr, "ok\n" ); + computedCombinedCRC = (computedCombinedCRC << 1) | (computedCombinedCRC >> 31); + computedCombinedCRC ^= computedBlockCRC; + }; + + storedCombinedCRC = bsGetUInt32 (); + if (verbosity >= 2) + fprintf ( stderr, + " combined CRCs: stored = 0x%x, computed = 0x%x\n ", + storedCombinedCRC, computedCombinedCRC ); + if (storedCombinedCRC != computedCombinedCRC) { + bsFinishedWithStream(); + fclose ( zStream ); + fprintf ( stderr, "\n%s: computed CRC does not match stored one\n", + inName ); + return False; + } + + bsFinishedWithStream (); + ERROR_IF_NOT_ZERO ( ferror(zStream) ); + retVal = fclose ( zStream ); + ERROR_IF_EOF ( retVal ); + return True; +} + + + +/*---------------------------------------------------*/ +/*--- Error [non-] handling grunge ---*/ +/*---------------------------------------------------*/ + +/*---------------------------------------------*/ +void cadvise ( void ) +{ + fprintf ( + stderr, + "\nIt is possible that the compressed file(s) have become corrupted.\n" + "You can use the -tvv option to test integrity of such files.\n\n" + "You can use the `bzip2recover' program to *attempt* to recover\n" + "data from undamaged sections of corrupted files.\n\n" + ); +} + + +/*---------------------------------------------*/ +void showFileNames ( void ) +{ + fprintf ( + stderr, + "\tInput file = %s, output file = %s\n", + inName==NULL ? "(null)" : inName, + outName==NULL ? "(null)" : outName + ); +} + + +/*---------------------------------------------*/ +void cleanUpAndFail ( Int32 ec ) +{ + IntNative retVal; + + if ( srcMode == SM_F2F && opMode != OM_TEST ) { + fprintf ( stderr, "%s: Deleting output file %s, if it exists.\n", + progName, + outName==NULL ? "(null)" : outName ); + if (outputHandleJustInCase != NULL) + fclose ( outputHandleJustInCase ); + retVal = remove ( outName ); + if (retVal != 0) + fprintf ( stderr, + "%s: WARNING: deletion of output file (apparently) failed.\n", + progName ); + } + if (numFileNames > 0 && numFilesProcessed < numFileNames) { + fprintf ( stderr, + "%s: WARNING: some files have not been processed:\n" + "\t%d specified on command line, %d not processed yet.\n\n", + progName, numFileNames, + numFileNames - numFilesProcessed ); + } + exit ( ec ); +} + + +/*---------------------------------------------*/ +void panic ( Char* s ) +{ + fprintf ( stderr, + "\n%s: PANIC -- internal consistency error:\n" + "\t%s\n" + "\tThis is a BUG. Please report it to me at:\n" + "\tjseward@acm.org\n", + progName, s ); + showFileNames(); + cleanUpAndFail( 3 ); +} + + +/*---------------------------------------------*/ +void badBGLengths ( void ) +{ + fprintf ( stderr, + "\n%s: error when reading background model code lengths,\n" + "\twhich probably means the compressed file is corrupted.\n", + progName ); + showFileNames(); + cadvise(); + cleanUpAndFail( 2 ); +} + + +/*---------------------------------------------*/ +void crcError ( UInt32 crcStored, UInt32 crcComputed ) +{ + fprintf ( stderr, + "\n%s: Data integrity error when decompressing.\n" + "\tStored CRC = 0x%x, computed CRC = 0x%x\n", + progName, crcStored, crcComputed ); + showFileNames(); + cadvise(); + cleanUpAndFail( 2 ); +} + + +/*---------------------------------------------*/ +void compressedStreamEOF ( void ) +{ + fprintf ( stderr, + "\n%s: Compressed file ends unexpectedly;\n\t" + "perhaps it is corrupted? *Possible* reason follows.\n", + progName ); + perror ( progName ); + showFileNames(); + cadvise(); + cleanUpAndFail( 2 ); +} + + +/*---------------------------------------------*/ +void ioError ( ) +{ + fprintf ( stderr, + "\n%s: I/O or other error, bailing out. Possible reason follows.\n", + progName ); + perror ( progName ); + showFileNames(); + cleanUpAndFail( 1 ); +} + + +/*---------------------------------------------*/ +void blockOverrun () +{ + fprintf ( stderr, + "\n%s: block overrun during decompression,\n" + "\twhich probably means the compressed file\n" + "\tis corrupted.\n", + progName ); + showFileNames(); + cadvise(); + cleanUpAndFail( 2 ); +} + + +/*---------------------------------------------*/ +void badBlockHeader () +{ + fprintf ( stderr, + "\n%s: bad block header in the compressed file,\n" + "\twhich probably means it is corrupted.\n", + progName ); + showFileNames(); + cadvise(); + cleanUpAndFail( 2 ); +} + + +/*---------------------------------------------*/ +void bitStreamEOF () +{ + fprintf ( stderr, + "\n%s: read past the end of compressed data,\n" + "\twhich probably means it is corrupted.\n", + progName ); + showFileNames(); + cadvise(); + cleanUpAndFail( 2 ); +} + + +/*---------------------------------------------*/ +void mySignalCatcher ( IntNative n ) +{ + fprintf ( stderr, + "\n%s: Control-C (or similar) caught, quitting.\n", + progName ); + cleanUpAndFail(1); +} + + +/*---------------------------------------------*/ +void mySIGSEGVorSIGBUScatcher ( IntNative n ) +{ + if (opMode == OM_Z) + fprintf ( stderr, + "\n%s: Caught a SIGSEGV or SIGBUS whilst compressing,\n" + "\twhich probably indicates a bug in bzip2. Please\n" + "\treport it to me at: jseward@acm.org\n", + progName ); + else + fprintf ( stderr, + "\n%s: Caught a SIGSEGV or SIGBUS whilst decompressing,\n" + "\twhich probably indicates that the compressed data\n" + "\tis corrupted.\n", + progName ); + + showFileNames(); + if (opMode == OM_Z) + cleanUpAndFail( 3 ); else + { cadvise(); cleanUpAndFail( 2 ); } +} + + +/*---------------------------------------------*/ +void uncompressOutOfMemory ( Int32 draw, Int32 blockSize ) +{ + fprintf ( stderr, + "\n%s: Can't allocate enough memory for decompression.\n" + "\tRequested %d bytes for a block size of %d.\n" + "\tTry selecting space-economic decompress (with flag -s)\n" + "\tand failing that, find a machine with more memory.\n", + progName, draw, blockSize ); + showFileNames(); + cleanUpAndFail(1); +} + + +/*---------------------------------------------*/ +void compressOutOfMemory ( Int32 draw, Int32 blockSize ) +{ + fprintf ( stderr, + "\n%s: Can't allocate enough memory for compression.\n" + "\tRequested %d bytes for a block size of %d.\n" + "\tTry selecting a small block size (with flag -s).\n", + progName, draw, blockSize ); + showFileNames(); + cleanUpAndFail(1); +} + + +/*---------------------------------------------------*/ +/*--- The main driver machinery ---*/ +/*---------------------------------------------------*/ + +/*---------------------------------------------*/ +void pad ( Char *s ) +{ + Int32 i; + if ( (Int32)strlen(s) >= longestFileName ) return; + for (i = 1; i <= longestFileName - (Int32)strlen(s); i++) + fprintf ( stderr, " " ); +} + + +/*---------------------------------------------*/ +Bool fileExists ( Char* name ) +{ + FILE *tmp = fopen ( name, "rb" ); + Bool exists = (tmp != NULL); + if (tmp != NULL) fclose ( tmp ); + return exists; +} + + +/*---------------------------------------------*/ +/*-- + if in doubt, return True +--*/ +Bool notABogStandardFile ( Char* name ) +{ + IntNative i; + struct MY_STAT statBuf; + + i = MY_LSTAT ( name, &statBuf ); + if (i != 0) return True; + if (MY_S_IFREG(statBuf.st_mode)) return False; + return True; +} + + +/*---------------------------------------------*/ +void copyDateAndPermissions ( Char *srcName, Char *dstName ) +{ + #if BZ_UNIX + IntNative retVal; + struct MY_STAT statBuf; + struct utimbuf uTimBuf; + + retVal = MY_LSTAT ( srcName, &statBuf ); + ERROR_IF_NOT_ZERO ( retVal ); + uTimBuf.actime = statBuf.st_atime; + uTimBuf.modtime = statBuf.st_mtime; + + retVal = chmod ( dstName, statBuf.st_mode ); + ERROR_IF_NOT_ZERO ( retVal ); + retVal = utime ( dstName, &uTimBuf ); + ERROR_IF_NOT_ZERO ( retVal ); + #endif +} + + +/*---------------------------------------------*/ +Bool endsInBz2 ( Char* name ) +{ + Int32 n = strlen ( name ); + if (n <= 4) return False; + return + (name[n-4] == '.' && + name[n-3] == 'b' && + name[n-2] == 'z' && + name[n-1] == '2'); +} + + +/*---------------------------------------------*/ +Bool containsDubiousChars ( Char* name ) +{ + Bool cdc = False; + for (; *name != '\0'; name++) + if (*name == '?' || *name == '*') cdc = True; + return cdc; +} + + +/*---------------------------------------------*/ +void compress ( Char *name ) +{ + FILE *inStr; + FILE *outStr; + + if (name == NULL && srcMode != SM_I2O) + panic ( "compress: bad modes\n" ); + + switch (srcMode) { + case SM_I2O: strcpy ( inName, "(stdin)" ); + strcpy ( outName, "(stdout)" ); break; + case SM_F2F: strcpy ( inName, name ); + strcpy ( outName, name ); + strcat ( outName, ".bz2" ); break; + case SM_F2O: strcpy ( inName, name ); + strcpy ( outName, "(stdout)" ); break; + } + + if ( srcMode != SM_I2O && containsDubiousChars ( inName ) ) { + fprintf ( stderr, "%s: There are no files matching `%s'.\n", + progName, inName ); + return; + } + if ( srcMode != SM_I2O && !fileExists ( inName ) ) { + fprintf ( stderr, "%s: Input file %s doesn't exist, skipping.\n", + progName, inName ); + return; + } + if ( srcMode != SM_I2O && endsInBz2 ( inName )) { + fprintf ( stderr, "%s: Input file name %s ends in `.bz2', skipping.\n", + progName, inName ); + return; + } + if ( srcMode != SM_I2O && notABogStandardFile ( inName )) { + fprintf ( stderr, "%s: Input file %s is not a normal file, skipping.\n", + progName, inName ); + return; + } + if ( srcMode == SM_F2F && fileExists ( outName ) ) { + fprintf ( stderr, "%s: Output file %s already exists, skipping.\n", + progName, outName ); + return; + } + + switch ( srcMode ) { + + case SM_I2O: + inStr = stdin; + outStr = stdout; + if ( isatty ( fileno ( stdout ) ) ) { + fprintf ( stderr, + "%s: I won't write compressed data to a terminal.\n", + progName ); + fprintf ( stderr, "%s: For help, type: `%s --help'.\n", + progName, progName ); + return; + }; + break; + + case SM_F2O: + inStr = fopen ( inName, "rb" ); + outStr = stdout; + if ( isatty ( fileno ( stdout ) ) ) { + fprintf ( stderr, + "%s: I won't write compressed data to a terminal.\n", + progName ); + fprintf ( stderr, "%s: For help, type: `%s --help'.\n", + progName, progName ); + return; + }; + if ( inStr == NULL ) { + fprintf ( stderr, "%s: Can't open input file %s, skipping.\n", + progName, inName ); + return; + }; + break; + + case SM_F2F: + inStr = fopen ( inName, "rb" ); + outStr = fopen ( outName, "wb" ); + if ( outStr == NULL) { + fprintf ( stderr, "%s: Can't create output file %s, skipping.\n", + progName, outName ); + return; + } + if ( inStr == NULL ) { + fprintf ( stderr, "%s: Can't open input file %s, skipping.\n", + progName, inName ); + return; + }; + break; + + default: + panic ( "compress: bad srcMode" ); + break; + } + + if (verbosity >= 1) { + fprintf ( stderr, " %s: ", inName ); + pad ( inName ); + fflush ( stderr ); + } + + /*--- Now the input and output handles are sane. Do the Biz. ---*/ + errno = 0; + outputHandleJustInCase = outStr; + compressStream ( inStr, outStr ); + outputHandleJustInCase = NULL; + + /*--- If there was an I/O error, we won't get here. ---*/ + if ( srcMode == SM_F2F ) { + copyDateAndPermissions ( inName, outName ); + if ( !keepInputFiles ) { + IntNative retVal = remove ( inName ); + ERROR_IF_NOT_ZERO ( retVal ); + } + } +} + + +/*---------------------------------------------*/ +void uncompress ( Char *name ) +{ + FILE *inStr; + FILE *outStr; + Bool magicNumberOK; + + if (name == NULL && srcMode != SM_I2O) + panic ( "uncompress: bad modes\n" ); + + switch (srcMode) { + case SM_I2O: strcpy ( inName, "(stdin)" ); + strcpy ( outName, "(stdout)" ); break; + case SM_F2F: strcpy ( inName, name ); + strcpy ( outName, name ); + if (endsInBz2 ( outName )) + outName [ strlen ( outName ) - 4 ] = '\0'; + break; + case SM_F2O: strcpy ( inName, name ); + strcpy ( outName, "(stdout)" ); break; + } + + if ( srcMode != SM_I2O && containsDubiousChars ( inName ) ) { + fprintf ( stderr, "%s: There are no files matching `%s'.\n", + progName, inName ); + return; + } + if ( srcMode != SM_I2O && !fileExists ( inName ) ) { + fprintf ( stderr, "%s: Input file %s doesn't exist, skipping.\n", + progName, inName ); + return; + } + if ( srcMode != SM_I2O && !endsInBz2 ( inName )) { + fprintf ( stderr, + "%s: Input file name %s doesn't end in `.bz2', skipping.\n", + progName, inName ); + return; + } + if ( srcMode != SM_I2O && notABogStandardFile ( inName )) { + fprintf ( stderr, "%s: Input file %s is not a normal file, skipping.\n", + progName, inName ); + return; + } + if ( srcMode == SM_F2F && fileExists ( outName ) ) { + fprintf ( stderr, "%s: Output file %s already exists, skipping.\n", + progName, outName ); + return; + } + + switch ( srcMode ) { + + case SM_I2O: + inStr = stdin; + outStr = stdout; + if ( isatty ( fileno ( stdin ) ) ) { + fprintf ( stderr, + "%s: I won't read compressed data from a terminal.\n", + progName ); + fprintf ( stderr, "%s: For help, type: `%s --help'.\n", + progName, progName ); + return; + }; + break; + + case SM_F2O: + inStr = fopen ( inName, "rb" ); + outStr = stdout; + if ( inStr == NULL ) { + fprintf ( stderr, "%s: Can't open input file %s, skipping.\n", + progName, inName ); + return; + }; + break; + + case SM_F2F: + inStr = fopen ( inName, "rb" ); + outStr = fopen ( outName, "wb" ); + if ( outStr == NULL) { + fprintf ( stderr, "%s: Can't create output file %s, skipping.\n", + progName, outName ); + return; + } + if ( inStr == NULL ) { + fprintf ( stderr, "%s: Can't open input file %s, skipping.\n", + progName, inName ); + return; + }; + break; + + default: + panic ( "uncompress: bad srcMode" ); + break; + } + + if (verbosity >= 1) { + fprintf ( stderr, " %s: ", inName ); + pad ( inName ); + fflush ( stderr ); + } + + /*--- Now the input and output handles are sane. Do the Biz. ---*/ + errno = 0; + outputHandleJustInCase = outStr; + magicNumberOK = uncompressStream ( inStr, outStr ); + outputHandleJustInCase = NULL; + + /*--- If there was an I/O error, we won't get here. ---*/ + if ( magicNumberOK ) { + if ( srcMode == SM_F2F ) { + copyDateAndPermissions ( inName, outName ); + if ( !keepInputFiles ) { + IntNative retVal = remove ( inName ); + ERROR_IF_NOT_ZERO ( retVal ); + } + } + } else { + if ( srcMode == SM_F2F ) { + IntNative retVal = remove ( outName ); + ERROR_IF_NOT_ZERO ( retVal ); + } + } + + if ( magicNumberOK ) { + if (verbosity >= 1) + fprintf ( stderr, "done\n" ); + } else { + if (verbosity >= 1) + fprintf ( stderr, "not a bzip2 file, skipping.\n" ); else + fprintf ( stderr, + "%s: %s is not a bzip2 file, skipping.\n", + progName, inName ); + } + +} + + +/*---------------------------------------------*/ +void testf ( Char *name ) +{ + FILE *inStr; + Bool allOK; + + if (name == NULL && srcMode != SM_I2O) + panic ( "testf: bad modes\n" ); + + strcpy ( outName, "(none)" ); + switch (srcMode) { + case SM_I2O: strcpy ( inName, "(stdin)" ); break; + case SM_F2F: strcpy ( inName, name ); break; + case SM_F2O: strcpy ( inName, name ); break; + } + + if ( srcMode != SM_I2O && containsDubiousChars ( inName ) ) { + fprintf ( stderr, "%s: There are no files matching `%s'.\n", + progName, inName ); + return; + } + if ( srcMode != SM_I2O && !fileExists ( inName ) ) { + fprintf ( stderr, "%s: Input file %s doesn't exist, skipping.\n", + progName, inName ); + return; + } + if ( srcMode != SM_I2O && !endsInBz2 ( inName )) { + fprintf ( stderr, + "%s: Input file name %s doesn't end in `.bz2', skipping.\n", + progName, inName ); + return; + } + if ( srcMode != SM_I2O && notABogStandardFile ( inName )) { + fprintf ( stderr, "%s: Input file %s is not a normal file, skipping.\n", + progName, inName ); + return; + } + + switch ( srcMode ) { + + case SM_I2O: + if ( isatty ( fileno ( stdin ) ) ) { + fprintf ( stderr, + "%s: I won't read compressed data from a terminal.\n", + progName ); + fprintf ( stderr, "%s: For help, type: `%s --help'.\n", + progName, progName ); + return; + }; + inStr = stdin; + break; + + case SM_F2O: case SM_F2F: + inStr = fopen ( inName, "rb" ); + if ( inStr == NULL ) { + fprintf ( stderr, "%s: Can't open input file %s, skipping.\n", + progName, inName ); + return; + }; + break; + + default: + panic ( "testf: bad srcMode" ); + break; + } + + if (verbosity >= 1) { + fprintf ( stderr, " %s: ", inName ); + pad ( inName ); + fflush ( stderr ); + } + + /*--- Now the input handle is sane. Do the Biz. ---*/ + errno = 0; + allOK = testStream ( inStr ); + + if (allOK && verbosity >= 1) fprintf ( stderr, "ok\n" ); + if (!allOK) testFailsExist = True; +} + + +/*---------------------------------------------*/ +void license ( void ) +{ + fprintf ( stderr, + + "bzip2, a block-sorting file compressor. " + "Version 0.1pl0, 17-Aug-97.\n" + " \n" + " Copyright (C) 1996, 1997 by Julian Seward.\n" + " \n" + " This program is free software; you can redistribute it and/or modify\n" + " it under the terms of the GNU General Public License as published by\n" + " the Free Software Foundation; either version 2 of the License, or\n" + " (at your option) any later version.\n" + " \n" + " This program is distributed in the hope that it will be useful,\n" + " but WITHOUT ANY WARRANTY; without even the implied warranty of\n" + " MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the\n" + " GNU General Public License for more details.\n" + " \n" + " You should have received a copy of the GNU General Public License\n" + " along with this program; if not, write to the Free Software\n" + " Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.\n" + " \n" + " The GNU General Public License is contained in the file LICENSE.\n" + " \n" + ); +} + + +/*---------------------------------------------*/ +void usage ( Char *fullProgName ) +{ + fprintf ( + stderr, + "bzip2, a block-sorting file compressor. " + "Version 0.1pl0, 17-Aug-97.\n" + "\n usage: %s [flags and input files in any order]\n" + "\n" + " -h --help print this message\n" + " -d --decompress force decompression\n" + " -f --compress force compression\n" + " -t --test test compressed file integrity\n" + " -c --stdout output to standard out\n" + " -v --verbose be verbose (a 2nd -v gives more)\n" + " -k --keep keep (don't delete) input files\n" + " -L --license display software version & license\n" + " -V --version display software version & license\n" + " -s --small use less memory (at most 2500k)\n" + " -1 .. -9 set block size to 100k .. 900k\n" + " --repetitive-fast compress repetitive blocks faster\n" + " --repetitive-best compress repetitive blocks better\n" + "\n" + " If invoked as `bzip2', the default action is to compress.\n" + " as `bunzip2', the default action is to decompress.\n" + "\n" + " If no file names are given, bzip2 compresses or decompresses\n" + " from standard input to standard output. You can combine\n" + " flags, so `-v -4' means the same as -v4 or -4v, &c.\n" + #if BZ_UNIX + "\n" + #endif + , + + fullProgName + ); +} + + +/*---------------------------------------------*/ +/*-- + All the garbage from here to main() is purely to + implement a linked list of command-line arguments, + into which main() copies argv[1 .. argc-1]. + + The purpose of this ridiculous exercise is to + facilitate the expansion of wildcard characters + * and ? in filenames for halfwitted OSs like + MSDOS, Windows 95 and NT. + + The actual Dirty Work is done by the platform-specific + macro APPEND_FILESPEC. +--*/ + +typedef + struct zzzz { + Char *name; + struct zzzz *link; + } + Cell; + + +/*---------------------------------------------*/ +void *myMalloc ( Int32 n ) +{ + void* p; + + p = malloc ( (size_t)n ); + if (p == NULL) { + fprintf ( + stderr, + "%s: `malloc' failed on request for %d bytes.\n", + progName, n + ); + exit ( 1 ); + } + return p; +} + + +/*---------------------------------------------*/ +Cell *mkCell ( void ) +{ + Cell *c; + + c = (Cell*) myMalloc ( sizeof ( Cell ) ); + c->name = NULL; + c->link = NULL; + return c; +} + + +/*---------------------------------------------*/ +Cell *snocString ( Cell *root, Char *name ) +{ + if (root == NULL) { + Cell *tmp = mkCell(); + tmp->name = (Char*) myMalloc ( 5 + strlen(name) ); + strcpy ( tmp->name, name ); + return tmp; + } else { + Cell *tmp = root; + while (tmp->link != NULL) tmp = tmp->link; + tmp->link = snocString ( tmp->link, name ); + return root; + } +} + + + +/*---------------------------------------------*/ +#define ISFLAG(s) (strcmp(aa->name, (s))==0) + + +IntNative main ( IntNative argc, Char *argv[] ) +{ + Int32 i, j; + Char *tmp; + Cell *argList; + Cell *aa; + + + #if DEBUG + fprintf ( stderr, "bzip2: *** compiled with debugging ON ***\n" ); + #endif + + /*-- Be really really really paranoid :-) --*/ + if (sizeof(Int32) != 4 || sizeof(UInt32) != 4 || + sizeof(Int16) != 2 || sizeof(UInt16) != 2 || + sizeof(Char) != 1 || sizeof(UChar) != 1) { + fprintf ( stderr, + "bzip2: I'm not configured correctly for this platform!\n" + "\tI require Int32, Int16 and Char to have sizes\n" + "\tof 4, 2 and 1 bytes to run properly, and they don't.\n" + "\tProbably you can fix this by defining them correctly,\n" + "\tand recompiling. Bye!\n" ); + exit(1); + } + + + /*-- Set up signal handlers --*/ + signal (SIGINT, mySignalCatcher); + signal (SIGTERM, mySignalCatcher); + signal (SIGSEGV, mySIGSEGVorSIGBUScatcher); + #if BZ_UNIX + signal (SIGHUP, mySignalCatcher); + signal (SIGBUS, mySIGSEGVorSIGBUScatcher); + #endif + + + /*-- Initialise --*/ + outputHandleJustInCase = NULL; + ftab = NULL; + ll4 = NULL; + ll16 = NULL; + ll8 = NULL; + tt = NULL; + block = NULL; + zptr = NULL; + errno = 0; + smallMode = False; + keepInputFiles = False; + verbosity = 0; + blockSize100k = 9; + testFailsExist = False; + bsStream = NULL; + numFileNames = 0; + numFilesProcessed = 0; + workFactor = 30; + + strcpy ( inName, "(none)" ); + strcpy ( outName, "(none)" ); + + strcpy ( progNameReally, argv[0] ); + progName = &progNameReally[0]; + for (tmp = &progNameReally[0]; *tmp != '\0'; tmp++) + if (*tmp == PATH_SEP) progName = tmp + 1; + + + /*-- Expand filename wildcards in arg list --*/ + argList = NULL; + for (i = 1; i <= argc-1; i++) + APPEND_FILESPEC(argList, argv[i]); + + + /*-- Find the length of the longest filename --*/ + longestFileName = 7; + numFileNames = 0; + for (aa = argList; aa != NULL; aa = aa->link) + if (aa->name[0] != '-') { + numFileNames++; + if (longestFileName < (Int32)strlen(aa->name) ) + longestFileName = (Int32)strlen(aa->name); + } + + + /*-- Determine what to do (compress/uncompress/test). --*/ + /*-- Note that subsequent flag handling may change this. --*/ + opMode = OM_Z; + + if ( (strcmp ( "bunzip2", progName ) == 0) || + (strcmp ( "BUNZIP2", progName ) == 0) || + (strcmp ( "bunzip2.exe", progName ) == 0) || + (strcmp ( "BUNZIP2.EXE", progName ) == 0) ) + opMode = OM_UNZ; + + + /*-- Determine source modes; flag handling may change this too. --*/ + if (numFileNames == 0) + srcMode = SM_I2O; else srcMode = SM_F2F; + + + /*-- Look at the flags. --*/ + for (aa = argList; aa != NULL; aa = aa->link) + if (aa->name[0] == '-' && aa->name[1] != '-') + for (j = 1; aa->name[j] != '\0'; j++) + switch (aa->name[j]) { + case 'c': srcMode = SM_F2O; break; + case 'd': opMode = OM_UNZ; break; + case 'f': opMode = OM_Z; break; + case 't': opMode = OM_TEST; break; + case 'k': keepInputFiles = True; break; + case 's': smallMode = True; break; + case '1': blockSize100k = 1; break; + case '2': blockSize100k = 2; break; + case '3': blockSize100k = 3; break; + case '4': blockSize100k = 4; break; + case '5': blockSize100k = 5; break; + case '6': blockSize100k = 6; break; + case '7': blockSize100k = 7; break; + case '8': blockSize100k = 8; break; + case '9': blockSize100k = 9; break; + case 'V': + case 'L': license(); break; + case 'v': verbosity++; break; + case 'h': usage ( progName ); + exit ( 1 ); + break; + default: fprintf ( stderr, "%s: Bad flag `%s'\n", + progName, aa->name ); + usage ( progName ); + exit ( 1 ); + break; + } + + /*-- And again ... --*/ + for (aa = argList; aa != NULL; aa = aa->link) { + if (ISFLAG("--stdout")) srcMode = SM_F2O; else + if (ISFLAG("--decompress")) opMode = OM_UNZ; else + if (ISFLAG("--compress")) opMode = OM_Z; else + if (ISFLAG("--test")) opMode = OM_TEST; else + if (ISFLAG("--keep")) keepInputFiles = True; else + if (ISFLAG("--small")) smallMode = True; else + if (ISFLAG("--version")) license(); else + if (ISFLAG("--license")) license(); else + if (ISFLAG("--repetitive-fast")) workFactor = 5; else + if (ISFLAG("--repetitive-best")) workFactor = 150; else + if (ISFLAG("--verbose")) verbosity++; else + if (ISFLAG("--help")) { usage ( progName ); exit ( 1 ); } + else + if (strncmp ( aa->name, "--", 2) == 0) { + fprintf ( stderr, "%s: Bad flag `%s'\n", progName, aa->name ); + usage ( progName ); + exit ( 1 ); + } + } + + if (opMode == OM_Z && smallMode) blockSize100k = 2; + + if (opMode == OM_Z && srcMode == SM_F2O && numFileNames > 1) { + fprintf ( stderr, "%s: I won't compress multiple files to stdout.\n", + progName ); + exit ( 1 ); + } + + if (opMode == OM_TEST && srcMode == SM_F2O) { + fprintf ( stderr, "%s: -c and -t cannot be used together.\n", + progName ); + exit ( 1 ); + } + + if (opMode != OM_Z) blockSize100k = 0; + + if (opMode == OM_Z) { + allocateCompressStructures(); + if (srcMode == SM_I2O) + compress ( NULL ); + else + for (aa = argList; aa != NULL; aa = aa->link) + if (aa->name[0] != '-') { + numFilesProcessed++; + compress ( aa->name ); + } + } else + if (opMode == OM_UNZ) { + if (srcMode == SM_I2O) + uncompress ( NULL ); + else + for (aa = argList; aa != NULL; aa = aa->link) + if (aa->name[0] != '-') { + numFilesProcessed++; + uncompress ( aa->name ); + } + } else { + testFailsExist = False; + if (srcMode == SM_I2O) + testf ( NULL ); + else + for (aa = argList; aa != NULL; aa = aa->link) + if (aa->name[0] != '-') { + numFilesProcessed++; + testf ( aa->name ); + } + if (testFailsExist) { + fprintf ( stderr, + "\n" + "You can use the `bzip2recover' program to *attempt* to recover\n" + "data from undamaged sections of corrupted files.\n\n" + ); + exit(2); + } + } + return 0; +} + + +/*-----------------------------------------------------------*/ +/*--- end bzip2.c ---*/ +/*-----------------------------------------------------------*/ diff --git a/bzip2.exe b/bzip2.exe new file mode 100644 index 0000000..4b3c4c1 Binary files /dev/null and b/bzip2.exe differ diff --git a/bzip2.txt b/bzip2.txt new file mode 100644 index 0000000..83366bc --- /dev/null +++ b/bzip2.txt @@ -0,0 +1,462 @@ + + + +bzip2(1) bzip2(1) + + +NAME + bzip2, bunzip2 - a block-sorting file compressor, v0.1 + bzip2recover - recovers data from damaged bzip2 files + + +SYNOPSIS + bzip2 [ -cdfkstvVL123456789 ] [ filenames ... ] + bunzip2 [ -kvsVL ] [ filenames ... ] + bzip2recover filename + + +DESCRIPTION + Bzip2 compresses files using the Burrows-Wheeler block- + sorting text compression algorithm, and Huffman coding. + Compression is generally considerably better than that + achieved by more conventional LZ77/LZ78-based compressors, + and approaches the performance of the PPM family of sta- + tistical compressors. + + The command-line options are deliberately very similar to + those of GNU Gzip, but they are not identical. + + Bzip2 expects a list of file names to accompany the com- + mand-line flags. Each file is replaced by a compressed + version of itself, with the name "original_name.bz2". + Each compressed file has the same modification date and + permissions as the corresponding original, so that these + properties can be correctly restored at decompression + time. File name handling is naive in the sense that there + is no mechanism for preserving original file names, per- + missions and dates in filesystems which lack these con- + cepts, or have serious file name length restrictions, such + as MS-DOS. + + Bzip2 and bunzip2 will not overwrite existing files; if + you want this to happen, you should delete them first. + + If no file names are specified, bzip2 compresses from + standard input to standard output. In this case, bzip2 + will decline to write compressed output to a terminal, as + this would be entirely incomprehensible and therefore + pointless. + + Bunzip2 (or bzip2 -d ) decompresses and restores all spec- + ified files whose names end in ".bz2". Files without this + suffix are ignored. Again, supplying no filenames causes + decompression from standard input to standard output. + + You can also compress or decompress files to the standard + output by giving the -c flag. You can decompress multiple + files like this, but you may only compress a single file + this way, since it would otherwise be difficult to sepa- + rate out the compressed representations of the original + files. + + + + 1 + + + + + +bzip2(1) bzip2(1) + + + Compression is always performed, even if the compressed + file is slightly larger than the original. Files of less + than about one hundred bytes tend to get larger, since the + compression mechanism has a constant overhead in the + region of 50 bytes. Random data (including the output of + most file compressors) is coded at about 8.05 bits per + byte, giving an expansion of around 0.5%. + + As a self-check for your protection, bzip2 uses 32-bit + CRCs to make sure that the decompressed version of a file + is identical to the original. This guards against corrup- + tion of the compressed data, and against undetected bugs + in bzip2 (hopefully very unlikely). The chances of data + corruption going undetected is microscopic, about one + chance in four billion for each file processed. Be aware, + though, that the check occurs upon decompression, so it + can only tell you that that something is wrong. It can't + help you recover the original uncompressed data. You can + use bzip2recover to try to recover data from damaged + files. + + Return values: 0 for a normal exit, 1 for environmental + problems (file not found, invalid flags, I/O errors, &c), + 2 to indicate a corrupt compressed file, 3 for an internal + consistency error (eg, bug) which caused bzip2 to panic. + + +MEMORY MANAGEMENT + Bzip2 compresses large files in blocks. The block size + affects both the compression ratio achieved, and the + amount of memory needed both for compression and decom- + pression. The flags -1 through -9 specify the block size + to be 100,000 bytes through 900,000 bytes (the default) + respectively. At decompression-time, the block size used + for compression is read from the header of the compressed + file, and bunzip2 then allocates itself just enough memory + to decompress the file. Since block sizes are stored in + compressed files, it follows that the flags -1 to -9 are + irrelevant to and so ignored during decompression. Com- + pression and decompression requirements, in bytes, can be + estimated as: + + Compression: 400k + ( 7 x block size ) + + Decompression: 100k + ( 5 x block size ), or + 100k + ( 2.5 x block size ) + + Larger block sizes give rapidly diminishing marginal + returns; most of the compression comes from the first two + or three hundred k of block size, a fact worth bearing in + mind when using bzip2 on small machines. It is also + important to appreciate that the decompression memory + requirement is set at compression-time by the choice of + block size. + + + + 2 + + + + + +bzip2(1) bzip2(1) + + + For files compressed with the default 900k block size, + bunzip2 will require about 4600 kbytes to decompress. To + support decompression of any file on a 4 megabyte machine, + bunzip2 has an option to decompress using approximately + half this amount of memory, about 2300 kbytes. Decompres- + sion speed is also halved, so you should use this option + only where necessary. The relevant flag is -s. + + In general, try and use the largest block size memory con- + straints allow, since that maximises the compression + achieved. Compression and decompression speed are virtu- + ally unaffected by block size. + + Another significant point applies to files which fit in a + single block -- that means most files you'd encounter + using a large block size. The amount of real memory + touched is proportional to the size of the file, since the + file is smaller than a block. For example, compressing a + file 20,000 bytes long with the flag -9 will cause the + compressor to allocate around 6700k of memory, but only + touch 400k + 20000 * 7 = 540 kbytes of it. Similarly, the + decompressor will allocate 4600k but only touch 100k + + 20000 * 5 = 200 kbytes. + + Here is a table which summarises the maximum memory usage + for different block sizes. Also recorded is the total + compressed size for 14 files of the Calgary Text Compres- + sion Corpus totalling 3,141,622 bytes. This column gives + some feel for how compression varies with block size. + These figures tend to understate the advantage of larger + block sizes for larger files, since the Corpus is domi- + nated by smaller files. + + Compress Decompress Decompress Corpus + Flag usage usage -s usage Size + + -1 1100k 600k 350k 914704 + -2 1800k 1100k 600k 877703 + -3 2500k 1600k 850k 860338 + -4 3200k 2100k 1100k 846899 + -5 3900k 2600k 1350k 845160 + -6 4600k 3100k 1600k 838626 + -7 5400k 3600k 1850k 834096 + -8 6000k 4100k 2100k 828642 + -9 6700k 4600k 2350k 828642 + + +OPTIONS + -c --stdout + Compress or decompress to standard output. -c will + decompress multiple files to stdout, but will only + compress a single file to stdout. + + + + + + 3 + + + + + +bzip2(1) bzip2(1) + + + -d --decompress + Force decompression. Bzip2 and bunzip2 are really + the same program, and the decision about whether to + compress or decompress is done on the basis of + which name is used. This flag overrides that mech- + anism, and forces bzip2 to decompress. + + -f --compress + The complement to -d: forces compression, regard- + less of the invokation name. + + -t --test + Check integrity of the specified file(s), but don't + decompress them. This really performs a trial + decompression and throws away the result, using the + low-memory decompression algorithm (see -s). + + -k --keep + Keep (don't delete) input files during compression + or decompression. + + -s --small + Reduce memory usage, both for compression and + decompression. Files are decompressed using a mod- + ified algorithm which only requires 2.5 bytes per + block byte. This means any file can be decom- + pressed in 2300k of memory, albeit somewhat more + slowly than usual. + + During compression, -s selects a block size of + 200k, which limits memory use to around the same + figure, at the expense of your compression ratio. + In short, if your machine is low on memory (8 + megabytes or less), use -s for everything. See + MEMORY MANAGEMENT above. + + + -v --verbose + Verbose mode -- show the compression ratio for each + file processed. Further -v's increase the ver- + bosity level, spewing out lots of information which + is primarily of interest for diagnostic purposes. + + -L --license + Display the software version, license terms and + conditions. + + -V --version + Same as -L. + + -1 to -9 + Set the block size to 100 k, 200 k .. 900 k when + compressing. Has no effect when decompressing. + See MEMORY MANAGEMENT above. + + + + 4 + + + + + +bzip2(1) bzip2(1) + + + --repetitive-fast + bzip2 injects some small pseudo-random variations + into very repetitive blocks to limit worst-case + performance during compression. If sorting runs + into difficulties, the block is randomised, and + sorting is restarted. Very roughly, bzip2 persists + for three times as long as a well-behaved input + would take before resorting to randomisation. This + flag makes it give up much sooner. + + + --repetitive-best + Opposite of --repetitive-fast; try a lot harder + before resorting to randomisation. + + +RECOVERING DATA FROM DAMAGED FILES + bzip2 compresses files in blocks, usually 900kbytes long. + Each block is handled independently. If a media or trans- + mission error causes a multi-block .bz2 file to become + damaged, it may be possible to recover data from the + undamaged blocks in the file. + + The compressed representation of each block is delimited + by a 48-bit pattern, which makes it possible to find the + block boundaries with reasonable certainty. Each block + also carries its own 32-bit CRC, so damaged blocks can be + distinguished from undamaged ones. + + bzip2recover is a simple program whose purpose is to + search for blocks in .bz2 files, and write each block out + into its own .bz2 file. You can then use bzip2 -t to test + the integrity of the resulting files, and decompress those + which are undamaged. + + bzip2recover takes a single argument, the name of the dam- + aged file, and writes a number of files "rec0001file.bz2", + "rec0002file.bz2", etc, containing the extracted blocks. + The output filenames are designed so that the use of wild- + cards in subsequent processing -- for example, "bzip2 -dc + rec*file.bz2 > recovered_data" -- lists the files in the + "right" order. + + bzip2recover should be of most use dealing with large .bz2 + files, as these will contain many blocks. It is clearly + futile to use it on damaged single-block files, since a + damaged block cannot be recovered. If you wish to min- + imise any potential data loss through media or transmis- + sion errors, you might consider compressing with a smaller + block size. + + +PERFORMANCE NOTES + The sorting phase of compression gathers together similar + + + + 5 + + + + + +bzip2(1) bzip2(1) + + + strings in the file. Because of this, files containing + very long runs of repeated symbols, like "aabaabaabaab + ..." (repeated several hundred times) may compress + extraordinarily slowly. You can use the -vvvvv option to + monitor progress in great detail, if you want. Decompres- + sion speed is unaffected. + + Such pathological cases seem rare in practice, appearing + mostly in artificially-constructed test files, and in low- + level disk images. It may be inadvisable to use bzip2 to + compress the latter. If you do get a file which causes + severe slowness in compression, try making the block size + as small as possible, with flag -1. + + Incompressible or virtually-incompressible data may decom- + press rather more slowly than one would hope. This is due + to a naive implementation of the move-to-front coder. + + bzip2 usually allocates several megabytes of memory to + operate in, and then charges all over it in a fairly ran- + dom fashion. This means that performance, both for com- + pressing and decompressing, is largely determined by the + speed at which your machine can service cache misses. + Because of this, small changes to the code to reduce the + miss rate have been observed to give disproportionately + large performance improvements. I imagine bzip2 will per- + form best on machines with very large caches. + + Test mode (-t) uses the low-memory decompression algorithm + (-s). This means test mode does not run as fast as it + could; it could run as fast as the normal decompression + machinery. This could easily be fixed at the cost of some + code bloat. + + +CAVEATS + I/O error messages are not as helpful as they could be. + Bzip2 tries hard to detect I/O errors and exit cleanly, + but the details of what the problem is sometimes seem + rather misleading. + + This manual page pertains to version 0.1 of bzip2. It may + well happen that some future version will use a different + compressed file format. If you try to decompress, using + 0.1, a .bz2 file created with some future version which + uses a different compressed file format, 0.1 will complain + that your file "is not a bzip2 file". If that happens, + you should obtain a more recent version of bzip2 and use + that to decompress the file. + + Wildcard expansion for Windows 95 and NT is flaky. + + bzip2recover uses 32-bit integers to represent bit posi- + tions in compressed files, so it cannot handle compressed + + + + 6 + + + + + +bzip2(1) bzip2(1) + + + files more than 512 megabytes long. This could easily be + fixed. + + bzip2recover sometimes reports a very small, incomplete + final block. This is spurious and can be safely ignored. + + +RELATIONSHIP TO bzip-0.21 + This program is a descendant of the bzip program, version + 0.21, which I released in August 1996. The primary dif- + ference of bzip2 is its avoidance of the possibly patented + algorithms which were used in 0.21. bzip2 also brings + various useful refinements (-s, -t), uses less memory, + decompresses significantly faster, and has support for + recovering data from damaged files. + + Because bzip2 uses Huffman coding to construct the com- + pressed bitstream, rather than the arithmetic coding used + in 0.21, the compressed representations generated by the + two programs are incompatible, and they will not interop- + erate. The change in suffix from .bz to .bz2 reflects + this. It would have been helpful to at least allow bzip2 + to decompress files created by 0.21, but this would defeat + the primary aim of having a patent-free compressor. + + Huffman coding necessarily involves some coding ineffi- + ciency compared to arithmetic coding. This means that + bzip2 compresses about 1% worse than 0.21, an unfortunate + but unavoidable fact-of-life. On the other hand, decom- + pression is approximately 50% faster for the same reason, + and the change in file format gave an opportunity to add + data-recovery features. So it is not all bad. + + +AUTHOR + Julian Seward, jseward@acm.org. + + The ideas embodied in bzip and bzip2 are due to (at least) + the following people: Michael Burrows and David Wheeler + (for the block sorting transformation), David Wheeler + (again, for the Huffman coder), Peter Fenwick (for the + structured coding model in 0.21, and many refinements), + and Alistair Moffat, Radford Neal and Ian Witten (for the + arithmetic coder in 0.21). I am much indebted for their + help, support and advice. See the file ALGORITHMS in the + source distribution for pointers to sources of documenta- + tion. Christian von Roques encouraged me to look for + faster sorting algorithms, so as to speed up compression. + Bela Lubkin encouraged me to improve the worst-case com- + pression performance. Many people sent patches, helped + with portability problems, lent machines, gave advice and + were generally helpful. + + + + + + 7 + + diff --git a/bzip2recover.c b/bzip2recover.c new file mode 100644 index 0000000..efdfb3c --- /dev/null +++ b/bzip2recover.c @@ -0,0 +1,399 @@ + +/*-----------------------------------------------------------*/ +/*--- Block recoverer program for bzip2 ---*/ +/*--- bzip2recover.c ---*/ +/*-----------------------------------------------------------*/ + +/*-- + This program is bzip2recover, a program to attempt data + salvage from damaged files created by the accompanying + bzip2 program. + + Copyright (C) 1996, 1997 by Julian Seward. + Guildford, Surrey, UK + email: jseward@acm.org + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + + The GNU General Public License is contained in the file LICENSE. +--*/ + + +#include +#include +#include +#include +#include /*-- or try string.h --*/ + +#define UInt32 unsigned int +#define Int32 int +#define UChar unsigned char +#define Char char +#define Bool unsigned char +#define True 1 +#define False 0 + + +Char inFileName[2000]; +Char outFileName[2000]; +Char progName[2000]; + +UInt32 bytesOut = 0; +UInt32 bytesIn = 0; + + +/*---------------------------------------------------*/ +/*--- I/O errors ---*/ +/*---------------------------------------------------*/ + +/*---------------------------------------------*/ +void readError ( void ) +{ + fprintf ( stderr, + "%s: I/O error reading `%s', possible reason follows.\n", + progName, inFileName ); + perror ( progName ); + fprintf ( stderr, "%s: warning: output file(s) may be incomplete.\n", + progName ); + exit ( 1 ); +} + + +/*---------------------------------------------*/ +void writeError ( void ) +{ + fprintf ( stderr, + "%s: I/O error reading `%s', possible reason follows.\n", + progName, inFileName ); + perror ( progName ); + fprintf ( stderr, "%s: warning: output file(s) may be incomplete.\n", + progName ); + exit ( 1 ); +} + + +/*---------------------------------------------*/ +void mallocFail ( Int32 n ) +{ + fprintf ( stderr, + "%s: malloc failed on request for %d bytes.\n", + progName, n ); + fprintf ( stderr, "%s: warning: output file(s) may be incomplete.\n", + progName ); + exit ( 1 ); +} + + +/*---------------------------------------------------*/ +/*--- Bit stream I/O ---*/ +/*---------------------------------------------------*/ + +typedef + struct { + FILE* handle; + Int32 buffer; + Int32 buffLive; + Char mode; + } + BitStream; + + +/*---------------------------------------------*/ +BitStream* bsOpenReadStream ( FILE* stream ) +{ + BitStream *bs = malloc ( sizeof(BitStream) ); + if (bs == NULL) mallocFail ( sizeof(BitStream) ); + bs->handle = stream; + bs->buffer = 0; + bs->buffLive = 0; + bs->mode = 'r'; + return bs; +} + + +/*---------------------------------------------*/ +BitStream* bsOpenWriteStream ( FILE* stream ) +{ + BitStream *bs = malloc ( sizeof(BitStream) ); + if (bs == NULL) mallocFail ( sizeof(BitStream) ); + bs->handle = stream; + bs->buffer = 0; + bs->buffLive = 0; + bs->mode = 'w'; + return bs; +} + + +/*---------------------------------------------*/ +void bsPutBit ( BitStream* bs, Int32 bit ) +{ + if (bs->buffLive == 8) { + Int32 retVal = putc ( (UChar) bs->buffer, bs->handle ); + if (retVal == EOF) writeError(); + bytesOut++; + bs->buffLive = 1; + bs->buffer = bit & 0x1; + } else { + bs->buffer = ( (bs->buffer << 1) | (bit & 0x1) ); + bs->buffLive++; + }; +} + + +/*---------------------------------------------*/ +/*-- + Returns 0 or 1, or 2 to indicate EOF. +--*/ +Int32 bsGetBit ( BitStream* bs ) +{ + if (bs->buffLive > 0) { + bs->buffLive --; + return ( ((bs->buffer) >> (bs->buffLive)) & 0x1 ); + } else { + Int32 retVal = getc ( bs->handle ); + if ( retVal == EOF ) { + if (errno != 0) readError(); + return 2; + } + bs->buffLive = 7; + bs->buffer = retVal; + return ( ((bs->buffer) >> 7) & 0x1 ); + } +} + + +/*---------------------------------------------*/ +void bsClose ( BitStream* bs ) +{ + Int32 retVal; + + if ( bs->mode == 'w' ) { + while ( bs->buffLive < 8 ) { + bs->buffLive++; + bs->buffer <<= 1; + }; + retVal = putc ( (UChar) (bs->buffer), bs->handle ); + if (retVal == EOF) writeError(); + bytesOut++; + retVal = fflush ( bs->handle ); + if (retVal == EOF) writeError(); + } + retVal = fclose ( bs->handle ); + if (retVal == EOF) + if (bs->mode == 'w') writeError(); else readError(); + free ( bs ); +} + + +/*---------------------------------------------*/ +void bsPutUChar ( BitStream* bs, UChar c ) +{ + Int32 i; + for (i = 7; i >= 0; i--) + bsPutBit ( bs, (((UInt32) c) >> i) & 0x1 ); +} + + +/*---------------------------------------------*/ +void bsPutUInt32 ( BitStream* bs, UInt32 c ) +{ + Int32 i; + + for (i = 31; i >= 0; i--) + bsPutBit ( bs, (c >> i) & 0x1 ); +} + + +/*---------------------------------------------*/ +Bool endsInBz2 ( Char* name ) +{ + Int32 n = strlen ( name ); + if (n <= 4) return False; + return + (name[n-4] == '.' && + name[n-3] == 'b' && + name[n-2] == 'z' && + name[n-1] == '2'); +} + + +/*---------------------------------------------------*/ +/*--- ---*/ +/*---------------------------------------------------*/ + +#define BLOCK_HEADER_HI 0x00003141UL +#define BLOCK_HEADER_LO 0x59265359UL + +#define BLOCK_ENDMARK_HI 0x00001772UL +#define BLOCK_ENDMARK_LO 0x45385090UL + +Int32 main ( Int32 argc, Char** argv ) +{ + FILE* inFile; + FILE* outFile; + BitStream* bsIn, *bsWr; + Int32 currBlock, b, wrBlock; + UInt32 bitsRead; + UInt32 bStart[20000]; + UInt32 bEnd[20000]; + UInt32 buffHi, buffLo, blockCRC; + Char* p; + + strcpy ( progName, argv[0] ); + inFileName[0] = outFileName[0] = 0; + + fprintf ( stderr, "bzip2recover: extracts blocks from damaged .bz2 files.\n" ); + + if (argc != 2) { + fprintf ( stderr, "%s: usage is `%s damaged_file_name'.\n", + progName, progName ); + exit(1); + } + + strcpy ( inFileName, argv[1] ); + + inFile = fopen ( inFileName, "rb" ); + if (inFile == NULL) { + fprintf ( stderr, "%s: can't read `%s'\n", progName, inFileName ); + exit(1); + } + + bsIn = bsOpenReadStream ( inFile ); + fprintf ( stderr, "%s: searching for block boundaries ...\n", progName ); + + bitsRead = 0; + buffHi = buffLo = 0; + currBlock = 0; + bStart[currBlock] = 0; + + while (True) { + b = bsGetBit ( bsIn ); + bitsRead++; + if (b == 2) { + if (bitsRead >= bStart[currBlock] && + (bitsRead - bStart[currBlock]) >= 40) { + bEnd[currBlock] = bitsRead-1; + if (currBlock > 0) + fprintf ( stderr, " block %d runs from %d to %d (incomplete)\n", + currBlock, bStart[currBlock], bEnd[currBlock] ); + } else + currBlock--; + break; + } + buffHi = (buffHi << 1) | (buffLo >> 31); + buffLo = (buffLo << 1) | (b & 1); + if ( ( (buffHi & 0x0000ffff) == BLOCK_HEADER_HI + && buffLo == BLOCK_HEADER_LO) + || + ( (buffHi & 0x0000ffff) == BLOCK_ENDMARK_HI + && buffLo == BLOCK_ENDMARK_LO) + ) { + if (bitsRead > 49) + bEnd[currBlock] = bitsRead-49; else + bEnd[currBlock] = 0; + if (currBlock > 0) + fprintf ( stderr, " block %d runs from %d to %d\n", + currBlock, bStart[currBlock], bEnd[currBlock] ); + currBlock++; + bStart[currBlock] = bitsRead; + } + } + + bsClose ( bsIn ); + + /*-- identified blocks run from 1 to currBlock inclusive. --*/ + + if (currBlock < 1) { + fprintf ( stderr, + "%s: sorry, I couldn't find any block boundaries.\n", + progName ); + exit(1); + }; + + fprintf ( stderr, "%s: splitting into blocks\n", progName ); + + inFile = fopen ( inFileName, "rb" ); + if (inFile == NULL) { + fprintf ( stderr, "%s: can't open `%s'\n", progName, inFileName ); + exit(1); + } + bsIn = bsOpenReadStream ( inFile ); + + /*-- placate gcc's dataflow analyser --*/ + blockCRC = 0; bsWr = 0; + + bitsRead = 0; + outFile = NULL; + wrBlock = 1; + while (True) { + b = bsGetBit(bsIn); + if (b == 2) break; + buffHi = (buffHi << 1) | (buffLo >> 31); + buffLo = (buffLo << 1) | (b & 1); + if (bitsRead == 47+bStart[wrBlock]) + blockCRC = (buffHi << 16) | (buffLo >> 16); + + if (outFile != NULL && bitsRead >= bStart[wrBlock] + && bitsRead <= bEnd[wrBlock]) { + bsPutBit ( bsWr, b ); + } + + bitsRead++; + + if (bitsRead == bEnd[wrBlock]+1) { + if (outFile != NULL) { + bsPutUChar ( bsWr, 0x17 ); bsPutUChar ( bsWr, 0x72 ); + bsPutUChar ( bsWr, 0x45 ); bsPutUChar ( bsWr, 0x38 ); + bsPutUChar ( bsWr, 0x50 ); bsPutUChar ( bsWr, 0x90 ); + bsPutUInt32 ( bsWr, blockCRC ); + bsClose ( bsWr ); + } + if (wrBlock >= currBlock) break; + wrBlock++; + } else + if (bitsRead == bStart[wrBlock]) { + outFileName[0] = 0; + sprintf ( outFileName, "rec%4d", wrBlock ); + for (p = outFileName; *p != 0; p++) if (*p == ' ') *p = '0'; + strcat ( outFileName, inFileName ); + if ( !endsInBz2(outFileName)) strcat ( outFileName, ".bz2" ); + + fprintf ( stderr, " writing block %d to `%s' ...\n", + wrBlock, outFileName ); + + outFile = fopen ( outFileName, "wb" ); + if (outFile == NULL) { + fprintf ( stderr, "%s: can't write `%s'\n", + progName, outFileName ); + exit(1); + } + bsWr = bsOpenWriteStream ( outFile ); + bsPutUChar ( bsWr, 'B' ); bsPutUChar ( bsWr, 'Z' ); + bsPutUChar ( bsWr, 'h' ); bsPutUChar ( bsWr, '9' ); + bsPutUChar ( bsWr, 0x31 ); bsPutUChar ( bsWr, 0x41 ); + bsPutUChar ( bsWr, 0x59 ); bsPutUChar ( bsWr, 0x26 ); + bsPutUChar ( bsWr, 0x53 ); bsPutUChar ( bsWr, 0x59 ); + } + } + + fprintf ( stderr, "%s: finished\n", progName ); + return 0; +} + + + +/*-----------------------------------------------------------*/ +/*--- end bzip2recover.c ---*/ +/*-----------------------------------------------------------*/ diff --git a/bzip2recover.exe b/bzip2recover.exe new file mode 100644 index 0000000..61daf5d Binary files /dev/null and b/bzip2recover.exe differ diff --git a/sample1.bz2 b/sample1.bz2 new file mode 100644 index 0000000..18dea60 Binary files /dev/null and b/sample1.bz2 differ diff --git a/sample1.ref b/sample1.ref new file mode 100644 index 0000000..a56e52b Binary files /dev/null and b/sample1.ref differ diff --git a/sample2.bz2 b/sample2.bz2 new file mode 100644 index 0000000..d5a6160 Binary files /dev/null and b/sample2.bz2 differ diff --git a/sample2.ref b/sample2.ref new file mode 100644 index 0000000..34af958 Binary files /dev/null and b/sample2.ref differ diff --git a/test.bat b/test.bat new file mode 100644 index 0000000..30b747d --- /dev/null +++ b/test.bat @@ -0,0 +1,9 @@ +@rem +@rem MSDOS test driver for bzip2 +@rem +type words1 +.\bzip2 -1 < sample1.ref > sample1.rbz +.\bzip2 -2 < sample2.ref > sample2.rbz +.\bzip2 -dvv < sample1.bz2 > sample1.tst +.\bzip2 -dvv < sample2.bz2 > sample2.tst +type words3sh \ No newline at end of file diff --git a/test.cmd b/test.cmd new file mode 100644 index 0000000..f7bc866 --- /dev/null +++ b/test.cmd @@ -0,0 +1,9 @@ +@rem +@rem OS/2 test driver for bzip2 +@rem +type words1 +.\bzip2 -1 < sample1.ref > sample1.rbz +.\bzip2 -2 < sample2.ref > sample2.rbz +.\bzip2 -dvv < sample1.bz2 > sample1.tst +.\bzip2 -dvv < sample2.bz2 > sample2.tst +type words3sh \ No newline at end of file diff --git a/words0 b/words0 new file mode 100644 index 0000000..527fb43 --- /dev/null +++ b/words0 @@ -0,0 +1,7 @@ +***-------------------------------------------------*** +***--------- IMPORTANT: READ WHAT FOLLOWS! ---------*** +***--------- viz: pay attention :-) ---------*** +***-------------------------------------------------*** + +Compiling bzip2 ... + diff --git a/words1 b/words1 new file mode 100644 index 0000000..c75293b --- /dev/null +++ b/words1 @@ -0,0 +1,5 @@ + + +Doing 4 tests (2 compress, 2 uncompress) ... +If there's a problem, things might stop at this point. + diff --git a/words2 b/words2 new file mode 100644 index 0000000..d3cafb9 --- /dev/null +++ b/words2 @@ -0,0 +1,6 @@ + + +Checking test results. If any of the four "cmp"s which follow +report any differences, something is wrong. If you can't easily +figure out what, please let me know (jseward@acm.org). + diff --git a/words3 b/words3 new file mode 100644 index 0000000..5739d18 --- /dev/null +++ b/words3 @@ -0,0 +1,23 @@ + + +If you got this far and the "cmp"s didn't find anything amiss, looks +like you're in business. You should install bzip2 and bunzip2: + + copy bzip2 to a public place, maybe /usr/bin. + In that public place, make bunzip2 a symbolic link + to the bzip2 you just copied there. + Put the manual page, bzip2.1, somewhere appropriate; + perhaps in /usr/man/man1. + +Complete instructions for use are in the preformatted +manual page, in the file bzip2.1.preformatted. + +You can also do "bzip2 --help" to see some helpful information. + +"bzip2 -L" displays the software license. + +Please read the README file carefully. +Finally, note that bzip2 comes with ABSOLUTELY NO WARRANTY. + +Happy compressing! + diff --git a/words3sh b/words3sh new file mode 100644 index 0000000..1139177 --- /dev/null +++ b/words3sh @@ -0,0 +1,12 @@ +If you got this far and the "bzip2 -dvv"s give identical +stored vs computed CRCs, you're probably in business. +Complete instructions for use are in the preformatted manual page, +in the file bzip2.txt. + +You can also do "bzip2 --help" to see some helpful information. +"bzip2 -L" displays the software license. + +Please read the README file carefully. +Finally, note that bzip2 comes with ABSOLUTELY NO WARRANTY. + +Happy compressing! \ No newline at end of file