Skip to content

Real world: corruption

Franco Corbelli edited this page Nov 22, 2024 · 11 revisions

Let's now talk about archive corruption, whether monolithic, multipart, chunked, or encrypted.

The zpaq file format does not contain redundant information (only minimal amounts) that would allow corrections on corrupted archives.

An unintended modification of an archive can easily make it impossible to extract its content, in the worst case, the entire archive.

It is not possible to "strengthen" the archive format without losing backward compatibility, meaning preventing zpaq from extracting the contents of the archives. Therefore, to summarize, it is necessary to try to minimize potential issues based on what is actually available, rather than what would be ideal to implement.

Let’s go through the various possibilities step by step.

The first scenario is the so-called incomplete transaction.

During archive updates, zpaq begins by writing a specific portion (the so-called jidacheader) initially filled with placeholder data. It then adds the compressed files one by one, in strictly sequential order. At the end, it rewinds to the temporary jidacheader and writes the final version.

Inability to Extract Files Due to Filesystem Limitations

This is another case that, in reality, is not related to corruption but rather to improper usage. zpaqfranz is extremely cross-platform. Archives created on systems like Macintosh can be easily extracted on Windows machines and vice versa. However, Windows and non-Windows systems have different filesystem limits.

For example, the most well-known limitation on Windows is the 255-character maximum file name length (this opens up a complex discussion, which we won’t delve into here). Additionally, Windows is case-insensitive. PIPPO and pippo can coexist on Linux, but not on Windows. There are also reserved file names (e.g., lpt1) and other system-specific restrictions.

Therefore, if you try to extract the contents of a zpaq archive on a different system, you may encounter issues. zpaqfranz includes several internal mitigation mechanisms, as well as specific switches to force data extraction.

In general, "strange" file names (a common example is email filenames in .eml format extracted from Thunderbird and copied from the email subject) or very long file names can cause problems. This is not my fault, but it’s something to be aware of.

Of course, there is an option to simulate an extraction (on zpaqfranz) to test for the existence of potential issues like this. Practically all zpaq functions (such as adding files, listing files, and so on) rely on the read_archive function, which reads the archive by identifying these specific data blocks (marking the boundaries of transactions). When it encounters a block written with placeholder data, it concludes that the transaction was not completed correctly and therefore considers the archive corrupted.

This type of problem is serious and usually occurs because zpaq was interrupted mid-operation for some reason. Common causes include running out of disk space, an unexpected power outage, or a kill command that abruptly terminates the process. Whatever the case, it’s not something that zpaq can generally control. It happens, and there’s nothing that can be done about it.

With other archivers (e.g., 7z), a temporary file is initially created; if the compression is successful, the file is renamed with a .7z extension. Otherwise, the temporary file remains orphaned on the storage device. With RAR, a .rar file is created, but if there is an abrupt interruption, the file remains empty, and so on.

zpaq, due to historical reasons, provides a very understated warning about the existence of an incomplete transaction, without much emphasis. It simply states Incomplete transaction ignored and moves on to ignore it.

The problem arises when new data is added to an archive containing an incomplete transaction. The new data is (almost always) lost. The archive remains extractable up to the incomplete transaction (excluded), but in general, everything after that point is corrupted. Understandably, incomplete transactions should NOT be underestimated; they need to be addressed as soon as possible.

zpaqfranz, following zpaq’s style, did not initially place much emphasis on this condition (up to version 60.10). Now, however, the warnings are significantly more noticeable and alarming.

The issue becomes potentially even worse in the case of multipart archives, where transactions are written to separate files (e.g., file0001.zpaq, file0002.zpaq, and so on). In this scenario, an incomplete transaction at, for example, point 27 will result in all subsequent versions (27, 28, 29 ... 100) appearing to be intact but actually being corrupted. Only the data up to version 26 (in our example) will be extractable.

It is, therefore, up to the operator to pay close attention to this type of error and handle it with particular care.

The new versions of zpaqfranz include a special switch (-trim) that activates a range of mitigations for this issue, both for monolithic and multipart archives. At present, it does NOT work for encrypted files or chunked files (those created with the -chunk switch, dividing data into fixed-size pieces).

For monolithic files, after issuing a prominent warning, it attempts to directly cut away the corrupted portion. This can be done (and has been possible for some time) with the trim command in zpaqfranz. Essentially, in our example, the file is truncated to version 26, with the rest being deleted. This approach works as long as the interrupted transaction is the last one. If it’s not, the file is in very poor condition.

For multipart files, zpaqfranz (using the -trim switch) refuses to operate if the interrupted transaction is not the very last one. In such cases, it renames the corrupted file (e.g., file0027.zpaq to file0027.spaz) and proceeds with the process.

When dealing with multipart files containing incomplete transactions (possibly multiple), it is possible to attempt a reconstruction—within limits—using the trim command. This command renames the corrupted parts and replaces them with empty files (i.e., files with zero length). This may allow, if you’re lucky, the extraction of the archive’s contents (albeit accompanied by a barrage of warning messages).

At that point, you can recreate a "clean" archive from the extracted data. In short, it’s the last resort before losing everything.

OK, here’s a more concise summary:

If you see a message saying "incomplete transaction," DO NOT ignore it.

Your archive is corrupted, and you need to address it. If you're unsure how to proceed, ask me for help, and I'll guide you.

Inconsistent Archives

An archive is considered inconsistent when its internal structure or content does not match what is expected. This can happen for various reasons, such as:

  • Incomplete transactions (as discussed earlier).
  • Modifications or corruption of the archive file due to hardware failures, software bugs, or improper handling.
  • Combining incompatible or mismatched portions of multipart archives.

Inconsistent archives may still allow partial data extraction, but the results are unreliable and often lead to further issues. It’s crucial to identify and resolve inconsistencies promptly to avoid permanent data loss. If you encounter an inconsistency message, don't hesitate to ask for assistance—I can help you troubleshoot and attempt a recovery.

zpaq offers numerous parameters and a wide range of compression and deduplication algorithms. It’s not guaranteed that every combination of these will produce an extractable file. The default settings are safe, meaning that if the process completes successfully, you can be reasonably confident that the archive is extractable.

However, certain combinations, especially modifications to the -fragment parameter for fine-tuning compression, can make extraction difficult and, in some cases, impossible, potentially resulting in data loss in the worst-case scenario.

It is therefore incorrect to define the archive as "corrupted," but I won’t go into the details.

Short version: Pay special attention if you modify the default preset parameters. It’s assumed that you know exactly what you're doing. If you're unsure, leave the defaults as they are, as they work well in nearly all circumstances.

Inability to Extract Files Due to Filesystem Limitations

This is another case that, in reality, is not related to corruption but rather to improper usage. zpaqfranz is extremely cross-platform. Archives created on systems like Macintosh can be easily extracted on Windows machines and vice versa. However, Windows and non-Windows systems have different filesystem limits.

For example, the most well-known limitation on Windows is the 255-character maximum file name length (this opens up a complex discussion, which we won’t delve into here). Additionally, Windows is case-insensitive. PIPPO and pippo can coexist on Linux, but not on Windows. There are also reserved file names (e.g., lpt1) and other system-specific restrictions.

Therefore, if you try to extract the contents of a zpaq archive on a different system, you may encounter issues. zpaqfranz includes several internal mitigation mechanisms, as well as specific switches to force data extraction.

In general, "strange" file names (a common example is email filenames in .eml format extracted from Thunderbird and copied from the email subject) or very long file names can cause problems. This is not my fault, but it’s something to be aware of.

Of course, there is an option to simulate an extraction (on zpaqfranz) to test for the existence of potential issues like this.

"Real" Archive Corruption

This is, paradoxically, a very rare case—one where, for example, due to a hardware issue, a part of the archive is improperly modified externally. The likelihood of being able to "fix" this type of damage is extremely low. The zpaq format does not have information for corrections or redundancy. The damage could result in a total loss (meaning you can’t extract anything), or a more modest issue (you may only be unable to extract a few files). There’s no simple way to determine the extent of the damage.

In practice, this type of corruption is less likely to occur during storage and more likely during transmission. For instance, you may prepare a local file and then send it to a remote server, perhaps using rsync. In this case, the risk of true corruption does exist.

zpaqfranz mitigates this problem with various precautions and switches (and the backup command). Essentially, it calculates integrity codes in separate files, which can then be compared to quickly confirm that two different copies of the same archive (perhaps one on an HDD and another on a NAS, or a USB drive, etc.) are truly identical.

Recommendations

Be careful NOT to ignore any unusual messages. Avoid using "strange" combinations of switches unless you know exactly what they do. For truly important data, consider making more than one copy (e.g., two copies) to safeguard against such problems. Personally, I make up to seven copies.

zpaqfranz offers a variety of mechanisms for testing the integrity of your archives. From quick checks to more paranoid-level inspections++. Of course, the time required varies. In some cases, a quick check is sufficient, while in others, you can schedule in-depth overnight verifications.

Another tip: be cautious with scheduled procedures using .batch files. They are often "stupid," meaning that in the event of a problem, they continue without notifying the user. Periodically verify manually that everything is in order.

Encrypted, chunked

ongoing

Clone this wiki locally