Skip to content

Latest commit

 

History

History

RDChecksum

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

OVERVIEW

RDChecksum version 0.87

Creates, updates or verifies a list of file checksums (hashes), for data corruption or offline file change detection purposes.

This tool can save a lot of time, because 1) it can update the checksums only for those files that have changed in the meantime (according to their 'last modified' timestamps), and 2) it can resume an interrupted checksum verification, instead of having to start from the first file again.

RATIONALE

In an ideal world, all filesystems would have integrated data checksums like ZFS does. This tool is a poor man's substitute for such filesystem-level checksums, in order to detect data corruption. It can also help detect data corruption when transferring files over a network.

Storage devices are supposed to use checksums to detect or even correct data errors, and most data transport protocols should do the same. However, computer systems are becoming more complex and more brittle at the same time, often due to economic pressure.

As a result, I have very rarely performed a large backup or file copy operation without hitting some data integrity issue. For example, interrupting with Ctrl+C an rsync transfer to an SMB network share tends to corrupt the destination files, and resuming the transfer will not fix such corruption.

Another usage scenario is to just detect offline file changes, see further below for more information.

There are many alternative checksum/hash tools around, but I decided to write a new one out of frustration with the existing software. Advantages of this tool are:

  • Checksum list file update

    If only a few files have changed, there is no need to checksum all of them again. None of the other checksum tools I know of have this feature. I actually wrote this tool because I found that very irritating.

    See option --update .

  • Resumable verification.

    I often verify large amounts of data on conventional hard disks, and it can take hours to complete. But sometimes I need to interrupt the process, or perhaps I inadvertently close the wrong terminal window. All other checksum tools I know will then restart from scratch, which is very annoying, as it is an unnecessary waste of my precious time.

    See option --resume-from-line .

  • It is possible to automate the processing of files that failed verification.

    The verification report file has a legible but strict data format. An automated tool can parse it and, for example, copy all failed files again.

Instead of using a checksum/hash tool, you could use an intrusion detection tool like Tripwire, AIDE, Samhain, afick, OSSEC or integrit, as most of them can also checksum files and test for changes on demand. However, I tried or evaluated a few such tools and found them rather user unfriendly and not flexible enough.

For disadvantages and other issues of RDChecksum see the CAVEATS section below.

USAGE

rdchecksum.pl --create [options] [--] [directory]

rdchecksum.pl --update [options] [--] [directory]

rdchecksum.pl --verify [options]

Argument 'directory' is optional and defaults to the current directory ('.').

If 'directory' is the current directory ('.'), then the filenames in the checksum list file will be like 'file.txt'. Otherwise, the filenames will be like 'directory/file.txt'.

The directory path is not normalised, except for removing any trailing slashes. For example, "file.txt" and "././file.txt" will be considered different files.

The specified directory will be scanned for files, and then each subdirectory will be recursively scanned. The resulting order looks like this:

dir1/file1.txt
dir1/file2.txt
dir1/subdir1/file.txt
dir1/subdir1/subdir/file.txt
dir1/subdir2/file.txt

This tool only processes files. Empty directories will not be recorded.

The checksum list file itself (FileChecksums.txt by default), the verification report file and any other related files with those basenames will be automatically skipped from the checksum list file, provided that their filenames' basedirs mach the 'directory' argument. Filenames are not normalised: if you specify an absolute directory path to scan, and the checksum list file can be encountered during scanning, then you should specify an absolute checksum list filename with option '--checksum-file' for this automatic exclusion to work.

Usage examples:

cd somewhere && rdchecksum.pl --create

cd somewhere && rdchecksum.pl --verify

OPTIONS

Command-line options are read from environment variable RDCHECKSUM_OPTIONS first, and then from the command line.

  • -h, --help

    Print this help text.

  • --help-pod

    Preprocess and print the POD section. Useful to generate the README.pod file.

  • --version

    Print this tool's name and version number (0.87).

  • --license

    Print the license.

  • --self-test

    Run the built-in self-tests.

  • --

    Terminate options processing. Useful to avoid confusion between options and filenames that begin with a hyphen ('-'). Recommended when calling this script from another script, where the filename comes from a variable or from user input.

  • --create

    Creates a checksum list file.

    Non-regular files, such as FIFOs (named pipes), will be automatically skipped.

    When creating a checksum list file named FileChecksums.txt , a temporary file named FileChecksums.txt.inProgress will also be created. If this script is interrupted, the temporary file will remain behind.

    Checksum list files FileChecksums.txt , FileChecksums.txt.inProgress and FileChecksums.txt.previous will be automatically excluded from the checksum list file. Any eventual verification report files FileChecksums.txt.verification.report , FileChecksums.txt.verification.report.resume and FileChecksums.txt.verification.report.resume.tmp will be automatically excluded too. However, there is a limitation due to the lack of filename normalisation, see the related note further above.

  • --update

    Updates a checksum list file.

    Files that do not exist anymore on disk will be deleted from the checksum list file, and new files will be added.

    For those files that are still on disk, their checksums will only be updated if their sizes or last modified timestamps have changed, unless option --always-checksum is specified.

    Make sure you pass the same directory as you did when creating the checksum list file. Otherwise, all files will be checksummed again.

    When updating a checksum list file named FileChecksums.txt , a temporary file named FileChecksums.txt.inProgress will also be created. If this script is interrupted, the temporary file will remain behind. If the update succeeds, the previous file will be renamed to FileChecksums.txt.previous .

  • --verify

    Verifies the files listed in the checksum list file.

    A report file named FileChecksums.txt.verification.report will be created. Only failed files will show up in the report.

    It is possible to parse the report in order to automatically process the files that failed verification.

    Temporary files FileChecksums.txt.verification.report.resume and FileChecksums.txt.verification.report.resume.tmp will be created and may remain behind if the script gets killed. See --resume-from-line for more information.

    You may want to flush the disk cache before verifying. Otherwise, you may be testing some of the files from memory, so you wouldn't notice if they are actually corrupt on disk.

  • --checksum-file=filename

    Specifies the checksum list file to create, update or verify.

    The default filename is FileChecksums.txt .

    The checksum list file itself and the verification report file are automatically excluded from the checksum list file if encountered during directory scanning, but there is a limitation due to the lack of filename normalisation, see the related note further above.

  • --report-file=filename

    Specifies the verification report filename which option '--verify' will create.

    The default is the checksum file name plus suffix '.verification.report', so the verification report filename is FileChecksums.txt.verification.report by default.

    An eventual verification resume file will have extension '.resume' appended, so it will be named FileChecksums.txt.verification.report.resume by default.

    Operations '--create' and '--update' automatically exclude the verification report file and the eventual verification resume file from the checksum list file if encountered during directory scanning, but there is a limitation due to the lack of filename normalisation, see the related note further above.

  • --checksum-type=xxx

    Supported checksum types are:

    • CRC-32

      Unsafe for protecting against intentional modification.

    • Adler-32

      Faster than CRC-32, but weak for small files. Unsafe for protecting against intentional modification.

    • none

      No checksum will be calculated or verified. Only the file metadata (size and last modified timestamp) will be used.

    The default checksum type is CRC-32. This option's argument is case insensitive.

    Changes in the checksum type only take effect when a checksum is recalculated for a file. Existing entries in the checksum list file will not be modified. Therefore, you may want to recreate the checksum list file from scratch if you select a different checksum type.

  • --always-checksum

    When updating a checksum list file, do not skip files that have apparently not changed, according the file size and last modified timestamp, but always recalculate the checksum.

    This way, the indication whether a file has changed becomes reliable, at the cost of disk performance.

  • --resume-from-line=n

    Before starting verification, skip (n - 1) text lines at the beginning of FileChecksums.txt . This option allows you to manually resume a previous, unfinished verification.

    During verification, file FileChecksums.txt.verification.report.resume is created and periodically updated with the latest verified line number. The update period is one minute. In order to guarantee an atomic update, temporary file FileChecksums.txt.verification.report.resume.tmp will be created and then moved to the final filename.

    If verification is completed, FileChecksums.txt.verification.report.resume is automatically deleted, but if verification is interrupted, FileChecksums.txt.verification.report.resume will remain behind and will contain the line number you can resume from. It the script gets suddenly killed and cannot gracefully stop, the line number in that file will lag up to the update period, and the temporary file might also be left behind.

    Before resuming, remember to rename or copy the last report file (should you need it), because it will be overwritten, so you will then lose the list of files that had failed the previous verification attempt.

  • --verbose

    Print the name of each file and directory found on disk.

    Without this option, operations --create, --update and --verify display a progress indicator every few seconds.

  • --no-progress-messages

    Suppress the progress messages.

  • --no-update-messages

    During an update, do not display which files in the checksum list file are new, missing or have changed.

  • --include=regex

    See section FILTERING FILENAMES WITH REGULAR EXPRESSIONS below.

  • --exclude=regex

    See section FILTERING FILENAMES WITH REGULAR EXPRESSIONS below.

DETECTING OFFLINE FILE CHANGES

RDChecksum can be used just to detect offline file changes.

You would normally install some sort of online file monitor or intrusion detection tool like Tripwire for that purpose, but installing such software often requires administrative privileges and is usually not trivial to configure and maintain.

Besides, sometimes immediate notification is not so important, so that regular file scans at non-busy times suffices. Or you may need such a service only for forensic analysis. Your needs may be as trivial as getting best-effort early warnings if some colleague changes some shared documents.

RDChecksum is of course rather simple in comparison to purpose-built tools. It does not scale as much, and it does not store and check most file attributes and file permissions. But it may still be enough for many situations.

In order to detect simple file changes quickly, create the checksum list file with option --checksum-type=none . Afterwards, run RDChecksum at regular intervals with options --update and --checksum-type=none , and check its exit code.

If a simple change detection based on filenames and 'last modified' timestamps is not safe enough for you, drop option --checksum-type=none when creating and updating the checksum list file, and specify option --always-checksum when updating.

If you specify option --no-progress-messages when updating, and you redirect RDChecksum's stdout to a file, you can then e-mail the generated output as a human-readable file change notification. See script EmailFileChanges.sh for an example.

FILTERING FILENAMES WITH REGULAR EXPRESSIONS

Filename filtering only applies to operations --create and --update .

Say you run this tool as follows:

rdchecksum.pl --create dir1

The specified directory dir1 is not subject to any filtering.

Let us say that the first filename found on disk is dir1/fileA.txt .

All --include and --exclude options are applied as Perl regular expressions against "dir1/fileA.txt", in the same order as these options are given on the command line. The first expression that matches wins.

If no regular expression matches:

  • If all filtering expressions are --include , the file will be excluded.

  • If at least one --exclude option is present, the file will be included.

Let us say that a subdirectory named dir1/dir2 is also found. Before descending into it, all --include and --exclude options are applied against "dir1/dir2/". Note the trailing slash.

If the directory is excluded, nothing else underneath will be considered anymore.

The regular expression itself is internally converted into UTF-8 first, and is applied against pathnames that are coded in UTF-8 too, so there should not be any issues with international characters.

FILTERING EXAMPLES

  • Skip files that end with '.jpg':

    --exclude='\.jpg\z'

    We are using \z instead of the usual $ to match the end of the filename, in case the filename contains new-line characters.

    Directory names end with a slash ('/'), so they will not match.

    The single quoting (') is there just to minimise interference from the shell. Otherwise, we would have to quote special shell characters such as '\' and '$'.

  • Skip files that end with either '.jpg' or '.jpeg':

    --exclude='\.(jpg|jpeg)\z'
  • Skip files that end with either '.jpg' or '.jpeg', but case insensitively, so that .JPG and .jpEg would also match:

    --exclude='(?i)\.(jpg|jpeg)\z'
  • Include only files that end with '.txt':

    --include='/\z'  --include='\.txt\z'

    The /\z rule means "anything that ends with a slash" and allows descending into all subdirectories. Otherwise, directory names that do not match the .txt rule will be filtered out too.

  • Do not descend into any subdirectories:

    --exclude='/\z'

    Anything that ends with a slash is a directory.

    Matching just a slash anywhere may not work properly, because if there is a starting directory like "my-dir", then filenames at top-level will be like "my-dir/file.txt". Therefore, the slash needs to match only at the end, in order to prevent descending into a directory like "my-dir/subdir/".

  • Skip all subdirectories named 'tmp':

    --exclude='(\A|/)tmp/\z'

    Including a trailing slash after 'tmp' makes sure it only matches directories, and not files.

    Expression (\A|/) will match either from the beginning, so "tmp/" will match, or after a slash, so that a subdirectory like "abc/tmp/" will match too. A name like "/xtmp" will not match.

    The \z is there to match the slash only at the end. Otherwise, if the starting directory itself is called "tmp", then a top-level file like "tmp/file.txt" will be filtered out too. Matching the slash only at the end avoids the risk of filtering out the starting directory specified on the command line.

  • Skip all files which names start with a digit or an ASCII lowecase letter:

    --exclude='(?xx)  (\A|/)  [0-9 a-z]  [^/]*  \z'

    (?xx) makes the expression more readable by allowing spaces between components (unquoted spaces will be discarded).

    (\A|/) indicates that matching must start right at the beginning of the path, or after a slash. Otherwise, filenames such as !abc and dir/!abc will also match, even though they start with '!'.

    0-9 matches digits, and a-z matches lowercase ASCII letters (there are 26 of them, so letters with diacritical marks are not included). The brackets [0-9 a-z] mean that one such character must exist.

    [^/]* allows for any characters afterwards, as long as it is not a slash. We want to match only filenames, so any slashes to the right are to be avoided.

    \z specifies that the end of the path must come at this point. Without it, other slashes to the right of the match would still be tolerated.

EXIT CODE

An exit code of 0 means success, but see exit code 1 below.

An exit code of 1 means that --update was specified and the checksum list file has changed, because at least one file was new, no longer there, or has changed. A change in the checksum type can also trigger such an indication.

Any other exit code means that there was an error, or the script was interrupted by a signal.

SIGNALS

Reception of signals SIGTERM, SIGINT (usually Ctrl+C) and SIGHUP (usually closing the terminal window) make this script gracefully stop the current operation. The script then kills itself with the same signal. Most other signals will kill the script straight away.

SIGHUP will probably not be handled as a normal request to stop if you close the terminal, because writing to sdtout or stderr will fail immediately and will make this script quit beforehand.

USING background.sh

It is probably most convenient to run this tool with another script of mine called background.sh , so that it runs with low priority and you get a visual notification when finished. For example:

background.sh rdchecksum.pl --verify

CHECKSUM LIST FILE FORMAT

The generated file with the list of checksums looks like this:

2019-12-31T20:15:01.200  CRC-32  12345678  1,234,567  subdir/file1.txt
2019-12-31T20:15:01.300  CRC-32  90ABCDEF  2,345,678  subdir/file2.txt

The file begins with the UTF-8 BOM. The column separator is the tab character.

Some characters in the filenames are escaped with URL encoding. For example, the tab character is escaped to "%09".

VERIFICATION REPORT FILE FORMAT

A verification report file looks like this:

subdir/file1.txt  No such file or directory
subdir/file2.txt  Some other error

The file format and character escaping are similar to the checksum list files.

The following shell commands align and reorder the columns for convenience when manually inspecting such files:

column  --table  --separator $'\t'                     FileChecksums.txt.verification.report
column  --table  --separator $'\t'  --table-order 2,1  FileChecksums.txt.verification.report

CAVEATS

  • This tools is rather young and could benefit from more options.

    If you need more features, drop me a line.

  • You can only specify one directory as the starting point.

    If you want to process multiple directories on different locations at once, you can work-around it by placing symlinks to all those directories into a single directory, and passing that single directory to RDChecksum.

  • When updating a checksum list file, the logic that detects whether a file has changed can be fooled if you move and rename files around, so that previous filenames and their file sizes still match. The reason ist that move and rename operations do not change the file's last modified timestamp.

    This shortcoming is not unique to RDChecksum, you can fool GNU Make this way too.

    If preventing this scenario is important to you, use option --always-checksum.

  • If you move or rename files or directories, this tool will neither detect it nor update the checksum list file accordingly. The affected files will be processed again from scratch on the next checksum list file update, as if they were new or missing files.

  • The granularity level is one file.

    If you data consists of a single huge file, you will not be able to resume an interrupted verification.

  • Processing is single threaded.

    If you have a very fast SSD and a multicore processor, you will probably be waiting longer than necessary.

    When using conventional disks (HDDs), reading several files in parallel would probably only decrease performance, due to the permanent seeks. So implementing multiprocessing would only really help with SSDs.

    This script could also benefit from asynchronous I/O, but that would not bring a lot, because checksumming is usually pretty fast compared to disk I/O, and because operating systems normally implement a pretty good "read ahead" optimisation when your program mainly issues sequential disk reads.

  • There is no symbolic link loop detection (protection against circular links).

    In such a situation, this tool will run forever.

  • Memory usage will be higher than with alternative tools.

    The main reason is that this script needs to sort the list of filenames inside each directory traversed, in order to support the --update operation later on. All filenames in the current directory are loaded into memory, and part of them will remain in memory while descending to its subdirectories.

    Therefore, if you have directories with a huge number of files or subdirectories, you will see increased memory usage. CPU usage may also be higher than expected due to the sorting operation.

  • UTF-8 assumption

    This tool assumes that all filenames returned from syscalls like readdir are encoded in UTF-8.

    It should be the case in all modern operating systems. There is no other encoding that will actually work in practice anyway. Note that the Linux kernel does not enforce any particular encoding nor provides encoding information to userspace.

  • Running on a native Perl under Microsoft Windows may cause problems.

    Support for drive letters like C: when dealing with absolute pathnames is missing. But this limitation is probably not hard to fix.

    There is also the issue of UTF-8 in filenames, but you can probably configure the way Windows interacts with non-Unicode applications to overcome this problem.

    I would be willing to help if you volunteer testing the changes under Windows.

    In the meantime, you can always use Cygwin. Windows 10 even has a Linux subsystem nowadays.

  • Using RDChecksum on large amounts of data may flush the Linux filesystem cache and impact the performance of other processes.

    This shortcoming is not unique to RDChecksum, but to any tool processing lots of file data. I have written a small summary about this cache flushing issue:

    https://rdiez.miraheze.org/wiki/The_Linux_Filesystem_Cache_is_Braindead

    One way to overcome this issue is to use syscall posix_fadvise . Unfortunately, Perl provides no easy access to it. I have raised a GitHub issue about this:

    Title: Provide access to posix_fadvise

    Link: Perl/perl5#17899

  • There is a file format compatibility break between script versions 0.76 and 0.77 (released in September 2022).

    The checksum list file format version increased from 1 to 2, as the possible values in the checksum type column had changed. Empty files now have "none" as both checksum type and value, and support for the "Adler-32" type was introduced.

    No backwards compatibility is implemented. If you want to use old checksum list files with format version 1, you will have to stay with script version 0.76 .

    If you update the file format version number manually from 1 to 2, all empty files will report a checksum mismatch. Delete them all from the checksum list file and run an update in order to reintroduce the empty files.

FEEDBACK

Please send feedback to rdiezmail-tools at yahoo.de

LICENSE

Copyright (C) 2020-2023 R. Diez

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License version 3 as published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License version 3 for more details.

You should have received a copy of the GNU Affero General Public License version 3 along with this program. If not, see http://www.gnu.org/licenses/.