Split PDF files into many based on barcode separators.
This is useful if scanning a large number of documents in a batch (e.g. via an automated office scanner) which then need to be split up again.
PDF-Dicer takes a single PDF file made up of multiple scanned documents. Each sub-document has a starting and ending barcode.
PDF-Dicer takes this file, splits on each barcode set, validates the barcodes and outputs back into individual files.
This module requires ImageMagick, GhostScript and Poppler.
You can install them as follows:
- Ubuntu Linux -
sudo apt-get install imagemagick ghostscript poppler-utils pdftk
- NOTE: Some versions of Ubuntu do not stock PDFTK, use
sudo add-apt-repository -y ppa:mike-quisido/pdftk && sudo apt-get update && sudo apt -y install pdftk-java
- PDFTK also requires permissions to be setup before it can process local files
- NOTE: Some versions of Ubuntu do not stock PDFTK, use
- OSX (Yosemite) -
brew install imagemagick ghostscript poppler
- Install PDFTK from website.
var pdfDicer = require('pdf-dicer');
var dicer = new pdfDicer();
dicer
.on('split', (data, buffer) => {
fs.writeFile('output.pdf', buffer);
})
.split('input.pdf', function(err, output) {
if (err) console.log(`Something went wrong: ${err}`);
});
There is an also an example of reading a directory of PDF's and saving based on extracted barcodes here
The main class of this module.
The constructor takes an optional settings object which is used to populate the initial setup.
var dicer = new pdfDicer({driver: 'quagga'});
An object of the instance settings. These can be set either on construction, via a call to set()
or directly.
The following settings are supported:
Setting | Type | Default | Profile | Description |
---|---|---|---|---|
areas |
Array | {top:'0%',right:'0%',left:'0%',bottom:'0%'} |
Quagga | The areas of the input pages that Quagga should scan |
imageFormat |
String | png (Quagga), tif (Bardecode) |
All | The intermediate image format to use before processing the barcode |
magickOptions |
Object | Various (Quagga), {} (Bardecode) |
All | Additional options to pass to ImageMagick when converting the PDF to images |
bardecode |
Object | See below | Bardecode | Options specific to Bardecode |
bardecode.bin |
String | /opt/bardecoder/bin/bardecode |
Bardecode | Path to the bardecode binary |
bardecode.checkEvaluation |
Boolean | true |
Bardecode | Check that the barcode doesn't end in ??? and raise a warning if it does |
bardecode.serial |
String | "" |
Bardecode | Your Bardecode serial number |
filter |
Function | (page) => true |
All | Optional filter to discard pages before calculating ranges |
quagga |
Object | See below | Quagga | Options specific to Quagga |
quagga.locate |
Boolean | false |
Quagga | Indicates if Quagga should try to detect the barcode or we should use areas |
quagga.decoder |
Object | {readers:['code_128_reader'],multiple: false} |
Quagga | Options passed to the Quagga decoder |
temp |
Object | See below | All | Options passed to Temp when generating a temporary directory |
tempClean |
Boolean | true |
All | Automatically erase the temporary directory when done |
temp.prefix |
String | pdfdicer- |
All | The prefix used when generating a temporary directory |
threads |
Object | See below | All | Options used for async threading |
threads.pages |
Number | 1 |
All | The number of threads allowed to run simultaneously when processing pages |
threads.areas |
Number | 1 |
Quagga | The number of threads allowed to run simultaneously when processing page areas |
Convenience function to quickly set a setting. Dotted notation is allowed for setting
.
Convenience function to configure the module with optimal settings for the supported barcode readers.
Supported profiles are:
quagga
bardecode
Process the inputPath (usually a PDF) and split it into multiple PDF files.
Hook into the output of this function by trapping events.
The following events are fired by this module:
Event | Arguments | Description |
---|---|---|
stage |
(stageName) |
Fired for each stage of operation. ENUM: 'init', 'readPDF', 'readPages', 'extracted', 'filtering', 'loadRange', 'preSplit' |
tempDir |
(path) |
Fired when a temp directory has been allocated |
pageConverted |
(page, pageOffset) |
Fired for each page that is converted |
pagesConverted |
(pages) |
Fired when all pages have been converted |
pageAnalyze |
(page) |
Fired before an individual page is analyzed |
barcodeFiltered |
(page) |
Fired if a page is filtered out |
barcodePassed |
(page) |
Fired if a page passes filtering and is not filtered out |
pageAnalyzed |
(page) |
Fired after a page has been analyzed |
pagesAnalyzed |
(pages) |
Fired when all pages have been analyzed |
split |
(range, buffer) |
Fired when a range has been detected and a buffer is ready |