Skip to content

Split PDF files into many based on barcode separators

License

Notifications You must be signed in to change notification settings

MomsFriendlyDevCo/pdf-dicer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF-Dicer

Split PDF files into many based on barcode separators.

This is useful if scanning a large number of documents in a batch (e.g. via an automated office scanner) which then need to be split up again.

PDF-Dicer takes a single PDF file made up of multiple scanned documents. Each sub-document has a starting and ending barcode.

Input file

PDF-Dicer takes this file, splits on each barcode set, validates the barcodes and outputs back into individual files.

Output process

Installing

This module requires ImageMagick, GhostScript and Poppler.

You can install them as follows:

  • Ubuntu Linux - sudo apt-get install imagemagick ghostscript poppler-utils pdftk
    • NOTE: Some versions of Ubuntu do not stock PDFTK, use sudo add-apt-repository -y ppa:mike-quisido/pdftk && sudo apt-get update && sudo apt -y install pdftk-java
    • PDFTK also requires permissions to be setup before it can process local files
  • OSX (Yosemite) - brew install imagemagick ghostscript poppler
    • Install PDFTK from website.

Example

var pdfDicer = require('pdf-dicer');

var dicer = new pdfDicer();

dicer
	.on('split', (data, buffer) => {
	  fs.writeFile('output.pdf', buffer);
	})
	.split('input.pdf', function(err, output) {
		if (err) console.log(`Something went wrong: ${err}`);
	});

There is an also an example of reading a directory of PDF's and saving based on extracted barcodes here

API

dicer (class)

The main class of this module.

The constructor takes an optional settings object which is used to populate the initial setup.

var dicer = new pdfDicer({driver: 'quagga'});

dicer.settings (object)

An object of the instance settings. These can be set either on construction, via a call to set() or directly.

The following settings are supported:

Setting Type Default Profile Description
areas Array {top:'0%',right:'0%',left:'0%',bottom:'0%'} Quagga The areas of the input pages that Quagga should scan
imageFormat String png (Quagga), tif (Bardecode) All The intermediate image format to use before processing the barcode
magickOptions Object Various (Quagga), {} (Bardecode) All Additional options to pass to ImageMagick when converting the PDF to images
bardecode Object See below Bardecode Options specific to Bardecode
bardecode.bin String /opt/bardecoder/bin/bardecode Bardecode Path to the bardecode binary
bardecode.checkEvaluation Boolean true Bardecode Check that the barcode doesn't end in ??? and raise a warning if it does
bardecode.serial String "" Bardecode Your Bardecode serial number
filter Function (page) => true All Optional filter to discard pages before calculating ranges
quagga Object See below Quagga Options specific to Quagga
quagga.locate Boolean false Quagga Indicates if Quagga should try to detect the barcode or we should use areas
quagga.decoder Object {readers:['code_128_reader'],multiple: false} Quagga Options passed to the Quagga decoder
temp Object See below All Options passed to Temp when generating a temporary directory
tempClean Boolean true All Automatically erase the temporary directory when done
temp.prefix String pdfdicer- All The prefix used when generating a temporary directory
threads Object See below All Options used for async threading
threads.pages Number 1 All The number of threads allowed to run simultaneously when processing pages
threads.areas Number 1 Quagga The number of threads allowed to run simultaneously when processing page areas

dicer.set(setting, value)

Convenience function to quickly set a setting. Dotted notation is allowed for setting.

dicer.profile(profile)

Convenience function to configure the module with optimal settings for the supported barcode readers.

Supported profiles are:

  • quagga
  • bardecode

dicer.split(inputPath, callback)

Process the inputPath (usually a PDF) and split it into multiple PDF files.

Hook into the output of this function by trapping events.

Events

The following events are fired by this module:

Event Arguments Description
stage (stageName) Fired for each stage of operation. ENUM: 'init', 'readPDF', 'readPages', 'extracted', 'filtering', 'loadRange', 'preSplit'
tempDir (path) Fired when a temp directory has been allocated
pageConverted (page, pageOffset) Fired for each page that is converted
pagesConverted (pages) Fired when all pages have been converted
pageAnalyze (page) Fired before an individual page is analyzed
barcodeFiltered (page) Fired if a page is filtered out
barcodePassed (page) Fired if a page passes filtering and is not filtered out
pageAnalyzed (page) Fired after a page has been analyzed
pagesAnalyzed (pages) Fired when all pages have been analyzed
split (range, buffer) Fired when a range has been detected and a buffer is ready

About

Split PDF files into many based on barcode separators

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •