-
Notifications
You must be signed in to change notification settings - Fork 43
Plugin: File filter
This article is a work in progress. You can help Searchdaimon by expanding it with information you know.
File filters are plugins to the document manager that extracts text from files so that text can be added to the index. The goal is to be able to use any text based programs on Linux as a file filter.
If you do any changes you will need to restart the document manager for the changes to take place.
The file filters resists in the fileFilter folder. Each file filter has its own folder with some of the following file and folders:
- runinfo – Configuration files that tells the ES what and how to run this file filter
- src – Folder with the source code of the file filter if available
- test – Folder with example files useful for testing
- [binary] - Possible binary the file filter need
The document manager reads the runinfo files at startup and will so run the correct file filter for files that gets crawled.
We try to collect at list one test file for each format the ES can support. If there is any test files it will be in the test folder for that file filter.
The runinfo file is the main configuration file. It list one section for each file format it know how to convert. The runinfo configuration file support several different type of file filters.
command describes what document manager should do when encountering a file. Normally it will execute a script or binary either of the file filter folder or somewhere on the system.
For example
/usr/bin/binary --to=txt #file
Will execute /usr/bin/binary with the file path as where #file is.
outputformat describes what format the file filter will use to output its result. It can be one of:
- html - The output will be formatted as html
- text - The output will be formatted as plain text
- htmlfile - The output will be a new file that is formatted as html
- textfile - The output will be a new file that is formatted as text
- dir - The output will be a new directory containing new files. The document manager will so go thru this new directory and call other file filters in each file. This is normally used on file that contains other files. For example a .zip file may contain other archived files, and an email file may contain attachment.
outputtype if describes where the file filter will output its results. It is currently only used for html and text and must then be stdio to indicate that the file will us stdout and stderr for output.
filtertype is used to run a specially crafted Perl module as a filter. It is an undocumented function for now.
For example the file fileFilter/pdftotext/runinfo
is the runinfo file for pdftotext, a program for converting PDF files to text. It therefore has a section for PDF files like this:
documentstype: pdf command: ./pdftotext #file outputformat: textfile
It tells the document manager that for each PDF file so shall it run the pdftotext binary located in its own folder. The pdftotext program will so make a new text file with the extracted text.
This file filter has some test files, so you can try this out directly from the console by typing this from the boithoTools folder:
fileFilter/pdftotext/pdftotext fileFilter/pdftotext/test/dmca.pdf
This will give you a new file fileFilter/pdftotext/test/dmca.txt with the text in the dmca.pdf file.
Files that don't contain any text, like images may also benefit from file filters. For example for images the ES will extract the name if the file format and the geometry.
For example the file as fileFilter/identify/runinfo
has a section for png images like this:
documentstype: png command: identify -ping -format "Geometry: %wx%h, Format: %m" #file outputformat: text outputtype: stdio
It tells the document manager that for each PNG file it shall run the ImageMagicks identify program with some parameters and the file name. The identify program will then output text on stdio.
This file filter also has some test files, so you can try this out directly from the console by typing this from the boithoTools folder:
identify -ping -format "Geometry: %wx%h, Format: %m" fileFilter/identify/test/png_sample.png
This will try to extract information from the png_sample.png file.