-
Notifications
You must be signed in to change notification settings - Fork 176
Home
phiresky edited this page May 25, 2023
·
8 revisions
Please see https://github.com/phiresky/ripgrep-all/issues/146 for the current state of the project
Since version 1.0, you can specify custom adapters that invoke external preprocessing scripts in the config file.
For example, the integrated PDF-to-text adapter would look like the following in the config file:
"custom_adapters": [
{
"name": "poppler",
"version": 1,
"description": "Uses pdftotext (from poppler-utils) to extract plain text from PDF files",
"extensions": ["pdf"],
"mimetypes": ["application/pdf"],
"binary": "pdftotext",
"args": ["-", "-"],
"disabled_by_default": false,
"match_only_by_mime": false,
"output_path_hint": "${input_virtual_path}.txt.asciipagebreaks"
}
]
More info about the custom adapter config can be found on docs.rs (CustomAdapterConfig)
With custom adapters, there's now three ways you could search custom files. Here's the (dis)advantages of each.
-
rg --pre
andrg --search-zip
: rg has integrated functionality to have custom preprocessors and to search some compressed files. The disadvantages are- Simplicity. '--pre' is one same script applied to all file types. You have to write decision logic yourself.
- Caching. What makes adapters in
rga
fast is the caching mechanism, which allows fast search even when the preprocesser is slow (which is often the case). With '--pre' you'd have to implement this caching yourself, which isn't trivial. That's how rga got started ;). - Recursion. rga can recurse into archives, and return contents at any depth as a binary stream. The same can be implemented for other things that aren't strictly archives, like a pdf file that contains images, where the images may be searched by a different extractor.
-
Custom adapters. Custom adapters are great because they allow you to write an adapter in non-rust code and use external libraries. You could even hook lesspipe into it. They are limited in that they can only output a single file per input file though, so they cannot handle archives like
zip
. - Integrated adapters. Integrated adapters are fastest and most flexible because they are written in Rust and don't require external spawns.
If you think your adapter config is useful, you can share it by adding it to the wiki
[Todo: tesseract OCR adapter]