Using Datafari to extract text for academic research on NLU and NLP #61

bloubi · 2021-06-25T07:58:37Z

bloubi
Jun 25, 2021
Collaborator

Extracting raw text to do Natural Language Understanding (NLU) or Natural Language Processing (NLP) is often a boring and time consuming task. Any student or researcher that has already had to prepare a pipeline for that knows what we are talking about. First, assess available open source technologies (very often Apache Tika), then understand how it works, put documents in a folder and make it work with trial and errors, probably through a python script.

This is what we had in mind when preparing a documentation on how to use Datafari Community Edition just for that. After all, Datafari is an enterprise search solution, which means it encompasses these tasks as part of its overall mission to index documents and allow to search through them.

With the documentation we provide, researchers will be able to have a fully operational pipeline that will look in a specific shared folder, extract the text (via Apache Tika), and ouput it in a dedicated folder. And with a bit more motivation, researchers can go beyond and use other connectors than the fileshare, as the pipeline can work with any data source.

Discover now how to extract text from any document thanks to Datafari to feed your favorite ML tools.

ANDRERIW · 2021-08-02T17:00:07Z

ANDRERIW
Aug 2, 2021

Thank you! Will be testing and reference it here in Brazil. Downloading it now.

0 replies

bloubi · 2021-08-03T14:19:47Z

bloubi
Aug 3, 2021
Collaborator Author

Glad to see it can be of help. Let us know if you have trouble with our documentation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Datafari to extract text for academic research on NLU and NLP #61

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Using Datafari to extract text for academic research on NLU and NLP #61

bloubi Jun 25, 2021 Collaborator

Replies: 2 comments

ANDRERIW Aug 2, 2021

bloubi Aug 3, 2021 Collaborator Author

bloubi
Jun 25, 2021
Collaborator

ANDRERIW
Aug 2, 2021

bloubi
Aug 3, 2021
Collaborator Author