Using Datafari to extract text for academic research on NLU and NLP #61
bloubi
started this conversation in
Show and tell
Replies: 2 comments
-
Thank you! Will be testing and reference it here in Brazil. Downloading it now. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Glad to see it can be of help. Let us know if you have trouble with our documentation. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Extracting raw text to do Natural Language Understanding (NLU) or Natural Language Processing (NLP) is often a boring and time consuming task. Any student or researcher that has already had to prepare a pipeline for that knows what we are talking about. First, assess available open source technologies (very often Apache Tika), then understand how it works, put documents in a folder and make it work with trial and errors, probably through a python script.
This is what we had in mind when preparing a documentation on how to use Datafari Community Edition just for that. After all, Datafari is an enterprise search solution, which means it encompasses these tasks as part of its overall mission to index documents and allow to search through them.
With the documentation we provide, researchers will be able to have a fully operational pipeline that will look in a specific shared folder, extract the text (via Apache Tika), and ouput it in a dedicated folder. And with a bit more motivation, researchers can go beyond and use other connectors than the fileshare, as the pipeline can work with any data source.
Discover now how to extract text from any document thanks to Datafari to feed your favorite ML tools.
Beta Was this translation helpful? Give feedback.
All reactions