-
Notifications
You must be signed in to change notification settings - Fork 3
A simple introduction to docanalysis
docanalysis
is an integrated suite of open-source Command Line
tools ("programs”) that
affords users — like citizens, scientists, students and teachers — unprecedented and
seamless ability to...
-
query using phrases, exact text, fuzzy matching, synonyms, abbreviations, and REGEX (up to 80,000 characters in length)
-
download a corpus of literature — tens, thousands or tens of thousands of scientific papers — directly from europepmc.org according to the user’s exact query specifications;
-
import downloaded or existing files from local computers or networks
-
select specific “sections” of the papers to include or exclude from analysis
- including, but not limited to: front matter (document ID, author(s), journal, year, etc.), body (introduction, abstract, materials and methods, results, discussions, conclusions), and references;
-
use generic or specialized topic-specific dictionaries of vocabulary which can be manually curated or automatically generated NLP/text-mining, which can be used to…
-
analyze individual or corpuses of text documents (.html, .xml, .pdf, .json,) for...
-
words frequencies
-
perform Named Entity Recognition (“NER”) of labels or spaCy named entities to automatically recognize general text such as companies, locations, organizations, products**,** and more 🔔Insert anchor link to table below🔔.
-
The Spacy model is pre-trained to recognize these entities, however, we can also add user-specified classes to the entity recognition system, and update the model with new examples and thereby expand
docanalysis’
NER capabilities by optionally installing scispaCy or other libraries [somewhere]
-
-
figures such as ….. for …?
-
-
extract for display or output, organized data from text embedded images (.png, .jpg) containing “flowchart-type" biosynthetic pathways and legend, as well as caption text;
-
annotate recognized entities
-
enrich document text automatically with the addition of links to wikidata pages for each named entity found in every converted document. Thus, …?
-
convert PDFs to editable text,
-
enrich the keywords automatically with hyperlinks to wikidata and wikipedia information, and more
-
export data as html, json, csv, and tsv formats that are portable and importable to your project
🔔🔔 [examples with workflow images (and perhaps a link to a video) goes here] 🔔🔔
-
The 10 chapter, 10,000 page CLIMATE REPORT from it’s uneditable PDF format, ...
-
EssOildb is using it to...
-
Verriclear Natural Skin Essentials Ltd. is using it to...
🔔🔔 [FORMAT:** "Do… X, SO YOU CAN…Y”**]
-
automatically _____ so you can invest your time and effort [doing x, not y]
-
automatically enrich documents with wikidata definitions and information, so you can...
-
x
-
y
-
z
-
-
transform the PDF documents already residing on your machine to facilitate/take the monotany out of....