A simple introduction to docanalysis

what is it?

docanalysis is an integrated suite of open-source Command Line tools ("programs”) that affords users — like citizens, scientists, students and teachers — unprecedented and seamless ability to...

query using phrases, exact text, fuzzy matching, synonyms, abbreviations, and REGEX (up to 80,000 characters in length)
download a corpus of literature — tens, thousands or tens of thousands of scientific papers — directly from europepmc.org according to the user’s exact query specifications;
import downloaded or existing files from local computers or networks
select specific “sections” of the papers to include or exclude from analysis
- including, but not limited to: front matter (document ID, author(s), journal, year, etc.), body (introduction, abstract, materials and methods, results, discussions, conclusions), and references;
use generic or specialized topic-specific dictionaries of vocabulary which can be manually curated or automatically generated NLP/text-mining, which can be used to…
analyze individual or corpuses of text documents (.html, .xml, .pdf, .json,) for...
- words frequencies
  - perform Named Entity Recognition (“NER”) of labels or spaCy named entities to automatically recognize general text such as companies, locations, organizations, products**,** and more 🔔Insert anchor link to table below🔔.
  - The Spacy model is pre-trained to recognize these entities, however, we can also add user-specified classes to the entity recognition system, and update the model with new examples and thereby expand docanalysis’ NER capabilities by optionally installing scispaCy or other libraries [somewhere]
- figures such as ….. for …?
extract for display or output, organized data from text embedded images (.png, .jpg) containing “flowchart-type" biosynthetic pathways and legend, as well as caption text;
annotate recognized entities
enrich document text automatically with the addition of links to wikidata pages for each named entity found in every converted document. Thus, …?
convert PDFs to editable text,
enrich the keywords automatically with hyperlinks to wikidata and wikipedia information, and more
export data as html, json, csv, and tsv formats that are portable and importable to your project

how are people using it?

🔔🔔 [examples with workflow images (and perhaps a link to a video) goes here] 🔔🔔

The 10 chapter, 10,000 page CLIMATE REPORT from it’s uneditable PDF format, ...
EssOildb is using it to...
Verriclear Natural Skin Essentials Ltd. is using it to...

what can YOU do with it?

🔔🔔 [FORMAT:** "Do… X, SO YOU CAN…Y”**]

automatically _____ so you can invest your time and effort [doing x, not y]
automatically enrich documents with wikidata definitions and information, so you can...
- x
- y
- z
transform the PDF documents already residing on your machine to facilitate/take the monotany out of....

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A simple introduction to docanalysis

what is it?

how are people using it?

what can YOU do with it?

Clone this wiki locally