Skip to content

A simple introduction to docanalysis

Emanuel Faria edited this page Jul 27, 2022 · 2 revisions

what is it?

docanalysis is an integrated suite of open-source Command Line tools ("programs”) that affords users — like citizens, scientists, students and teachers — unprecedented and seamless ability to...

  • query using phrases, exact text, fuzzy matching, synonyms, abbreviations, and REGEX (up to 80,000 characters in length)

  • download a corpus of literature — tens, thousands or tens of thousands of scientific papers — directly from europepmc.org according to the user’s exact query specifications;

  • import downloaded or existing files from local computers or networks

  • select specific “sections” of the papers to include or exclude from analysis

    • including, but not limited to: front matter (document ID, author(s), journal, year, etc.), body (introduction, abstract, materials and methods, results, discussions, conclusions), and references;
  • use generic or specialized topic-specific dictionaries of vocabulary which can be manually curated or automatically generated NLP/text-mining, which can be used to…

  • analyze individual or corpuses of text documents (.html, .xml, .pdf, .json,) for...

    • words frequencies

      • perform Named Entity Recognition (“NER”) of labels or spaCy named entities to automatically recognize general text such as companies, locations, organizations, products**,** and more 🔔Insert anchor link to table below🔔.

      • The Spacy model is pre-trained to recognize these entities, however, we can also add user-specified classes to the entity recognition system, and update the model with new examples and thereby expand docanalysis’ NER capabilities by optionally installing scispaCy or other libraries [somewhere]

    • figures such as ….. for …?

  • extract for display or output, organized data from text embedded images (.png, .jpg) containing “flowchart-type" biosynthetic pathways and legend, as well as caption text;

  • annotate recognized entities

  • enrich document text automatically with the addition of links to wikidata pages for each named entity found in every converted document. Thus, …?

  • convert PDFs to editable text,

  • enrich the keywords automatically with hyperlinks to wikidata and wikipedia information, and more

  • export data as html, json, csv, and tsv formats that are portable and importable to your project

 

how are people using it?

🔔🔔 [examples with workflow images (and perhaps a link to a video) goes here] 🔔🔔

 

what can YOU do with it?

🔔🔔 [FORMAT:** "Do… X, SO YOU CAN…Y”**]

  • automatically _____ so you can invest your time and effort [doing x, not y]

  • automatically enrich documents with wikidata definitions and information, so you can...

    • x

    • y

    • z

  • transform the PDF documents already residing on your machine to facilitate/take the monotany out of....