Skip to content

Activities Summary: Anuv

Anubhab Chakraborty edited this page Sep 14, 2021 · 8 revisions

Software Installation

System Information

  • Operating System: Ubuntu 20.04
  • python3 --version: 3.9
  • pip --version: 20.0.2

pygetpapers

To install pygetpapers run pip install pygetpapers

Check if pygetpapers is properly installed: pygetpapers --help

Adding pygetpapers to path

In ubuntu the binaries are installed in ~/.local/bin by default. We can add this directory to our system path, and run pygetpapers from our console. To add the binary to the system path, execute: export PATH="$HOME/.local/bin:$PATH"

Dependencies

  • JAVA sudo apt install default-jre

To check if the software is successfully installed, run java --version

  • Maven sudo apt install maven

After Java and Maven is installed, we git clone the repository, and build it.

git clone https://github.com/petermr/ami3.git
cd ami3
mvn install -Dmaven.test.skip=true

To add ami to system path execute the following command: `export PATH="$HOME/ami3/target/appassembler/bin:$PATH"


20210914

Project Idea

A common representation of chemical reactions in scientific literature is in a paragraph format. Reaction information encoded in unstructured paragraph could be potentially useful in a machine-readable structured format. Chemical Markup Language (CML) is an application of XML which provides a tagset for encoding chemical information which might be useful for representing reactions found in the literature. Machines cannot simply read and understand a paragraph of plaintext the way humans do. But with NLP we might be able to identify important and chemical relevant information in paragraphs and parse the information as CML.

Why it would be useful:

There is a vast repository of chemical information locked away in paragraphs of reaction description in scientific literature. The information can be easily deciphered by a chemist, but such a process cannot scale in time and cost when analysing large amounts of scientific literature. Having such information in CML would make analysis and use of chemistry and biochemistry literature scalable.

Goal:

To identify the components of a (chemical reaction) information rich paragraph and correctly encode the information in CML.

Initial Plan (Plan 0):

  • We can get a sense of the structure of a reaction by looking for certain words or word groups.
    • Look for words such as ‘reacts with’, ‘undergoes reaction’, ‘undergoes elimination’ ‘combusts’, etc. These words or phrases might indicate the presence of a chemical reaction and also tell us about the products and the type of reaction.
    • 0.5M; number followed by M indicated concentration
    • ‘Catalysed by’, ‘in presence of’ indicate catalysts and reaction conditions
    • ‘At K’ and ‘atm’, ‘temperature’, ‘pressure’, ‘NTP’, etc. indicate reaction conditions.
    • ‘Gives’, ‘to form’ is usually followed by the reaction product.
  • We can match words against a dictionary of chemical names to check if it is a valid compound or element or not.

Proposed usage:

Text:

Phenol reacts with NaOH and CO2 at 400K and 2-7atm to give Sodium Salicylate.

XML:

<reaction>
   <reactant>
       <formula>C6 H6 O</formula>
   <name>Phenol</name>
   </reactant>
   <reactant>
      <formula>Na O H</formula>
      <name>Sodium Hydroxide</name>
   </reactant>
   <reactant>
      <formula>C O2</formula>
      <name>Carbon Dioxide</name>
   </reactant>
   <product>
    <formula>C7  H5 Na O3</formula>
    <name>Sodium Salicylate</name>
   </product>
   <reaction-conditions>
      <temperature>400K</temperature>
      <pressure>4-7atm</pressure>
   </reaction-conditions>
</reaction>

Related projects:

  • Identify passages containing description of a chemical reaction
  • Convert molecules descriptions into CML
  • Identify images depicting chemical molecules and reactions
  • Convert chemical molecules or reactions presented as images into CML
  • Encoding metabolic pathways as XML
Clone this wiki locally