Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import new data set from PubChem #158

Open
Tracked by #229
Denz1994 opened this issue Feb 7, 2020 · 2 comments
Open
Tracked by #229

Import new data set from PubChem #158

Denz1994 opened this issue Feb 7, 2020 · 2 comments
Assignees

Comments

@Denz1994
Copy link
Contributor

Denz1994 commented Feb 7, 2020

This sim requires that all possible molecules and molecule structures are defined prior to being built. This data is stored in js/data and was derived from PubChem. Taking a look at js/data/ we see the current data set is comprised of:

  • collectionMoleculesData.js: Shortlist of Pubchem molecules used for collection boxes.
  • otherMoleculesData.js: Responsible for all PubChem related data with entries that can be read as described in More examples of incorrect nomenclature #153 (comment).
  • structuresData.js: Responsible for all possible structures. These structures may or may not have a correlated structure in collectionMoleculeData.js

The tools used to generate this data set have yet to be completely ported from Java and would require additional documentation. This includes handling filtering out any molecules not desired for this sim. During the design meeting on 01/31/20, it was decided to postpone this work until after publication of this sim.

Assigning to @ariel-phet for prioritization and assignment.

@Denz1994
Copy link
Contributor Author

Additional Details:

The approach to importing the data set in the legacy version required a parser, some filters, and a post-processor.

The parser (MoleculeSDFCombinedParser) is responsible for importing all the data from PubChem using an SDF file. This will generate two text files of molecule data (collection-molecules.txt and other-molecule.txt). Collection-molecules.txt contains molecule data for the collection boxes, while other-molecules.txt holds data for other molecules that can be built in the sim. See #153 (comment) for details on how to read these entries.

At this point, we will need to filter out molecules that we don't want to build (either for pedagogical, or memory reasons). MoleculeKitFilterer and MoleculeDuplicateNameFilter handle this for us.

The last step involves MoleculePreprocessing, which will generate the structural format for our molecules in a serialize format. See Structure.txt

Action Items:

  • Familiarize yourself with the intended formatting and expected input/output for each component mentioned above. Docs are provided in molecule-data-readme.txt

  • Determine if the steps provided in molecule-data-readme.txt are still accurate. Will these steps generate a usable data set for the ported sim?

  • If the legacy steps for data generation don't work as intended, then investigate the pub chem website for a modernized approach for importing the data set. This may involve a need for a new set of filters or post-processor. A good place to start would be here and more generally, the PubChem site.

  • Confirm with the design team to assure the filters are filtering out the correct data. There may be additional molecule classifications we don't want to feature. They should be identified and filtered as needed.

  • Work on porting the parser, filters, and post-processor tools into HTML5 code with support for ES6 modules.

@Denz1994
Copy link
Contributor Author

Denz1994 commented May 15, 2020

Here is a zip file of the BAM legacy source code with the relevant content described above: build-a-molecule-java.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants