Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cui2vec embeddings #25

Open
souravsingh opened this issue Apr 6, 2018 · 22 comments
Open

Add cui2vec embeddings #25

souravsingh opened this issue Apr 6, 2018 · 22 comments

Comments

@souravsingh
Copy link

The embeddings for over 100k medical concepts using data from 60 million patients, 1.7 million journal articles and 20 million notes is up, available here- https://figshare.com/s/00d69861786cd0156d81

Explorer available here- http://ec2-52-14-191-192.us-east-2.compute.amazonaws.com:1234/

@piskvorky
Copy link
Owner

Nice find!

@menshikh-iv
Copy link
Contributor

Additional information:

@beamandrew
Copy link

Hey this is my paper, how cool! I'd be happy to contribute these, let me know if they need any clean up first.

@menshikh-iv
Copy link
Contributor

Oh, hi @beamandrew, glad to see you here! Please follow the instruction https://github.com/RaRe-Technologies/gensim-data#want-to-add-a-new-corpus-or-model

@beamandrew
Copy link

Will do! It might be a couple weeks until I can get it together. I'm teaching a deep learning class right now that won't end until May which keeps me pretty busy.

I'm actually having them use the embeddings from this repo in class to build an RNN (which is how I ended up finding this issue).

You can check it out here if you're interested:
https://colab.research.google.com/drive/1JsdhsiJQP5JPEEGWWFtOMpQajBj4w1KA

@menshikh-iv
Copy link
Contributor

@beamandrew can you give read access for [email protected] please (I can't open your link, lack of permissions)?

@beamandrew
Copy link

Oops, try this link which should let you view: https://drive.google.com/file/d/1WuoHWf1KyFsNiilbVa7qnKkSDALfch01/view?usp=sharing

@matanox
Copy link

matanox commented May 22, 2018

Last I checked the actual concept names aren't include in this dataset and/or under the same license, but they are available from a different source which looks legitimately released. I have, in fact, a task to correlate them. Without this correlation, the embeddings discussed here include arbitrary codes instead of the original (concept) words that you see in the online demo.

@hscells
Copy link

hscells commented May 23, 2018

I currently have some data that will allow for this mapping as @matanster describes from the author of this publication (Section 2).

If anyone is interested I can upload a link to this as I sit next to the author and he has given his permission @jimmyoentung.

@piskvorky
Copy link
Owner

piskvorky commented May 23, 2018

Thanks guys.

What we want is for users who download this dataset to be able to use it easily.

If the dataset requires users to jump through hoops, it's not a good fit for gensim-data. The experience of applying / using a dataset has to be streamlined and intuitive, including access and code (not just data). That is why we created this repo, and it's a mandatory part of each new contribution.

@hscells and @matanster what does this extra step mean for users? Can we somehow integrate it directly, so it's transparent to people who want to use cui2vec? Is it necessary?

@hscells
Copy link

hscells commented May 23, 2018

The CUI in cui2vec stands for Concept Unique Identifier. A CUI is an identifier for all of the types of synonyms for a particular medical string.

The dataset which I described in my comment is a mapping of CUI to the most commonly used string in the UMLS meta-thesaurus. One may simply replace the CUIs in the pre-trained vector file with terms from this mapping file (although I believe not all CUIs are mapped because the semantic types of the strings were filtered in this particular dataset).

One may use QuickUMLS or MetaMap to map a term to a CUI, then using the method described above map the CUI to the most commonly used term in UMLS or MetaMap.

I'm not exactly sure how the demo in the OP is mapping CUIs to strings, but I believe this is most likely how it would be done. In terms of how it could be integrated @piskvorky, the original data could be modified or this mapping could be performed in a separate step, however like I said, due to the relationship between CUI and the strings associated with that concept (one-to-many) this mapping would preferably be performed as two separate steps.

@piskvorky
Copy link
Owner

No problem, as long as the process is clearly described to users, and the dataset ready-to-use out of the box.

@juancq
Copy link

juancq commented Aug 6, 2018

Just curious, any progress on this issue?

@andresrosso
Copy link

Hi, any body knows if the dataset 'cui2vec' is available??
@souravsingh share the vector in csv, but i don know how to load that in gensim and start using.
Can anyone help me or tell em when the dataset would be ready.

@andresrosso
Copy link

The embeddings for over 100k medical concepts using data from 60 million patients, 1.7 million journal articles and 20 million notes is up, available here- https://figshare.com/s/00d69861786cd0156d81

Explorer available here- http://ec2-52-14-191-192.us-east-2.compute.amazonaws.com:1234/

@souravsingh can i load the CSV in gensim?

Can you tell me how to do that.

@beamandrew
Copy link

Hi everyone,

I am lead author on this paper. Apologies for the radio silence on this request. We are currently working on a revision to the paper/approach that we hope to release this month. I will check back in and try to make it gensim compatible at that time.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Dec 14, 2018

@juancq @andresrosso sorry for waiting, I can't say when this will be added
BTW you always can load that manually (without api.load, just read the file from disk or s3).

@menshikh-iv
Copy link
Contributor

@beamandrew great, thanks!

@prabhatM
Copy link

Is there any model using snowmed CT data?

@Dhanachandra
Copy link

Dhanachandra commented Mar 18, 2019

Hi everyone,

I am lead author on this paper. Apologies for the radio silence on this request. We are currently working on a revision to the paper/approach that we hope to release this month. I will check back in and try to make it gensim compatible at that time.

Please share the source code for the evaluation metrics used in this work. I would like to evaluate my own embedding trained on EHRs. Thanks in advanced.

@kaushikacharya
Copy link

kaushikacharya commented Sep 25, 2019

Hi, any body knows if the dataset 'cui2vec' is available??
@souravsingh share the vector in csv, but i don know how to load that in gensim and start using.
Can anyone help me or tell em when the dataset would be ready.

@andresrosso
Here are the steps for loading cui2vec in gensim:

  1. Download the pre-trained embeddings from the download url mentioned in http://cui2vec.dbmi.hms.harvard.edu/

  2. Dump the embeddings into a text file in word2vec format in these two steps:

  • Load the csv into pandas dataframe.

    import pandas as pd
    import numpy as np
    
    with open('cui2vec_pretrained.csv') as fd:
          cui2vec_df = pd.read_csv(fd, index_col=0)
    
  • Dump the embeddings(loaded in dataframe) into a text file.

     np.savetxt('cui2vec_pretrained.txt', cui2vec_df.reset_index().values, delimiter=" ", header="{} {}".format(len(cui2vec_df), len(cui2vec_df.columns)), comments="", fmt=["%s"] + ["%.18e"]*len(cui2vec_df.columns))
    
  1. Load the word vectors using gensim.models.keyedvectors.KeyedVectors.
from gensim.models.keyedvectors import KeyedVectors

word_vectors = KeyedVectors.load_word2vec_format('cui2vec_pretrained.txt', binary=False)

# An example
word_vectors.most_similar('C0034079')

Source: https://stackoverflow.com/questions/46297740/how-to-turn-embeddings-loaded-in-a-pandas-dataframe-into-a-gensim-model (Ken Syme's answer)

@andresrosso
Copy link

Great work, thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests