-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
really incorrect match_affil results #11
Comments
Hi @simonatdrg, yeah, totally agree I was hard coding I'll take a look at it and fix it by following weeks! (sorry, I'm a little busy for this week) |
Great – I look forward to it.
Here’s some background. We’re trying to use Pubmed affiliation data as an aid to researcher name disambiguation (not just across Pubmed author names, but incorporating other sources such as Physician lists and clinical trial data). We also will use things like associated MeSH terms for a publication to refine the matching (i.e. a name associated with Cardiac disease will most likely not be a match to someone with the same name but associated with dermatology publications, even if they are both working at the same organization).
Harvard is a good (and extreme) test case, as authors may have multiple affiliations ( University / medical school / institute / teaching hospital and one or more of these can occur in affiliation strings. I was attracted to the organization hierarchy present in the Grid dataset as a way to handle these.
Regards
…-Simon
From: Titipat Achakulvisut <[email protected]>
Reply-To: titipata/affiliation_parser <[email protected]>
Date: Thursday, March 15, 2018 at 10:54 AM
To: titipata/affiliation_parser <[email protected]>
Cc: "Rosenthal, Simon" <[email protected]>, Mention <[email protected]>
Subject: Re: [titipata/affiliation_parser] really incorrect match_affil results (#11)
Hi @simonatdrg<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsimonatdrg&data=02%7C01%7Csrosenthal%40teamdrg.com%7Cf304f1e7c6ec4b58342408d58a84aee9%7C5d6495b15cd44a4fa6dd1f5f3bf58831%7C0%7C0%7C636567224830440528&sdata=jEp6EA8PwlEDqGlOy0fAYKAxKnBgON7nuLRdKAzJ%2B6g%3D&reserved=0>, yeah, totally agree I was hard coding Harvard when parsing it.
I'll take a look at it and fix it by following weeks! (sorry, I'm a little busy for this week)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftitipata%2Faffiliation_parser%2Fissues%2F11%23issuecomment-373404088&data=02%7C01%7Csrosenthal%40teamdrg.com%7Cf304f1e7c6ec4b58342408d58a84aee9%7C5d6495b15cd44a4fa6dd1f5f3bf58831%7C0%7C0%7C636567224830450533&sdata=%2FMsU%2Fw5UuofFWdl3K8m1F2Rolgt2IoMbTk8K7AciLb0%3D&reserved=0>, or mute the thread<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAGd_lhBu-4eU2CqO2Trwl8ZAi4sYLpEiks5teoCjgaJpZM4SsNj5&data=02%7C01%7Csrosenthal%40teamdrg.com%7Cf304f1e7c6ec4b58342408d58a84aee9%7C5d6495b15cd44a4fa6dd1f5f3bf58831%7C0%7C0%7C636567224830450533&sdata=NSYM3Fu63S3afnvB8udXrAIerqpiieBEWTGhNuf9%2Brc%3D&reserved=0>.
|
Hi @simonatdrg, sorry for the really late reply. I will work on this issue over the weekend. Hopefully will solve most issues here. |
Hi, currently I have similar questions. For example when matching "Stanford University", it shows:
I have tried some other well-known universities (Harvard, Princeton, NYU, Columbia, Caltech, etc) but the Thank you! |
Hi @mark-fangzhou-xie, yeah, I wrote this library a while ago and I really need to update the code on this repo. Currently, the matching is done based on the nearest neighbor algorithm. I will try to improve it over the next month if I have time. |
I created a set of affiliation strings from Pubmed abstracts which all include 'Harvard' (University, medical school, etc) and ran them through match_affil, After downloading the most recent grid.csv dataset (which has many entries for Harvard, including 'Harvard medical school'). The script, input file and results are attached.
You'll see that there are very few matches, and of those quite a few are incorrect. II'm not an expert on the machine learning techniques involved - can you explain and possibly suggest ways to improve these results ?
Zip file with sccript, input data and outputs attached.
matchaffil_test.zip
The text was updated successfully, but these errors were encountered: