really incorrect match_affil results #11

simonatdrg · 2018-03-15T14:06:21Z

I created a set of affiliation strings from Pubmed abstracts which all include 'Harvard' (University, medical school, etc) and ran them through match_affil, After downloading the most recent grid.csv dataset (which has many entries for Harvard, including 'Harvard medical school'). The script, input file and results are attached.

You'll see that there are very few matches, and of those quite a few are incorrect. II'm not an expert on the machine learning techniques involved - can you explain and possibly suggest ways to improve these results ?
Zip file with sccript, input data and outputs attached.

matchaffil_test.zip

titipata · 2018-03-15T14:54:27Z

Hi @simonatdrg, yeah, totally agree I was hard coding Harvard when parsing it.

I'll take a look at it and fix it by following weeks! (sorry, I'm a little busy for this week)

simonatdrg · 2018-03-15T15:59:01Z

Great – I look forward to it. Here’s some background. We’re trying to use Pubmed affiliation data as an aid to researcher name disambiguation (not just across Pubmed author names, but incorporating other sources such as Physician lists and clinical trial data). We also will use things like associated MeSH terms for a publication to refine the matching (i.e. a name associated with Cardiac disease will most likely not be a match to someone with the same name but associated with dermatology publications, even if they are both working at the same organization). Harvard is a good (and extreme) test case, as authors may have multiple affiliations ( University / medical school / institute / teaching hospital and one or more of these can occur in affiliation strings. I was attracted to the organization hierarchy present in the Grid dataset as a way to handle these. Regards

…

-Simon From: Titipat Achakulvisut <[email protected]> Reply-To: titipata/affiliation_parser <[email protected]> Date: Thursday, March 15, 2018 at 10:54 AM To: titipata/affiliation_parser <[email protected]> Cc: "Rosenthal, Simon" <[email protected]>, Mention <[email protected]> Subject: Re: [titipata/affiliation_parser] really incorrect match_affil results (#11) Hi @simonatdrg<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsimonatdrg&data=02%7C01%7Csrosenthal%40teamdrg.com%7Cf304f1e7c6ec4b58342408d58a84aee9%7C5d6495b15cd44a4fa6dd1f5f3bf58831%7C0%7C0%7C636567224830440528&sdata=jEp6EA8PwlEDqGlOy0fAYKAxKnBgON7nuLRdKAzJ%2B6g%3D&reserved=0>, yeah, totally agree I was hard coding Harvard when parsing it. I'll take a look at it and fix it by following weeks! (sorry, I'm a little busy for this week) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftitipata%2Faffiliation_parser%2Fissues%2F11%23issuecomment-373404088&data=02%7C01%7Csrosenthal%40teamdrg.com%7Cf304f1e7c6ec4b58342408d58a84aee9%7C5d6495b15cd44a4fa6dd1f5f3bf58831%7C0%7C0%7C636567224830450533&sdata=%2FMsU%2Fw5UuofFWdl3K8m1F2Rolgt2IoMbTk8K7AciLb0%3D&reserved=0>, or mute the thread<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAGd_lhBu-4eU2CqO2Trwl8ZAi4sYLpEiks5teoCjgaJpZM4SsNj5&data=02%7C01%7Csrosenthal%40teamdrg.com%7Cf304f1e7c6ec4b58342408d58a84aee9%7C5d6495b15cd44a4fa6dd1f5f3bf58831%7C0%7C0%7C636567224830450533&sdata=NSYM3Fu63S3afnvB8udXrAIerqpiieBEWTGhNuf9%2Brc%3D&reserved=0>.

titipata · 2018-07-27T04:36:42Z

Hi @simonatdrg, sorry for the really late reply. I will work on this issue over the weekend. Hopefully will solve most issues here.

fangzhou-xie · 2019-12-16T15:47:25Z

Hi, currently I have similar questions. For example when matching "Stanford University", it shows:

(Pdb++) match_affil('Stanford University')
OrderedDict([('ID', 'grid.440952.e'), ('Name', 'University of Belize'), ('City', 'Belmopan'), ('State', ''), ('Country', 'Belize')])

I have tried some other well-known universities (Harvard, Princeton, NYU, Columbia, Caltech, etc) but the match_affil works just fine. I wonder if this result could be improved somehow?

Thank you!

titipata · 2019-12-16T16:21:54Z

Hi @mark-fangzhou-xie, yeah, I wrote this library a while ago and I really need to update the code on this repo. Currently, the matching is done based on the nearest neighbor algorithm. I will try to improve it over the next month if I have time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

really incorrect match_affil results #11

really incorrect match_affil results #11

simonatdrg commented Mar 15, 2018

titipata commented Mar 15, 2018

simonatdrg commented Mar 15, 2018 via email

titipata commented Jul 27, 2018

fangzhou-xie commented Dec 16, 2019 •

edited

Loading

titipata commented Dec 16, 2019

really incorrect match_affil results #11

really incorrect match_affil results #11

Comments

simonatdrg commented Mar 15, 2018

titipata commented Mar 15, 2018

simonatdrg commented Mar 15, 2018 via email

titipata commented Jul 27, 2018

fangzhou-xie commented Dec 16, 2019 • edited Loading

titipata commented Dec 16, 2019

fangzhou-xie commented Dec 16, 2019 •

edited

Loading