Skip to content
Titipat Achakulvisut edited this page Feb 3, 2017 · 16 revisions

Affiliation Parser

This page is for documentation. I will put examples to run the library here.

Example

from affiliation_parser import parse_affil
parse_affil("Department of Health Science, Kochi Women's University, Kochi 780-8515, Japan. [email protected]")

Output is as follows

{'full_text': "Department of Health Science, Kochi Women's University, Kochi , Japan. ",
 'department': 'Department of Health Science',
 'institution': "Kochi Women's University",
 'location': 'Kochi , Japan',
 'country': 'japan',
 'zipcode': '780-8515',
 'email': '[email protected]'}

Example with PySpark

Here is an example to use parse_affil with PySpark 2.1. I parsed MEDLINE article details using pubmed_parser and save table to medline_lastview.parquet.

from affiliation_parser import parse_affil
from pyspark.sql.functions import udf
from pyspark.sql.types import *

# here we change column name to `country_medline` so that it doesn't match with `country` output from `parse_affil`
medline_df = sqlContext.read.parquet('medline_lastview.parquet').withColumnRenamed('country', 'country_medline')

schema = StructType([
    StructField("full_text", StringType(), False),
    StructField("department", StringType(), False), 
    StructField("institution", StringType(), False),
    StructField("location", StringType(), False),
    StructField("country", StringType(), False),
    StructField("zipcode", StringType(), False),
    StructField("email", StringType(), False)
])
udf_parse_affil = udf(parse_affil, schema)

medline_parsed_affil = medline_df.select('*', udf_parse_affil("affiliation").alias("affil_parsed"))
medline_parsed_affil = medline_parsed_affil.select("*" , "affil_parsed.full_text", "affil_parsed.department", 
                                                   "affil_parsed.institution", "affil_parsed.location", 
                                                   "affil_parsed.country", "affil_parsed.zipcode", "affil_parsed.email")
medline_parsed_affil.write.parquet('medline_parsed_affiliation.parquet')

See more udf with multiple output here

Summarize number of publications per country

Here we can plot total number of publications per country.

df = spark.read.parquet('medline_parsed_affiliation.parquet')
df.registerTempTable('df_country')

count_country = spark.sql("""
select count(distinct pmid), count(if(full_text = '', NULL, 1)), country 
from df_country 
group by country 
order by count(*) desc
""")

count_country_df = count_country.toPandas()
count_country_df.to_csv('publications_per_country.csv', index=False)

We can then produce the plot using ggplot2 in R as follows

library(ggplot2)
library(ggthemes)
library(scales)

df = read.csv('publications_per_country.csv')

scientific_10 <- function(x) {
  parse(text=gsub("e", " %*% 10^", scientific_format()(x)))
}

pdf('medline_number_publications.pdf', width=6, height=8)
ggplot(df, aes(x = n_publications, y = reorder(country, n_publications))) + 
  geom_point() +
  scale_x_log10(label=scientific_10, breaks=c(100, 1000, 10000, 100000, 1000000)) +
  ylab('') +
  xlab('Number of Publications') +
  theme_classic() +
  theme(axis.line.x = element_line(color="black", size = 0.5), 
        axis.line.y = element_line(color="black", size = 0.5))
dev.off()

Number of publications over time for selected countries

We can also see the trend number of publications per country per year. As you can see, China and Korea have produced much more publications per year since 1980!

df = spark.read.parquet('medline_parsed_affiliation.parquet')
df_sel = df.selectExpr('cast(year as int) year', 'pmid', 'country')
df_sel.registerTempTable('df_country')

count_country = sqlContext.sql("""
select count(distinct pmid) as n_publications, country, year 
from df_country 
group by country, year 
order by count(*) desc
""")
count_country_df = count_country.toPandas()
count_country_df.to_csv('publications_per_country_year.csv', index=False)

After exporting dataframe to csv file, we can use ggplot2 to plot it.

countries <- c('united states of america', 'japan', 'germany', 
               'united kingdom', 'china', 'france', 'italy', 'canada', 'australia', 
               'spain', 'netherlands', 'korea', 'sweden')

df = read.csv('publications_per_country.csv')
df = na.exclude(df[df$year >= 1987 & df$year < 2016 & df$country %in% countries, ])

pdf('medline_number_publications_year.pdf', width=9, height=5)

ggplot(df, aes(x = year, y = n_publications, color = reorder(country, -n_publications))) + 
  geom_line(size=1.1) + 
  scale_y_log10(label=scientific_10, breaks=c(100, 1000, 10000, 100000)) + 
  theme_minimal() +
  xlab('Year') +
  ylab('Number of Publications') +
  # scale_colour_hue(name = 'Country', palette = "Greens" ) + 
  # scale_colour_hue(l=70, c=30, name = 'Country') +
  scale_colour_hue(name = 'Country', h=c(0, 270)) + 
  theme(axis.line.x = element_line(color="black", size = 0.5), 
        axis.line.y = element_line(color="black", size = 0.5))
Clone this wiki locally