-
Notifications
You must be signed in to change notification settings - Fork 15
Home
Titipat Achakulvisut edited this page Feb 3, 2017
·
16 revisions
This page is for documentation. I will put examples to run the library here.
from affiliation_parser import parse_affil
parse_affil("Department of Health Science, Kochi Women's University, Kochi 780-8515, Japan. [email protected]")
Output is as follows
{'full_text': "Department of Health Science, Kochi Women's University, Kochi , Japan. ",
'department': 'Department of Health Science',
'institution': "Kochi Women's University",
'location': 'Kochi , Japan',
'country': 'japan',
'zipcode': '780-8515',
'email': '[email protected]'}
Here is an example to use parse_affil
with PySpark 2.1. I parsed MEDLINE article details using pubmed_parser and save table to medline_lastview.parquet
.
from affiliation_parser import parse_affil
from pyspark.sql.functions import udf
from pyspark.sql.types import *
# here we change column name to `country_medline` so that it doesn't match with `country` output from `parse_affil`
medline_df = sqlContext.read.parquet('medline_lastview.parquet').withColumnRenamed('country', 'country_medline')
schema = StructType([
StructField("full_text", StringType(), False),
StructField("department", StringType(), False),
StructField("institution", StringType(), False),
StructField("location", StringType(), False),
StructField("country", StringType(), False),
StructField("zipcode", StringType(), False),
StructField("email", StringType(), False)
])
udf_parse_affil = udf(parse_affil, schema)
medline_parsed_affil = medline_df.select('*', udf_parse_affil("affiliation").alias("affil_parsed"))
medline_parsed_affil = medline_parsed_affil.select("*" , "affil_parsed.full_text", "affil_parsed.department",
"affil_parsed.institution", "affil_parsed.location",
"affil_parsed.country", "affil_parsed.zipcode", "affil_parsed.email")
medline_parsed_affil.write.parquet('medline_parsed_affiliation.parquet')
See more udf
with multiple output here
Here we can plot total number of publications per country.
df = spark.read.parquet('medline_parsed_affiliation.parquet')
df.registerTempTable('df_country')
count_country = spark.sql("""
select count(distinct pmid), count(if(full_text = '', NULL, 1)), country
from df_country
group by country
order by count(*) desc
""")
count_country_df = count_country.toPandas()
count_country_df.to_csv('publications_per_country.csv', index=False)
We can then produce the plot using ggplot2
in R
as follows
library(ggplot2)
library(ggthemes)
library(scales)
df = read.csv('publications_per_country.csv')
scientific_10 <- function(x) {
parse(text=gsub("e", " %*% 10^", scientific_format()(x)))
}
pdf('medline_number_publications.pdf', width=6, height=8)
ggplot(df, aes(x = n_publications, y = reorder(country, n_publications))) +
geom_point() +
scale_x_log10(label=scientific_10, breaks=c(100, 1000, 10000, 100000, 1000000)) +
ylab('') +
xlab('Number of Publications') +
theme_classic() +
theme(axis.line.x = element_line(color="black", size = 0.5),
axis.line.y = element_line(color="black", size = 0.5))
dev.off()
Number of publications over time for selected countries
We can also see the trend number of publications per country per year. As you can see, China and Korea have produced much more publications per year since 1980!
df = spark.read.parquet('medline_parsed_affiliation.parquet')
df_sel = df.selectExpr('cast(year as int) year', 'pmid', 'country')
df_sel.registerTempTable('df_country')
count_country = sqlContext.sql("""
select count(distinct pmid) as n_publications, country, year
from df_country
group by country, year
order by count(*) desc
""")
count_country_df = count_country.toPandas()
count_country_df.to_csv('publications_per_country_year.csv', index=False)
After exporting dataframe to csv file, we can use ggplot2
to plot it.
countries <- c('united states of america', 'japan', 'germany',
'united kingdom', 'china', 'france', 'italy', 'canada', 'australia',
'spain', 'netherlands', 'korea', 'sweden')
df = read.csv('publications_per_country.csv')
df = na.exclude(df[df$year >= 1987 & df$year < 2016 & df$country %in% countries, ])
pdf('medline_number_publications_year.pdf', width=9, height=5)
ggplot(df, aes(x = year, y = n_publications, color = reorder(country, -n_publications))) +
geom_line(size=1.1) +
scale_y_log10(label=scientific_10, breaks=c(100, 1000, 10000, 100000)) +
theme_minimal() +
xlab('Year') +
ylab('Number of Publications') +
# scale_colour_hue(name = 'Country', palette = "Greens" ) +
# scale_colour_hue(l=70, c=30, name = 'Country') +
scale_colour_hue(name = 'Country', h=c(0, 270)) +
theme(axis.line.x = element_line(color="black", size = 0.5),
axis.line.y = element_line(color="black", size = 0.5))