-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathanalysis.py
151 lines (108 loc) · 4.41 KB
/
analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
# %%[markdown]
# This file is created so that it can be run interactively using vscode or pycharm.
# It is possible to run this whole file as a script as well.
# Using the `Python: Export Current Python File as Jupyter Notebook` on the command palette,
# this `.py` file can be converted into a jupyter noteobok.
## Introduction
## Methodology
# Data was scraped from [NMC](nmc.org.np) website using python (request, beautifulsoup).
# Then it was added to a sqlite database. The insert technique used `concurrent.futures`
# for multithreading (to parallelize the write to the database).
# This analysis is written on a notebook style interactive document.
# It uses pandas for data manupulation and plotting
## Analysis
# Men outnumber women in medical profession in Nepal. If divided between different practices (degrees),
# except for BDS, the dental professionals, women outnumber men.
# In MBBS, a general practice doctor, and MD, specialized practice, men outnumber
# more than 2:1
# Almost a third (7k+) of the total doctors call Kathmandu Valley (Kathmandu, Bhaktapur, Lalitpur)
# About 118 of them did not have a location listed whereas another 117 or so
# did not have district name listed (abbreviated, house number, city/locality, etc.)
# %%[code]
# import libraries
import pandas as pd
import sqlite3
# connect to sqlite
from numpy import int64
with sqlite3.connect("data/nmc.db") as conn:
doctors = pd.read_sql("select * from doctors", conn)
# %%
total_doctors = len(doctors)
print(f'There are {total_doctors} doctors registered through Nepal Medical Council')
# %%
# first 5 rows of data
print(doctors.head())
# dimension of the dataframe
print(doctors.shape)
# describes the summary statistics of numeric columms
print(doctors.describe())
# column names in the dataframe
print(doctors.columns)
# data type in each columns
print(doctors.dtypes)
# %%
doctors.describe(include=[object])
# %%
# Nepal recognizes third gender
print(doctors.gender.value_counts())
print(doctors['degree'].value_counts())
# %%
doctors.set_index('nmc_number', inplace=True)
# %%
# plotting
doctors.groupby(['gender',"degree"])['gender'].count().unstack(0).plot(kind='bar')
# %%[markdown]
# Additional ideas for analysis
# - Location based analysis. What locations they come from
# - we can take out the districts maybe?
# - Name based analysis
# - Frequent first names
# - Frequent last names
# - Match last name with caste/ ethnicity, etc?
# %%
# location based analysis
clean = lambda x: str(x).strip(',. \n\t').replace(" ","").lower().split()
doctors['district'] = doctors['location'].\
apply(lambda x: x.strip(',.!? \n\t').lower().split(',')[-1].strip())
# %%
# counts = pd.DataFrame(doctors['location'].\
# apply(lambda x: x.strip(',.!? \n\t').lower().split(',')[-1].strip()).\
# value_counts())
#%%
# name based analyses
doctors['full_name'].\
apply(lambda x: x.lower().replace('dr. ','').split()[0]).\
value_counts().head(30).\
plot(kind = 'bar')
#%% [markdown]
# Next, try to analyze the demographic mix of doctors based on caste.
# While we do not have caste listed on our data, Nepali surname is a pretty
# good indicator of their caste. While some surnames are shared over multile castes/ethnicities,
# we will not dive into the details. This is a mere attempt at trying to see if such join is possible.
# The joined data is from the internet, so I cannot comment on it, but it looked complete (ish) and accurate (ish)
# from my novice eyes
#%%
doctors['last_name'] = doctors['full_name'].\
apply(lambda x: x.lower().replace('dr. ','').split()[-1])
# %%
# trying to convert caste.txt into a legible dataframe
with open('data/caste.txt', 'r') as caste:
next(caste)
next(caste)
caste = caste.readlines()
# with open('writecaste.csv','w') as writefile:
clean_cast =[line.strip().split(".")[1].strip().lower().split('>', maxsplit=3)
for line in caste]
# %%
caste_df = pd.DataFrame(clean_cast, columns= None)
caste_cols = ['surname', 'caste_ethnicity', 'subclass', 'extra']
caste_df.columns = caste_cols
caste_df['surname']= caste_df['surname'].str.strip()
caste_df.iloc[:,0] = caste_df.iloc[:,0].str.strip()
print(caste_df.head())
# %%
merged_doctors = doctors.merge(caste_df.drop_duplicates(\
subset='surname',keep='last'),\
left_on='last_name', right_on='surname')
merged_doctors.caste_ethnicity.value_counts().plot(kind ='bar')
# %%