You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We already have an example of procurement data - from the World Bank. @dannyparsons do you think there may now be an updated version of these data? So that is one example in this area.
Danny has now sent another set and I suggest this is perfect forthe new theme of support for data science. I describe the set briefly and then have quite a lot of questions.
"It" comes in 3 (alternative?) forms
And the bottom of the sets to download are as follows:
a) I first downloaded the whole csv file (9.2 GB). This is 12 files with one for each year. I was able to import one year (2016) which is one of the largest - into R-Instat. It has about 9 million rows and 130 variables. It is for all countries. So easy to compare countries in a single year. Harder to look at changes between one year and the next.
b) I then chose one country (Spain) and downloaded their data. Again 12 files. This time I was able to read them all together into R-Instat. (Great!). They have about 30,000 records in one year, up to 450,000 in 2019. They each have 130 variables. I tried appending, and this should work, but variables are of different types in the different years.
I also downloaded a json file, but don't know how to import into R-Instat? I chose Malta as a small example. That's a single file.
These look to be an excellent example for showing data and data science. The work on this task divides into 2 clear parts. The first has been the web-scraping and then tidying of the data. As shown in the figures above, this has been a lot of work. And some of the variables are clearly already derived from others. In particular there are many tender indicators in the data:
a) As statisticians, I suggest we need to learn a little more about the initial work that produced these files. We don't need to get to methodological or computational details that the team may wish to keep to themselves. We just need enough to be able to know about the completeness and quality of the data.
b) Then I would like to understand what is currently done to process these data. It may be useful to do more data tidying, perhaps including changing the long variable names into short names plus labels.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
We already have an example of procurement data - from the World Bank. @dannyparsons do you think there may now be an updated version of these data? So that is one example in this area.
Danny has now sent another set and I suggest this is perfect forthe new theme of support for data science. I describe the set briefly and then have quite a lot of questions.
"It" comes in 3 (alternative?) forms
And the bottom of the sets to download are as follows:
a) I first downloaded the whole csv file (9.2 GB). This is 12 files with one for each year. I was able to import one year (2016) which is one of the largest - into R-Instat. It has about 9 million rows and 130 variables. It is for all countries. So easy to compare countries in a single year. Harder to look at changes between one year and the next.
b) I then chose one country (Spain) and downloaded their data. Again 12 files. This time I was able to read them all together into R-Instat. (Great!). They have about 30,000 records in one year, up to 450,000 in 2019. They each have 130 variables. I tried appending, and this should work, but variables are of different types in the different years.
I also downloaded a json file, but don't know how to import into R-Instat? I chose Malta as a small example. That's a single file.
These look to be an excellent example for showing data and data science. The work on this task divides into 2 clear parts. The first has been the web-scraping and then tidying of the data. As shown in the figures above, this has been a lot of work. And some of the variables are clearly already derived from others. In particular there are many tender indicators in the data:
a) As statisticians, I suggest we need to learn a little more about the initial work that produced these files. We don't need to get to methodological or computational details that the team may wish to keep to themselves. We just need enough to be able to know about the completeness and quality of the data.
b) Then I would like to understand what is currently done to process these data. It may be useful to do more data tidying, perhaps including changing the long variable names into short names plus labels.
Beta Was this translation helpful? Give feedback.
All reactions