Data Science for Development #7961

rdstern · 2022-11-12T09:47:44Z

rdstern
Nov 12, 2022
Maintainer

Following our meeting I was asked to suggest the books to recommend.
I am starting with one book, namely Introduction to Data Science by Rafael A. Irizarry.

This is just an elementary book, but I still suggest it can be the main book for the course. Given the proposed audience, and starting skill-sets of some students, we need to include an elementary book somewhere. I suggest basing the whole course, largely on this book, and then introducing other books that cover particular areas in more depth, is a reasonable way to go.

It has six sections and I further propose that each section could lead to one-or-more course modules. The sections are as follows:

R,
Data Visualization,
Data Wrangling,
Statistics with R,
Machine Learning,
Productivity Tools.

So here we go on the course modules:

R: I propose 2 or 3 optional modules. The students should choose 1 to be assessed - so maybe one out of them is compulsory?
For those who would like to use R, there is a programming in R course. Maybe not a whole course is needed, in which case it could include some of the productivity tools from part 6 of the course.
This should be at the start of the programme, so students can, if they wish, use R (with RStudio) for the course.
They may, instead, use R-Instat for many of the course modules, but we would expect them to use R or python (instead or in addition to) R-Instat on their project.
So, later - in semester 2 or 3 there is an R through R-Instat course, for those who would like that, and there is also a python for data science course, for those who prefer to use python.
Data Visualization: That's descriptive statistics for us. The book only covers ggplot2, and we need to also include tabulation. This is where we remind people that (at least for development) large-scale surveys are still routinely collected and analysed. Being able to analyse the MICS surveys, etc would be covered here. As it is data science, they don't need to design a survey, but they do need to be able to process the data. We might still call the module data visualisation as it sounds better than descriptive statistics, but that's what it would be. We could include climatic data and producing PICSA-type graphs, etc there too. I assume this would be a compulsory module.
Data Wrangling. That's a nice title for a compulsory module.
Statistics with R. We might have up to 3 modules here. Statistical Methods and perhaps Statistical Models 1 may be compulsory. I suspect that we may want Statistical models 2 as well, but let's see. This is very superficial in the book. I suggest statistical methods could introduce the different methodologies - frequentist, Bayesian, randomisation and might even be limited to relatively small problems. Statistical models 1 might even be devoted largely to generalised linear models? We need more thought here. Maybe Models 1 is regression and Models 2 is classification. I am assuming we can't leave classification to the machine learning.
Machine Learning. This will probably be at least 2 modules, with one being compulsory. We probably leave splitting data and cross validation for these modules, rather than in the modelling above.
Do we want an optional module on what might be called bigger data? I hesitate to call it big data, because what's defined as big might change in a few years?
And I am not sure what's in the productivity tools part. I hope we might include that in the programming module.

I suggest others, particularly perhaps @volloholic and @jkmusyoka and @lilyclements and others might comment next. If it stands up, then one aspect I am unclear about is the sort of projects students might do and an initial list could be useful. Perhaps @volloholic could easily start on this?

Then there will be further books, etc on particular sections, and I will look further. In addition I think we will need to have our own, at least drafted - for some parts - by then?

rdstern · 2022-11-19T08:48:06Z

rdstern
Nov 19, 2022
Maintainer Author

In #7966 I mention possible content of 2-3 initial modules for the Data Science for development course.
One was on data and considered various sources of data. We might also distinguish between data in statistics and data in data science? Whats, different in the 2 areas? We are thinking here of typical data sets for statistics training and application that may not depend on automatic data collection. They don't come from the internet, or from satellites or from automatic weather stations, etc.
Experiments, surveys and routine data.
Then automation: The time periods between observations becomes smaller. Volume becomes greater - sometimes far greater.

Tweets from Donald Trump - only one person - examine and check all the tweets from all the users!
Loyalty cards in stores. All the shopping habits from all the customers.

Objectives are sometimes different. Data Science (more than statistics) contact individual customers with appropriate customised messages.

Automation will always have problems and we often like the personal touch. But with the huge volumes of data and personal objectives, the analyses must be automated - machine learning is obviously what needs to be used!

An important point David raised is that our course - for development - will include spreadsheets! Not just are we proposing to use a GUI for R. We also will be using Excel and/or Open Office. There are multiple reasons:
a) Often used - and often used badly - we have to learn to use it well
b) Professionals in data science for development have to be able to work in a team. Other team members may be comfortable with a spreadsheet. They could also see what's happening in a GUI in R much better than from the code. Hence, in team work you can be working with other team members. Not simply working for them - and quietly doing magic that others can not question.
c) Why else???

0 replies

rdstern · 2023-01-01T22:05:31Z

rdstern
Jan 1, 2023
Maintainer Author

At the last meeting I was given the task of finding books for the course.

I have mentioned elsewhere about the Introduction to Data Science book. This is only introductory but I like it a lot, and propose graduates should (at least) know all of that.

This site discusses 5 free books on statistics for data science

That section is not so strong in the Introductory book. Among those 5 free books, one seems outstanding and written by giants in the subject.

This one seems excellent and written by current giants in the subject.

It also has a sort of history of the subject that puts the down-from-the mountains ideas in a much more modern perspective.

The second book is called Think Stats. It is much simpler. It could be useful for us, because it is based on python. I would like to get a copy of the data they use through the book - which needs python. But then it could be interesting to give that course material the R-Instat treatment, i.e. can you start by R-Instat and then learn python for statistics later.

.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Science for Development #7961

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Data Science for Development #7961

rdstern Nov 12, 2022 Maintainer

Replies: 2 comments

rdstern Nov 19, 2022 Maintainer Author

rdstern Jan 1, 2023 Maintainer Author

rdstern
Nov 12, 2022
Maintainer

rdstern
Nov 19, 2022
Maintainer Author

rdstern
Jan 1, 2023
Maintainer Author