forked from yihui/bookdown-crc
-
Notifications
You must be signed in to change notification settings - Fork 0
/
r4lis-crc.tex
5168 lines (4061 loc) · 303 KB
/
r4lis-crc.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% Options for packages loaded elsewhere
\PassOptionsToPackage{unicode}{hyperref}
\PassOptionsToPackage{hyphens}{url}
\PassOptionsToPackage{dvipsnames,svgnames,x11names}{xcolor}
%
\documentclass[
krantz2]{krantz}
\usepackage{amsmath,amssymb}
\usepackage{lmodern}
\usepackage{iftex}
\ifPDFTeX
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{textcomp} % provide euro and other symbols
\else % if luatex or xetex
\usepackage{unicode-math}
\defaultfontfeatures{Scale=MatchLowercase}
\defaultfontfeatures[\rmfamily]{Ligatures=TeX,Scale=1}
\fi
% Use upquote if available, for straight quotes in verbatim environments
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
\IfFileExists{microtype.sty}{% use microtype if available
\usepackage[]{microtype}
\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts
}{}
\makeatletter
\@ifundefined{KOMAClassName}{% if non-KOMA class
\IfFileExists{parskip.sty}{%
\usepackage{parskip}
}{% else
\setlength{\parindent}{0pt}
\setlength{\parskip}{6pt plus 2pt minus 1pt}}
}{% if KOMA class
\KOMAoptions{parskip=half}}
\makeatother
\usepackage{xcolor}
\usepackage{color}
\usepackage{fancyvrb}
\newcommand{\VerbBar}{|}
\newcommand{\VERB}{\Verb[commandchars=\\\{\}]}
\DefineVerbatimEnvironment{Highlighting}{Verbatim}{commandchars=\\\{\}}
% Add ',fontsize=\small' for more characters per line
\usepackage{framed}
\definecolor{shadecolor}{RGB}{248,248,248}
\newenvironment{Shaded}{\begin{snugshade}}{\end{snugshade}}
\newcommand{\AlertTok}[1]{\textcolor[rgb]{0.33,0.33,0.33}{#1}}
\newcommand{\AnnotationTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{\textbf{\textit{#1}}}}
\newcommand{\AttributeTok}[1]{\textcolor[rgb]{0.61,0.61,0.61}{#1}}
\newcommand{\BaseNTok}[1]{\textcolor[rgb]{0.06,0.06,0.06}{#1}}
\newcommand{\BuiltInTok}[1]{#1}
\newcommand{\CharTok}[1]{\textcolor[rgb]{0.5,0.5,0.5}{#1}}
\newcommand{\CommentTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{\textit{#1}}}
\newcommand{\CommentVarTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{\textbf{\textit{#1}}}}
\newcommand{\ConstantTok}[1]{\textcolor[rgb]{0,0,0}{#1}}
\newcommand{\ControlFlowTok}[1]{\textcolor[rgb]{0.27,0.27,0.27}{\textbf{#1}}}
\newcommand{\DataTypeTok}[1]{\textcolor[rgb]{0.27,0.27,0.27}{#1}}
\newcommand{\DecValTok}[1]{\textcolor[rgb]{0.06,0.06,0.06}{#1}}
\newcommand{\DocumentationTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{\textbf{\textit{#1}}}}
\newcommand{\ErrorTok}[1]{\textcolor[rgb]{0.14,0.14,0.14}{\textbf{#1}}}
\newcommand{\ExtensionTok}[1]{#1}
\newcommand{\FloatTok}[1]{\textcolor[rgb]{0.06,0.06,0.06}{#1}}
\newcommand{\FunctionTok}[1]{\textcolor[rgb]{0,0,0}{#1}}
\newcommand{\ImportTok}[1]{#1}
\newcommand{\InformationTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{\textbf{\textit{#1}}}}
\newcommand{\KeywordTok}[1]{\textcolor[rgb]{0.27,0.27,0.27}{\textbf{#1}}}
\newcommand{\NormalTok}[1]{#1}
\newcommand{\OperatorTok}[1]{\textcolor[rgb]{0.43,0.43,0.43}{\textbf{#1}}}
\newcommand{\OtherTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{#1}}
\newcommand{\PreprocessorTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{\textit{#1}}}
\newcommand{\RegionMarkerTok}[1]{#1}
\newcommand{\SpecialCharTok}[1]{\textcolor[rgb]{0,0,0}{#1}}
\newcommand{\SpecialStringTok}[1]{\textcolor[rgb]{0.5,0.5,0.5}{#1}}
\newcommand{\StringTok}[1]{\textcolor[rgb]{0.5,0.5,0.5}{#1}}
\newcommand{\VariableTok}[1]{\textcolor[rgb]{0,0,0}{#1}}
\newcommand{\VerbatimStringTok}[1]{\textcolor[rgb]{0.5,0.5,0.5}{#1}}
\newcommand{\WarningTok}[1]{\textcolor[rgb]{0.37,0.37,0.37}{\textbf{\textit{#1}}}}
\usepackage{longtable,booktabs,array}
\usepackage{calc} % for calculating minipage widths
% Correct order of tables after \paragraph or \subparagraph
\usepackage{etoolbox}
\makeatletter
\patchcmd\longtable{\par}{\if@noskipsec\mbox{}\fi\par}{}{}
\makeatother
% Allow footnotes in longtable head/foot
\IfFileExists{footnotehyper.sty}{\usepackage{footnotehyper}}{\usepackage{footnote}}
\makesavenoteenv{longtable}
\usepackage{graphicx}
\makeatletter
\def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi}
\def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi}
\makeatother
% Scale images if necessary, so that they will not overflow the page
% margins by default, and it is still possible to overwrite the defaults
% using explicit options in \includegraphics[width, height, ...]{}
\setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio}
% Set default figure placement to htbp
\makeatletter
\def\fps@figure{htbp}
\makeatother
\setlength{\emergencystretch}{3em} % prevent overfull lines
\providecommand{\tightlist}{%
\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
\setcounter{secnumdepth}{5}
\usepackage{booktabs}
\usepackage{longtable}
\usepackage[bf,singlelinecheck=off]{caption}
\captionsetup[table]{labelsep=space}
\captionsetup[figure]{labelsep=space}
\usepackage[scale=.8]{sourcecodepro}
\usepackage{framed,color}
\definecolor{shadecolor}{RGB}{248,248,248}
\renewcommand{\textfraction}{0.05}
\renewcommand{\topfraction}{0.8}
\renewcommand{\bottomfraction}{0.8}
\renewcommand{\floatpagefraction}{0.75}
\renewenvironment{quote}{\begin{VF}}{\end{VF}}
% \let\oldhref\href
% \renewcommand{\href}[2]{#2\footnote{\url{#1}}}
\makeatletter
\newenvironment{kframe}{%
\medskip{}
\setlength{\fboxsep}{.8em}
\def\at@end@of@kframe{}%
\ifinner\ifhmode%
\def\at@end@of@kframe{\end{minipage}}%
\begin{minipage}{\columnwidth}%
\fi\fi%
\def\FrameCommand##1{\hskip\@totalleftmargin \hskip-\fboxsep
\colorbox{shadecolor}{##1}\hskip-\fboxsep
% There is no \\@totalrightmargin, so:
\hskip-\linewidth \hskip-\@totalleftmargin \hskip\columnwidth}%
\MakeFramed {\advance\hsize-\width
\@totalleftmargin\z@ \linewidth\hsize
\@setminipage}}%
{\par\unskip\endMakeFramed%
\at@end@of@kframe}
\makeatother
\renewenvironment{Shaded}{\begin{kframe}}{\end{kframe}}
\usepackage{makeidx}
\makeindex
% \urlstyle{tt}
\usepackage{amsthm}
\makeatletter
\def\thm@space@setup{%
\thm@preskip=8pt plus 2pt minus 4pt
\thm@postskip=\thm@preskip
}
\makeatother
\frontmatter
\ifLuaTeX
\usepackage{selnolig} % disable illegal ligatures
\fi
\usepackage[]{natbib}
\bibliographystyle{apalike}
\IfFileExists{bookmark.sty}{\usepackage{bookmark}}{\usepackage{hyperref}}
\IfFileExists{xurl.sty}{\usepackage{xurl}}{} % add URL line breaks if available
\urlstyle{same} % disable monospaced font for URLs
\hypersetup{
pdftitle={Hands-on Data Science for Librarians},
pdfauthor={Sarah Lin \& Dorris Scott},
colorlinks=true,
linkcolor={Maroon},
filecolor={Maroon},
citecolor={Blue},
urlcolor={Blue},
pdfcreator={LaTeX via pandoc}}
\title{Hands-on Data Science for Librarians}
\author{Sarah Lin \& Dorris Scott}
\date{2022-09-21}
\begin{document}
\maketitle
% you may need to leave a few empty pages before the dedication page
%\cleardoublepage\newpage\thispagestyle{empty}\null
%\cleardoublepage\newpage\thispagestyle{empty}\null
%\cleardoublepage\newpage
\thispagestyle{empty}
\begin{center}
To my son,
without whom I should have finished this book two years earlier
%\includegraphics{images/dedication.pdf}
\end{center}
\setlength{\abovedisplayskip}{-5pt}
\setlength{\abovedisplayshortskip}{-5pt}
{
\hypersetup{linkcolor=}
\setcounter{tocdepth}{2}
\tableofcontents
}
\listoffigures
\hypertarget{preface}{%
\chapter*{Preface}\label{preface}}
Resources to learn R are all over the internet and most libraries. However, easy access to resources doesn't mean it's easy to learn to do data science in R. This book spends time on an introduction to R and basic data cleaning tasks that are taught elsewhere because we want to provide a gentle, low-stress introduction to key aspects of data science using R. Librarians have varied backgrounds, but for most of us, rigorous education in mathematics, statistics, and computer science are not part of our expertise. That doesn't mean we can't learn to code or do data science in code. Based on our own experiences, we are particuarly concerned that you, our reader, are able to access the content in this book with minimal frustration, exasperation, and despair.
Using resources at the end of each chapter, in the appendix, and in the bibliography of this book will provide you with next steps to further your data science skills beyond this introductory text. With a basic foundation in data science skills, any of the resources we link to should be comprehensible, if challenging. We wish you well on our data science journey!
\hypertarget{what-youll-need}{%
\section*{What you'll need}\label{what-youll-need}}
Access to a personal computer (desktop or laptop) with permission to install programs, such as the Chrome web browser and extensions.
\hypertarget{software-information-and-conventions}{%
\section*{Software information and conventions}\label{software-information-and-conventions}}
We used the \emph{knitr}\index{knitr} package \citep{xie2015} and the \emph{bookdown}\index{bookdown} package \citep{R-bookdown} to compile my book. Our R session information is shown below:
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{xfun}\SpecialCharTok{::}\FunctionTok{session\_info}\NormalTok{()}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Locale: en_US.UTF-8 / en_US.UTF-8 / en_US.UTF-8 / C / en_US.UTF-8 / en_US.UTF-8
##
## Package version:
## base64enc_0.1.3 bookdown_0.28 bslib_0.4.0
## cachem_1.0.6 cli_3.3.0 compiler_4.2.1
## digest_0.6.29 evaluate_0.16 fastmap_1.1.0
## fs_1.5.2 glue_1.6.2 graphics_4.2.1
## grDevices_4.2.1 highr_0.9 htmltools_0.5.3
## jquerylib_0.1.4 jsonlite_1.8.0 knitr_1.39
## magrittr_2.0.3 memoise_2.0.1 methods_4.2.1
## R6_2.5.1 rappdirs_0.3.3 renv_0.15.5
## rlang_1.0.4 rmarkdown_2.14 rstudioapi_0.13
## sass_0.4.2 stats_4.2.1 stringi_1.7.8
## stringr_1.4.0 tinytex_0.40 tools_4.2.1
## utils_4.2.1 xfun_0.32 yaml_2.3.5
\end{verbatim}
Package names are in italic text (e.g., \emph{rmarkdown}), and inline code and filenames are formatted in a typewriter font (e.g., \texttt{knitr::knit(\textquotesingle{}foo.Rmd\textquotesingle{})}). Function names are followed by parentheses (e.g., \texttt{bookdown::render\_book()}).
We use the assignment (\texttt{\textless{}-}) operator in all code chunks to assign and store objects in this book, but you can also use the equals sign (\texttt{=}).
In 2022, the company RStudio, PBC changed its name to Posit, PBC. The open source IDE created by this company is now known as either ``the RStudio IDE'', or simply ``RStudio.'' We use both terms interchangeably in this book.
\hypertarget{acknowledgments}{%
\section*{Acknowledgments}\label{acknowledgments}}
People to thank: Greg Wilson, Myfawnwy Johnston, Patrick Alston, Emily Nimsakont, Luke Johnston, Carl Howe
\hypertarget{about-the-authors}{%
\chapter*{About the Authors}\label{about-the-authors}}
Sarah Lin manages the Enterprise Information Management team at Posit, PBC. A graduate of the University of Illinois iSchool, Sarah worked as a technical services librarian in many different library types before moving into corporate librarianship and information management. She didn't know anything about coding in R before joining Posit.
Dorris Scott is \ldots{}
\mainmatter
\hypertarget{introduction}{%
\chapter{Introduction}\label{introduction}}
\hypertarget{data-science}{%
\section{What is data science?}\label{data-science}}
Data science degree and certificate programs have sprouted at academic institutions around the country, while books, articles, and conference programs about data and how to analyze it regularly appear in library conference programs and educational events. The increased visibility of data science belies the fact that data science has been around for a while. Indeed, data collection and the need to make sense of it is not new. R, the programming language used in this book, has been around for decades. However, experts have some back-and-forth about the discipline of data science and its relationship to other subjects.
Rather than take sides, this book takes a broad view of what constitutes data science and highlights five interdependent elements. These include both \emph{mathematics} and \emph{statistics} on the computational side. With or without a graphical user interface, data science is made real through \emph{computer programming}. Practitioners of data science bring extensive \emph{subject matter knowledge}. Their expertise enables them to communicate their conclusions through data \emph{visualizations}, often providing pictures that speak louder than numbers.
\begin{figure}
\centering
\includegraphics{images/DS-AI-CS-Graphic-UPDATED-Aug2021.jpeg}
\caption{Data Science as Discipline Diagram, Data Science Program, Viterbi School of Engineering, University of Southern California, \url{http://datascience.usc.edu}, 2021}
\end{figure}
Data science is a discipline that extracts knowledge from data in various fields, including librarianship. While data science can help make decisions, it is not a substitute for human decision-making. It can provide insights and generalizations from collected observations (data). Aspects of some subjects remain unquantifiable yet comprehensible to human interpretation. Data analysis is fallible; it requires data science practitioners to bring their expertise to bear on interpretation and decision-making.
Whether we realize it or not, data science is a broad discipline that saturates our professional lives. For academic librarians, faculty, staff, and students learn and perform data science tasks daily, such as data cleaning, management, and visualizations. This occurs in computational science disciplines as well as the biological, physical, social sciences, and even in the humanities. In addition, librarians can act as data curators who help researchers publish or deposit their data to data repositories and academic journals.
Corporations and other institutions with special libraries likely have teams using many tools to analyze the market or user behavior. Predictive text in search engines relies upon text mining and machine learning. Humanities and social science professionals use maps, analysis, web scraping, and text mining to create and analyze datasets. These disciplines need to communicate their findings through written reports and dashboards for their stakeholders and constituents. Data science also permeates the public sphere. Users are subject to machine learning algorithms in their daily lives within loan applications, resume screenings, social media feeds, news visualizations, public health data, social services eligibility, and medical care. Public librarians interact with patrons whose complex information needs may result from how data science impacts their lives. Data literacy is required when data science provides input for human decisions, particularly when those decisions affect others' well-being.
\hypertarget{learn-ds}{%
\section{Why learn data science?}\label{learn-ds}}
Librarians have long collected metrics about their collections and their patrons. However, the pervasiveness of data collection and the need to justify or rationalize library expenditures creates an environment that data science can exploit in the best interests of library and information professionals. Because librarians are both consumers of data and teachers of data literacy, they must acquire skills to perform data science and interrogate data analyses to determine their veracity.
Data literacy is the ability to read, interpret and analyze data, and it is a requirement when people use data to distort the truth\footnote{\url{https://royalsocietypublishing.org/doi/10.1098/rsos.190161}}. Unfortunately, data literacy is both a necessary and frequently needed skill. Data science enables data literacy and democratizes access to the source material; so much of our personal and professional lives are affected by data, whether created or influenced by data-driven decision-making. Data provides valuable information to help experts make decisions. Beyond just the economy, so much in our society rewards data literacy and penalizes the illiterate. Because of this, data is too valuable to be left only to data scientists, computer scientists, or statisticians. Instead, subject experts need to learn to code because they know their data best and are best suited to analyze it and draw healthy and accurate conclusions. Your professional expertise lets you ask the right questions and interpret meaning from the data. When experts in their field add data science skills to their repertoire, data science is further democratized\footnote{\url{https://www.rstudio.com/resources/rstudioconf-2020/data-science-education-in-2022/}}, and data-driven decisions are more impactful.
\hypertarget{use-code}{%
\section{Why use code?}\label{use-code}}
Ever the proponents of literacy, librarians have embraced data literacy and data-driven decision-making for many years. Conference sessions to improve both data collection and analytics presentation abound. When data skills are adopted, it is usually in the context of a commercial spreadsheet or analytics program. Learning to code is not as common among library and information professionals; this book argues that learning to code is doable and provides increased utility and impact. In the long run, learning a programming language for data science is best because it is accessible to all, ensures data analysis is reproducible, and it is future-proof as applications change.
If we define programming as being able to talk to computers in a language they understand, then most librarians have already done that and are probably quite good at it. Technical services and cataloging librarians will be familiar with MARC (Machine Readable Cataloging), the special syntax libraries use to catalog their collections so that computer software can read. More commonly, if you've written formulas in a spreadsheet application, you've dabbled in the basics of computer programming. However, learning to code offers far greater applications and versatility than a spreadsheet application.
The core benefits of doing data science in code are interoperability and reproducibility. Many academic librarians will be familiar with FAIR Principles\footnote{\url{https://www.go-fair.org/}} through their data curation work; this initiative focuses on making information Findable, Accessible, Interoperable, and Reproducible. Doing data science in code ensures that data and data analysis are both interoperable and reproducible, neither of which is possible with proprietary software applications.
Interoperability requires that other librarians who may have completely different software applications on their computers would be able to run anyone else's code. The R programming language is an open-source tool that is free to anyone across the globe and provides transparent data analysis. Additionally, platform-agnostic tools like coding can bring together the output of multiple commercial products to rationalize and analyze the data together.
Reproducibility is closely related to interoperability because code should run on any application configuration. Still, the analysis must be able to be re-run by another person and get the same results. In the past few years, there have been stories in the news about errors in spreadsheet applications that allowed researchers to draw erroneous conclusions. In one case, years of austerity measures around the globe rested on one economics research paper that was missing a few values for some variables\footnote{\url{https://www.businessinsider.com/thomas-herndon-michael-ash-and-robert-pollin-on-reinhart-and-rogoff-2013-4}}. Using code allows researchers to combine their data, code, and analysis, providing transparency into the process of data science. Unfortunately, there have other examples of reproducibility problems in various scientific disciplines: physics\footnote{\url{https://physicstoday.scitation.org/do/10.1063/PT.6.1.20180822a/full/}}, psychology\footnote{\url{https://www.science.org/doi/10.1126/science.aac4716}}, and medical research\footnote{\url{https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002165}} as well. A librarian will need to re-run their analyses on new iterations of data without replicating the data cleaning and analysis steps manually. Thankfully, code can be run repeatedly with new data as input, saving hours and hours while repeating each step precisely. The ultimate benefit of doing data science using computer programming languages is the ability to share raw data and the steps for analysis.
\hypertarget{vignette}{%
\section{Vignette}\label{vignette}}
This book creates an overarching narrative that presents realistic code examples and valuable outputs centered around a hypothetical outreach librarian in St.~Louis, MO. Envision that you are this outreach librarian and you want to create a partnership with community institutions to address unemployment in St.~Louis. Your goal is to present a report to stakeholders at the library and within the community that analyzes several data sources related to employment and unemployment in St.~Louis. You will employ different data science skills to compile the report. Each chapter in this book will touch on a different aspect of her report, building upon each other to learn data science and code each analytical section in R.
The reader is invited to inhabit the role of this librarian, who we will address as `you' throughout the book as we introduce each chapter with a scenario that describes what the librarian is trying to accomplish with each data science skill.
\hypertarget{book-structure}{%
\section{Structure of this book}\label{book-structure}}
In pursuit of data to justify a community partnership, you will learn R in incremental steps with a topic for each chapter that will produce one aspect of the final report. This book isn't an exhaustive textbook on R or data science but rather a guidebook through the central functional practices of data science in R. The focus is on immediately applicable skill acquisition made easier through library-specific hypothetical tasks. The chapter topics include:
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
Use RStudio to code in R
\item
Learn to clean data using code
\item
Plot basic visualizations
\item
Scrape websites using code
\item
Visualize data using maps
\item
Use code to mine textual data
\item
Publish your code using R Markdown
\item
Communicate your findings via Flexdashboard
\item
Let stakeholders draw their conclusions from an interactive Shiny application
\item
Understand how AI intersects with employment by understanding how machine learning works
\end{enumerate}
To expand on this list, the first two chapters explain R, the RStudio IDE used to program in R, and how to get started cleaning data. In any data-related project, cleaning data is the first and often the most time-consuming task. Chapters three through nine teach different data science skills: plots/graphs, web scraping, geographic visualizations, text mining, publishing, dashboards, and interactive web applications. The final chapter covers machine learning, explaining the construction of algorithms and their implications for librarians who interact with them. An explanation of how resumé screening software uses machine learning to accept or reject job applications ties how machine learning works with experiences job seekers have through the prospective outreach partnership.
\hypertarget{audience}{%
\section{Who this book is for}\label{audience}}
The anticipated audience for this book is all librarians and information professionals interested in learning data science and applying it to their everyday jobs. Public, academic, medical, legal, special, and corporate librarians can all put the data science skills taught in this book to use in their daily work. The book has been designed with examples adaptable to many job positions and library types, creating a practical introduction to primary data science skills needed in a professional setting. This book does not include in-depth explanations of particular R packages, the statistical and mathematical principles behind package functions, or theoretical foundations of different analysis types. There are several related topics that, while not required, are helpful to learn alongside or following this book. The Appendix includes those topics, and resources to learn more about them.
\hypertarget{rstudio}{%
\chapter{Using RStudio's IDE}\label{rstudio}}
\hypertarget{rstudio-los}{%
\section{Learning Objective: use the RStudio IDE (Integrated Development Environment) for importing data.}\label{rstudio-los}}
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
Use your computer knowledge to install RStudio.
\item
Describe the function of each pane in the IDE.
\item
Modify IDE settings to their liking.
\item
Use the IDE to import a tabular data file.
\end{enumerate}
\hypertarget{rstudio-terms}{%
\section{Terms You'll Learn}\label{rstudio-terms}}
\begin{itemize}
\tightlist
\item
integrated development environment (IDE)
\item
package
\item
tidyverse
\item
session
\item
working directory
\end{itemize}
\hypertarget{rstudio-scenario}{%
\section{Scenario}\label{rstudio-scenario}}
You want to use R to do data science and publish a data-based report to support your outreach efforts, but you don't know how to code in R or get started.
\hypertarget{rstudio-intro}{%
\section{Introduction}\label{rstudio-intro}}
This chapter aims to get you up and running with programming in R using the RStudio \emph{Integrated Development Environment}, or IDE, which is generally referred to as `RStudio.' An IDE is a computer program that makes it easier to code; while you can use your computer's command line\footnote{the program that enables you to type commands that your computer will follow to complete a task, such as Terminal on MacOS} or UNIX shell\footnote{\url{https://librarycarpentry.org/lc-shell/01-intro-shell/index.html}} interface to code, the graphical user interface of an IDE makes it a lot more accessible. The distinction between coding in the command line or using an IDE is a lot like the difference between finding stored files in the command line or or using Finder/File Explorer on your work or personal computer. While there are some scenarios where using the command line makes the most sense, for the day-to-day, most computer users use the Finder/File Explorer to more easily navigate through their files and data. IDEs are very common in computer programming, and many different applications exist. We're using RStudio because it was designed specifically for R, though you can use it to program in Python and other languages. It is free and open-source, and using it to program in R is a widely-used way to wrangle and interpret data. We will also cover the basics of R as a programming language, and a widely-used core of packages called the Tidyverse and then install RStudio to get started with R.
\hypertarget{what-is-r}{%
\section{What is R?}\label{what-is-r}}
Version 1.0 of the R programming language was released publicly in 2000\footnote{\url{https://blog.revolutionanalytics.com/2020/07/the-history-of-r-updated-for-2020.html}}, five years after initial distribution as open-source software. The intellectual genealogy of R comes from the S statistical programming language, created at Bell Labs in the 1970s\footnote{\url{https://youtu.be/jk9S3RTAl38}}. As a programming language, R was designed for statisticians to analyze data interactively. R's statistical and academic origins stand in contrast to other programming languages used for data science.
R is an object-based programming language, where code and outputs are stored as objects to be acted upon later. Algebra might store a single value or mathematical expression in a variable; R can hold single or multiple variables, or values, in each object. Where algebra uses an equal sign to denote what a variable is, such as \texttt{x\ =\ 5}, R uses \texttt{\textless{}-} in the same way. You can read the left-pointing arrow as the word ``is.'' We can use the \texttt{print()} function to display the value of an object when we put the object we want to see inside of the parentheses.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{x }\OtherTok{\textless{}{-}} \DecValTok{5}
\NormalTok{y }\OtherTok{\textless{}{-}}\NormalTok{ x }\SpecialCharTok{+} \DecValTok{2}
\FunctionTok{print}\NormalTok{(y)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## [1] 7
\end{verbatim}
R lets us work with data interactively through the use of code. When we write code in R, we are usually creating and saving data objects of various classes according to our needs. We can then conduct operations and/or analyses on these data objects in our R session(s).
Common classes (types) for these objects include numeric, character (text) and logical (true/false). Objects of a single class are often collected and stored together as vectors. Vectors can in turn be grouped together to make larger data objects you might already be familiar with, including matrices, arrays, or data frames. In this book we focus on data frames.
The data frame structure is central to data analysis because it requires each element of the data frame to have the same length, just like rows in a table. Another way to say this is that each column must be the same length; each column in the data frame must have the same number of rows. This consistent table-like structure is vital for many data science functions. Readers who move on to further data science tasks beyond this book will need to understand data class and structures. Coding errors in R are often traced back to problems with incompatible data structures or inconsistent application of classes.
Please note that any code preceded by a \texttt{\#} functions as a comment because R ignores anything following that character. We can combine multiple values into one object using \texttt{c()} to determine an object's class using \texttt{class()}.
\begin{Shaded}
\begin{Highlighting}[]
\CommentTok{\# numeric vector}
\NormalTok{numbers }\OtherTok{\textless{}{-}} \FunctionTok{c}\NormalTok{(}\DecValTok{8}\NormalTok{, }\DecValTok{6}\NormalTok{, }\DecValTok{7}\NormalTok{, }\DecValTok{5}\NormalTok{, }\DecValTok{3}\NormalTok{, }\DecValTok{0}\NormalTok{, }\DecValTok{9}\NormalTok{) }
\FunctionTok{class}\NormalTok{(numbers)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## [1] "numeric"
\end{verbatim}
\begin{Shaded}
\begin{Highlighting}[]
\CommentTok{\#logical vector}
\NormalTok{values }\OtherTok{\textless{}{-}} \FunctionTok{c}\NormalTok{(}\ConstantTok{TRUE}\NormalTok{, }\ConstantTok{TRUE}\NormalTok{, }\ConstantTok{TRUE}\NormalTok{, }\ConstantTok{FALSE}\NormalTok{, }\ConstantTok{FALSE}\NormalTok{, }\ConstantTok{FALSE}\NormalTok{, }\ConstantTok{TRUE}\NormalTok{)}
\FunctionTok{class}\NormalTok{(values)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## [1] "logical"
\end{verbatim}
Designed explicitly to work with data, R works with many object types. Like many other programming languages, when data scientists find a need for specific applications and groups of related functions, they can create and bundle them into what R calls ``packages''. A \emph{package} is a group of associated functions equivalent to what in other languages might be called a ``library.''
The complete list of the thousands and thousands of contributed R packages is on CRAN (\url{https://cran.r-project.org/}). These run the gamut from technical to purely fun. Some packages focus on a particular skill (web scraping) or a specific dataset (the Project Gutenberg books). You'll become familiar with a dozen or so packages throughout this book.
\hypertarget{tidyverse}{%
\section{Introducing the Tidyverse}\label{tidyverse}}
One of the most helpful R packages to become familiar with is the \emph{tidyverse} package, which is a collection of packages\footnote{\url{https://www.tidyverse.org/packages/}} usually referred to as the Tidyverse. Each of these, listed below, focuses on a different aspect of cleaning or tidying data before it's used or analyzed further. While it is possible to use ``base R,'' meaning the functions that come loaded with R when installing it, many R users prefer to use the Tidyverse because they make common tasks in R easier. The Tidyverse packages all work together, and Posit, PBC staff maintain them.
The ``core'' Tidyverse packages include:
* ggplot2, for data visualization
* dplyr, for data manipulation
* tidyr, for data tidying
* readr, for importing data from CSV files
* purrr, for functional programming (such as repetitive functions)
* tibble, for tibbles, a more straightforward way to create data frames
* stringr, for manipulating strings\footnote{\url{https://en.wikipedia.org/wiki/String_(computer_science)}}
* forecast for factors (a data structure not used in this book)
There are several other additional packages in the Tidyverse, and we will use several of them in this book:
* httr, for web APIs
* jsonlite, for JSON files
* readxl, for .xls and .xlsx files. (not used in this book, but useful for those who use Microsoft Excel frequently)
* rvest, for web scraping.
* xml2, for working with XML formats
This book will cover the purpose and functions of these packages as they are needed.
\hypertarget{ide-start}{%
\section{Getting Started with the RStudio IDE}\label{ide-start}}
There are many ways to interface with R on your computer, and you can chose the interface that makes the most sense for you. Millions of R users use the graphical user interface provided by RStudio:
\begin{quote}
The RStudio IDE is a set of integrated tools designed to help you be more productive with R and Python. It includes a console, syntax-highlighting editor that supports direct code execution, and a variety of robust tools for plotting, viewing history, debugging, and managing your workspace\footnote{\url{https://www.rstudio.com/products/rstudio/download/}}.
\end{quote}
RStudio is also open-source software, which means that the code used to create it is freely available to download, use, and modify. In contrast, other statistical analysis software programs have inaccessible code and require paid subscriptions. Additionally, Posit, PBC supports the continued development of RStudio by dedicating a portion of its engineering team to work only on open-source software projects.
\hypertarget{install-r}{%
\subsection{Install R}\label{install-r}}
The RStudio IDE does not come with R; instead, download the latest version of R for your operating system from the Comprehensive R Archive Network, or CRAN\footnote{\url{https://cran.r-project.org/}}. Following the download and installation instructions for your operating system to install R.
\hypertarget{install-the-rstudio-ide}{%
\subsection{Install the RStudio IDE}\label{install-the-rstudio-ide}}
We will use the open-source desktop version of the IDE, which is available as a free download from Posit's website\footnote{\url{https://www.rstudio.com/products/rstudio/download/\#download}}. On the download page, you should select the correct version of the IDE that matches your operating system.\\
\includegraphics{images/ide-download.png}
After selecting the download button, follow the prompts on your computer to install RStudio.
\hypertarget{navigate-ide}{%
\subsection{Navigating RStudio}\label{navigate-ide}}
The RStudio IDE brings together all the tools you need to do data science: an editor to write code and text, a console to execute code, access to your computer's terminal, a file explorer, a viewer pane for graphs and visualizations, as well as a version control pane, for those who use Git or Github (see appendix@ref\{appendix\}). While it can accommodate many programming languages, the focus of this book will be using RStudio to code in R. Within the ecosystem of R tools, it includes common code libraries and other tools, like spellcheck, which make the work of data science much more manageable.
\includegraphics{images/rstudio-landing-screen.png}
RStudio has numerous features, but this book covers only some of those. We'll go over those that are necessary for the tasks at hand. The left-hand pane is called the console, where we can type code directly, or else the IDE will run code within particular files automatically so we can see a log of our code as it executes. Additional tabs are in that pane for the terminal (see appendix@ref\{appendix\}). On the top right is the environment pane, where the R objects you create and use in your session are stored. The bottom right is the files pane, where you can navigate through your computer's file directory. Other useful tabs in that pane are Help and Viewer, which shows any graphs or plots you create.
RStudio uses the concept of project files, which group together all the code and dataset files for one project. Every new data science project should start with a new R project in the RStudio IDE. From the \emph{File} menu, select \emph{New Project} and follow the prompts to create a new project. Each project must have a name, which will create a folder of the same name and save all your code and other files within that folder. Naming projects separately keep project files organized and more easily navigable from a file directory. When you open a project at the start of your work session, the IDE will use the file directory for that project as that session's working directory. Any files created will be automatically saved to that same directory or folder, helpful in keeping files organized.
Once you create a file or open one, the console moves to the left bottom, and an editor pane opens in the upper left. To store some R code as a file to access or re-run later, create a new R Script file by going to File \textgreater{} New File \textgreater{} R Script.
\includegraphics{images/new-r-script.png}
With four panes, the IDE screen looks like this:
\includegraphics{images/4-panes.png}
\hypertarget{pkgs-download}{%
\section{Packages needed for this book}\label{pkgs-download}}
As you progress in your data science journey, you will install more and more R packages. As with any new project you start, begin by installing all the packages you will need to use. Please see the appendix@ref\{appendix\} for instructions on installing additional software that these packages depend on to function properly (commonly called `dependencies'). You might see prompts in the console during this process. If you're asked to install other packages, say `yes.' If you're asked if you want to compile binaries from source, say `no.'
\begin{Shaded}
\begin{Highlighting}[]
\CommentTok{\#commenting out until we\textquotesingle{}re ready to go}
\CommentTok{\# install the packages needed by this book; you may need to install dependencies before proceeding or if you encounter an error message}
\FunctionTok{lapply}\NormalTok{(}\FunctionTok{c}\NormalTok{(}\StringTok{\textquotesingle{}xfun\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}tidyverse\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}gapminder\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}tidytext\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}jsonlite\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}units\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}rgdal\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}terra\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}sf\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}tmap\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}tidycensus\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}readr\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}textdata\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}tidymodels\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}flexdashboard\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}DT\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}shiny\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}Rcpp\textquotesingle{}}\NormalTok{, }\StringTok{\textquotesingle{}raster\textquotesingle{}}\NormalTok{), }\ControlFlowTok{function}\NormalTok{(pkg) \{}
\ControlFlowTok{if}\NormalTok{ (}\FunctionTok{system.file}\NormalTok{(}\AttributeTok{package =}\NormalTok{ pkg) }\SpecialCharTok{==} \StringTok{\textquotesingle{}\textquotesingle{}}\NormalTok{) }\FunctionTok{install.packages}\NormalTok{(pkg, }\AttributeTok{repos=}\StringTok{"https://cloud.r{-}project.org"}\NormalTok{)}
\NormalTok{\})}
\CommentTok{\# installing \_rgdal\_ can sometimes result in error codes. Please see the Appendix\textbackslash{}@ref\{appendix\} for troubleshooting tips.}
\end{Highlighting}
\end{Shaded}
At the start of all subsequent chapters, you'll notice a code chunk that loads each package into your current session using the \texttt{library()} function. Installing a package happens only once, but loading a package must occur each time you open RStudio or start a new R session.
\hypertarget{ide-viewing}{%
\section{Viewing tabular data in RStudio}\label{ide-viewing}}
Let's read some data into R and get more comfortable with RStudio while exploring the data. We'll use COVID stats for the city of St.~Louis that are available at: \url{https://www.stlouis-mo.gov/covid-19/data/\#totalsByDate}. Scroll down to Totals By Specimen Collection Date and click View Data, then save the csv file.
After the file is saved, we can use the Tidyverse package \emph{readr} and its \texttt{read.csv()} function to read the file into R and make it available for us to use. First, we need to load the Tidyverse packages we already installed with the \texttt{library()} function.
\begin{Shaded}
\begin{Highlighting}[]
\CommentTok{\#load \_readr\_ as part of the \_tidyverse\_ package}
\FunctionTok{library}\NormalTok{(tidyverse) }
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## -- Attaching packages -------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.8 v dplyr 1.0.9
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ----------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
\end{verbatim}
\begin{Shaded}
\begin{Highlighting}[]
\CommentTok{\#create an object to store the csv data we read in}
\NormalTok{stl\_covid }\OtherTok{\textless{}{-}} \FunctionTok{read.csv}\NormalTok{(}\StringTok{"City{-}of{-}St{-}Louis{-}COVID{-}19{-}Case{-}Data.csv"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
All output needs to be an object, so we created a \texttt{stl\_covid} object that contains the csv file we just downloaded. Most COVID datasets are very large, so while we could click on this object in the Environment pane and open it to view the entire file, we could use a few R functions to get a sense of what this dataset looks like.
If we want to see the entire file, we can use the \texttt{view()} command to open up a spreadsheet view in our editor pane. The file is very large, as expected.
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{view}\NormalTok{(stl\_covid)}
\end{Highlighting}
\end{Shaded}
\includegraphics{images/view-covid.png}
We can also use some built-in base R functions to see snippets of the \texttt{stl\_covid} dataset. To see the first ten lines, we can use \texttt{head()} and \texttt{tail()} to see the last ten lines. An additional function is \texttt{summary()}, which will display summary statistics for each column in the data frame.
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{head}\NormalTok{(stl\_covid)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## CONFIRMEDCASECHANGE DATE DEATHCHANGE
## 1 1 03-03-2020 0
## 2 0 03-04-2020 0
## 3 0 03-05-2020 0
## 4 0 03-06-2020 0
## 5 0 03-07-2020 0
## 6 0 03-08-2020 0
## PROBABLECASECHANGE R0 R0CIHIGH R0CILOW
## 1 0 NA NA NA
## 2 0 NA NA NA
## 3 0 NA NA NA
## 4 0 NA NA NA
## 5 0 NA NA NA
## 6 0 NA NA NA
## TOTALCONFIRMEDCASES TOTALDEATHS TOTALPROBABLECASES
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 1 0 0
## 6 1 0 0
\end{verbatim}
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{tail}\NormalTok{(stl\_covid)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## CONFIRMEDCASECHANGE DATE DEATHCHANGE
## 743 8 03-15-2022 1
## 744 11 03-16-2022 0
## 745 11 03-17-2022 0
## 746 7 03-18-2022 0
## 747 5 03-19-2022 0
## 748 0 03-20-2022 0
## PROBABLECASECHANGE R0 R0CIHIGH R0CILOW
## 743 5 NA NA NA
## 744 3 NA NA NA
## 745 0 NA NA NA
## 746 1 NA NA NA
## 747 1 NA NA NA
## 748 0 NA NA NA
## TOTALCONFIRMEDCASES TOTALDEATHS TOTALPROBABLECASES
## 743 45378 746 7504
## 744 45389 746 7507
## 745 45400 746 7507
## 746 45407 746 7508
## 747 45412 746 7509
## 748 45412 746 7509
\end{verbatim}
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{summary}\NormalTok{(stl\_covid)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## CONFIRMEDCASECHANGE DATE
## Min. : 0.0 Length:748
## 1st Qu.: 21.0 Class :character
## Median : 35.0 Mode :character
## Mean : 60.7
## 3rd Qu.: 66.0
## Max. :735.0
##
## DEATHCHANGE PROBABLECASECHANGE R0
## Min. : 0.000 Min. : -2 Min. :0.56
## 1st Qu.: 0.000 1st Qu.: 0 1st Qu.:0.89
## Median : 0.000 Median : 5 Median :0.99
## Mean : 0.997 Mean : 10 Mean :1.04
## 3rd Qu.: 2.000 3rd Qu.: 12 3rd Qu.:1.14
## Max. :12.000 Max. :241 Max. :3.99
## NA's :34
## R0CIHIGH R0CILOW TOTALCONFIRMEDCASES
## Min. :0.61 Min. :0.51 Min. : 1
## 1st Qu.:1.00 1st Qu.:0.76 1st Qu.: 6541
## Median :1.11 Median :0.89 Median :20429
## Mean :1.16 Mean :0.91 Mean :18537
## 3rd Qu.:1.28 3rd Qu.:1.00 3rd Qu.:26597
## Max. :5.36 Max. :2.63 Max. :45412
## NA's :34 NA's :34
## TOTALDEATHS TOTALPROBABLECASES
## Min. : 0 Min. : 0
## 1st Qu.:214 1st Qu.: 97
## Median :438 Median :1724
## Mean :396 Mean :2125
## 3rd Qu.:577 3rd Qu.:3401
## Max. :746 Max. :7509
##
\end{verbatim}
While this dataset originated as a CSV file, there are specific R packages for reading in Microsoft Excel (\emph{readxl}) and Google Sheets (\emph{googlesheets4}). This book works only with CSV files, but know that if you often work with those proprietary formats, other packages exist to help you with those if you don't want to convert them to CSV files first.
\hypertarget{rstudio-summary}{%
\section{Summary}\label{rstudio-summary}}
This chapter took you from no experience coding in R to interacting with data in the RStudio IDE using R functions. R is an object-oriented programming language used within RStudio's graphical user interface alongside several popular code packages, such as the Tidyverse. New users must install R and RStudio before learning the various features the IDE offers for data scientists. There are several ways to view data in RStudio, whether viewing the entire dataset file or using R functions to see snippets of the dataset within the console.
\hypertarget{rstudio-study}{%
\section{Further Practice}\label{rstudio-study}}
\begin{itemize}
\tightlist
\item
Read in a csv file of your own and run the same summary functions: \texttt{head()}, \texttt{tail()}, \texttt{summary()}
\item
Install \emph{janeaustenR} for use in chapter 6\ref{text-study}
\end{itemize}
\hypertarget{rstudio-resources}{%
\section{Additional Resources}\label{rstudio-resources}}
\begin{itemize}
\tightlist
\item
\emph{Hands-on programming with R}, Garrett \& Hadley
\item
RStudio IDE, Base R, \& data import (\emph{readr}) cheatsheets: \url{https://www.rstudio.com/resources/cheatsheets/}
\item
``Getting Started with R and RStudio'': \url{https://moderndive.netlify.com/1-getting-started.html}
\item
An Introduction to R: \url{https://cran.r-project.org/doc/manuals/R-intro.html}
\end{itemize}
\hypertarget{dplyr}{%
\chapter{\texorpdfstring{Tidying data with \emph{dplyr}}{Tidying data with dplyr}}\label{dplyr}}
\hypertarget{dplyr-los}{%
\section{\texorpdfstring{Learning Objective: write code to perform data scrubbing functions with the Tidyverse's \emph{dplyr} package.}{Learning Objective: write code to perform data scrubbing functions with the Tidyverse's dplyr package.}}\label{dplyr-los}}
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
Use the IDE to load the \emph{dplyr} package.
\item
Identify data elements in RStudio's IDE that need to be changed.
\item
Summarize the most common functions \emph{dplyr} is used for.
\item
Use \emph{dplyr} functions to normalize fields in a dataset.
\end{enumerate}
\hypertarget{dplyr-terms}{%
\section{Terms You'll Learn}\label{dplyr-terms}}
\begin{itemize}
\tightlist
\item
API
\end{itemize}
\hypertarget{dplyr-scenario}{%
\section{Scenario}\label{dplyr-scenario}}
You need data on unemployment in the city of St.~Louis, and the first step to creating visualizations related to unemployment requires you to read the data and tidy it. You'd like to target your outreach to areas of low unemployment, so you will need to prepare data to use in determining those. Occupations with the highest employment would be helpful to target training for job seekers for jobs that are in demand.
\hypertarget{dplyr-pkgs}{%
\section{Packages \& Datasets Needed}\label{dplyr-pkgs}}
\hypertarget{dplyr-intro}{%
\section{Introduction}\label{dplyr-intro}}
This chapter is focused on census data and learning data tidying functions to create an unemployment dataset for use in subsequent chapters. We are aided in this endeavor by the \emph{tidycensus} package\footnote{\url{https://walker-data.com/tidycensus/}}, which interfaces with the US Census data and returns data that are ready to work with Tidyverse packages. \emph{Tidycensus} lets us access census data for many communities, St.~Louis included. The Census contains data about employment, occupation, gender, and location.
\hypertarget{census-setup}{%
\section{Getting started with U.S. Census data}\label{census-setup}}
Census data is available from the Census \emph{API}\footnote{\url{https://en.wikipedia.org/wiki/API}}. An API, or application programming interface, allows our computer to access the computer(s) storing the census data. APIs enable computers to talk to each other; they are a valuable tool for data scientists who want to get a dataset directly from the source. Many data sources provide API access to their databases, which we will visit again in chapter 7\ref{text-api}.
\hypertarget{census-prerequisites}{%
\subsection{Census prerequisites}\label{census-prerequisites}}
Before using \emph{tidycensus} to query the Census database, each user must have a unique identifier: an API key. This unique authorization code from the Census website allows you to access census data\footnote{\url{http://api.census.gov/data/key_signup.html}}.
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
Create a Census API key
\end{enumerate}
If you're following along and entering this code into your R console, sign up for your own census data key, delete the `\#' and replace '``your-key-here'' with your own API key.
\begin{Shaded}
\begin{Highlighting}[]
\CommentTok{\# census\_api\_key("your{-}key{-}here") }
\end{Highlighting}
\end{Shaded}
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\setcounter{enumi}{1}
\tightlist
\item
Get FIPS codes
We are limiting our analysis to the city of St.~Louis and need to restrict our data to that area. To do that, we'll use the Federal Information Processing Series (FIPS) Codes. Thankfully, \texttt{fips\_codes} are already part of \emph{tidycensus}.
\end{enumerate}
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{head}\NormalTok{(fips\_codes)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## state state_code state_name county_code
## 1 AL 01 Alabama 001
## 2 AL 01 Alabama 003
## 3 AL 01 Alabama 005
## 4 AL 01 Alabama 007
## 5 AL 01 Alabama 009
## 6 AL 01 Alabama 011
## county
## 1 Autauga County
## 2 Baldwin County
## 3 Barbour County
## 4 Bibb County
## 5 Blount County
## 6 Bullock County
\end{verbatim}
When combined with the state, each county has a code that allows us to query the Census database for only the geographic area of interest, like St.~Louis.
\hypertarget{census-variables}{%
\subsection{Census variables}\label{census-variables}}
The Census collects a lot of data about the US population, but we don't need all that data! To narrow our scope to the most applicable data, we must select the Census report year, type, and metadata fields (variables) we want to analyze. The American Community Survey\footnote{\url{https://www.census.gov/programs-surveys/acs}} will provide the most valuable data for our analysis.
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
Review all Census variables
We'll use \texttt{load\_variables()} to review the 2019 ACS 5-year survey data variables.
\end{enumerate}
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{var\_2019 }\OtherTok{\textless{}{-}} \FunctionTok{load\_variables}\NormalTok{(}\DecValTok{2019}\NormalTok{, }\StringTok{"acs5"}\NormalTok{)}
\NormalTok{var\_2019}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## # A tibble: 27,040 x 4
## name label concept geogr~1
## <chr> <chr> <chr> <chr>
## 1 B01001_001 Estimate!!Total: SEX BY~ block ~
## 2 B01001_002 Estimate!!Total:!!Male: SEX BY~ block ~
## 3 B01001_003 Estimate!!Total:!!Male:!~ SEX BY~ block ~
## 4 B01001_004 Estimate!!Total:!!Male:!~ SEX BY~ block ~
## 5 B01001_005 Estimate!!Total:!!Male:!~ SEX BY~ block ~
## 6 B01001_006 Estimate!!Total:!!Male:!~ SEX BY~ block ~
## 7 B01001_007 Estimate!!Total:!!Male:!~ SEX BY~ block ~
## 8 B01001_008 Estimate!!Total:!!Male:!~ SEX BY~ block ~
## 9 B01001_009 Estimate!!Total:!!Male:!~ SEX BY~ block ~
## 10 B01001_010 Estimate!!Total:!!Male:!~ SEX BY~ block ~
## # ... with 27,030 more rows, and abbreviated variable
## # name 1: geography
## # i Use `print(n = ...)` to see more rows
\end{verbatim}
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\setcounter{enumi}{1}
\tightlist
\item
Create new object of variables
Having pulled in the FIPS codes that allow us to identify data from St.~Louis and the variable names from the 2019 ACS, we can now create a new object that contains only the data we want:
\end{enumerate}
\begin{itemize}
\tightlist
\item
Survey: 5-year ACS
\item
Year: 2019
\item
Locations: St.~Louis County, Missouri
\item
Variables:
\item
Total population: B23025\_001
\item
Population not in the labor force (unemployed): B23025\_007
\end{itemize}
One Base R function that we'll rely on for this code is \texttt{c()}, which concatenates strings (numbers or text) into one value. We'll concatenate the two variables we're interested in: total population and the number of unemployed. The function \texttt{get\_acs()} passes the metadata requirements to the Census database, returning the data we need for each Census tract. We're interested in all the variables and want to see them spread out across the columns, so we will use the \texttt{output\ =\ "wide"} setting to adjust the output.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{data }\OtherTok{\textless{}{-}} \FunctionTok{get\_acs}\NormalTok{(}\AttributeTok{geography =} \StringTok{"tract"}\NormalTok{, }\AttributeTok{variables =} \FunctionTok{c}\NormalTok{(}
\AttributeTok{total\_pop =} \StringTok{"B23001\_001"}\NormalTok{,}
\AttributeTok{unemployed =} \StringTok{"B23025\_007"}\NormalTok{), }\AttributeTok{state =} \StringTok{"MO"}\NormalTok{, }\AttributeTok{county =} \StringTok{"510"}\NormalTok{, }\AttributeTok{year =} \DecValTok{2019}\NormalTok{, }\AttributeTok{output =} \StringTok{"wide"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## Getting data from the 2015-2019 5-year ACS
\end{verbatim}
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{data}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## # A tibble: 106 x 6
## GEOID NAME total~1 total~2 unemp~3 unemp~4
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 29510124200 Census ~ 2536 326 668 194
## 2 29510124300 Census ~ 3043 305 649 142
## 3 29510125500 Census ~ 3881 328 791 177
## 4 29510102100 Census ~ 2300 184 545 107
## 5 29510102400 Census ~ 2086 166 570 152
## 6 29510103800 Census ~ 3269 240 898 131
## 7 29510104200 Census ~ 3000 261 869 218
## 8 29510105500 Census ~ 2265 388 964 215
## 9 29510106500 Census ~ 2275 376 1135 265
## 10 29510107500 Census ~ 1730 312 904 220
## # ... with 96 more rows, and abbreviated variable
## # names 1: total_popE, 2: total_popM,
## # 3: unemployedE, 4: unemployedM
## # i Use `print(n = ...)` to see more rows
\end{verbatim}
\hypertarget{dplyr-tidy-tools}{%
\section{\texorpdfstring{Tidy data tools from \emph{dplyr}}{Tidy data tools from dplyr}}\label{dplyr-tidy-tools}}
The data we've pulled is the total population and the number of unemployed, but that's not what we need to know. We need an unemployment rate; from there we can determine where the areas of highest and lowest unemployment are alongside occupation data. To do this, we must tidy and modify the \emph{tidycensus} data we have.
The \emph{dplyr} package within the Tidyverse contains a constellation of functions designed for data modification. Some of the actions we'll need to perform are:
\begin{itemize}
\tightlist
\item
renaming columns
\item
creating a new column for the unemployment rate, which involves performing a mathematical operation on other columns
\item
combine columns
\item
sort column values
\item
combine several functions sequentially
\item
choose specific columns or rows within the table
\item
see a snapshot of a dataset
\item
group data by column value
\item
filter a subset of a table
\item
combine datasets based on common column values
\end{itemize}
\hypertarget{dplyr-start}{%
\section{\texorpdfstring{Getting started with \emph{dplyr} functions}{Getting started with dplyr functions}}\label{dplyr-start}}
One of the formative concepts of the Tidyverse, which we will rely upon heavily through the remainder of the book, is the use of the pipe: \texttt{\%\textgreater{}\%}. This operator can be read within a code chunk as `then': it allows us to call \emph{dplyr} functions sequentially to make our code more readable. You will often see code from the Tidyverse written in an ``object {[}then{]} function'' syntax pattern.
\hypertarget{create-unemployment-rate}{%
\subsection{Create unemployment rate}\label{create-unemployment-rate}}
We'll use the ``object {[}then{]} function'' pattern to create a new variable and column for the unemployment rate. The ACS doesn't provide an unemployment rate, so we must calculate it from the columns we have: total population and population unemployed. Our task here is two-fold: 1) create a new column and 2) populate each row in the column with its calculated unemployment rate. This is the number unemployed divided by the total population:
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{unemployment\_data }\OtherTok{\textless{}{-}}\NormalTok{ data }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{mutate}\NormalTok{(}\AttributeTok{unemployment\_rate =} \FunctionTok{as.numeric}\NormalTok{(unemployedE)}\SpecialCharTok{/}\FunctionTok{as.numeric}\NormalTok{(total\_popE)) }
\end{Highlighting}
\end{Shaded}
In plain English, the code above says ``create a new object called \texttt{unemployment\_data}, which takes the \texttt{data} object and then makes a new column in it called \texttt{unemployment\_rate}; fill the rows in that new column with the value of the number unemployed divided by the total population.''
\hypertarget{save-unemployment-data}{%
\subsection{Save unemployment data}\label{save-unemployment-data}}
Before going any further, we will save the \texttt{unemployment\_data} object to a CSV file in the \texttt{data/} sub-directory, or folder, using the \texttt{write\_csv()} function from \emph{readr}:
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{write\_csv}\NormalTok{(unemployment\_data, }\StringTok{"data/unemployment\_data.csv"}\NormalTok{) }
\end{Highlighting}
\end{Shaded}
\hypertarget{find-the-areas-with-the-highest-unemployment}{%
\subsection{Find the areas with the highest unemployment}\label{find-the-areas-with-the-highest-unemployment}}
Getting back to \emph{dplyr}, we need to figure out where the areas with the highest levels of unemployment are. We'll use \texttt{arrange()} to sort the dataset by unemployment rate in descending order and then look at only the top 10 locations in the dataset.
\begin{Shaded}