forked from gergness/ipumsr-tcrug
-
Notifications
You must be signed in to change notification settings - Fork 0
/
ipumsr_webinar_presenter_notes.Rmd
830 lines (469 loc) · 28 KB
/
ipumsr_webinar_presenter_notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
---
title: "ipumsr webinar presenter notes"
output: html_document
---
# NO HEADING (1)
Welcome, everyone, to this webinar on using IPUMS data in R with
ipumsr. We only have an hour, and we want to leave time for questions at the
end, so this will be a whirlwind tour of some of the key features of ipumsr.
Our goal is to get you excited about analyzing IPUMS data using ipumsr, and
point you to additional resources to explore on your own. As you use ipumsr, you
are bound to have questions, and at the end we will provide links to several
places where you can seek help, including by reaching out to our excellent IPUMS
user support team or posting a question on the IPUMS user forum.
We will also share with you the webinar recording, these slides, and all of the
code used to build this presentation, so there's no need to frantically jot down
notes during the webinar itself. You should, however, type any questions in the
Q&A as they occur to you, and our user support expert Grace will either answer
them or add them to the list to be answered at the end.
Finally, I just want to acknowledge that the ipumsr package was created by Greg
Freedman Ellis, and this webinar is based on a presentation created by Greg, so
thanks to Greg for creating a great package, and for letting us use your slides.
With that, we'll get started by briefly introducing ourselves. You'll primarily
be hearing from me and Dan, but the development of the webinar was a
collaborative effort between Dan, Kara, and myself.
# Who we are (2)
(Name, pronouns, academic field, time with IPUMS)
So, my name is Derek Burk, I use he/him pronouns, I've been working on the
IPUMS International team for the past four years, and my academic training is in
sociology.
# Who we are (3)
Rather than our faces, we thought everyone would appreciate pets.
**A FEW NOTES**
This presentation will be recorded and will be available along with slides so don't worry about trying to take notes.
You can always reach out to through our **github**, or via **user support**.
# Overview (4)
# Overview (5)
# What is IPUMS? (6)
ipums has grown substantially since its first beta release in 1993,
started with **US census data** has grown to include 9 collections
*mostly just segueing to projects*
# Poll: Which IPUMS data collections have you used already? (7)
# NO HEADING (8)
# NO HEADING (9)
# NO HEADING (10)
# NO HEADING (11)
Check out Matt Gunther's blog using IPUMS Global Health Data
# NO HEADING (12)
# NO HEADING (13)
# NO HEADING (14)
# NO HEADING (15)
# NO HEADING (16)
# Poll: Which IPUMS data collections are you most excited to use in the future ? (17)
# So what is IPUMS? (18)
So ipums really is **data** and a whole lot of it. These 9 different projects interact with different types of data and at different scales but they are united in the use of metadata that helps contextualize the data
*So I know you're asking yourselves..*
# So what is IPUMS? (19)
# Overview (20)
# What is ipumsr? (21)
(Metadata such as value labels, variable labels, and detailed variable
descriptions.)
Initial API support will be for IPUMS USA, with more projects to follow soon.
# Why use ipumsr? (22)
Regarding "One package": Without ipumsr, you'd need to use a variety of
different approaches from different packages to read in and explore IPUMS
**microdata** from IPUMS: USA, CPS, and International, IPUMS
**aggregate data** (from NHGIS or IHGIS),
and **IPUMS shapefiles**. ipumsr provides one
package with a consistent interface for working with all these different types
of IPUMS data.
*a one stop shop that makes it easy to work with ipums data*
# One package to rule them all (23)
# Why use ipumsr? (24)
Regarding "More features": The aforementioned IPUMS API support will be the next
big feature. Another potential new feature is adding tools for properly handling
survey weights. Let us know what would be helpful to you via GitHub or email.
# Why use ipumsr? (25)
There are other ways to read IPUMS data into R, but ipumsr is the fastest way
to read the data in with attached metadata, such as variable and value labels.
Relatively small example, so the time costs can grow with larger datasets.
# Poll: Have you ever used ipumsr before? (26)
If you've never used ipumsr before you may be asking *where can i download it*
# Installing ipumsr (27)
More advanced users might be thinking, *"this stable release is boring"* in which case you can download the development version to test out the latest features.
The API functions are in the development version, but the API itself is not yet
publicly available.
# To run the code in this webinar (28)
Repeat: slides/recording will be available. To run this code yourself you can clone/download the repo from our github. This provides the data extracts you'll need.
However, you may need to install some additional packages used in the example code below, as shown on the next slide.
# Install R packages (as needed) (29)
reminder, installing packages only needs to happen once, but some packages do update frequently so it can be a good idea to re-install once in a while
# Loading packages (30)
These packages are used in some of the examples we will walk through.
# Overview (31)
Dan passes to Kara/Derek
# Downloading your data extract (32)
Kind of confusing how to save the DDI/.xml file. THIS IS HOW.
DDI is EXTREMELY important, as it contains all the instructions regarding the METADATA
Once your extract is complete, download the data file and the DDI. Downloading
the DDI is a little bit different depending on your browser. On most browsers
you should right-click the file and select “Save As…”. If this saves a file with
a .xml file extension, then you should be ready. However, Safari users must
select “Download Linked File” instead of “Download Linked File As”. On Safari,
selecting the wrong version of these two will download a file with a .html file
extension instead of a .xml extension.
In case anyone was curious, DDI stands for "Data Documentation Initiative" --
the DDI project sets standards for documenting datasets, and the codebooks for
most IPUMS projects follow this standard.
Make sure to save the data and DDI files in the same location.
# Downloading your data extract (33)
The links under "Command Files" contain program-specific code for reading in the
data. The R one contains the code we'll show on the next slide.
This helper code checks that you have ipumsr installed, and if you do, it reads
in the DDI codebook and data into separate objects.
# Read in the data (34)
So you've downloaded both the data and the DDI codebook, and saved them in the
same folder. Here's how you actually read the data into R.
The first option, and the one I'd recommend, is to read the DDI codebook into an
object named "ddi" using the `read_ipums_ddi()` function, and then supply that
ddi object to `read_ipums_micro()`.
The second option is to read in the data by passing the file path to the DDI
codebook to the `read_ipums_micro()` function. However, we recommend the first
pattern because it's often helpful to save the codebook to it's own object to
preserve the original metadata once you start messing around with the data.
Notice that in both cases we are supplying the codebook (either already read in
or as a file path), not the data file, to `read_ipums_micro()`. The fact that we
don't point to the actual data file when reading in the data might confusing. If
I want to read in the data shouldn't I point to the data file?
# Why do I point to the codebook to read in the data? (35)
Here's my attempt to explain this in a hopefully memorable way: the data file
is the raw ingredients, and the codebook is the recipe. And when you go to make
dinner, your recipe tells you how to combine your ingredients to get the
finished product.
# The data file is just raw ingredients (36)
Looking at the first ten lines of our example data file, it makes sense why you
can't just tell R "read this data file into a data.frame". The data file itself
doesn't contain any information about what the variables are called or which
numbers on each line correspond to which variables, let alone what the meaning
of those values might be.
(In case you're wondering what those blanks are, that's a geography variable
that is not available in some years of our data.)
# The DDI codebook is the recipe (37)
The DDI codebook tells you what to do with the ingredients in the data file. It
contains a bunch of information about your extract, including the name of the
data file, which is why you don't need to supply the path to the data file if
you saved the codebook and data in the same folder.
But perhaps the most important element for understanding and analyzing the data
is the "var_info" element, so let's take a look at that.
# The DDI codebook is the recipe (38)
As we can see here, var_info is just data.frame where each row has information
about one variable, including the variable name, label, description, and value
labels, as well as where the variable is located in the fixed-width file.
You don't need to remember the details of what's in the DDI object or this
var_info data.frame, as long as you remember that to read in the data, you must
point to the DDI codebook, not the data file.
# Overview (39)
# What's in my extract again? (40)
So we've read in our data -- how do we go about exploring it?
We could refer back to the description we wrote when creating the extract,
maybe that will be informative. Let's see (read extract description). Ooh,
that's not very helpful. I'm guessing this joke hits close to home for some of
you long-time IPUMS users.
# What's in my extract again? (41)
[Read text on slides]
# Available metadata (42)
So you can see here that ipumsr provides function to extract information from
the DDI object without the need to slice and dice that "var_info" data.frame I
showed before. Here we grab the label and description for the variable PHONE.
# Available metadata (43)
Similarly, we can print the value labels by pointing to the DDI. We see here
that the variable PHONE takes four values: no, yes, not applicable, and
suppressed.
# An interactive view of metadata (44)
For a more browsable view of the metadata, the ipums_view() function makes a
nicely-formatted static html page that allows you to browse the metadata
associated with your data extract.
# Wrangling value labels (45)
IPUMS value labels don't translate perfectly to R factors. The most important
difference is that in a factor, values always count up from 1. In IPUMS
variables, the values themselves often encode meaningful information about a
nested structure, which we'll see with the education variable below.
To preserve both the value and label, `ipumsr` imports labelled variables as
`haven::labelled()` objects, but these aren't always the easiest to deal with,
because they aren't widely supported by functions from base R and other
packages. Thus, you will almost always want to convert these objects to
something like a factor or a regular numeric or character vector.
Luckily ipumsr provides helpers that allow you to use information from both the
value and label to transform your variables into the shape you want.
# haven::labelled columns at a glance (46)
Here's what the data look like when you read them in.
One thing I like about the behavior of `haven::labelled()` objects is how they
show the value and the label when you print your data.
But let's say you just don't want to deal with these special haven::labelled
objects, and you want to get rid of them in one fell swipe. I would advise
against this approach in general, but it could be what you want in some cases.
# Get rid of haven::labelled columns (47)
One option is the `as_factor()` function, which can be applied to the whole data
frame at once. Here we see that the values have been stripped away, leaving only
the labels as factor levels.
# Get rid of haven::labelled columns (48)
Alternatively, if you just want the numeric values, you can use `zap_labels()`
on the whole data frame. Here we see that the labels have been stripped away,
leaving only the numeric codes.
Both `as_factor()` and `zap_labels()` can be applied to a single column of a
data frame as well, to just strip away values or labels for that one column.
But if you're going to want to recode any of these variables, you'll often be
able to do that more elegantly when you have access to both values and labels,
with the help of special ipumsr functions that leverage all that information.
These special functions might take some trial and error when you first start
using them, but I promise the payoff will be worth it.
***I wonder if a transition slide here would be nice to transition away from
getting rid of labelled columns to using them with ipumsr. Would maybe help
to distinguish which functions (ours) we are suggesting to use. ***
# Using ipumsr label helper functions (49)
# `lbl_na_if()` (50)
The first label helper function is label NA if, which allows you to set values
to NA, which is R's code for missing values, according to either the value or
the label.
Notice that we've called `ipums_val_labels()` directly on the labelled column,
rather than pointing to the DDI object. This works if PHONE is still a
haven::labelled column.
In this case, given the value labels we see here, it makes sense to set values 0
and 8 to missing.
# `lbl_na_if()` (51)
`lbl_na_if()` supports a special syntax used by a few of ipumsr's label
functions that allows you to use anonymous functions (signaled by the "tilde"
symbol, which is that little squiggly line) that refer to either the value of a
variable (using the special dot val variable), the label (using dot lbl), or
both. Here we use dot val to specify that PHONE2 should be set to missing if the
original value is zero or eight.
This can be read as, "recode variable PHONE to NA if the value is zero or
eight".
Notice in all these examples that we create a new variable for the transformed
version, so that we don't overwrite the original. This can be useful so that
you can engage in some trial and error until you get the variable transformed as
you want it.
# `lbl_na_if()` (52)
In case you're unfamiliar with the anonymous function notation, that squiggly
tilde could be replaced with a regular function definition, as shown here. So
anytime you see the tilde symbol in any of these label helper functions, you
can read it as "a function of dot val and dot label".
# `lbl_na_if()` (53)
Because lbl_na_if works with both values and labels, we could accomplish the
same thing by referring to the labels. This time, we specify that PHONE3 should
be set to missing if the original label was "N/A" or "Suppressed".
# `lbl_collapse()` (54)
[Read bullet point]
We see here that values less than 10 indicate missing or no schooling values;
values between 10 and 19 all capture levels between nursery school and grade 4;
and so on.
# `lbl_collapse()` (55)
[Read bullet point]
The label collapse function allows you to collapse values based on a
hierarchical coding scheme.
Here we use the integer division operator "percent forward slash percent" to
group values with the same first digit, and label collapse automatically assigns
the label of the smallest original value.
You can
interpret this integer division expression as "how many times does 10 go into
this value?" For original values 0, 1, and 2, the answer is that 10 goes into
the value zero times, so those values all receive the same output value, with
the label coming from the smallest original value. I think the details of what's
going on here take a bit of time to unpack, but hopefully this gives you a sense
of the usefulness and power of this function.
# Still too detailed for a graph (56)
This shows the values and labels that we get by applying lbl_collapse() to
collapse the last digit. However, if we want to visualize educational
differences, as we'll do below, this might still be too much detail.
# `lbl_relabel()` (57)
That's where `lbl_relabel()` comes in.
Label relabel offers the most power and flexibility by allowing you to create an
arbitrary number of new labeled values, controlling both the resulting value
and label, using the special dot val and dot label variables to refer to the
values and labels of the input vector.
Label relabel is powerful, but it's syntax may be confusing at first glance, so
let's unpack what's going on here. The first line just creates a regular
expression string that will match either 1, 2, or 3 years of college, and which
we use a few lines below.
The piped expression takes the EDUCD variable, then applies the same label
collapse expression we used before to collapse the last digit, resulting in the
values and labels shown on the previous slide. So, that collapsed education
variable is the input to the lbl_relabel() call.
Each line in the lbl_relabel() call creates a labeled value using the l-b-l (or
"label") helper function. So the first line can be read as "assign a value of
two with a label of "Less than High School" when the input value is greater
than 0 and less than 6".
# `lbl_relabel()` (58)
Notice that we have examples of using the dot val variable on the first and
fourth lines...
# `lbl_relabel()` (59)
...and of using the dot label variable on the second and third lines. The
third line of the lbl_relabel() call can be read as "assign a value of 4 with a
label of "Some college" when the input label matches our "college_regex"
regular expression.
# `lbl_relabel()` (60)
So, now we have a education variable with fewer categories, which is more
suitable for visualization.
For more on these helper functions, see the "Value labels" vignette, which we
will link to below.
# Overview (61)
# Phone availability (62)
Microdata often needs to be summarized at a higher level for visualization. In
this case, if we want to make a graph of phone availability over time, we need
to first summarize at the year level.
The first block of code computes the weighted proportion of persons with access
to a phone in each year. Once our data are summarized so that each row
represents a value for one year, we can make a graph by year.
# Phone availability (63)
Here we see a strange decline, then resurgence, in access to a phone line.
What's going on here?
# Interpretation (64)
If we look at the Comparability tab for the PHONE variable on the IPUMS USA
website, we find an explanation:
The 2008 ACS and 2008 PRCS instructed respondents to include cell phone service;
prior to 2008, this was not made explicit.
Just one small example of the value added to these data by the IPUMS USA team.
# Phone availability by education (65)
With our simplified education variable, we can also plot access to a phone line
by education. Here we see that anomalous drop and resurgence is present at all
education levels, and that differences in phone access by education have
declined since 1960.
# Overview (66)
# Getting geographic data (67)
[Read bullet points]
# Getting geographic data (68)
Geographic boundary data is usually found via a "Geography and GIS" link on the
left sidebar of the home page for a data collection, under the heading
"Supplemental Data".
Under that link, you can browse available shape files, then download the ones
you want in a zip archive, and unzip the contents of the archive into their
own folder within your project directory.
# Loading shape data (69)
The sp package (short for "spatial") has been around since 2005. It is more
established and some other R spatial packages might still assume you are using
sp data structures. The sf package (short for "simple features") is newer, but
seems to be on the rise as an alternative sp for some use cases.
For more on the differences between sp and sf, see the free online book
*Geocomputation with R*, which we link to at the end of the slides.
# Loading shape data (70)
At the risk of oversimplifying, an sf object is basically a tibble or data.frame
with a special geometry column. That simplification helps me understand the
process of joining the sf object to data.
# Joining shape data (71)
Before joining to shape data, we need to summarize our person level data at the
level of the geography we are joining to. Thus, the first block of code computes
the weighted proportion of persons with access to a phone for each CONSPUMA
unit in each year. Once our data are summarized so that each row represents a
value for one CONSPUMA unit in one year, we can join to the sf object we loaded
above.
# Plotting shape data (72)
# Plotting shape data (73)
As with phone access by education, these maps show evidence of declining
geographic variation in access to a phone line.
For more information on IPUMS geographic data, see the ipumsr vignette on
working with geographic data, or one of several collection-specific recorded
webinars and tutorials on that topic, which can be found on the tutorials page
which we will link to below.
# Overview (74)
# API Timeline (75)
The API is not yet publicly available, but we wanted to offer a preview of the
functionality that will soon be available in ipumsr.
We plan to support creating extracts via API for all of our data collections,
but that support will be rolled out one collection at a time, to allow us to
thoroughly test that the extract API is working as expected for each collection.
IPUMS USA is the first collection that will be supported, and we are currently
doing internal testing of the API for USA. We'll put out a call for beta testers
before the end of this year, but feel free to reach out in the meantime if you
want to be added to that list. We expect to open up the USA extract API to all
IPUMS USA users in the first few months of next year, depending on what sort of
issues arise during beta testing.
Another important note is that there is already a public API for the IPUMS NHGIS
collection, which offers access to tabular data from the US Census Bureau, as
well as corresponding geographic boundaries. ipumsr does not yet include
functions for interacting with the NHGIS API, but there is a guide to
interacting with that API in R, which we'll share a link to at the end of these
slides, and we plan to add that functionality to ipumsr sometime next year.
# What can I do with the API? (76)
So, your next question might be, "what will I be able to do with this API?"
Here's the high-level overview:
You'll be able to:
Define and submit extract requests.
Check the status of a submitted extract, or have R "wait" for an extract to
finish by periodically checking the status until it is ready to download.
Download completed data extracts.
Get information on past extracts, including the ability to pull down the
definition of a previous extract, revise that definition, and resubmit it.
And finally, you can save your extract definition to a JSON file, allowing you
to share the extract definition with other users for collaboration or
reproducibility. Saving as JSON makes the definition more easily shareable
across languages, since R is not the only way to interact with the IPUMS API --
you can also call the API using a general purpose tool like curl, and IPUMS is
developing API client tools for Python in parallel with the R client tools that
will be part of the ipumsr package.
# What can't I do with the API? (77)
The other important question to answer is what the API can't do.
Most importantly, it can't deliver data "on demand" -- extracts are still
created through the same extract system used by the website, which means you
have to wait for your extract to process before you can download it.
In other words, the API does not create a separate system of accessing IPUMS
data, but rather provides a programmatic way to create and download extracts
through the existing extract system.
Secondly, you can't use the API to explore what data are available from IPUMS.
We plan to add a public metadata API, but the timeline on that is more unknown.
A third limitation is that API users will not initially have access to all the
bells and whistles of the extract system, such as the ability to attach
characteristics of family members such as spouses or children. We plan to add
these capabilities once the core functionality is well-tested and stable.
# Pipe-friendly example (78)
Now let's get to some example code! We'll start with a brief example that shows
a typical API workflow using the "pipe" operator from the magrittr package,
which is often used in conjunction with tidyverse packages such as dplyr.
We start by defining an extract object using the `define_extract()` function,
and specifying
- a data collection -- in this case, "usa"
- an extract description -- "Occupation by sex and age"
- samples -- here we're asking for the 2017 and 2018 American Community Survey
- and variables -- sex, age, industry and occupation.
The next code chunk shows how we can go from this extract definition to having
our extract data available in our R session in one piped expression. For any of
you who are unfamiliar with the pipe operator, it can be read aloud as the word
"then". So this piped expression says, take "my_extract", *then* submit extract,
*then* wait for extract, *then* download extract, *then* read in the data using
`read_ipums_micro()`.
Since we have to wait for the extract to process, this piped expression will
take a bit of time to execute, depending on the size of your extract.
# Pipe-friendly example (79)
And just for fun, here's what the same piped expression looks like with the new
base R pipe, available as of R 4.1.
# Pipe-friendly example (80)
Here are the messages you'll see if you run that code once the API is live. As
you can see, the `wait_for_extract()` function doubles the wait time after every
status check, but the maximum wait time between checks is capped at five at five
minutes. In the testing I've done, even large extracts of around 30 million
records and 40-50 variables can finish in under 30 minutes, but the wait times
also depend on how much traffic the extract system is getting.
# Revise and resubmit (81)
Another handy workflow where using the extract API is a "revise and resubmit"
workflow. Often, you might realize that you should have added one more or a few
more variables to your previous extract, and you just want to resubmit that
extract with a few additional variables.
To do this with ipumsr functions, you first pull down the definition of your
most recent extract using the function `get_recent_extracts_info_list()`, which
can return information on up to 10 recent extracts. Here, we specify that we
only want one recent extract (the most recent one), but because this function
always returns a list, we also have to subset the first element.
Alternatively, if we know the extract number of the extract we want to revise,
we can use `get_extract_info()` and specify a shorthand notation for USA
extract number 33, for example.
# Revise and resubmit (82)
Once we have the old extract definition, we can pass it to the
`revise_extract()` function to add a variable, then resubmit it.
The `revise_extract()` function currently allows you to add and remove variables
and samples, as well as change the description, data format, or data structure
of your extract definition.
# Share your extract definition (83)
One thing that really excites us about the API is the ability to share extract
definitions to facilitate collaboration or reproducibility.
To write your extract out to a JSON file, you can use the
`save_extract_as_json()` function as shown here.
Saving as JSON makes it easier to share an extract definition with another user
who might not use R -- they can use the JSON definition to submit the extract
using curl or Python, possibly using the currently in-development IPUMS API
client tools in the "ipumspy" module.
If you are using ipumsr, the `define_extract_from_json()` function will create
an extract object from a JSON-formatted definition shared by someone else.
As Dan pointed out as we were preparing this webinar, this functionality could
also be useful for instructors who want to share an extract definition with
students for a class assignment.
# Overview (84)
# Resources (85)