Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post: Missing elements in list columns #31

Open
gadenbuie opened this issue Jun 4, 2019 · 0 comments
Open

Post: Missing elements in list columns #31

gadenbuie opened this issue Jun 4, 2019 · 0 comments
Assignees

Comments

@gadenbuie
Copy link
Member

I was going to post this on community.rstudio.com but it might be better as a blog post -- about half way through I worked out a good answer.

When working with list columns, it can be useful to mark entire elements as missing, but I’m struggling to find a consistent and easy-to-use data structure that works well with unnest().

Here’s a small example with a list column of tibbles, where, ideally, the 2nd element is “missing”. I’d like to unnest() column y but keep all of the rows in the original data frame. In real life, the tibbles in y are more complicated, but when present they all have the same number and type of columns.

The first idea I tried was to store missingness in the list column as NULL, but unnest() throws an error in this case.

library(tidyverse)
(data_null <- tibble(x = 1:2, y = list(tibble(z = 1L), NULL)))
#> # A tibble: 2 x 2
#>       x y               
#>   <int> <list>          
#> 1     1 <tibble [1 × 1]>
#> 2     2 <NULL>
data_null %>% unnest()
#> Each column must either be a list of vectors or a list of data frames [y]

The second idea was to use a zero-row data frame. I was hopeful this would work because it’s easy to grab a valid example and use the valid_ex[0, ] trick to create the zero-row data frame with the correct number and type of columns. This now works, but we lose the row with the zero-length data frame.

(data_zero_tibble <- tibble(x = 1:2, y = list(tibble(z = 1L), tibble())))
#> # A tibble: 2 x 2
#>       x y               
#>   <int> <list>          
#> 1     1 <tibble [1 × 1]>
#> 2     2 <tibble [0 × 0]>
data_zero_tibble %>% unnest()
#> # A tibble: 1 x 2
#>       x     z
#>   <int> <int>
#> 1     1     1

Even trying to .preserve column y in the unest() drops the zero-length row.

data_zero_tibble %>% unnest(y, .preserve = "y")
#> # A tibble: 1 x 3
#>       x y                    z
#>   <int> <list>           <int>
#> 1     1 <tibble [1 × 1]>     1

What does work is to explicitly use NA across rows with missing values.

(data_na_int <- tibble(x = 1:2, y = list(tibble(z = 1L), tibble(z = NA_integer_))))
#> # A tibble: 2 x 2
#>       x y               
#>   <int> <list>          
#> 1     1 <tibble [1 × 1]>
#> 2     2 <tibble [1 × 1]>
data_na_int %>% unnest()
#> # A tibble: 2 x 2
#>       x     z
#>   <int> <int>
#> 1     1     1
#> 2     2    NA

And the type of missing value doesn’t seem to matter.

(data_na_chr <- tibble(x = 1:2, y = list(tibble(z = 1L), tibble(.drop = NA_character_))))
#> # A tibble: 2 x 2
#>       x y               
#>   <int> <list>          
#> 1     1 <tibble [1 × 1]>
#> 2     2 <tibble [1 × 1]>
data_na_chr %>% unnest()
#> # A tibble: 2 x 3
#>       x     z .drop
#>   <int> <int> <chr>
#> 1     1     1 <NA> 
#> 2     2    NA <NA>

This might be the best solution, because it's not necessary to know anything about the other list elements in advance. All that is needed is an NA value in the same data shape as the other list elements.

(data_iris_zero <- tibble(x = 1:2, y = list(iris[1:2, ], iris[0,])))
#> # A tibble: 2 x 2
#>       x y               
#>   <int> <list>          
#> 1     1 <df[,5] [2 × 5]>
#> 2     2 <df[,5] [0 × 5]>
data_iris_zero %>% unnest()
#> # A tibble: 2 x 6
#>       x Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>   <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1     1          5.1         3.5          1.4         0.2 setosa 
#> 2     1          4.9         3            1.4         0.2 setosa

(data_iris_na <- tibble(x = 1:2, y = list(iris[1:2, ], data.frame(Sepal.Length = NA))))
#> # A tibble: 2 x 2
#>       x y               
#>   <int> <list>          
#> 1     1 <df[,5] [2 × 5]>
#> 2     2 <df[,1] [1 × 1]>
data_iris_na %>% unnest()
#> # A tibble: 3 x 6
#>       x Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>   <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1     1          5.1         3.5          1.4         0.2 setosa 
#> 2     1          4.9         3            1.4         0.2 setosa 
#> 3     2         NA          NA           NA          NA   <NA>

Finally, another solution is to use the zero-length data frame element and
full_join() the unnest()ed data with the original data, minus the list column.

full_join(
  data_iris_zero %>% unnest(),
  data_iris_zero %>% select(-y)
)
#> Joining, by = "x"
#> # A tibble: 3 x 6
#>       x Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>   <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1     1          5.1         3.5          1.4         0.2 setosa 
#> 2     1          4.9         3            1.4         0.2 setosa 
#> 3     2         NA          NA           NA          NA   <NA>

Created on 2019-06-04 by the reprex package (v0.2.1)

@gadenbuie gadenbuie self-assigned this Jun 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant