Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compatibility with sf library #2273

Closed
rafapereirabr opened this issue Jul 17, 2017 · 31 comments · Fixed by #5224
Closed

compatibility with sf library #2273

rafapereirabr opened this issue Jul 17, 2017 · 31 comments · Fixed by #5224
Labels
non-atomic column e.g. list columns, S4 vector columns
Milestone

Comments

@rafapereirabr
Copy link

I wonder if it would be possible to make data.table compatible with the new sf library. The library sf is promising to be a game changer for spatial analysis in R so it sounds like good idea to bring together the power of both libraries.

Currently, the class of an sf object is "sf" "data.frame" and it brings a column named geometry of class "sfc_MULTIPOLYGON" "sfc" .

Right now, I think the main incompatibility between dt and sf is that when we convert an sf data.frame into a data.table using setDT(), the geometry column is ruined and the object is not recognised anymore as an sf class.

I'm sure there are other points to take into account when making these two great libraries together so I just wanted to put the ball rolling if someone has not done this before.

@MichaelChirico

This comment has been minimized.

@mattdowle
Copy link
Member

Reproducible example please.

@MichaelChirico
Copy link
Member

#1310 is related

@rafapereirabr
Copy link
Author

Reproducible Example

library(data.table)
library(sf)

# load sf data
  nc <- st_read( system.file("shape/nc.shp", package="sf") )
  
  class(nc)
  > [1] "sf"         "data.frame"


head(nc)
  > Simple feature collection with 6 features and 14 fields
  > geometry type:  MULTIPOLYGON
  > dimension:      XY
  > bbox:           xmin: -81.74107 ymin: 36.07282 xmax: -75.77316 ymax: 36.58965
  > epsg (SRID):    4267
  > proj4string:    +proj=longlat +datum=NAD27 +no_defs
  >    AREA PERIMETER CNTY_ CNTY_ID        NAME  FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79 NWBIR79                        geometry
  > 1 0.114     1.442  1825    1825        Ashe 37009  37009        5  1091     1      10  1364     0      19 1 MULTIPOLYGON(((-81.47275543...
  > 2 0.061     1.231  1827    1827   Alleghany 37005  37005        3   487     0      10   542     3      12 2 MULTIPOLYGON(((-81.23989105...
  > 3 0.143     1.630  1828    1828       Surry 37171  37171       86  3188     5     208  3616     6     260 3 MULTIPOLYGON(((-80.45634460...
  > 4 0.070     2.968  1831    1831   Currituck 37053  37053       27   508     1     123   830     2     145 4 MULTIPOLYGON(((-76.00897216...
  > 5 0.153     2.206  1832    1832 Northampton 37131  37131       66  1421     9    1066  1606     3    1197 5 MULTIPOLYGON(((-77.21766662...
  > 6 0.097     1.670  1833    1833    Hertford 37091  37091       46  1452     7     954  1838     5    1237 6 MULTIPOLYGON(((-76.74506378...

Now when we try setting the sf object into data.table, note that the object looses both its "sf" class and the information in the geometry column

# set sf into data.table
  setDT(nc)
  
  class(nc)
  > [1] "data.table" "data.frame"
  
  head(nc)
  >     AREA PERIMETER CNTY_ CNTY_ID        NAME  FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79 NWBIR79 geometry
  > 1: 0.114     1.442  1825    1825        Ashe 37009  37009        5  1091     1      10  1364     0      19     <XY>
  > 2: 0.061     1.231  1827    1827   Alleghany 37005  37005        3   487     0      10   542     3      12     <XY>
  > 3: 0.143     1.630  1828    1828       Surry 37171  37171       86  3188     5     208  3616     6     260     <XY>
  > 4: 0.070     2.968  1831    1831   Currituck 37053  37053       27   508     1     123   830     2     145     <XY>
  > 5: 0.153     2.206  1832    1832 Northampton 37131  37131       66  1421     9    1066  1606     3    1197     <XY>
  > 6: 0.097     1.670  1833    1833    Hertford 37091  37091       46  1452     7     954  1838     5    1237     <XY>
  ```

@rafapereirabr
Copy link
Author

For the record, I've also created an issue in the sf github page #428 because the compatibility between the two libraries might request a bit of collaboration from both sides.

@etiennebr
Copy link

etiennebr commented Jul 18, 2017

If you lose the sf class you can still rely on the sfc class, but you have to take care of more stuff. Also, returning a geometry in a data.table seems tricky, but my DT skills are rusty, so there might be a better way. There is no reason it shouldn't work in theory, but in practice some things might have to be ironed. Looking at the tidy verbs implementation should give a good idea of what needs to be reconciled.

library(sf)
#> Linking to GEOS 3.5.1, GDAL 2.1.3, proj.4 4.9.2, lwgeom 2.2.1 r14555
library(data.table)
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
#> Reading layer `nc' from data source `/usr/local/lib/R/site-library/sf/shape/nc.shp' (...)
#> Simple feature collection with 100 features and 14 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> epsg (SRID):    4267
#> proj4string:    +proj=longlat +datum=NAD27 +no_defs
nc <- setDT(nc)

nc[AREA > 0.1, st_area(geometry)]
#> Units: m^2
#>  [1] 1137388604 1423489919 1520740530 1179347051 1232769242 1136287383
#>  [...]
#> [61] 1264050755 2288682896 2181174167 2450241815 2165843695
nc[AREA > 0.1, sum(st_area(st_union(geometry))), by = SID74]
#>     SID74              V1
#>  1:     1  3598847644 m^2
#>  2:     5 11299175600 m^2
#> [...]
#> 20:    15  3850824500 m^2
#> 21:    29  1978619669 m^2
#> 22:    31  2439553215 m^2
#>     SID74              V1

But returning geometries is a problem

nc[, st_union(geometry), by = SID74]
#>     SID74
#>  1:     1
#> [...]
#> 83:    31
#>     SID74
#>                                                                                                                           V1
#>  1:                                                                                                                   <list>
#>  2:                                                                                                                   <list>
[...]
#> 83: -78.86451,-78.91947,-78.95074,-78.97536,-79.00224,-79.00642, 34.47720, 34.45364, 34.44938, 34.39917, 34.38804, 34.36627,
#>                                                                                                                           V1

Another path is to nc <- st_as_sf(setDT(nc)), but there are still problems.

@rafapereirabr
Copy link
Author

A quick note that @SymbolixAU has been working on a project creating a spatial.data.table More info HERE. I'm not sure creating a new class is the way forward, but I'm sure there is a good overlap between all these projects.

@SymbolixAU
Copy link

FYI, the reason I'm extending the data.table class is because i want to make use of Google's encoded polylines to store shape information. But these polylines can be hundreds of characters, so I wanted a custom print method to handle them.

more info/background on what I'm doing
SO: https://stackoverflow.com/questions/44819369/extend-data-table-class-with-custom-print-method
blog: https://www.symbolix.com.au/blog-main/x7gdctwdhre678bsplakplkclg8hg4

@vlulla
Copy link
Contributor

vlulla commented Sep 7, 2017

The problem @etiennebr is describing is because aggregation of geometry can lead to two different kinds of geometries (POLYGON and MULTIPOLYGON in this case). Try the following:

nc <- st_read(system.file("shape/nc.shp",package="sf"))
nc_DT <- as.data.table(nc)
nc %>% group_by(SID74) %>% summarise(geom = st_union(geometry))
nc_DT[,st_union(geometry),by=SID74][,table(SID74)] # indices where frequency > 1 are multipolygon geometries!

I still don't understand why data.table shows 83 rows for nc_DT[, st_union(geometry),by=SID74]. Regardless of the geometry aggregation issue shouldn't data.table still return only 23 rows here?

@vlulla
Copy link
Contributor

vlulla commented Jan 29, 2019

I have realized my error and I know that I have to do nc_DT[, .(st_union(geometry)), by=.(SID74)] but that still doesn't give me a valid object that I can plot!

nc %>% group_by(SID74) %>% summarise(geom=st_union(geometry)) %>% plot  # works!
plot(nc_DT[, .(st_union(geometry)), by=.(SID74)]) # does not work...some weird error!

@SymbolixAU
Copy link

@vlulla try

setDT(nc)[, .(st_union(geometry)), by=.(SID74)] %>% sf::st_as_sf() %>% plot

@vlulla

This comment has been minimized.

@mattdowle
Copy link
Member

mattdowle commented Jan 29, 2019

Am I right in thinking it would still be nice if setDT() retained the sf in the class?
Or can this issue be closed now on the data.table side. The solutions above are workarounds?
I reran the reproducible example above with 1.12.0 but it's still the same. Note that geometry column is not really lost though. That's just a printing/display difference. It's still all there :

> nc$geometry
Geometry set for 100 features 
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
epsg (SRID):    4267
proj4string:    +proj=longlat +datum=NAD27 +no_defs
First 5 geometries:
MULTIPOLYGON (((-81.47276 36.23436, -81.54084 3...
MULTIPOLYGON (((-81.23989 36.36536, -81.24069 3...
MULTIPOLYGON (((-80.45634 36.24256, -80.47639 3...
MULTIPOLYGON (((-76.00897 36.3196, -76.01735 36...
MULTIPOLYGON (((-77.21767 36.24098, -77.23461 3...
>

@SymbolixAU
Copy link

SymbolixAU commented Jan 29, 2019

I just tried

setDT(nc)
attr(nc, "class") <- c(attr(nc, "class"), "sf")

and it seems to cause more issues, so not sure if there's any immediate benefit in retaining the sf class.

At the moment I'm happy with workarounds. There's lots which can be done inside the j argument using anonymous functions and the like. And as has been mentioned, you don't lose the sfc class on the geometry column, which is the main component.

@mattdowle
Copy link
Member

mattdowle commented Jan 30, 2019

Instead of putting sf last in the class I tried putting it first :
setattr(nc, "class", c("sf", "data.table", "data.frame"))
but this still resulted in problems. Perhaps you could try that and see if those problems can be overcome.
The root issue is perhaps that sf has its own [.sf method which either masks or is masked by [.data.table depending on whether sf is before or after data.table in class().
The class attribute being c("sf", "data.table", "data.frame") makes sense to me on first glance. data.table could change setDT() to retain sf like that. But then sf would likely need to become data.table-aware (import data.table) and call NextMethod() inside its [.sf if inherts(x,"data.table"). Something like that.

@vlulla
Copy link
Contributor

vlulla commented Jan 30, 2019

Where in https://github.com/r-spatial/sf/blob/master/R/sf.R#L299-L335 should this NextMethod() be called? In the else around Line 318?

At one time (when print.sf was not tbl_df aware) I had the following in my .Rprofile but it caused various problems that I did not know how to resolve so it's commented now:

#### print.sf <- function(x, ..., n = ifelse(options("max.print")[[1]] == 99999, 20, options("max.print")[[1]])) {
####   geoms = which(vapply(x,function(col) inherits(col,"sfc"), TRUE))
####   nf = length(x) - length(geoms)
####   app = paste("and", nf, ifelse(nf == 1, "field", "fields"))
####   if (any(!is.na(st_agr(x))))
####     app = paste0(app,"\n","Attribute-geometry relationship: ", sf:::summarize_agr(x))
####   if (length(geoms) > 1)
####     app = paste0(app,"\n","Active geometry column: ", attr(x, "sf_column"))
####   print(st_geometry(x),n=0,what="Simple feature collection with",append=app)
####   if(is.data.table(x)) {
####     ## data.table:::print.data.table(x, ...)
####     NextMethod()
####   } else {
####     print.data.frame(x, ...)
####   }
####   ## print.data.frame(x, ...)
####   invisible(x)
#### }

@mattdowle
Copy link
Member

The lines around 318 in [.sf are like this :

class(x) = setdiff(class(x), "sf") # one step down
x = if (missing(j)) {
	if (nargs == 2) # `[`(x,i)
		x[i] # do sth else for tbl?
	else
		x[i, , drop = drop]
} else
	x[i, j, drop = drop]

You don't need to remove sf from class(x) like that. That's what NextMethod() does for you.
I'm not sure exactly what you'd like to achieve in sf in terms of print or [ methods so I'm not sure what exactly to be done. If you can provide some inputs and expected outputs then I could maybe help out.

@jangorecki
Copy link
Member

jangorecki commented Dec 11, 2020

Since last Matt's reply almost 2 years ago there was no feedback. Also previously mentioned issues were addressed in comments by workarounds or suggestions on how to improve integration on sf side. This is quite popular issue so I am not closing it yet but encourage anyone interested into it, to read it, and provide comment what compatibility improvement could be added on data.table side, providing an example of course.

@jplecavalier
Copy link

I personally am a massive user of data.table and sf and the best way to integrate them together is to use sfc column inside a data.table instead of using an sf object. Everything integrates very well, the only known issue for me is #4217 and I suggested a temporary workaround in the issue.

@CWen001
Copy link

CWen001 commented Dec 13, 2020

Maybe this thread helps paleolimbot/wk#12. It seems that (1) data.table fully supports vctrs, or that (2) data.table supports geovctrs (wk) that are the bridge to sf can be potential solutions.

@grantmcdermott
Copy link
Contributor

As Matt noted above:

That's just a printing/display difference. It's still all there.

With @MichaelChirico's recent PR to enable custom printing for list column types (#3414), I'm wondering whether we could turn some special cases like sfg on by default. For instance:

library(data.table)
library(sf)
#> Linking to GEOS 3.9.1, GDAL 3.3.1, PROJ 8.0.1

nc = st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)
setDT(nc)

format_list_item.sfg = function(x, ...) format(x)
nc[1:5, .(NAME, geometry)]
#>           NAME                       geometry
#> 1:        Ashe MULTIPOLYGON (((-81.47276 3...
#> 2:   Alleghany MULTIPOLYGON (((-81.23989 3...
#> 3:       Surry MULTIPOLYGON (((-80.45634 3...
#> 4:   Currituck MULTIPOLYGON (((-76.00897 3...
#> 5: Northampton MULTIPOLYGON (((-77.21767 3...

Created on 2021-08-30 by the reprex package (v2.0.1)

@JoshOBrien
Copy link
Contributor

JoshOBrien commented Oct 13, 2021

@grantmcdermott Based on your nice suggestion, I've prepared a PR for the sf package that will produce pretty-printed sfg columns whenever both sf and data.table are loaded. Because format_list_item() is so far only available in the development version of data.table, though, I'll wait to submit the PR until data.table v1.14.4 is available on CRAN.

@grantmcdermott
Copy link
Contributor

Great, thanks @JoshOBrien!

(Aside: shouldn't the PR be on the data.table side, though?)

@JoshOBrien
Copy link
Contributor

@grantmcdermott That's a great question, and I honestly don't know the answer. FTR, here is the commit I would make into a PR if I do this via sf. Following that commit, an sfg column in a data.table gets printed like this:

library(sf)
library(data.table)
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
DT <- data.table(nc)
DT[1:3, .(NAME, FIPS, geometry)]
##         NAME  FIPS                       geometry
## 1:      Ashe 37009 MULTIPOLYGON (((-81.47276 3...
## 2: Alleghany 37005 MULTIPOLYGON (((-81.23989 3...
## 3:     Surry 37171 MULTIPOLYGON (((-80.45634 3...

@jangorecki Do you have any opinion about whether I should submit a PR adding a format_list_item() method format_list_item.sfg() on the sf or the data.table side?

@jangorecki
Copy link
Member

I think it may fit better to sf package, then it may be reused in other packages as well.

@JoshOBrien
Copy link
Contributor

@jangorecki Sounds good. I'll drop another note here once data.table v1.14.4 is out if my PR then gets pulled into the sf package.

@MichaelChirico
Copy link
Member

Certainly my intention when writing the PR was that downstream packages would define methods for their own classes. sf is in a slightly unique case where it doesn't Depend/Import/Suggest data.table, so it's a bit more of a toss-up... still, I think hosting it in sf would be more appropriate.

If they are unwilling to merge we could do so here, since IINM we don't need to add any dependency, just to create an .sfg method.

I had a look at your PR, it looks quite elaborate for just adding a method! I am assuming because of the invisible-line dependency issue.

But the meat of your PR is very simple:

format_list_item.sfg = function(x, ...) {
    format.sfg(x)
}

Maybe format_list_item.default could check if there's a format method registered for the class of the object, and run that as a default, and only lacking that back up to the current <{class}[dim]> approach.

I'm not great with S3 registration stuff, but if this sounds easy to you, a PR would be great.

@grantmcdermott
Copy link
Contributor

Hmmm. Looks I'm in the minority here, thinking that an internal data.table method is the simplest and cleanest thing to do. As Michael (and my earlier post) suggest, it would basically require a single (unexported) line of code.

format_list_item.sfg = function(x, ...) format(x)

Similarly, I don't think this method would change the behaviour for any other data frame(ish) object on the sf side. Both tibbles and vanilla data frames would keep their existing print behaviour. So I'm not sure there's much incentive on their end?

Final argument for keeping this on the data.table side: Converting an sf object to a data.table must not only be done explicitly, but also necessitates a slight change in workflow. (As documented in this thread, we need to refer to the geometry column explicitly when performing spatial operations, e.g. st_union(nc) vs nc_dt[, st_union(geometry)].) That's a perfectly acceptable trade-off from my perspective, for those of us that want to combine the power of the two packages. But the point is we're selecting into this modified workflow. IMHO the print method is just a small (but much appreciated!) convenience layer from the data.table side further enabling this integration.

@JoshOBrien
Copy link
Contributor

Now that I've thought about this a bit more, I think I'm with @grantmcdermott on this one.

It'd be really nice to be able to sidestep the elaborate hoops I go through in that sf-side PR just to be able to register the method without adding data.table to sf as an Imports or Depends. As both @grantmcdermott and @MichaelChirico note, adding the method on the data.table side would require just three lines of code and no additional dependencies.

FTR, I looked into @MichaelChirico suggestion that we could possibly modify the code of data.table:::format_list_item.default(). That solution (as implemented by the code below) works just fine for a simple feature geometry. Unfortunately, it's not a great general approach, since it'll cause all sorts of problems when list items have format methods that don't produce such nicely succinct output For example, it fails (throwing an error) when printing a data.table such as this: x <- data.table(list(mtcars, mtcars))

## Alternative implementation, that uses each list element's class to dispatch a format method
format_list_item.default <- function (x, ...)
{
    exists_format_method <- function(x) {
        !(is.null(getS3method("format", class = x, optional = TRUE)))
    }
    if (is.null(x))
        ""
    else if (is.atomic(x) || inherits(x, "formula"))
        paste(c(format(head(x, 6L), ...), if (length(x) > 6L) "..."),
               collapse = ",")
    else if (any(sapply(class(x), exists_format_method)))
        format(x)
    else
        paste0("<", class(x)[1L], paste_dims(x), ">")
}

@MichaelChirico
Copy link
Member

Thanks for investigating Josh. point taken... I am only trying to avoid a maintenance headache if dozens of formatters are eventually requested in data.table.

the most scalable thing seems to me to be keeping a list of classes where the format method plays well.

but I think that can be kicked down the road a bit. I would be fine with a simple PR to data.table as described (with NEWS and mention it in the .Rd)

JoshOBrien added a commit to JoshOBrien/data.table that referenced this issue Oct 16, 2021
By adding `format_list_item.sfg()`, a method for the recently added
`format_list_item()` S3 generic, we can ensure that any simple feature
geometry columns (with elements of class `"sfg"`) in a data.table will
be pretty printed using the **sf** package's `format.sfg()`. To see
what that looks like, see the example below

```r
library(sf)
library(data.table)
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
DT <- data.table(nc)
DT[1:3, .(NAME, FIPS, geometry)]
 ##         NAME  FIPS                       geometry
 ## 1:      Ashe 37009 MULTIPOLYGON (((-81.47276 3...
 ## 2: Alleghany 37005 MULTIPOLYGON (((-81.23989 3...
 ## 3:     Surry 37171 MULTIPOLYGON (((-80.45634 3...
```

FR Rdatatable#2273, starting
[here](Rdatatable#2273 (comment)),
includes a discussion of the pros and cons of adding this method, for
a data type defined in the **sf** package, to **data.table**.
@JoshOBrien
Copy link
Contributor

@MichaelChirico That makes total sense, and I like your idea of, in the longer term, using a list to record classes with format methods that produce reasonable print.data.table-ready output.

In the PR I just pushed, I added this as a new item in NEWS, but could also see folding it into NEWS item 17, which announced the format_list_item() generic.

Also, I didn't add an example to the *.Rd file, only because I didn't want to induce any dependency on sf or trigger any CRAN-check warnings. Let me know if you think an example would be good, though, and I'll see how best to add one that'll satisfy CRAN's requirements.

MichaelChirico pushed a commit to JoshOBrien/data.table that referenced this issue Oct 21, 2021
By adding `format_list_item.sfg()`, a method for the recently added
`format_list_item()` S3 generic, we can ensure that any simple feature
geometry columns (with elements of class `"sfg"`) in a data.table will
be pretty printed using the **sf** package's `format.sfg()`. To see
what that looks like, see the example below

```r
library(sf)
library(data.table)
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
DT <- data.table(nc)
DT[1:3, .(NAME, FIPS, geometry)]
 ##         NAME  FIPS                       geometry
 ## 1:      Ashe 37009 MULTIPOLYGON (((-81.47276 3...
 ## 2: Alleghany 37005 MULTIPOLYGON (((-81.23989 3...
 ## 3:     Surry 37171 MULTIPOLYGON (((-80.45634 3...
```

FR Rdatatable#2273, starting
[here](Rdatatable#2273 (comment)),
includes a discussion of the pros and cons of adding this method, for
a data type defined in the **sf** package, to **data.table**.
MichaelChirico pushed a commit to JoshOBrien/data.table that referenced this issue Oct 21, 2021
By adding `format_list_item.sfg()`, a method for the recently added
`format_list_item()` S3 generic, we can ensure that any simple feature
geometry columns (with elements of class `"sfg"`) in a data.table will
be pretty printed using the **sf** package's `format.sfg()`. To see
what that looks like, see the example below

```r
library(sf)
library(data.table)
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
DT <- data.table(nc)
DT[1:3, .(NAME, FIPS, geometry)]
 ##         NAME  FIPS                       geometry
 ## 1:      Ashe 37009 MULTIPOLYGON (((-81.47276 3...
 ## 2: Alleghany 37005 MULTIPOLYGON (((-81.23989 3...
 ## 3:     Surry 37171 MULTIPOLYGON (((-80.45634 3...
```

FR Rdatatable#2273, starting
[here](Rdatatable#2273 (comment)),
includes a discussion of the pros and cons of adding this method, for
a data type defined in the **sf** package, to **data.table**.
@mattdowle mattdowle added this to the 1.14.3 milestone Nov 26, 2021
@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-atomic column e.g. list columns, S4 vector columns
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants