Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

count() gives wrong results with dbplyr 2.4.0, not 2.3.4 #1398

Closed
andrew-schulman opened this issue Nov 7, 2023 · 3 comments
Closed

count() gives wrong results with dbplyr 2.4.0, not 2.3.4 #1398

andrew-schulman opened this issue Nov 7, 2023 · 3 comments
Labels
backend 🕺 bug an unexpected problem or unintended behavior

Comments

@andrew-schulman
Copy link

With dplyr 2.4.0 and an Oracle database back end (ROracle 1.3), tbl() %>% count() always returns 11 when the table row count is more than 11.

# Example 1: dbplyr 2.4.0

install.packages('tidyverse')
#> Installing package into 'C:/Users/xxx/Software/R/win-library/4.3'
#> (as 'lib' is unspecified)
#> package 'tidyverse' successfully unpacked and MD5 sums checked
#> 
#> The downloaded binary packages are in
#>  C:\Users\xxx\AppData\Local\Temp\RtmpwXPxVZ\downloaded_packages
install.packages('https://cran.r-project.org/src/contrib/dbplyr_2.4.0.tar.gz')
#> Installing package into 'C:/Users/xxx/Software/R/win-library/4.3'
#> (as 'lib' is unspecified)
#> inferring 'repos = NULL' from 'pkgs'

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(DBI)

C <- dbConnect(ROracle::Oracle(),
               username = "xxx",
               password = "xxx", 
               dbname="(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=xxx)(PORT=1521))(CONNECT_DATA=(SID=xxx)))")

# Count of table rows, by direct query:
dbGetQuery(C, "select count(*) n from points_auto_removed")
#>       N
#> 1 29579

# The above is the correct row count.
# Count of table rows, by dbplyr:

T <- tbl(C, sql('points_auto_removed'))

# SQL is correct...
T %>% count() %>% show_query()
#> <SQL>
#> SELECT COUNT(*) AS "n"
#> FROM (points_auto_removed) "q01"

# ... but the count is wrong:
T %>% count()
#> # Source:   SQL [1 x 1]
#> # Database: OraConnection
#>       n
#>   <dbl>
#> 1    11

Created on 2023-11-07 with reprex v2.0.2

With dbplyr 2.3.4, that doesn't happen. The count is correct.

# Example 2: dbplyr 2.3.4

install.packages('tidyverse')
#> Installing package into 'C:/Users/xxx/Software/R/win-library/4.3'
#> (as 'lib' is unspecified)
#> package 'tidyverse' successfully unpacked and MD5 sums checked
#> 
#> The downloaded binary packages are in
#>  C:\Users\xxx\AppData\Local\Temp\Rtmp27yAZ1\downloaded_packages
install.packages('https://cran.r-project.org/src/contrib/Archive/dbplyr/dbplyr_2.3.4.tar.gz')
#> Installing package into 'C:/Users/xxx/Software/R/win-library/4.3'
#> (as 'lib' is unspecified)
#> inferring 'repos = NULL' from 'pkgs'

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(DBI)

C <- dbConnect(ROracle::Oracle(),
               username = "xxx",
               password = "xxx",
               dbname="(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=xxx)(PORT=1521))(CONNECT_DATA=(SID=xxx)))")

# Count of table rows, by direct query:
dbGetQuery(C, "select count(*) n from points_auto_removed")
#>       N
#> 1 29579

# dbplyr method agrees:
T <- tbl(C, sql('points_auto_removed'))
T %>% count()
#> # Source:   SQL [1 x 1]
#> # Database: OraConnection
#>       n
#>   <dbl>
#> 1 29579

Created on 2023-11-07 with reprex v2.0.2

@andrew-schulman
Copy link
Author

andrew-schulman commented Nov 8, 2023

More information:

  • When the row count is less than 11, count returns the correct count.
  • On a different client host, my coworker has the same issue, but with an upper count limit of 1000, not 11.
  • We don't see the same behavior with SQLite or RPostgres.
  • We do see the same behavior with odbc backed by an Oracle connection.

I'll work on developing more complete reprexes that can be more easily tested with different row counts, back ends, and dbplyr versions.

@hadley
Copy link
Member

hadley commented Dec 19, 2023

Specific problem excerpted from reprex above:

library(dplyr)
library(DBI)

C <- dbConnect(ROracle::Oracle(),
               username = "xxx",
               password = "xxx", 
               dbname="(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=xxx)(PORT=1521))(CONNECT_DATA=(SID=xxx)))")

dbGetQuery(C, "select count(*) n from points_auto_removed")
#>       N
#> 1 29579

# SQL is correct...
T %>% count() %>% show_query()
#> <SQL>
#> SELECT COUNT(*) AS "n"
#> FROM (points_auto_removed) "q01"

# ... but the count is wrong:
T %>% count()
#> # Source:   SQL [1 x 1]
#> # Database: OraConnection
#>       n
#>   <dbl>
#> 1    11

Can you double check that using exactly the same SQL as dbplyr generates gives the right value?

Otherwise I have no idea what is going wrong 😭 (But at least we now have https://odbc.r-dbi.org/articles/develop.html#oracle to start reproducing the problem.)

@hadley hadley added bug an unexpected problem or unintended behavior backend 🕺 labels Dec 19, 2023
@hadley
Copy link
Member

hadley commented Feb 14, 2024

Duplicate of #1436

@hadley hadley marked this as a duplicate of #1436 Feb 14, 2024
@hadley hadley closed this as completed Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend 🕺 bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants