-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] open_dataset - add file_name as column #30754
Comments
Martin du Toit / @martindut: |
Nicola Crane / @thisisnic: |
Martin du Toit / @martindut: I hope this makes sense |
Nicola Crane / @thisisnic: files <- list.files(directory, recursive = TRUE, full.names = TRUE)
for(file in files){
data <- read_csv_arrow(file)
data <- mutate(data, filename = file)
write_csv_arrow(data, file = file)
} I also wonder if the code required to solve ARROW-14612 might bring us closer to this being possible. Would be good to hear others' thoughts here. |
Weston Pace / @westonpace: That being said, the low level ScanBatchesAsync method actually returns a generator of TaggedRecordBatch for this very purpose. A TaggedRecordBatch is a struct with the record batch as well as the source fragment for that record batch. So if you were to execute scan, you could inspect the fragment and, if it is a FileFragment, you could extract the filename. Another challenge is that R is moving towards more and more access through an exec plan and not directly using a scanner. In order for that to work we would need to augment the scan results with the filename in C++ before sending into the exec plan. Luckily, we already do this a bit as well. We currently augment the scan results with fragment index, batch index, and whether the batch is the last batch in the fragment. Since ExecBatch can work with constants efficiently I don't think there will be much performance cost in always including the filename. So the work remaining is simply to add a new augmented field __fragment_source_name which is always attached if the underlying fragment is a filename. Then users can get this field if they want by including "__fragment_source_name" in the list of columns they query for. |
Martin du Toit / @martindut: |
Nicola Crane / @thisisnic: |
Nicola Crane / @thisisnic: In Python, we can do something like In the body of So we'll need to make some sort of change so that we can select this "metadata" kind of column. It may be complicated further by the fact that this deviates a bit from the usual way of using |
Dewey Dunnington / @paleolimbot:
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
tf <- tempfile()
write_dataset(mtcars, tf, partitioning = "cyl")
ds <- open_dataset(tf)
# works!
scanner <- Scanner$create(
open_dataset(tf),
projection = c("__filename", names(ds))
)
as_tibble(scanner$ToTable())
#> # A tibble: 32 × 12
#> `__filename` mpg disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 /private/var/fol… 22.8 108 93 3.85 2.32 18.6 1 1 4 1
#> 2 /private/var/fol… 24.4 147. 62 3.69 3.19 20 1 0 4 2
#> 3 /private/var/fol… 22.8 141. 95 3.92 3.15 22.9 1 0 4 2
#> 4 /private/var/fol… 32.4 78.7 66 4.08 2.2 19.5 1 1 4 1
#> 5 /private/var/fol… 30.4 75.7 52 4.93 1.62 18.5 1 1 4 2
#> 6 /private/var/fol… 33.9 71.1 65 4.22 1.84 19.9 1 1 4 1
#> 7 /private/var/fol… 21.5 120. 97 3.7 2.46 20.0 1 0 3 1
#> 8 /private/var/fol… 27.3 79 66 4.08 1.94 18.9 1 1 4 1
#> 9 /private/var/fol… 26 120. 91 4.43 2.14 16.7 0 1 5 2
#> 10 /private/var/fol… 30.4 95.1 113 3.77 1.51 16.9 1 1 5 2
#> # … with 22 more rows, and 1 more variable: cyl <int>
# seems that we still can't use __filename in a filter expr
Scanner$create(
open_dataset(tf),
projection = c("__filename", names(ds)),
filter = Expression$create(
"match_substring",
Expression$field_ref("__filename"),
options = list(pattern = "cyl=8")
)
)
#> Error: Invalid: No match for FieldRef.Name(__filename) in mpg: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#> cyl: int32
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/type.h:1717 CheckNonEmpty(matches, root)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/dataset/scanner.cc:782 ref.FindOne(*scan_options_->dataset_schema) |
Nicola Crane / @thisisnic: |
Neal Richardson / @nealrichardson:
This is tricky because we (and apparently elsewhere in the C++ code) have logic to filter out secret internal columns like this: https://github.com/apache/arrow/blob/master/r/R/query-engine.R#L159-L163. Sounds like we need to find a safe way to loosen that, or otherwise rethink the implementation. In terms of UX in R, a special helper like |
That's a C++ problem (just filed ARROW-16115)
In C++ if the user doesn't specify any projection we default to "all columns but not augmented columns". I think that's the only time we filter out these special columns and I think we want to keep this interpretation. |
Weston Pace / @westonpace:
In other words, you might think we would get the hint and only read files matching that pattern. This is not the case. We will read the entire dataset and apply the "cyl=8" filter in memory. If we want to pushdown filters on the filename column we will need to add some special logic. Feel free to create a JIRA. |
Neal Richardson / @nealrichardson: |
Hi. Is it possible to add the file_name as a column to a dataset?
This works, but I need the file_name as a column.
Thanks
Reporter: Martin du Toit / @martindut
Assignee: Nicola Crane / @thisisnic
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-15260. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: