-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6532 [R] write_parquet() uses writer properties (general and arrow specific) #5451
ARROW-6532 [R] write_parquet() uses writer properties (general and arrow specific) #5451
Conversation
Yeah, I'm not sure about this on a few levels. I think the R I want to type to write a compressed Parquet file looks like I'm also not sure whether this works as intended. The Parquet C++ code seems to have its own compression and writing logic; that may be historical artifact, or it may be meaningful. Maybe we can get away without implementing bindings for those classes--the proof would be a passing test of writing a compressed parquet file and reading it back in. Then again, maybe in principle we should write the Parquet bindings to match the C++ library. |
It is not a good idea to write a Parquet file into a CompressedOutputStream. Such file will not be readable with Parquet already compresses data internally. |
Here's the way we handle it in Python, you'll need to do the same thing in R https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L363 |
d85b6fc
to
aa2833e
Compare
Some progress inspired from the python implementation. write_parquet <- function(
table,
sink, chunk_size = NULL,
version = NULL, compression = NULL, use_dictionary = NULL, write_statistics = NULL, data_page_size = NULL,
properties = ParquetWriterProperties$create(
version = version,
compression = compression,
use_dictionary = use_dictionary,
write_statistics = write_statistics,
data_page_size = data_page_size
),
use_deprecated_int96_timestamps = FALSE, coerce_timestamps = NULL, allow_truncated_timestamps = FALSE,
arrow_properties = ParquetArrowWriterProperties$create(
use_deprecated_int96_timestamps = use_deprecated_int96_timestamps,
coerce_timestamps = coerce_timestamps,
allow_truncated_timestamps = allow_truncated_timestamps
)
) that are managed by the classes Only simple versions so far, e.g. library(arrow, warn.conflicts = FALSE)
df <- tibble::tibble(x = 1:5)
write_parquet(df, "/tmp/test.parquet", compression = "snappy")
read_parquet("/tmp/test.parquet")
#> # A tibble: 5 x 1
#> x
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5 but we can't e.g. specify specific variables to handle by such and such compression. This is a good place I think for a tidy select, e.g. something like that: df <- tibble::tibble(x1 = 1:5, x2 = 1:5, y = 1:5)
write_parquet(df, "/tmp/test.parquet",
compression = list(snappy = starts_with("x"))
) The list in python goes the other way, so if we do something similar it would look like write_parquet(df, "/tmp/test.parquet",
compression = list(x1 = "snappy", x2 = "snappy")
) perhaps we can have write_parquet(df, "/tmp/test.parquet",
compression = compression_spec(snappy = starts_with("x"))
) |
One option we discussed with @nealrichardson was to be able to do e.g. write_parquet(df, "/tmp/test.parquet",
compression = Codec$create("snappy", 5L)
) But unfortunately, the C++ class Instead, I followed python's lead and we can do this instead: write_parquet(df, "/tmp/test.parquet",
compression = "snappy",
compression_level = 5L
) |
These arguments that are handled by |
Taking a look now; FTR Travis says
|
Codecov Report
@@ Coverage Diff @@
## master #5451 +/- ##
===========================================
- Coverage 88.7% 76.76% -11.94%
===========================================
Files 964 59 -905
Lines 128215 4330 -123885
Branches 1501 0 -1501
===========================================
- Hits 113731 3324 -110407
+ Misses 14119 1006 -13113
+ Partials 365 0 -365
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more notes. I'd also like to see better coverage on https://codecov.io/gh/apache/arrow/pull/5451/diff
) | ||
|
||
make_valid_version <- function(version, valid_versions = valid_parquet_version) { | ||
if (is_integerish(version)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd write this function as
make_valid_version <- function(version, valid_versions = valid_parquet_version) {
pq_version <- valid_version[[version]]
if (is.null(pq_version)) {
stop('"version" should be one of ', oxford_paste(names(valid_versions), "or"), call.=FALSE)
}
pq_version
}
As it stands, make_valid_version(1)
won't work, and it seems like it should.
Per the codecov report, this code isn't being exercised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wfm:
arrow:::make_valid_version("1.0")
#> [1] 0
arrow:::make_valid_version("2.0")
#> [1] 1
arrow:::make_valid_version(1)
#> [1] 0
arrow:::make_valid_version(2)
#> [1] 1
Created on 2019-09-27 by the reprex package (v0.3.0.9000)
r/R/parquet.R
Outdated
write_statistics = NULL, | ||
data_page_size = NULL, | ||
|
||
properties = ParquetWriterProperties$create( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure there's value including properties
and arrow_properties
in the signature here. I kept them in read_delim_arrow()
because there were some properties they expose that aren't mapped to arguments the readr::read_delim
signature but that doesn't seem to be the case here. (On reflection, that's probably not the right call there either; if you want lower-level access to those settings, you should probably be doing CsvTableReader$create(...)
anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My rationale was that perhaps you'd already have built those objects properties
and arrow_properties
before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I get the point that maybe this could be diverted to using a ParquetFileWriter
instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes on the docs
r/R/parquet.R
Outdated
#' @param compression compression specification. Possible values: | ||
#' - a single string: uses that compression algorithm for all columns | ||
#' - an unnamed string vector: specify a compression algorithm for each, same order as the columns | ||
#' - a named string vector: specify compression algorithm individually |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And not all columns need to be specified, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
r/R/parquet.R
Outdated
#' - an unnamed string vector: specify a compression algorithm for each, same order as the columns | ||
#' - a named string vector: specify compression algorithm individually | ||
#' @param compression_level compression level. A single integer, a named integer vector | ||
#' or an unnamed integer vector of the same size as the number of columns of `table` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this follow the same conventions as compression
? Maybe there should be a paragraph/section in the docs that explains how these parameters work since it's the same/similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored the documentation in @details
r/R/parquet.R
Outdated
#' @param compression_level compression level. A single integer, a named integer vector | ||
#' or an unnamed integer vector of the same size as the number of columns of `table` | ||
#' @param use_dictionary Specify if we should use dictionary encoding. | ||
#' @param write_statistics Specify if we should write statistics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, and what are statistics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know
… file path for more flexibility
…rrow::WriterProperties to R side
…ties_Builder class skeleton
…perties to write_parquet()
Co-Authored-By: Neal Richardson <[email protected]>
…ch class. Many tests were false positives
354d263
to
50555f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One final style question but otherwise LGTM, happy to merge today regardless of where we land on the whitespace question.
as_data_frame = TRUE, | ||
props = ParquetReaderProperties$create(), | ||
...) { | ||
col_select = NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is "bad", according to the tidyverse style guide, which I believed we were trying to follow: https://style.tidyverse.org/functions.html#long-lines-1
I can get used to whatever style conventions we decide, just want to make sure we're in agreement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll setup my Rstudio to obey the style, perhaps we should use styler::
once in a while to do that automatically.
Can you update the PR description to reflect what is actually in the PR (since writing a Parquet file into a CompressedOutputStream isn't recommendable -- you would have to decompress the entire file first to be able to read any part of it) |
The ability to preserve categorical values was introduced in #5077 as the convention of storing a special `ARROW:schema` key in the metadata. To invoke this, we need to call `ArrowWriterProperties::store_schema()`. The R binding is already ready for this, but calls `store_schema()` only conditionally and uses `parquet___default_arrow_writer_properties()` by default. Though I don't see the motivation to implement as such in #5451, considering [the Python binding always calls `store_schema()`](https://github.com/apache/arrow/blob/dbe708c7527a4aa6b63df7722cd57db4e0bd2dc7/python/pyarrow/_parquet.pyx#L1269), I guess the R code can do the same. Closes #6135 from yutannihilation/ARROW-7045_preserve_factor_in_parquet and squashes the following commits: 9227e7e <Hiroaki Yutani> Fix test 4d8bb46 <Hiroaki Yutani> Remove default_arrow_writer_properties() dfd08cb <Hiroaki Yutani> Add failing tests Authored-by: Hiroaki Yutani <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
This adds parameters to
write_parquet()
to control compression, whether to use dictionary, etc ... on top of the C++ classesparquet::WriterProperties
andparquet::ArrowWriterProperties
e.g.