ARROW-6532 [R] write_parquet() uses writer properties (general and arrow specific) #5451

romainfrancois · 2019-09-20T09:00:14Z

This adds parameters to write_parquet() to control compression, whether to use dictionary, etc ... on top of the C++ classes parquet::WriterProperties and parquet::ArrowWriterProperties e.g.

write_parquet(tab, file, compression = "gzip", compression_level = 7)

nealrichardson · 2019-09-20T14:58:21Z

Yeah, I'm not sure about this on a few levels. I think the R I want to type to write a compressed Parquet file looks like write_parquet(df, file="file.parquet", compression="snappy"). This should be more naturally exposed to the causal user, without having to create a CompressedOutputStream directly.

I'm also not sure whether this works as intended. The Parquet C++ code seems to have its own compression and writing logic; that may be historical artifact, or it may be meaningful. Maybe we can get away without implementing bindings for those classes--the proof would be a passing test of writing a compressed parquet file and reading it back in. Then again, maybe in principle we should write the Parquet bindings to match the C++ library.

wesm · 2019-09-20T16:24:08Z

It is not a good idea to write a Parquet file into a CompressedOutputStream. Such file will not be readable with read_parquet.

Parquet already compresses data internally.

wesm · 2019-09-20T16:25:55Z

Here's the way we handle it in Python, you'll need to do the same thing in R

https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L363

romainfrancois · 2019-09-24T11:49:23Z

Some progress inspired from the python implementation. write_parquet() gains many parameters:

write_parquet <- function(
  table,
  sink, chunk_size = NULL,
  version = NULL, compression = NULL, use_dictionary = NULL, write_statistics = NULL, data_page_size = NULL,
  properties = ParquetWriterProperties$create(
    version = version,
    compression = compression,
    use_dictionary = use_dictionary,
    write_statistics = write_statistics,
    data_page_size = data_page_size
  ),

  use_deprecated_int96_timestamps = FALSE, coerce_timestamps = NULL, allow_truncated_timestamps = FALSE,
  arrow_properties = ParquetArrowWriterProperties$create(
    use_deprecated_int96_timestamps = use_deprecated_int96_timestamps,
    coerce_timestamps = coerce_timestamps,
    allow_truncated_timestamps = allow_truncated_timestamps
  )
)

that are managed by the classes ParquetWriterProperties and ParquetArrowWriterProperties.

Only simple versions so far, e.g. compression may only be a single string, so we may do:

library(arrow, warn.conflicts = FALSE)

df <- tibble::tibble(x = 1:5)
write_parquet(df, "/tmp/test.parquet", compression = "snappy")
read_parquet("/tmp/test.parquet")
#> # A tibble: 5 x 1
#>       x
#>   <int>
#> 1     1
#> 2     2
#> 3     3
#> 4     4
#> 5     5

but we can't e.g. specify specific variables to handle by such and such compression. This is a good place I think for a tidy select, e.g. something like that:

df <- tibble::tibble(x1 = 1:5, x2 = 1:5, y = 1:5)
write_parquet(df, "/tmp/test.parquet", 
  compression = list(snappy = starts_with("x"))
)

The list in python goes the other way, so if we do something similar it would look like

write_parquet(df, "/tmp/test.parquet", 
  compression = list(x1 = "snappy", x2 = "snappy")
)

perhaps we can have compression = only handle the same type of thing python does, but then come up with some helper function so that we'd have e.g.

write_parquet(df, "/tmp/test.parquet", 
  compression = compression_spec(snappy = starts_with("x"))
)

romainfrancois · 2019-09-25T06:38:22Z

One option we discussed with @nealrichardson was to be able to do e.g.

write_parquet(df, "/tmp/test.parquet", 
  compression = Codec$create("snappy", 5L)
)

But unfortunately, the C++ class arrow::util::Codec does not give a way to swim back to the compression level, so I can't do e.g. compression$level.

Instead, I followed python's lead and we can do this instead:

write_parquet(df, "/tmp/test.parquet", 
  compression = "snappy", 
  compression_level = 5L
)

romainfrancois · 2019-09-25T09:21:54Z

These arguments that are handled by ParquetWriterProperties can now be single values, unnamed vectors of the same length as the number of columns in the table, or named vectors: compression, compression_level, use_dictionary and write_statistics.

nealrichardson · 2019-09-25T17:45:29Z

Taking a look now; FTR Travis says

Missing link or links in documentation object 'write_parquet.Rd':

  ‘to_arrow’

r/R/compression.R

r/R/parquet.R

r/tests/testthat/test-parquet.R

codecov-io · 2019-09-26T16:48:37Z

Codecov Report

Merging #5451 into master will decrease coverage by 11.93%.
The diff coverage is 65.9%.

@@             Coverage Diff             @@
##           master    #5451       +/-   ##
===========================================
- Coverage    88.7%   76.76%   -11.94%     
===========================================
  Files         964       59      -905     
  Lines      128215     4330   -123885     
  Branches     1501        0     -1501     
===========================================
- Hits       113731     3324   -110407     
+ Misses      14119     1006    -13113     
+ Partials      365        0      -365

Impacted Files	Coverage Δ
r/R/record-batch.R	`97.36% <ø> (-0.04%)`	⬇️
r/R/field.R	`92.85% <ø> (-0.48%)`	⬇️
r/src/arrow_types.h	`96% <ø> (ø)`	⬆️
r/R/schema.R	`31.25% <ø> (+7.72%)`	⬆️
r/R/type.R	`83.9% <ø> (-0.19%)`	⬇️
r/R/enums.R	`0% <ø> (ø)`	⬆️
r/R/message.R	`75% <ø> (+21.15%)`	⬆️
r/R/array.R	`77.14% <ø> (+4.92%)`	⬆️
r/src/compression.cpp	`85.71% <0%> (-14.29%)`	⬇️
r/R/feather.R	`63.33% <100%> (ø)`	⬆️
... and 928 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 46a14db...413dd41. Read the comment docs.

nealrichardson

A few more notes. I'd also like to see better coverage on https://codecov.io/gh/apache/arrow/pull/5451/diff

nealrichardson · 2019-09-26T17:06:35Z

r/R/parquet.R

+)
+
+make_valid_version <- function(version, valid_versions = valid_parquet_version) {
+  if (is_integerish(version)) {


I'd write this function as

make_valid_version <- function(version, valid_versions = valid_parquet_version) { pq_version <- valid_version[[version]] if (is.null(pq_version)) { stop('"version" should be one of ', oxford_paste(names(valid_versions), "or"), call.=FALSE) } pq_version }

As it stands, make_valid_version(1) won't work, and it seems like it should.

Per the codecov report, this code isn't being exercised.

wfm:

arrow:::make_valid_version("1.0") #> [1] 0 arrow:::make_valid_version("2.0") #> [1] 1 arrow:::make_valid_version(1) #> [1] 0 arrow:::make_valid_version(2) #> [1] 1

^{Created on 2019-09-27 by the reprex package (v0.3.0.9000)}

nealrichardson · 2019-09-26T17:11:38Z

r/R/parquet.R

+                          write_statistics = NULL,
+                          data_page_size = NULL,
+
+                          properties = ParquetWriterProperties$create(


I'm not sure there's value including properties and arrow_properties in the signature here. I kept them in read_delim_arrow() because there were some properties they expose that aren't mapped to arguments the readr::read_delim signature but that doesn't seem to be the case here. (On reflection, that's probably not the right call there either; if you want lower-level access to those settings, you should probably be doing CsvTableReader$create(...) anyway.

My rationale was that perhaps you'd already have built those objects properties and arrow_properties before.

But I get the point that maybe this could be diverted to using a ParquetFileWriter instance

r/tests/testthat/test-parquet.R

r/tests/testthat/helper-parquet.R

nealrichardson

Some notes on the docs

nealrichardson · 2019-09-26T18:57:56Z

r/R/parquet.R

+#' @param compression compression specification. Possible values:
+#'  - a single string: uses that compression algorithm for all columns
+#'  - an unnamed string vector: specify a compression algorithm for each, same order as the columns
+#'  - a named string vector: specify compression algorithm individually


And not all columns need to be specified, correct?

nealrichardson · 2019-09-26T18:58:50Z

r/R/parquet.R

+#'  - an unnamed string vector: specify a compression algorithm for each, same order as the columns
+#'  - a named string vector: specify compression algorithm individually
+#' @param compression_level compression level. A single integer, a named integer vector
+#'   or an unnamed integer vector of the same size as the number of columns of `table`


Does this follow the same conventions as compression? Maybe there should be a paragraph/section in the docs that explains how these parameters work since it's the same/similar.

I've refactored the documentation in @details

r/R/parquet.R

nealrichardson · 2019-09-26T18:59:17Z

r/R/parquet.R

+#' @param compression_level compression level. A single integer, a named integer vector
+#'   or an unnamed integer vector of the same size as the number of columns of `table`
+#' @param use_dictionary Specify if we should use dictionary encoding.
+#' @param write_statistics Specify if we should write statistics


Same, and what are statistics?

I don't know

r/R/parquet.R

… file path for more flexibility

…rrow::WriterProperties to R side

…ties_Builder class skeleton

…perties to write_parquet()

…nstead.

Co-Authored-By: Neal Richardson <[email protected]>

…ch class. Many tests were false positives

nealrichardson

One final style question but otherwise LGTM, happy to merge today regardless of where we land on the whitespace question.

nealrichardson · 2019-09-27T14:58:19Z

r/R/parquet.R

-                         as_data_frame = TRUE,
-                         props = ParquetReaderProperties$create(),
-                         ...) {
+  col_select = NULL,


This is "bad", according to the tidyverse style guide, which I believed we were trying to follow: https://style.tidyverse.org/functions.html#long-lines-1

I can get used to whatever style conventions we decide, just want to make sure we're in agreement.

I'll setup my Rstudio to obey the style, perhaps we should use styler:: once in a while to do that automatically.

wesm · 2019-09-27T17:44:32Z

Can you update the PR description to reflect what is actually in the PR (since writing a Parquet file into a CompressedOutputStream isn't recommendable -- you would have to decompress the entire file first to be able to read any part of it)

The ability to preserve categorical values was introduced in #5077 as the convention of storing a special `ARROW:schema` key in the metadata. To invoke this, we need to call `ArrowWriterProperties::store_schema()`. The R binding is already ready for this, but calls `store_schema()` only conditionally and uses `parquet___default_arrow_writer_properties()` by default. Though I don't see the motivation to implement as such in #5451, considering [the Python binding always calls `store_schema()`](https://github.com/apache/arrow/blob/dbe708c7527a4aa6b63df7722cd57db4e0bd2dc7/python/pyarrow/_parquet.pyx#L1269), I guess the R code can do the same. Closes #6135 from yutannihilation/ARROW-7045_preserve_factor_in_parquet and squashes the following commits: 9227e7e <Hiroaki Yutani> Fix test 4d8bb46 <Hiroaki Yutani> Remove default_arrow_writer_properties() dfd08cb <Hiroaki Yutani> Add failing tests Authored-by: Hiroaki Yutani <[email protected]> Signed-off-by: Neal Richardson <[email protected]>

romainfrancois requested a review from nealrichardson September 20, 2019 09:00

fsaintjacques added the Component: R label Sep 20, 2019

romainfrancois force-pushed the ARROW-6532/write_parquet_compression branch from d85b6fc to aa2833e Compare September 24, 2019 11:26

nealrichardson requested changes Sep 25, 2019

View reviewed changes

nealrichardson requested changes Sep 26, 2019

View reviewed changes

nealrichardson reviewed Sep 26, 2019

View reviewed changes

romainfrancois added 16 commits September 27, 2019 15:57

Make write_parquet() generic, internal impl using streams rather than…

9ed32b6

… file path for more flexibility

passing down the right stream

aa34095

lint

0e09ac8

Exposing classes parquet::arrow::ArrowWriterProperties and parquet::a…

1b84ad4

…rrow::WriterProperties to R side

+ ParquetWriterProperties$create() and associated ParquetWriterProper…

09ea0ad

…ties_Builder class skeleton

Expose options from ParquetWriterProperties and ParquetArrowWriterPro…

fa8990b

…perties to write_parquet()

lint

b8337e1

+ compression_level= in write_parquet()

2dd2cb9

document()

1e3b5b6

More flexible compression= and compression_level=

2f2ae00

More flexible arguments use_dictionary= and write_statistics=

4055f67

using make_valid_time_unit()

1166264

Remove $default() methods and use $create() wityh default arguments i…

738ea6e

…nstead.

using assert_that()

72caaab

align arguments following tidyverse style guide

7f1c184

==.Table

d318a66

romainfrancois and others added 14 commits September 27, 2019 15:57

add test helper so that we actually can test parquet roundtrip

6c4f003

Remove the _ from builder classes

86d9ff4

M%ake compression_from_name() vectorized

004cf90

define and use internal make_valid_version() function

9bee8de

suggestsions from @nealrichardson

1fdcc0b

abstract various ParquetWriterPropertiesBuilder$set_*() methods

00cc214

wrong length for use_dictionary and write_statistics

5ade52d

Test ==.Table

c5549de

added all.equal.Object() that uses ==

66c51fd

Update r/R/parquet.R

45ec63b

Co-Authored-By: Neal Richardson <[email protected]>

Move read_parquet() and write_parquet() to top of the file

56dac33

rework documentation for write_parquet()

ecd9218

implement ==.Object that calls $Equals instead of implementing for ea…

9aff79b

…ch class. Many tests were false positives

rename arguments to x and sink

50555f8

romainfrancois force-pushed the ARROW-6532/write_parquet_compression branch from 354d263 to 50555f8 Compare September 27, 2019 14:04

test make_valid_version()

413dd41

nealrichardson approved these changes Sep 27, 2019

View reviewed changes

romainfrancois changed the title ~~ARROW-6532 [R] Write parquet files with compression~~ ARROW-6532 [R] write_parquet() uses writer properties (general and arrow specific) Sep 27, 2019

nealrichardson closed this in cf9df14 Sep 27, 2019

yutannihilation mentioned this pull request Jan 7, 2020

ARROW-7045: [R] Preserve factor in Parquet roundtrip #6135

Closed

asfimport mentioned this pull request Sep 30, 2019

[R] Write parquet files with compression #22896

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-6532 [R] write_parquet() uses writer properties (general and arrow specific) #5451

ARROW-6532 [R] write_parquet() uses writer properties (general and arrow specific) #5451

romainfrancois commented Sep 20, 2019 •

edited

Loading

nealrichardson commented Sep 20, 2019

wesm commented Sep 20, 2019

wesm commented Sep 20, 2019

romainfrancois commented Sep 24, 2019

romainfrancois commented Sep 25, 2019

romainfrancois commented Sep 25, 2019

nealrichardson commented Sep 25, 2019

codecov-io commented Sep 26, 2019 •

edited

Loading

nealrichardson left a comment

nealrichardson Sep 26, 2019

romainfrancois Sep 27, 2019

nealrichardson Sep 26, 2019

romainfrancois Sep 27, 2019

romainfrancois Sep 27, 2019

nealrichardson left a comment

nealrichardson Sep 26, 2019

romainfrancois Sep 27, 2019

nealrichardson Sep 26, 2019

romainfrancois Sep 27, 2019

nealrichardson Sep 26, 2019

romainfrancois Sep 27, 2019

nealrichardson left a comment

nealrichardson Sep 27, 2019

romainfrancois Sep 27, 2019

wesm commented Sep 27, 2019

ARROW-6532 [R] write_parquet() uses writer properties (general and arrow specific) #5451

ARROW-6532 [R] write_parquet() uses writer properties (general and arrow specific) #5451

Conversation

romainfrancois commented Sep 20, 2019 • edited Loading

nealrichardson commented Sep 20, 2019

wesm commented Sep 20, 2019

wesm commented Sep 20, 2019

romainfrancois commented Sep 24, 2019

romainfrancois commented Sep 25, 2019

romainfrancois commented Sep 25, 2019

nealrichardson commented Sep 25, 2019

codecov-io commented Sep 26, 2019 • edited Loading

Codecov Report

nealrichardson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nealrichardson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nealrichardson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Sep 27, 2019

romainfrancois commented Sep 20, 2019 •

edited

Loading

codecov-io commented Sep 26, 2019 •

edited

Loading