Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6532 [R] write_parquet() uses writer properties (general and arrow specific) #5451

Conversation

romainfrancois
Copy link
Contributor

@romainfrancois romainfrancois commented Sep 20, 2019

This adds parameters to write_parquet() to control compression, whether to use dictionary, etc ... on top of the C++ classes parquet::WriterProperties and parquet::ArrowWriterProperties e.g.

write_parquet(tab, file, compression = "gzip", compression_level = 7)

@nealrichardson
Copy link
Member

Yeah, I'm not sure about this on a few levels. I think the R I want to type to write a compressed Parquet file looks like write_parquet(df, file="file.parquet", compression="snappy"). This should be more naturally exposed to the causal user, without having to create a CompressedOutputStream directly.

I'm also not sure whether this works as intended. The Parquet C++ code seems to have its own compression and writing logic; that may be historical artifact, or it may be meaningful. Maybe we can get away without implementing bindings for those classes--the proof would be a passing test of writing a compressed parquet file and reading it back in. Then again, maybe in principle we should write the Parquet bindings to match the C++ library.

@wesm
Copy link
Member

wesm commented Sep 20, 2019

It is not a good idea to write a Parquet file into a CompressedOutputStream. Such file will not be readable with read_parquet.

Parquet already compresses data internally.

@wesm
Copy link
Member

wesm commented Sep 20, 2019

Here's the way we handle it in Python, you'll need to do the same thing in R

https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L363

@romainfrancois romainfrancois force-pushed the ARROW-6532/write_parquet_compression branch from d85b6fc to aa2833e Compare September 24, 2019 11:26
@romainfrancois
Copy link
Contributor Author

Some progress inspired from the python implementation. write_parquet() gains many parameters:

write_parquet <- function(
  table,
  sink, chunk_size = NULL,
  version = NULL, compression = NULL, use_dictionary = NULL, write_statistics = NULL, data_page_size = NULL,
  properties = ParquetWriterProperties$create(
    version = version,
    compression = compression,
    use_dictionary = use_dictionary,
    write_statistics = write_statistics,
    data_page_size = data_page_size
  ),

  use_deprecated_int96_timestamps = FALSE, coerce_timestamps = NULL, allow_truncated_timestamps = FALSE,
  arrow_properties = ParquetArrowWriterProperties$create(
    use_deprecated_int96_timestamps = use_deprecated_int96_timestamps,
    coerce_timestamps = coerce_timestamps,
    allow_truncated_timestamps = allow_truncated_timestamps
  )
)

that are managed by the classes ParquetWriterProperties and ParquetArrowWriterProperties.

Only simple versions so far, e.g. compression may only be a single string, so we may do:

library(arrow, warn.conflicts = FALSE)

df <- tibble::tibble(x = 1:5)
write_parquet(df, "/tmp/test.parquet", compression = "snappy")
read_parquet("/tmp/test.parquet")
#> # A tibble: 5 x 1
#>       x
#>   <int>
#> 1     1
#> 2     2
#> 3     3
#> 4     4
#> 5     5

but we can't e.g. specify specific variables to handle by such and such compression. This is a good place I think for a tidy select, e.g. something like that:

df <- tibble::tibble(x1 = 1:5, x2 = 1:5, y = 1:5)
write_parquet(df, "/tmp/test.parquet", 
  compression = list(snappy = starts_with("x"))
)

The list in python goes the other way, so if we do something similar it would look like

write_parquet(df, "/tmp/test.parquet", 
  compression = list(x1 = "snappy", x2 = "snappy")
)

perhaps we can have compression = only handle the same type of thing python does, but then come up with some helper function so that we'd have e.g.

write_parquet(df, "/tmp/test.parquet", 
  compression = compression_spec(snappy = starts_with("x"))
)

@romainfrancois
Copy link
Contributor Author

One option we discussed with @nealrichardson was to be able to do e.g.

write_parquet(df, "/tmp/test.parquet", 
  compression = Codec$create("snappy", 5L)
)

But unfortunately, the C++ class arrow::util::Codec does not give a way to swim back to the compression level, so I can't do e.g. compression$level.

Instead, I followed python's lead and we can do this instead:

write_parquet(df, "/tmp/test.parquet", 
  compression = "snappy", 
  compression_level = 5L
)

@romainfrancois
Copy link
Contributor Author

These arguments that are handled by ParquetWriterProperties can now be single values, unnamed vectors of the same length as the number of columns in the table, or named vectors: compression, compression_level, use_dictionary and write_statistics.

@nealrichardson
Copy link
Member

Taking a look now; FTR Travis says

Missing link or links in documentation object 'write_parquet.Rd':

  ‘to_arrow’

r/R/compression.R Outdated Show resolved Hide resolved
r/R/compression.R Show resolved Hide resolved
r/R/parquet.R Outdated Show resolved Hide resolved
r/R/parquet.R Outdated Show resolved Hide resolved
r/R/parquet.R Outdated Show resolved Hide resolved
r/R/parquet.R Outdated Show resolved Hide resolved
r/R/parquet.R Outdated Show resolved Hide resolved
r/R/parquet.R Outdated Show resolved Hide resolved
r/R/parquet.R Outdated Show resolved Hide resolved
r/tests/testthat/test-parquet.R Outdated Show resolved Hide resolved
@codecov-io
Copy link

codecov-io commented Sep 26, 2019

Codecov Report

Merging #5451 into master will decrease coverage by 11.93%.
The diff coverage is 65.9%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #5451       +/-   ##
===========================================
- Coverage    88.7%   76.76%   -11.94%     
===========================================
  Files         964       59      -905     
  Lines      128215     4330   -123885     
  Branches     1501        0     -1501     
===========================================
- Hits       113731     3324   -110407     
+ Misses      14119     1006    -13113     
+ Partials      365        0      -365
Impacted Files Coverage Δ
r/R/record-batch.R 97.36% <ø> (-0.04%) ⬇️
r/R/field.R 92.85% <ø> (-0.48%) ⬇️
r/src/arrow_types.h 96% <ø> (ø) ⬆️
r/R/schema.R 31.25% <ø> (+7.72%) ⬆️
r/R/type.R 83.9% <ø> (-0.19%) ⬇️
r/R/enums.R 0% <ø> (ø) ⬆️
r/R/message.R 75% <ø> (+21.15%) ⬆️
r/R/array.R 77.14% <ø> (+4.92%) ⬆️
r/src/compression.cpp 85.71% <0%> (-14.29%) ⬇️
r/R/feather.R 63.33% <100%> (ø) ⬆️
... and 928 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 46a14db...413dd41. Read the comment docs.

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more notes. I'd also like to see better coverage on https://codecov.io/gh/apache/arrow/pull/5451/diff

)

make_valid_version <- function(version, valid_versions = valid_parquet_version) {
if (is_integerish(version)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd write this function as

make_valid_version <- function(version, valid_versions = valid_parquet_version) {
  pq_version <- valid_version[[version]]
  if (is.null(pq_version)) {
    stop('"version" should be one of ', oxford_paste(names(valid_versions), "or"), call.=FALSE)
  }
  pq_version
}

As it stands, make_valid_version(1) won't work, and it seems like it should.

Per the codecov report, this code isn't being exercised.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wfm:

arrow:::make_valid_version("1.0")
#> [1] 0
arrow:::make_valid_version("2.0")
#> [1] 1
arrow:::make_valid_version(1)
#> [1] 0
arrow:::make_valid_version(2)
#> [1] 1

Created on 2019-09-27 by the reprex package (v0.3.0.9000)

r/R/parquet.R Outdated
write_statistics = NULL,
data_page_size = NULL,

properties = ParquetWriterProperties$create(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure there's value including properties and arrow_properties in the signature here. I kept them in read_delim_arrow() because there were some properties they expose that aren't mapped to arguments the readr::read_delim signature but that doesn't seem to be the case here. (On reflection, that's probably not the right call there either; if you want lower-level access to those settings, you should probably be doing CsvTableReader$create(...) anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My rationale was that perhaps you'd already have built those objects properties and arrow_properties before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I get the point that maybe this could be diverted to using a ParquetFileWriter instance

r/tests/testthat/test-parquet.R Outdated Show resolved Hide resolved
r/tests/testthat/helper-parquet.R Show resolved Hide resolved
Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes on the docs

r/R/parquet.R Outdated
#' @param compression compression specification. Possible values:
#' - a single string: uses that compression algorithm for all columns
#' - an unnamed string vector: specify a compression algorithm for each, same order as the columns
#' - a named string vector: specify compression algorithm individually
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And not all columns need to be specified, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

r/R/parquet.R Outdated
#' - an unnamed string vector: specify a compression algorithm for each, same order as the columns
#' - a named string vector: specify compression algorithm individually
#' @param compression_level compression level. A single integer, a named integer vector
#' or an unnamed integer vector of the same size as the number of columns of `table`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this follow the same conventions as compression? Maybe there should be a paragraph/section in the docs that explains how these parameters work since it's the same/similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the documentation in @details

r/R/parquet.R Outdated Show resolved Hide resolved
r/R/parquet.R Outdated
#' @param compression_level compression level. A single integer, a named integer vector
#' or an unnamed integer vector of the same size as the number of columns of `table`
#' @param use_dictionary Specify if we should use dictionary encoding.
#' @param write_statistics Specify if we should write statistics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, and what are statistics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know

r/R/parquet.R Outdated Show resolved Hide resolved
r/R/parquet.R Outdated Show resolved Hide resolved
r/R/parquet.R Outdated Show resolved Hide resolved
r/R/parquet.R Outdated Show resolved Hide resolved
r/R/parquet.R Outdated Show resolved Hide resolved
r/R/parquet.R Outdated Show resolved Hide resolved
@romainfrancois romainfrancois force-pushed the ARROW-6532/write_parquet_compression branch from 354d263 to 50555f8 Compare September 27, 2019 14:04
Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final style question but otherwise LGTM, happy to merge today regardless of where we land on the whitespace question.

as_data_frame = TRUE,
props = ParquetReaderProperties$create(),
...) {
col_select = NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is "bad", according to the tidyverse style guide, which I believed we were trying to follow: https://style.tidyverse.org/functions.html#long-lines-1

I can get used to whatever style conventions we decide, just want to make sure we're in agreement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll setup my Rstudio to obey the style, perhaps we should use styler:: once in a while to do that automatically.

@wesm
Copy link
Member

wesm commented Sep 27, 2019

Can you update the PR description to reflect what is actually in the PR (since writing a Parquet file into a CompressedOutputStream isn't recommendable -- you would have to decompress the entire file first to be able to read any part of it)

@romainfrancois romainfrancois changed the title ARROW-6532 [R] Write parquet files with compression ARROW-6532 [R] write_parquet() uses writer properties (general and arrow specific) Sep 27, 2019
nealrichardson pushed a commit that referenced this pull request Jan 8, 2020
The ability to preserve categorical values was introduced in #5077 as the convention of storing a special `ARROW:schema` key in the metadata. To invoke this, we need to call `ArrowWriterProperties::store_schema()`.

The R binding is already ready for this, but calls `store_schema()` only conditionally and uses `parquet___default_arrow_writer_properties()` by default. Though I don't see the motivation to implement as such in #5451, considering [the Python binding always calls `store_schema()`](https://github.com/apache/arrow/blob/dbe708c7527a4aa6b63df7722cd57db4e0bd2dc7/python/pyarrow/_parquet.pyx#L1269), I guess the R code can do the same.

Closes #6135 from yutannihilation/ARROW-7045_preserve_factor_in_parquet and squashes the following commits:

9227e7e <Hiroaki Yutani> Fix test
4d8bb46 <Hiroaki Yutani> Remove default_arrow_writer_properties()
dfd08cb <Hiroaki Yutani> Add failing tests

Authored-by: Hiroaki Yutani <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants