Skip to content

Commit

Permalink
Incorporate Wes's revisions
Browse files Browse the repository at this point in the history
  • Loading branch information
nealrichardson committed Jul 31, 2019
1 parent ddb1857 commit c5dd6fa
Show file tree
Hide file tree
Showing 3 changed files with 18 additions and 11 deletions.
2 changes: 1 addition & 1 deletion r/README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Install the latest release of `arrow` from CRAN with
install.packages("arrow")
```

On macOS and Windows, installing a binary package from CRAN will handle Arrow's C++ dependencies for you. On Linux, you'll need to first install the C++ library. See the [Arrow project installation page](https://arrow.apache.org/install/) for a list of PPAs from which you can obtain it.
On macOS and Windows, installing a binary package from CRAN will handle Arrow's C++ dependencies for you. On Linux, you'll need to first install the C++ library. See the [Arrow project installation page](https://arrow.apache.org/install/) to find pre-compiled binary packages for some common Linux distributions, such as Debian, Ubuntu, CentOS, and Fedora. Other Linux distributions must install the C++ library from source.

If you install the `arrow` package from source and the C++ library is not found, the R package functions will notify you that Arrow is not available. Call

Expand Down
7 changes: 4 additions & 3 deletions r/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,9 @@ install.packages("arrow")
On macOS and Windows, installing a binary package from CRAN will handle
Arrow’s C++ dependencies for you. On Linux, you’ll need to first install
the C++ library. See the [Arrow project installation
page](https://arrow.apache.org/install/) for a list of PPAs from which
you can obtain it.
page](https://arrow.apache.org/install/) to find pre-compiled binary packages
for some common Linux distributions, such as Debian, Ubuntu, CentOS, and
Fedora. Other Linux distributions must install the C++ library from source.

If you install the `arrow` package from source and the C++ library is
not found, the R package functions will notify you that Arrow is not
Expand All @@ -57,7 +58,7 @@ set.seed(24)

tab <- arrow::table(x = 1:10, y = rnorm(10))
tab$schema
#> arrow::Schema
#> arrow::Schema
#> x: int32
#> y: double
tab
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ limitations under the License.
{% endcomment %}
-->

We are very excited to announce that the `arrow` R package is now available on CRAN.
We are very excited to announce that the `arrow` R package is now available on [CRAN](https://cran.r-project.org/).

[Apache Arrow](https://arrow.apache.org/) is a cross-language development platform for in-memory data that specifies a standardized columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. The `arrow` package provides an R interface to the Arrow C++ library, including support for working with Parquet and Feather files, as well as lower-level access to Arrow memory and messages.

Expand All @@ -34,7 +34,9 @@ You can install the package from CRAN with
install.packages("arrow")
```

On macOS and Windows, installing a binary package from CRAN will handle Arrow’s C++ dependencies for you. On Linux, you’ll need to first install the C++ library. See the [Arrow project installation page](https://arrow.apache.org/install/) for a list of PPAs from which you can obtain it. If you install the `arrow` package from source and the C++ library is not found, the R package functions will notify you that Arrow is not available. Call
On macOS and Windows, installing a binary package from CRAN will handle Arrow’s C++ dependencies for you. On Linux, you’ll need to first install the C++ library. See the [Arrow project installation page](https://arrow.apache.org/install/) to find pre-compiled binary packages for some common Linux distributions, such as Debian, Ubuntu, CentOS, and Fedora. Other Linux distributions must install the C++ library from source.

If you install the `arrow` R package from source and the C++ library is not found, the R package functions will notify you that Arrow is not available. Call

```r
arrow::install_arrow()
Expand All @@ -45,7 +47,7 @@ library.

## Parquet files

This release introduces read and write support for the [Parquet](https://parquet.apache.org/) columnar data file format. Prior to this release, options for accessing Parquet data in R were limited; the most common recommendation was to use Spark. The `arrow` package greatly simplifies this access and lets you go from a Parquet file to a `data.frame` and back easily, without having to set up a database.
This release introduces basic read and write support for the [Apache Parquet](https://parquet.apache.org/) columnar data file format. Prior to this release, options for accessing Parquet data in R were limited; the most common recommendation was to use Apache Spark. The `arrow` package greatly simplifies this access and lets you go from a Parquet file to a `data.frame` and back easily, without having to set up a database.

```r
library(arrow)
Expand All @@ -66,21 +68,25 @@ Just as you can read, you can write Parquet files:
write_parquet(df, "path/to/different_file.parquet")
```

Note that this read and write support for Parquet files in R is in its early stages of development. The Python Arrow library ([pyarrow](https://arrow.apache.org/docs/python/)) still has much richer support for Parquet files, including working with multi-file datasets. In the coming months, we hope to bring the R package towards feature equivalency.

## Feather files

This release also includes full support for the Feather file format, providing `read_feather()` and `write_feather()`. [Feather](https://github.com/wesm/feather) was one of the initial products coming out of the Arrow project, providing an efficient, common file format language-agnostic data frame storage, along with implementations in R and Python.
This release also includes a much faster and robust implementation of the Feather file format, providing `read_feather()` and `write_feather()`. [Feather](https://github.com/wesm/feather) was one of the initial applications of Apache Arrow for Python and R, providing an efficient, common file format language-agnostic data frame storage, along with implementations in R and Python.

As Arrow progressed, development of Feather moved to the [`apache/arrow`](https://github.com/apache/arrow) project, and for the last two years, the Python implementation of Feather has just been a wrapper around `pyarrow`. This meant that as Arrow progressed and bugs were fixed, the Python version of Feather got the improvements but sadly R did not.

With this release, the R implementation of Feather catches up and now depends on the same underlying C++ library as the Python version does. This should result in more reliable and consistent behavior across the two languages.
With this release, the R implementation of Feather catches up and now depends on the same underlying C++ library as the Python version does. This should result in more reliable and consistent behavior across the two languages, as well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/).

We encourage all R users of `feather` to switch to using `arrow::read_feather()` and `arrow::write_feather()`.

Note that both Feather and Parquet are columnar data formats that allow sharing data frames across R, Pandas, and other tools. When should you use Feather and when should you use Parquet? We currently recommend Parquet for long-term storage, as well as for cases where the size on disk matters because Parquet supports various compression formats. Feather, on the other hand, may be faster to read in because it matches the in-memory format and doesn't require deserialization, and it also allows for memory mapping so that you can access data that is larger than can fit into memory. See the [Arrow project FAQ](https://arrow.apache.org/faq/) for more.
Note that both Feather and Parquet are columnar data formats that allow sharing data frames across R, Pandas, and other tools. When should you use Feather and when should you use Parquet? Parquet is optimized to create small files and as a result can be more expensive to read locally, but it performs very well with remote storage like HDFS or Amazon S3. Feather is designed for fast local reads, particularly with solid-state drives, and is not intended for use with remote storage systems. Feather files can be memory-mapped and read in Arrow format without any deserialization while Parquet files always must be decompressed and decoded. See the [Arrow project FAQ](https://arrow.apache.org/faq/) for more.

## Other capabilities

In addition to these readers and writers, the `arrow` package has wrappers for other readers in the C++ library; see `?read_csv_arrow` and `?read_json_arrow`. It also provides many lower-level bindings to the C++ library, which enable you to access and manipulate Arrow objects. You can use these to build connectors to other applications and services that use Arrow. One example is Spark: the [`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to move data to and from Spark, yielding [significant performance gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).
In addition to these readers and writers, the `arrow` package has wrappers for other readers in the C++ library; see `?read_csv_arrow` and `?read_json_arrow`. These readers are being developed to optimize for the memory layout of the Arrow columnar format and are not intended as a direct replacement for existing R CSV readers (`base::read.csv`, `readr::read_csv`, `data.table::fread`) that return an R `data.frame`.

It also provides many lower-level bindings to the C++ library, which enable you to access and manipulate Arrow objects. You can use these to build connectors to other applications and services that use Arrow. One example is Spark: the [`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to move data to and from Spark, yielding [significant performance gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).

## Acknowledgements

Expand Down

0 comments on commit c5dd6fa

Please sign in to comment.