layout | title | description |
---|---|---|
article |
Powered by |
List of projects powered by Apache Arrow |
Organizations creating products and projects for use with Apache Arrow, along with associated marketing materials, should take care to respect the trademark in "Apache Arrow" and its logo. Please refer to ASF Trademarks Guidance and associated FAQ for comprehensive and authoritative guidance on proper usage of ASF trademarks.
Names that do not include "Apache Arrow" at all have no potential trademark issue with the Apache Arrow project. This is recommended.
Names like "Apache Arrow BigCoProduct" are not OK, as are names including "Apache Arrow" in general. The above links, however, describe some exceptions, like for names such as "BigCoProduct, powered by Apache Arrow" or "BigCoProduct for Apache Arrow".
It is common practice to create software identifiers (Maven coordinates, module names, etc.) like "arrow-foo". These are permitted. Nominative use of trademarks in descriptions is also always allowed, as in "BigCoProduct is a widget for Apache Arrow".
Projects and documents that want to include a logo for Apache Arrow should use the official logo:
To add yourself to the list, please open a pull request adding your organization name, URL, a list of which Arrow components you are using, and a short description of your use case.
- Apache Parquet: A columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. The C++ and Java implementation provide vectorized reads and write to/from Arrow data structures.
- Apache Spark: Apache Spark™ is a fast and general engine for
large-scale data processing. Spark uses Apache Arrow to
- improve performance of conversion between Spark DataFrame and pandas DataFrame
- enable a set of vectorized user-defined functions (
pandas_udf
) in PySpark.
- AWS Data Wrangler: Extends the power of Pandas library to AWS connecting DataFrames and AWS data related services such as Amazon Redshift, AWS Glue, Amazon Athena, Amazon EMR, Amazon QuickSight, etc.
- Bodo: Bodo is a universal Python analytics engine that democratizes High Performance Computing (HPC) architecture for mainstream enterprises, allowing Python analytics workloads to scale efficiently. Bodo uses Arrow to support I/O for Parquet files, as well as internal support for data operations.
- Cylon: An open-source high performance distributed data processing library that can be seamlessly integrated with existing Big Data and AI/ML frameworks. Cylon uses Arrow memory format and exposes language bindings to C++, Java, and Python.
- Dask: Python library for parallel and distributed execution of dynamic task graphs. Dask supports using pyarrow for accessing Parquet files
- Data Preview: Data Preview is a Visual Studio Code extension for viewing text and binary data files. Data Preview uses Arrow JS API for loading, transforming and saving Arrow data files and schemas.
- Dremio: A self-service data platform. Dremio makes it easy for users to discover, curate, accelerate, and share data from any source. It includes a distributed SQL execution engine based on Apache Arrow. Dremio reads data from any source (RDBMS, HDFS, S3, NoSQL) into Arrow buffers, and provides fast SQL access via ODBC, JDBC, and REST for BI, Python, R, and more (all backed by Apache Arrow).
- Falcon: An interactive data exploration tool with coordinated views. Falcon loads Arrow files using the Arrow JavaScript module. Since Arrow does not need to be parsed (like text-based formats like CSV and JSON), startup cost is significantly minimized.
- FASTDATA.io: Plasma Engine (unrelated to Arrow's Plasma In-Memory Object Store) exploits the massive parallel processing power of GPUs for stream and batch processing. It supports Arrow as input and output, uses Arrow internally to maximize performance, and can be used with existing Apache Spark™ APIs.
- Fletcher: Fletcher is a framework that can integrate FPGA accelerators with tools and frameworks that use the Apache Arrow in-memory format. From a set of Arrow Schemas, Fletcher generates highly optimized hardware structures that allow accelerator kernels to read and write RecordBatches at system bandwidth through easy-to-use interfaces.
- GeoMesa: A suite of tools that enables large-scale geospatial query and analytics on distributed computing systems. GeoMesa supports query results in the Arrow IPC format, which can then be used for in-browser visualizations and/or further analytics.
- GOAI: Open GPU-Accelerated Analytics Initiative for Arrow-powered analytics across GPU tools and vendors
- Graphistry: Supercharged Visual Investigation Platform used by teams for security, anti-fraud, and related investigations. The Graphistry team uses Arrow in its NodeJS GPU backend and client libraries, and is an early contributing member to GOAI and Arrow[JS] focused on bringing these technologies to the enterprise.
- HASH: HASH is an open-core platform for building, running, and learning from simulations, with an in-browser IDE. HASH Engine uses Apache Arrow to power the datastore for simulation state during computation, enabling zero-copy data transfer between simulation logic written across Rust, JavaScript, and Python.
- InAccel: A machine learning acceleration framework which leverages FPGAs-as-a-service. InAccel supports dataframes backed by Apache Arrow to serve as input for our implemented ML algorithms. Those dataframes can be accessed from the FPGAs with a single DMA operation by implementing a shared memory communication schema.
- libgdf: A C library of CUDA-based analytics functions and GPU IPC support for structured data. Uses the Arrow IPC format and targets the Arrow memory layout in its analytic functions. This work is part of the GPU Open Analytics Initiative
- MATLAB: A numerical computing environment for engineers and scientists. MATLAB uses Apache Arrow to support reading and writing Parquet and Feather files.
- OmniSci (formerly MapD): In-memory columnar SQL engine designed to run on both GPUs and CPUs. OmniSci supports Arrow for data ingest and data interchange via CUDA IPC handles. This work is part of the GPU Open Analytics Initiative
- pandas: data analysis toolkit for Python programmers. pandas supports reading and writing Parquet files using pyarrow. Several pandas core developers are also contributors to Apache Arrow.
- Perspective: Perspective is a streaming data visualization engine in JavaScript for building real-time & user-configurable analytics entirely in the browser.
- Petastorm: Petastorm enables single machine or distributed training and evaluation of deep learning models directly from datasets in Apache Parquet format. Petastorm supports popular Python-based machine learning (ML) frameworks such as Tensorflow, Pytorch, and PySpark. It can also be used from pure Python code.
- Polars: Polars is a blazingly fast DataFrame library and query engine that aims to utilize modern hardware efficiently. (e.g. multi-threading, SIMD vectorization, hiding memory latencies). Polars is built upon Apache Arrow and uses its columnar memory, compute kernels, and several IO utilities. Polars is written in Rust and available in Rust and Python.
- Quilt Data: Quilt is a data package manager, designed to make managing data as easy as managing code. It supports Parquet format via pyarrow for data access.
- Ray: A flexible, high-performance distributed execution framework with a focus on machine learning and AI applications. Uses Arrow to efficiently store Python data structures containing large arrays of numerical data. Data can be accessed with zero-copy by multiple processes using the Plasma shared memory object store which originated from Ray and is part of Arrow now.
- Red Data Tools: A project that provides data processing tools for Ruby. It provides Red Arrow that is a Ruby bindings of Apache Arrow based on Apache Arrow GLib. Red Arrow is a core library for it. It also provides many Ruby libraries to integrate existing Ruby libraries with Apache Arrow. They use Red Arrow.
- SciDB: Paradigm4's SciDB is a scalable, scientific database management system that helps researchers integrate and analyze diverse, multi-dimensional, high resolution data - like genomic, clinical, images, sensor, environmental, and IoT data - all in one analytical platform. SciDB streaming and accelerated_io_tools are powered by Apache Arrow.
- TileDB: TileDB is an open-source, cloud-optimized engine for storing and accessing dense/sparse multi-dimensional arrays and dataframes. It is an embeddable C++ library that works on Linux, macOS, and Windows, which comes with numerous APIs and integrations. We use Arrow in our TileDB-VCF project for genomics to achieve zero-copying when accessing TileDB data from Spark and Dask.
- Turbodbc: Python module to access relational databases via the Open Database Connectivity (ODBC) interface. It provides the ability to return Arrow Tables and RecordBatches in addition to the Python Database API Specification 2.0.
- Vaex: Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualize and explore big tabular data at a billion rows per second.
- VAST: A network telemetry engine for data-driven security investigations. VAST uses Arrow as standardized data plane to provide a high-bandwidth output path for downstream analytics. This makes it easy and efficient to access security data via pyarrow and other available bindings.