This project provides a library that reads Parquet files into Java objects.
Add this library as a dependency to your project's pom.xml
file.
<dependencies>
<dependency>
<groupId>com.exasol</groupId>
<artifactId>parquet-io-java</artifactId>
<version>LATEST VERSION</version>
</dependency>
</dependencies>
Please use the latest version of the library.
Here is a small example code showing the usage of the library.
final Path path = new Path("/data/parquet/part-0000.parquet");
final Configuration conf = new Configuration();
try (final ParquetReader<Row> reader = RowParquetReader
.builder(HadoopInputFile.fromPath(path, conf)).build()) {
Row row = reader.read();
while (row != null) {
List<Object> values = row.getValues();
System.out.println(values);
row = reader.read();
}
} catch (final IOException exception) {
//
}
The following table shows how each Parquet data type is mapped into Java data types.
Parquet Data Type | Parquet Logical Type | Java Data Type |
---|---|---|
boolean | Boolean | |
int32 | Integer | |
int32 | date | Date |
int32 | decimal(p, s) | BigDecimal |
int64 | Long | |
int64 | timestamp_millis | Timestamp |
int64 | timestamp_micros | Timestamp |
int64 | decimal(p, s) | BigDecimal |
float | Float | |
double | Double | |
binary | String | |
binary | utf8 | String |
binary | decimal(p, s) | BigDecimal |
fixed_len_byte_array | String | |
fixed_len_byte_array | decimal(p, s) | BigDecimal |
fixed_len_byte_array | uuid | UUID |
int96 | Timestamp | |
group | Map | |
group | LIST | List |
group | MAP | Map |
group | REPEATED | List |
Parquet data type can repeat a single field or the group of fields. The
parquet-io-java (PIOJ) reads these data types into Java List
type.
For example, given the following Parquet schemas:
message parquet_schema {
repeated binary name (UTF8);
}
message parquet_schema {
repeated group person {
required binary name (UTF8);
}
}
The PIOJ reads both of these Parquet types into Java list of ["John", "Jane"]
.
On the other hand, you can import a repeated group with multiple fields as a list of maps.
message parquet_schema {
repeated group person {
required binary name (UTF8);
optional int32 age;
}
}
The PIOJ reads it into a list of person maps:
[ Map("name" -> "John", "age" -> 24), Map("name" -> "Jane", "age" -> 22) ]