Skip to content

Commit

Permalink
apacheGH-39780: [Python][Parquet] Support hashing for FileMetaData an…
Browse files Browse the repository at this point in the history
…d ParquetSchema (apache#39781)

I think the hash, especially for `FileMetaData` could be better, maybe just use return of `__repr__`, even though that won't include row group info?

### Rationale for this change

Helpful for dependent projects. 

### What changes are included in this PR?

Impl `__hash__` for `ParquetSchema` and `FileMetaData`

### Are these changes tested?

Yes

### Are there any user-facing changes?

Supports hashing metadata:

```python
In [1]: import pyarrow.parquet as pq

In [2]: f = pq.ParquetFile('test.parquet')

In [3]: hash(f.metadata)
Out[3]: 4816453453708427907

In [4]: hash(f.metadata.schema)
Out[4]: 2300988959078172540
```
* Closes: apache#39780

Authored-by: Miles Granger <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
  • Loading branch information
milesgranger authored and dgreiss committed Feb 17, 2024
1 parent 55d4d8b commit 5f4e012
Show file tree
Hide file tree
Showing 2 changed files with 36 additions and 0 deletions.
10 changes: 10 additions & 0 deletions python/pyarrow/_parquet.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -849,6 +849,13 @@ cdef class FileMetaData(_Weakrefable):
cdef Buffer buffer = sink.getvalue()
return _reconstruct_filemetadata, (buffer,)

def __hash__(self):
return hash((self.schema,
self.num_rows,
self.num_row_groups,
self.format_version,
self.serialized_size))

def __repr__(self):
return """{0}
created_by: {1}
Expand Down Expand Up @@ -1071,6 +1078,9 @@ cdef class ParquetSchema(_Weakrefable):
def __getitem__(self, i):
return self.column(i)

def __hash__(self):
return hash(self.schema.ToString())

@property
def names(self):
"""Name of each field (list of str)."""
Expand Down
26 changes: 26 additions & 0 deletions python/pyarrow/tests/parquet/test_metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -499,6 +499,32 @@ def test_multi_dataset_metadata(tempdir):
assert md['serialized_size'] > 0


def test_metadata_hashing(tempdir):
path1 = str(tempdir / "metadata1")
schema1 = pa.schema([("a", "int64"), ("b", "float64")])
pq.write_metadata(schema1, path1)
parquet_meta1 = pq.read_metadata(path1)

# Same as 1, just different path
path2 = str(tempdir / "metadata2")
schema2 = pa.schema([("a", "int64"), ("b", "float64")])
pq.write_metadata(schema2, path2)
parquet_meta2 = pq.read_metadata(path2)

# different schema
path3 = str(tempdir / "metadata3")
schema3 = pa.schema([("a", "int64"), ("b", "float32")])
pq.write_metadata(schema3, path3)
parquet_meta3 = pq.read_metadata(path3)

# Deterministic
assert hash(parquet_meta1) == hash(parquet_meta1) # equal w/ same instance
assert hash(parquet_meta1) == hash(parquet_meta2) # equal w/ different instance

# Not the same as other metadata with different schema
assert hash(parquet_meta1) != hash(parquet_meta3)


@pytest.mark.filterwarnings("ignore:Parquet format:FutureWarning")
def test_write_metadata(tempdir):
path = str(tempdir / "metadata")
Expand Down

0 comments on commit 5f4e012

Please sign in to comment.