Skip to content

Commit

Permalink
fix: fsspec connectors returning data source version as integer (#2427)
Browse files Browse the repository at this point in the history
Connector data source versions should always be string values, however
we were using the integer checksum value for the version for fsspec
connectors. This casts that value to a string.

## Changes

* Cast the checksum value to a string when assigning the version value
for fsspec connectors.
* Adds test to validate that these connectors will assign a string value
when an integer checksum is fetched.

## Testing

Unit test added.
  • Loading branch information
ryannikolaidis authored Jan 19, 2024
1 parent 7378a37 commit 2e97494
Show file tree
Hide file tree
Showing 3 changed files with 27 additions and 1 deletion.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
* **Fix documentation and sample code for Chroma.** Was pointing to wrong examples..
* **Fix flatten_dict to be able to flatten tuples inside dicts** Update flatten_dict function to support flattening tuples inside dicts. This is necessary for objects like Coordinates, when the object is not written to the disk, therefore not being converted to a list before getting flattened (still being a tuple).
* **Fix the serialization of the Chroma destination connector.** Presence of the ChromaCollection object breaks serialization due to TypeError: cannot pickle 'module' object. This removes that object before serialization.
* **Fix fsspec connectors returning version as integer.** Connector data source versions should always be string values, however we were using the integer checksum value for the version for fsspec connectors. This casts that value to a string.

## 0.12.0

Expand Down
25 changes: 25 additions & 0 deletions test_unstructured_ingest/unit/test_fsspec.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
from unittest.mock import MagicMock, patch

from fsspec import AbstractFileSystem

from unstructured.ingest.connector.fsspec.fsspec import FsspecIngestDoc, SimpleFsspecConfig
from unstructured.ingest.interfaces import ProcessorConfig, ReadConfig


@patch("fsspec.get_filesystem_class")
def test_version_is_string(mock_get_filesystem_class):
"""
Test that the version is a string even when the filesystem checksum is an integer.
"""
mock_fs = MagicMock(spec=AbstractFileSystem)
mock_fs.checksum.return_value = 1234567890
mock_fs.info.return_value = {"etag": ""}
mock_get_filesystem_class.return_value = lambda **kwargs: mock_fs
config = SimpleFsspecConfig("s3://my-bucket", access_config={})
doc = FsspecIngestDoc(
processor_config=ProcessorConfig(),
read_config=ReadConfig(),
connector_config=config,
remote_file_path="test.txt",
)
assert isinstance(doc.source_metadata.version, str)
2 changes: 1 addition & 1 deletion unstructured/ingest/connector/fsspec/fsspec.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ def update_source_metadata(self):
self.source_metadata = SourceMetadata(
date_created=date_created,
date_modified=date_modified,
version=version,
version=str(version),
source_url=f"{self.connector_config.protocol}://{self.remote_file_path}",
exists=file_exists,
)
Expand Down

0 comments on commit 2e97494

Please sign in to comment.