Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ndjson support #3845

Merged
merged 7 commits into from
Dec 19, 2024
Merged

feat: add ndjson support #3845

merged 7 commits into from
Dec 19, 2024

Conversation

rbiseck3
Copy link
Contributor

Description

Add ndjson file type support and treat is the same as json files.

@rbiseck3 rbiseck3 requested a review from scanny December 18, 2024 16:46
Copy link
Collaborator

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few remarks.

test_unstructured/partition/test_ndjson.py Outdated Show resolved Hide resolved
)

assert len(chunks) == 9
assert all(isinstance(ch, CompositeElement) for ch in chunks)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the general case, chunks can be CompositeElement, Table, or TableElement. Not sure if there's a later case that exercises table behaviors but probably a good idea to have at least one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied all these tests verbatim from the json tests but updated it to use the ndjson support.

@@ -90,6 +90,7 @@ def test_it_detects_correct_file_type_for_CFB_and_ZIP_subtypes_detected_by_direc
(FileType.WAV, "CantinaBand3.wav", "audio/wav"),
(FileType.XML, "factbook.xml", "application/xml"),
(FileType.ZIP, "simple.zip", "application/zip"),
(FileType.NDJSON, "spring-weather.html.ndjson", "application/x-ndjson"),
Copy link
Collaborator

@scanny scanny Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a couple special-case tests below in the special-cases section, not so much for present purposes as to defend against later adjustments/rearrangements of ordering.

Cases I would try:

  • correctly identifies when no filename (so no extension) but correct content-type is specified.
  • correct when correct extension is specified but wrong content-type (especially application/json)
  • correct when content-type is right but extension is .json.

Not sure if all those can be made to work without creating a tangle, but worth a try I expect, should only take a few minutes to give those a whirl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep the changes in this PR isolated to only the ndjson support. I don't think the special cases you're asking about would really fall under that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests I'm proposing are all specifically for NDJSON, like the first would be that an NDJSON file is correctly identified as FileType.NDJSON when no filename is provided but application/x-ndjson content-type is provided.

The ordering of filetype-detection tests is subject to change as special cases arise, so it's important to have all the cases defended against someone inadvertently breaking your code by another change to say get CSV properly identified in some special case.

This is the kind of test I'm talking about:

def test_it_identifies_NDJSON_for_file_like_object_with_no_name_but_NDJSON_content_type():
    with open(example_doc_path("simple.ndjson"), "rb") as f:
        file = io.BytesIO(f.read())
    assert detect_filetype(file=file, content_type=FileType.NDJSON.mime_type) == FileType.NDJSON

def test_it_identifies_NDJSON_for_file_with_ndjson_extension_but_JSON_content_type():
    file_path = example_doc_path("simple.ndjson")
    assert detect_filetype(file_path, content_type=FileType.JSON.mime_type) == FileType.NDJSON

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_it_identifies_NDJSON_for_file_with_ndjson_extension_but_JSON_content_type fails at the moment which might be a good call out for improvement but I won't include that in this PR.

@@ -152,6 +154,29 @@ def elements_to_json(
return json_str


def elements_to_ndjson(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've always disliked the elements_to_json() function because it tries to do two jobs and it makes the calling/typing ugly.

I'd make this two functions, elements_to_ndjson(elements) -> str and `elements_to_ndjson_file(elements, filename, encoding) -> None.

And feel free to leave one of those out if you don't have a current need for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not want to keep with the same pattern? I would think we'd want to split elements_to_json() then as well.

@rbiseck3 rbiseck3 changed the title feat/add ndjson support feat: add ndjson support Dec 18, 2024
@scanny scanny self-requested a review December 18, 2024 21:12
Copy link
Collaborator

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after discussed items :)

@rbiseck3 rbiseck3 force-pushed the roman/add-ndjson-support branch from 8e30226 to d747e2f Compare December 19, 2024 13:30
@rbiseck3 rbiseck3 added this pull request to the merge queue Dec 19, 2024
Merged via the queue into main with commit 50ea6fe Dec 19, 2024
41 checks passed
@rbiseck3 rbiseck3 deleted the roman/add-ndjson-support branch December 19, 2024 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants