-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add ndjson support #3845
feat: add ndjson support #3845
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few remarks.
) | ||
|
||
assert len(chunks) == 9 | ||
assert all(isinstance(ch, CompositeElement) for ch in chunks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the general case, chunks can be CompositeElement
, Table
, or TableElement
. Not sure if there's a later case that exercises table behaviors but probably a good idea to have at least one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied all these tests verbatim from the json tests but updated it to use the ndjson support.
@@ -90,6 +90,7 @@ def test_it_detects_correct_file_type_for_CFB_and_ZIP_subtypes_detected_by_direc | |||
(FileType.WAV, "CantinaBand3.wav", "audio/wav"), | |||
(FileType.XML, "factbook.xml", "application/xml"), | |||
(FileType.ZIP, "simple.zip", "application/zip"), | |||
(FileType.NDJSON, "spring-weather.html.ndjson", "application/x-ndjson"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a couple special-case tests below in the special-cases section, not so much for present purposes as to defend against later adjustments/rearrangements of ordering.
Cases I would try:
- correctly identifies when no filename (so no extension) but correct content-type is specified.
- correct when correct extension is specified but wrong content-type (especially
application/json
) - correct when content-type is right but extension is
.json
.
Not sure if all those can be made to work without creating a tangle, but worth a try I expect, should only take a few minutes to give those a whirl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to keep the changes in this PR isolated to only the ndjson support. I don't think the special cases you're asking about would really fall under that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests I'm proposing are all specifically for NDJSON, like the first would be that an NDJSON file is correctly identified as FileType.NDJSON
when no filename is provided but application/x-ndjson
content-type is provided.
The ordering of filetype-detection tests is subject to change as special cases arise, so it's important to have all the cases defended against someone inadvertently breaking your code by another change to say get CSV properly identified in some special case.
This is the kind of test I'm talking about:
def test_it_identifies_NDJSON_for_file_like_object_with_no_name_but_NDJSON_content_type():
with open(example_doc_path("simple.ndjson"), "rb") as f:
file = io.BytesIO(f.read())
assert detect_filetype(file=file, content_type=FileType.NDJSON.mime_type) == FileType.NDJSON
def test_it_identifies_NDJSON_for_file_with_ndjson_extension_but_JSON_content_type():
file_path = example_doc_path("simple.ndjson")
assert detect_filetype(file_path, content_type=FileType.JSON.mime_type) == FileType.NDJSON
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_it_identifies_NDJSON_for_file_with_ndjson_extension_but_JSON_content_type
fails at the moment which might be a good call out for improvement but I won't include that in this PR.
@@ -152,6 +154,29 @@ def elements_to_json( | |||
return json_str | |||
|
|||
|
|||
def elements_to_ndjson( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've always disliked the elements_to_json()
function because it tries to do two jobs and it makes the calling/typing ugly.
I'd make this two functions, elements_to_ndjson(elements) -> str
and `elements_to_ndjson_file(elements, filename, encoding) -> None.
And feel free to leave one of those out if you don't have a current need for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not want to keep with the same pattern? I would think we'd want to split elements_to_json()
then as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after discussed items :)
8e30226
to
d747e2f
Compare
Description
Add ndjson file type support and treat is the same as json files.