Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Csv Reader Allow Use of mimetype=text/html Files Without Extension #4040

Merged
merged 3 commits into from
May 25, 2024

Conversation

oleibman
Copy link
Collaborator

Fix #4036. The issue was originally reported as #564 (and #811) and fixed for the most part, but this is a variation that was not covered by the original. Cells with html fragments can cause mime_content_type to identify the file as text\html. Original fix was to ignore mime_content_type when file extension is 'csv' or 'tsv'. However, if the file does not have one of those extensions, it will be rejected by Csv Reader as invalid mimetype. This PR adds text\html to the list of valid mimetypes.

I imagine that this type of problem might occur for other mimetypes. If any of those are reported in future, it might be better to just add a "suppress mimetype" check option, rather than extending the list forever. Html is unusual in that its rules are so lax, which is why it seems appropriate to add it here.

Note that IOFactory may still identify a file as Html even when intended as Csv. The sample associated with this issue does not fall into this category, but one of the unit tests on this ticket does. The file will still be read correctly by Csv Reader, but IOFactory load may cause it to use Html Reader instead.

This is:

  • a bugfix
  • a new feature
  • refactoring
  • additional unit tests

Checklist:

  • Changes are covered by unit tests
    • Changes are covered by existing unit tests
    • New unit tests have been added
  • Code style is respected
  • Commit message explains why the change is made (see https://github.com/erlang/otp/wiki/Writing-good-commit-messages)
  • CHANGELOG.md contains a short summary of the change and a link to the pull request if applicable
  • Documentation is updated as necessary

Why this change is needed?

Provide an explanation of why this change is needed, with links to any Issues (if appropriate).
If this is a bugfix or a new feature, and there are no existing Issues, then please also create an issue that will make it easier to track progress with this PR.

Fix PHPOffice#4036. The issue was originally reported as PHPOffice#564 (and PHPOffice#811) and fixed for the most part, but this is a variation that was not covered by the original. Cells with html fragments can cause `mime_content_type` to identify the file as `text\html`. Original fix was to ignore mime_content_type when file extension is 'csv' or 'tsv'. However, if the file does not have one of those extensions, it will be rejected by Csv Reader as invalid mimetype. This PR adds text\html to the list of valid mimetypes.

I imagine that this type of problem might occur for other mimetypes. If any of those are reported in future, it might be better to just add a "suppress mimetype" check option, rather than extending the list forever. Html is unusual in that its rules are so lax, which is why it seems appropriate to add it here.

Note that IOFactory may still identify a file as Html even when intended as Csv. The sample associated with this issue does not fall into this category, but one of the unit tests on this ticket does. The file will still be read correctly by Csv Reader, but IOFactory load may cause it to use Html Reader instead.
@oleibman
Copy link
Collaborator Author

Pretty funny result with Php nightly. mime_content_type is changed to recognize the files as text/csv after all, which means that this problem would have gone away on its own eventually. Of course, that'll be several years away, so we need to proceed with this change, but I have to tweak the tests.

oleibman added 2 commits May 24, 2024 21:00
mime_content_type will recognize files as text/csv rather than text/html.
@oleibman oleibman added this pull request to the merge queue May 25, 2024
Merged via the queue into PHPOffice:master with commit 7b25b18 May 25, 2024
12 of 13 checks passed
@oleibman oleibman deleted the issue4036 branch May 25, 2024 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Cannot read csv files that contain '<a href=' and do not have the extension ‘.csv’ ‘.tsv’.
1 participant