Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve checking of OCF file name characters #1408

Merged
merged 2 commits into from
Dec 6, 2022

Conversation

rdeltour
Copy link
Member

@rdeltour rdeltour commented Dec 2, 2022

  • significantly refactored OCFFilenameChecker:
    • now based on ICU4J for more up-to-date Unicode support
    • iterate over the input string codepoints (int) instead of code
      units (char) for better support of charactres outside the BMP.
    • rely on ICU UnicodeSet to define the set of forbidden characters
    • improve the message reported for forbidden characters (notably for
      non-printable characters)
    • the class now implements the Checker interface
  • add unit tests for the OCFFilenameChecker only in a new feature
    file filename-checker.feature, located along with the EPUB 3 OCF
    functional tests, for easier lookup.
  • add a few functional tests, to make sure OCFFilenameChecker is
    correctly called when checking full publications or single package
    documents.
  • update ICU4J to v72.1

Fixes #1240, fixes #1292, fixes #1302

- significantly refactored `OCFFilenameChecker`:
  - now based on ICU4J for more up-to-date Unicode support
  - iterate over the input string codepoints (`int`) instead of code
    units (`char`) for better support of charactres outside the BMP.
  - rely on ICU UnicodeSet to define the set of forbidden characters
  - improve the message reported for forbidden characters (notably for
    non-printable characters)
  - the class now implements the `Checker` interface
- add unit tests for the `OCFFilenameChecker` only in a new feature
  file `filename-checker.feature`, located along with the EPUB 3 OCF
  functional tests, for easier lookup.
- add a few functional tests, to make sure `OCFFilenameChecker` is
  correctly called when checking full publications or single package
  documents.
- update ICU4J to v72.1

Fixes #1240, fixes #1292, fixes #1302
@rdeltour rdeltour added this to the v5.0.0-beta milestone Dec 2, 2022
@rdeltour rdeltour self-assigned this Dec 2, 2022
@mattgarrish
Copy link
Member

Hm, I can't even check out this branch because of the file names:

git.exe checkout -b feat/ocf-filename-character-check remotes/origin/feat/ocf-filename-character-check --
error: invalid path 'src/test/resources/epub3/04-ocf/files/ocf-filename-character-forbidden-error/EPUB/content_|.xhtml'
error: invalid path 'src/test/resources/epub3/04-ocf/files/ocf-filename-character-forbidden-non-publication-resource-error/META-INF/file_<>.txt'

git did not exit cleanly (exit code 1) (125 ms @ 2022-12-02 10:43:55 AM)

Is this going to make the repository unusable on windows if it gets merged?

@rdeltour
Copy link
Member Author

rdeltour commented Dec 2, 2022

Is this going to make the repository unusable on windows if it gets merged?

ou damn. Are you able to checkout the repo in subsequent commits?
If not, I’ll have to rewrite history and move that test to a packaged EPUB

@mattgarrish
Copy link
Member

Are you able to checkout the repo in subsequent commits?

No, I kept getting the same error messages about the files in the one I tried.

@rdeltour
Copy link
Member Author

rdeltour commented Dec 2, 2022

Can you retry checking out after setting:

git config core.protectNTFS false

I'd like to avoid rewriting the history of released code if possible, even if they are only previews 🤞

@mattgarrish
Copy link
Member

That lets me check out the branch, yes. git still can't create the files so I can't build epubcheck, but if you plan to fix that in a future update it should be fine.

@rdeltour
Copy link
Member Author

rdeltour commented Dec 2, 2022

That lets me check out the branch, yes.

good.

git still can't create the files so I can't build epubcheck, but if you plan to fix that in a future update it should be fine.

If it's fine with you, temporarily, then yes I'd prefer to fix that in a future update.

In the mean time, you should be able to build if you either:

  • skip all the tests with mvn clean package -DskipTests
  • skip only the problematic tests by annotating them with an @ignore tag in the feature file, and run mvn clean package -DCucumber.options="--tags 'not @ignore'"

Regarding the fix, I'm wondering if packaging the EPUB will be enough, since EPUBCheck unpacks it at runtime, and Windows/NTFS might complain… fun times ahead 😊

@mattgarrish
Copy link
Member

I'm wondering if packaging the EPUB will be enough, since EPUBCheck unpacks it at runtime, and Windows/NTFS might complain…

Ya, that's a recipe for disaster!

Do you need to test the physical files? I see there's a similar ocf-filename-character-forbidden-error.opf file performing the same check. Maybe that's enough in this case.

@rdeltour
Copy link
Member Author

rdeltour commented Dec 2, 2022

Do you need to test the physical files? I see there's a similar ocf-filename-character-forbidden-error.opf file performing the same check. Maybe that's enough in this case.

yes, it's not the same code branch:

  • when checking a full publication, the file names are checked when the OCF container is parsed (from zip or directory)
  • when checking a single package-document, we fall back to checking the file names appearing in the package doc.

Doing only the latter is not optimal since files can be in the container without being listed in the package document.

So the tests ideally need to cover both cases. I'll try to find a workaround. If I can't find any, it's not that big a deal to prune the container scenarios, since the file name checker is unit-tested anyway.

@rdeltour
Copy link
Member Author

rdeltour commented Dec 5, 2022

I opened issue #1420 and PR #1421 to fix the test suite. @mattgarrish can you approve this one once you confirm #1421 fixes the checkout+testing issue?

Base automatically changed from refactor/item-location-info to release/v5.0.0 December 5, 2022 13:10
@rdeltour rdeltour merged commit bfd30b5 into release/v5.0.0 Dec 6, 2022
@rdeltour rdeltour deleted the feat/ocf-filename-character-check branch December 6, 2022 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants