-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve code quality of token sequence normalization #1872
Conversation
… match merging is enabled to prevent interference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found one thing, the rest looks got to me
core/src/main/java/de/jplag/normalization/TokenSequenceNormalizer.java
Outdated
Show resolved
Hide resolved
Had some time to do some systematic testing. I will paste the results under this comment. Toggle me!
TicTacToe + GPT-4 Obfuscation: TicTacToe + Insertion-based Obfuscation: PROGpedia-19 + Insertion-based Obfuscation: TL;DR: I cannot reproduce the issue at a large scale. Thus, I remove the disabled sorting from this PR. However, I will keep the PR as the code quality improvements to TSN are good. The title and description will be adapted. |
Quality Gate passed for 'JPlag Plagiarism Detector'Issues Measures |
This PR improves the code quality of the token sequence normalization module:
Old Description (Outdated)
Run token sequence normalization (
--normalize
) without topological sorting whenever match merging (--match--merging
) is enabled to prevent interference. This PR also adapts the naming ofTokenStringNormalizer
toTokenSequenceNormalizer
which is the terminology used in JDoc and the scientific literature.This PR also refactors the token sequence normalization to improve code quality.
Background:
The reordering of tokens to normalize the token order helps to improve detection quality slightly, but the main impact comes from dead token removal. When match merging is also enabled, it struggles to revert split matches due to the sorting of the token sequence normalization. Thus, when using both, the topological sorting is now disabled in favor of match merging.