Improve code quality of token sequence normalization #1872

tsaglam · 2024-07-16T09:12:43Z

This PR improves the code quality of the token sequence normalization module:

Add more documentation
Use more speaking names to avoid cryptic code
Merge graph builder into dedicated graph class to avoid using unspecific data structures
Rename concepts to be consistent with the paper
Add paper link

Old Description (Outdated)

Run token sequence normalization (--normalize) without topological sorting whenever match merging (--match--merging) is enabled to prevent interference. This PR also adapts the naming of TokenStringNormalizer to TokenSequenceNormalizer which is the terminology used in JDoc and the scientific literature.

This PR also refactors the token sequence normalization to improve code quality.

Background:
The reordering of tokens to normalize the token order helps to improve detection quality slightly, but the main impact comes from dead token removal. When match merging is also enabled, it struggles to revert split matches due to the sorting of the token sequence normalization. Thus, when using both, the topological sorting is now disabled in favor of match merging.

… match merging is enabled to prevent interference.

uuqjz

Found one thing, the rest looks got to me

core/src/main/java/de/jplag/normalization/TokenSequenceNormalizer.java

…ity.

tsaglam · 2024-07-30T14:57:52Z

Had some time to do some systematic testing. I will paste the results under this comment.

Toggle me!

Base = Default JPlag
TSN = --normalize
SMM = --match-merging
-new suffix = PR Branch

TicTacToe + GPT-4 Obfuscation:

TicTacToe + Insertion-based Obfuscation:

PROGpedia-19 + Insertion-based Obfuscation:

TL;DR: I cannot reproduce the issue at a large scale. Thus, I remove the disabled sorting from this PR. However, I will keep the PR as the code quality improvements to TSN are good. The title and description will be adapted.

sonarqubecloud · 2024-07-30T15:55:43Z

Quality Gate passed for 'JPlag Plagiarism Detector'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
95.7% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

Run token sequence normalization without topological sorting whenever…

0f73cc0

… match merging is enabled to prevent interference.

tsaglam added enhancement Issue/PR that involves features, improvements and other changes minor Minor issue/feature/contribution/change labels Jul 16, 2024

tsaglam added 3 commits July 16, 2024 11:14

Remove unused import.

793007c

Extend normalization test cases.

d0f13a8

Minor code quality improvements.

c3e8703

tsaglam marked this pull request as ready for review July 16, 2024 11:37

tsaglam requested review from a team July 16, 2024 11:38

uuqjz reviewed Jul 16, 2024

View reviewed changes

core/src/main/java/de/jplag/normalization/TokenSequenceNormalizer.java Outdated Show resolved Hide resolved

TwoOfTwelve approved these changes Jul 17, 2024

View reviewed changes

tsaglam added 2 commits July 17, 2024 16:35

Refactor token sequence normalization completely to improve code qual…

46744f2

…ity.

Make fields required for construction transient to make sonar happy.

a88e538

tsaglam changed the title ~~Prevent conflicts between token sequence normalization and match merging~~ Imrpove code quality of token sequence normalization Jul 30, 2024

tsaglam added 2 commits July 30, 2024 17:45

Revert disabling of topological sorting.

5ccddb4

Revert to more compact method reference syntax.

86af3f8

tsaglam changed the title ~~Imrpove code quality of token sequence normalization~~ Improve code quality of token sequence normalization Jul 30, 2024

tsaglam merged commit 36f025c into develop Jul 31, 2024
44 checks passed

tsaglam deleted the feature/improved-normalization branch July 31, 2024 06:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve code quality of token sequence normalization #1872

Improve code quality of token sequence normalization #1872

tsaglam commented Jul 16, 2024 •

edited

Loading

uuqjz left a comment

tsaglam commented Jul 30, 2024 •

edited

Loading

sonarqubecloud bot commented Jul 30, 2024

Improve code quality of token sequence normalization #1872

Improve code quality of token sequence normalization #1872

Conversation

tsaglam commented Jul 16, 2024 • edited Loading

uuqjz left a comment

Choose a reason for hiding this comment

tsaglam commented Jul 30, 2024 • edited Loading

sonarqubecloud bot commented Jul 30, 2024

Quality Gate passed for 'JPlag Plagiarism Detector'

tsaglam commented Jul 16, 2024 •

edited

Loading

tsaglam commented Jul 30, 2024 •

edited

Loading