Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve code quality of token sequence normalization #1872

Merged
merged 8 commits into from
Jul 31, 2024

Conversation

tsaglam
Copy link
Member

@tsaglam tsaglam commented Jul 16, 2024

This PR improves the code quality of the token sequence normalization module:

  • Add more documentation
  • Use more speaking names to avoid cryptic code
  • Merge graph builder into dedicated graph class to avoid using unspecific data structures
  • Rename concepts to be consistent with the paper
  • Add paper link
Old Description (Outdated)

Run token sequence normalization (--normalize) without topological sorting whenever match merging (--match--merging) is enabled to prevent interference. This PR also adapts the naming of TokenStringNormalizer to TokenSequenceNormalizer which is the terminology used in JDoc and the scientific literature.

This PR also refactors the token sequence normalization to improve code quality.

Background:
The reordering of tokens to normalize the token order helps to improve detection quality slightly, but the main impact comes from dead token removal. When match merging is also enabled, it struggles to revert split matches due to the sorting of the token sequence normalization. Thus, when using both, the topological sorting is now disabled in favor of match merging.

… match merging is enabled to prevent interference.
@tsaglam tsaglam added enhancement Issue/PR that involves features, improvements and other changes minor Minor issue/feature/contribution/change labels Jul 16, 2024
@tsaglam tsaglam marked this pull request as ready for review July 16, 2024 11:37
@tsaglam tsaglam requested review from a team July 16, 2024 11:38
Copy link
Contributor

@uuqjz uuqjz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one thing, the rest looks got to me

@tsaglam
Copy link
Member Author

tsaglam commented Jul 30, 2024

Had some time to do some systematic testing. I will paste the results under this comment.

Toggle me!
  • Base = Default JPlag
  • TSN = --normalize
  • SMM = --match-merging
  • -new suffix = PR Branch

TicTacToe + GPT-4 Obfuscation:

image

TicTacToe + Insertion-based Obfuscation:

image

PROGpedia-19 + Insertion-based Obfuscation:

image

TL;DR: I cannot reproduce the issue at a large scale. Thus, I remove the disabled sorting from this PR. However, I will keep the PR as the code quality improvements to TSN are good. The title and description will be adapted.

@tsaglam tsaglam changed the title Prevent conflicts between token sequence normalization and match merging Imrpove code quality of token sequence normalization Jul 30, 2024
@tsaglam tsaglam changed the title Imrpove code quality of token sequence normalization Improve code quality of token sequence normalization Jul 30, 2024
Copy link

@tsaglam tsaglam merged commit 36f025c into develop Jul 31, 2024
44 checks passed
@tsaglam tsaglam deleted the feature/improved-normalization branch July 31, 2024 06:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue/PR that involves features, improvements and other changes minor Minor issue/feature/contribution/change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants