Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi Language Program Support #2087

Merged

Conversation

TwoOfTwelve
Copy link
Contributor

@TwoOfTwelve TwoOfTwelve commented Dec 2, 2024

This PR adds the capability to parse multiple languages with JPlag in on run.

A new language module is added, which identifies a module for each file and delegates the parsing to that module. The resulting tokens are concatenated and passed back to JPlag. This does not allow comparing different languages with each other, but it allows for projects that contain multiple languages (multi-language projects).

The languages are discovered on runtime using the same mechanism as the cli. This made it necessary to move the LanguageLoader class from the cli module to the language-api module.
The language modules are identified using the suffixes defined in the individual language modules. This matches the way JPlag already identifies code files in a submission.
The user has to select all language modules that should be used manually using a language specific option.

If multiple language modules are selected for the same suffix, a module is chosen arbitrarily.

Future considerations:

  • Change Cli to allow configuration for multiple language-modules. Right now it's not possible to set language specific options for the selected languages. Changing that would be a major change to the cli though.
  • Allow implicitly selecting all languages. This would make JPlag easier to use, since a user would not have to select a language module anymore, but makes handing language modules with the same suffix harder
  • Find a better way to identify the correct language module. This could be done by adding priorities to language modules, or by adding function to analyze the contents of a file before.

Usage example:

jplag multi --languages java,cpp <path to submissions>

Copy link
Contributor

@uuqjz uuqjz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just found two small things.

@TwoOfTwelve TwoOfTwelve requested a review from uuqjz December 2, 2024 10:18
Copy link
Contributor

@uuqjz uuqjz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

tsaglam

This comment was marked as resolved.

@TwoOfTwelve TwoOfTwelve changed the title Feature/bachelor alex multi language Mulit language module Dec 2, 2024
Copy link
Contributor

@robinmaisch robinmaisch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

@uuqjz
Copy link
Contributor

uuqjz commented Dec 2, 2024

@TwoOfTwelve could you adress the Sonar issues please?

Copy link
Member

@tsaglam tsaglam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For conceptually larger PRs, PRs that add a lot of new functionality, or PRs that are research-related, the PR description should be more detailed. It should sufficiently describe the intent of the PR and how it works. Also, edge cases like the one mentioned below. Note that the PRs are linked in the release note, so we should write them as intended for someone external.

As an example, for this PR, write:

  • What is the intent (parsing programs with code from multiple languages)
  • How is it implemented (language module that passes the files to the language-specific ones and receives tokens)
  • Design decisions (e.g., loading languages dynamically instead of dependencies)
  • Edge cases (e.g. see below)


private Optional<Language> findLanguageForFile(File file) {
return this.languages.stream().filter(language -> Arrays.stream(language.suffixes()).anyMatch(suffix -> file.getName().endsWith(suffix)))
.findFirst();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will currently pick the first language for cases where multiple languages support the same file type. Have you discussed this behavior? This affects the C/C++ modules and also the EMF modules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also in future Java vs. Java-CPG)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have discussed this with Robin, Nils and Sebastian. Currently it's not that big of an issue, since the user has to select the modules manually. If there are multiple selected modules for the same file, the module is chosen arbitrarily.

This should be addressed in the future, maybe by adding priorities to language modules or by distinguishing files in more detail than just the suffix. I think it should be done separately though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I understand; I overlooked that the users specify the languages when using the module.
In the long run, I considered making the language module the default, but that would only be user-friendly if users do not need to specify languages. This would mean the multi-language module automatically parses all code that JPlag supports. Then we need prioritization.

With the current solution, we add yet another cli argument, which is less likely to be used by many users. For now, let us leave it as is, but before the release, we need to think about which mode we truly want. If we want more people to try out the language module, we probably need to implement the unparameterized version. However, in all cases, I would not make it the default language straight away.

@tsaglam tsaglam changed the title Mulit language module Multi Language Program Support Dec 3, 2024
@tsaglam tsaglam added enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change language PR / Issue deals (partly) with new and/or existing languages for JPlag labels Dec 3, 2024
@TwoOfTwelve TwoOfTwelve requested a review from tsaglam December 4, 2024 11:56
Copy link
Member

@tsaglam tsaglam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; we should discuss in the future if we prefer specifying languages or if all supported languages should be used by default.

@TwoOfTwelve
Copy link
Contributor Author

@tsaglam Can you look at the sonar issue with the cyclic dependency? I think it's fine in this case.

@tsaglam
Copy link
Member

tsaglam commented Dec 4, 2024

Can you look at the sonar issue with the cyclic dependency? I think it's fine in this case.

I briefly looked at the cycles.
The multi-language cycle could be resolved by avoiding the .class comparison and adding a method to the language interface. However, I would not do that, as this could also be resolved when changing the things discussed above in a later PR. So let us leave that cycle as is.

The other one was not immediately clear to me, I will have a look at it later.

@TwoOfTwelve
Copy link
Contributor Author

I didn't expect the other one either. I'll take a look at it too

@tsaglam
Copy link
Member

tsaglam commented Dec 4, 2024

I think the other one is: Submission --> JPlagOptions --> SimilarityMetric --> JPlagComparison --> Submission.
Which is not mentioned by Sonar. It should be fine as well; I think structurally SimilarityMetric is the culprit and needs a redesign; I think it takes over the responsibility of the JPlagComparison class via the method references. But this has been part of the code for a long time. I would say, fix only the public modifier, and we ignore the cycles for now.

EDIT: I have an idea how to fix this cycle in a future PR, but this is not time-sensitive at all.

Copy link

sonarqubecloud bot commented Dec 4, 2024

@robinmaisch robinmaisch merged commit 0774cc7 into jplag:develop Dec 4, 2024
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue/PR that involves features, improvements and other changes language PR / Issue deals (partly) with new and/or existing languages for JPlag major Major issue/feature/contribution/change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants