Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] CPD: Add total number of tokens to XML reports #4021

Merged
merged 11 commits into from
Jul 21, 2022

Conversation

maikelsteneker
Copy link
Contributor

@maikelsteneker maikelsteneker commented Jun 22, 2022

Describe the PR

This PR extends the XML output of CPD. This output produces <duplication> elements for each duplication that is encountered. In addition to this, it now also produces <file> elements for each file with the number of tokens in the file added to this format. For example:

<?xml version="1.0" encoding="UTF-8"?>
<pmd-cpd>
   <file filename="/home/maikel/CR28584/file1.cs" totalNumberOfTokens="102"/>
   <file filename="/home/maikel/CR28584/file2.cs" totalNumberOfTokens="102"/>
   <duplication lines="15" tokens="102">
...

Other output formats currently don't contain this information, but could easily be extended to add it if needed.

Ready?

  • Added unit tests for fixed bug/feature
  • Passing all unit tests
  • Complete build ./mvnw clean verify passes (checked automatically by github actions)
  • Added (in-code) documentation (if needed)

@pmd-test
Copy link

pmd-test commented Jun 22, 2022

1 Message
📖 Compared to master:
This changeset changes 0 violations,
introduces 0 new violations, 0 new errors and 0 new configuration errors,
removes 0 violations, 0 errors and 0 configuration errors.
Full report
Compared to master:
This changeset changes 0 violations,
introduces 0 new violations, 0 new errors and 0 new configuration errors,
removes 0 violations, 0 errors and 0 configuration errors.
Full report
Compared to master:
This changeset changes 0 violations,
introduces 0 new violations, 0 new errors and 0 new configuration errors,
removes 0 violations, 0 errors and 0 configuration errors.
Full report
Compared to master:
This changeset changes 0 violations,
introduces 0 new violations, 0 new errors and 0 new configuration errors,
removes 0 violations, 0 errors and 0 configuration errors.
Full report
Compared to master:
This changeset changes 0 violations,
introduces 0 new violations, 0 new errors and 0 new configuration errors,
removes 0 violations, 0 errors and 0 configuration errors.
Full report
Compared to master:
This changeset changes 0 violations,
introduces 0 new violations, 0 new errors and 0 new configuration errors,
removes 0 violations, 0 errors and 0 configuration errors.
Full report
Compared to master:
This changeset changes 0 violations,
introduces 0 new violations, 0 new errors and 0 new configuration errors,
removes 0 violations, 0 errors and 0 configuration errors.
Full report

Generated by 🚫 Danger

@adangel adangel changed the title Add total number of tokens to XML reports [core] Add total number of tokens to XML reports Jun 23, 2022
@adangel adangel added the an:enhancement An improvement on existing features / rules label Jun 23, 2022
@oowekyala oowekyala self-requested a review June 23, 2022 13:35
Copy link
Member

@adangel adangel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

I'm not sure, whether it's a good idea to add this additional method to the existing interface CPDRenderer.... wdyt @oowekyala ?

Documentation-wise we should also update the example on https://pmd.github.io/latest/pmd_userdocs_cpd_report_formats.html#xml
Maybe it's even time to add a formal documentation of the format or even a schema?
Add these additional elements to the XML format shouldn't be a problem. E.g. the maven-pmd-plugin would just ignore this additional elements (I hope...).

Could you also describe the use case a bit? What is this information used for? The number of tokens is an indication of how big the file is. Or do you maybe want to calculate later on that e.g. file x contains 20% duplicated code?

The XML format is now rather flat. Should we maybe group the duplication inside a <file> element? That of course would render the whole format incompatible, but we could introduce complete new XML renderer with a different format...

@maikelsteneker
Copy link
Contributor Author

Thank you for the comments!

On the use case: you seem to have guessed correctly already :) We're using data from CPD to calculate a percentage of duplicated code in a file/project. Until now, we've done this by calculating the number of lines that contain a duplication and we divide it by the total number of lines. This works reasonably well, but there are obvious examples where the results are unexpected (for example, removing all line breaks from a file means it can only have 0% or 100% duplication). If we can use the number of tokens instead, we'll have a more precise metric for the relative amount of duplication within a file.

On the XML format: my thinking was that adding additional elements would preserve backwards compatibility reasonably well (assuming users don't blindly consume all elements). Restructuring the format would require users to change the way the file is parsed. I think that changing the structure does not result in a worthwhile tradeoff in this case, but I'm not against changing it if you disagree.

@oowekyala
Copy link
Member

Thanks for the contribution @maikelsteneker

I'm not sure, whether it's a good idea to add this additional method to the existing interface CPDRenderer.... wdyt @oowekyala ?

This is indeed my main gripe with the code change. I think adding a method parameter is not the right way, if we want to extend the info available to renderers in the future, it would be more flexible to pack all parameters in an object (eg a "CpdReport"). I would rather introduce an interface CpdReportRenderer whose method takes a CpdReport and a Writer, and deprecate the old interface. This way we can add members to CpdReport later without breaking the renderer interface

I wouldn't add an AbstractCpdRenderer either. The renderer interfaces should remain simple and an abstract class hints otherwise. We can use an adapter pattern to convert from the old interface to the new one.

@maikelsteneker maikelsteneker marked this pull request as draft June 28, 2022 13:07
@maikelsteneker
Copy link
Contributor Author

@oowekyala Thank you for the feedback!

I think you raised some very good points, so I have reworked my solution to resemble your proposal more closely. That is, I have introduced a new interface, the XML renderer is the only one that is changed as a result and the remaining renderers use an adapter to remain functioning. I guess that in the PMD 7.x branch, all renderers should implement just the latest interface, with the older ones being removed.

@adangel @oowekyala Could you take another look to see if this new implementation is up to your standards? Thanks in advance for your efforts!

@maikelsteneker maikelsteneker marked this pull request as ready for review June 29, 2022 10:21
Copy link
Member

@oowekyala oowekyala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing these changes, just a couple of minor comments

Copy link
Member

@adangel adangel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes!

From my point of view the only thing we should decide about is: Should we return a List<Matches> instead of Iterator<Matches> in the new CPDReport?

My other comments are some minor things that can also be changed later.

@adangel adangel added this to the 6.48.0 milestone Jul 1, 2022
@adangel adangel self-requested a review July 18, 2022 19:33
Copy link
Member

@adangel adangel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updated PR.

I'll change the things I've mentioned and merge this afterwards.

@adangel adangel changed the title [core] Add total number of tokens to XML reports [core] CPD: Add total number of tokens to XML reports Jul 21, 2022
@adangel adangel added the in:cpd Affects the copy-paste detector label Jul 21, 2022
@adangel adangel merged commit 029b4b2 into pmd:master Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
an:enhancement An improvement on existing features / rules in:cpd Affects the copy-paste detector
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants