[core] CPD: Add total number of tokens to XML reports #4021

maikelsteneker · 2022-06-22T10:48:38Z

Describe the PR

This PR extends the XML output of CPD. This output produces <duplication> elements for each duplication that is encountered. In addition to this, it now also produces <file> elements for each file with the number of tokens in the file added to this format. For example:

<?xml version="1.0" encoding="UTF-8"?>
<pmd-cpd>
   <file filename="/home/maikel/CR28584/file1.cs" totalNumberOfTokens="102"/>
   <file filename="/home/maikel/CR28584/file2.cs" totalNumberOfTokens="102"/>
   <duplication lines="15" tokens="102">
...

Other output formats currently don't contain this information, but could easily be extended to add it if needed.

Ready?

Added unit tests for fixed bug/feature
Passing all unit tests
Complete build ./mvnw clean verify passes (checked automatically by github actions)
Added (in-code) documentation (if needed)

pmd-test · 2022-06-22T11:17:43Z

	1 Message
📖	Compared to master: This changeset changes 0 violations, introduces 0 new violations, 0 new errors and 0 new configuration errors, removes 0 violations, 0 errors and 0 configuration errors. Full report
✅	Compared to master: This changeset changes 0 violations, introduces 0 new violations, 0 new errors and 0 new configuration errors, removes 0 violations, 0 errors and 0 configuration errors. Full report
✅	Compared to master: This changeset changes 0 violations, introduces 0 new violations, 0 new errors and 0 new configuration errors, removes 0 violations, 0 errors and 0 configuration errors. Full report
✅	Compared to master: This changeset changes 0 violations, introduces 0 new violations, 0 new errors and 0 new configuration errors, removes 0 violations, 0 errors and 0 configuration errors. Full report
✅	Compared to master: This changeset changes 0 violations, introduces 0 new violations, 0 new errors and 0 new configuration errors, removes 0 violations, 0 errors and 0 configuration errors. Full report
✅	Compared to master: This changeset changes 0 violations, introduces 0 new violations, 0 new errors and 0 new configuration errors, removes 0 violations, 0 errors and 0 configuration errors. Full report
✅	Compared to master: This changeset changes 0 violations, introduces 0 new violations, 0 new errors and 0 new configuration errors, removes 0 violations, 0 errors and 0 configuration errors. Full report

Generated by 🚫 Danger

adangel

Thanks for the PR!

I'm not sure, whether it's a good idea to add this additional method to the existing interface CPDRenderer.... wdyt @oowekyala ?

Documentation-wise we should also update the example on https://pmd.github.io/latest/pmd_userdocs_cpd_report_formats.html#xml
Maybe it's even time to add a formal documentation of the format or even a schema?
Add these additional elements to the XML format shouldn't be a problem. E.g. the maven-pmd-plugin would just ignore this additional elements (I hope...).

Could you also describe the use case a bit? What is this information used for? The number of tokens is an indication of how big the file is. Or do you maybe want to calculate later on that e.g. file x contains 20% duplicated code?

The XML format is now rather flat. Should we maybe group the duplication inside a <file> element? That of course would render the whole format incompatible, but we could introduce complete new XML renderer with a different format...

pmd-core/src/main/java/net/sourceforge/pmd/cpd/SimpleRenderer.java

pmd-core/src/main/java/net/sourceforge/pmd/cpd/VSRenderer.java

pmd-core/src/main/java/net/sourceforge/pmd/cpd/renderer/CPDRenderer.java

pmd-core/src/main/java/net/sourceforge/pmd/cpd/CSVRenderer.java

pmd-core/src/test/java/net/sourceforge/pmd/cpd/CSVRendererTest.java

pmd-core/src/test/java/net/sourceforge/pmd/cpd/XMLRendererTest.java

pmd-core/src/test/java/net/sourceforge/pmd/cpd/CPDCommandLineInterfaceTest.java

maikelsteneker · 2022-06-24T14:32:14Z

Thank you for the comments!

On the use case: you seem to have guessed correctly already :) We're using data from CPD to calculate a percentage of duplicated code in a file/project. Until now, we've done this by calculating the number of lines that contain a duplication and we divide it by the total number of lines. This works reasonably well, but there are obvious examples where the results are unexpected (for example, removing all line breaks from a file means it can only have 0% or 100% duplication). If we can use the number of tokens instead, we'll have a more precise metric for the relative amount of duplication within a file.

On the XML format: my thinking was that adding additional elements would preserve backwards compatibility reasonably well (assuming users don't blindly consume all elements). Restructuring the format would require users to change the way the file is parsed. I think that changing the structure does not result in a worthwhile tradeoff in this case, but I'm not against changing it if you disagree.

oowekyala · 2022-06-24T17:05:38Z

Thanks for the contribution @maikelsteneker

I'm not sure, whether it's a good idea to add this additional method to the existing interface CPDRenderer.... wdyt @oowekyala ?

This is indeed my main gripe with the code change. I think adding a method parameter is not the right way, if we want to extend the info available to renderers in the future, it would be more flexible to pack all parameters in an object (eg a "CpdReport"). I would rather introduce an interface CpdReportRenderer whose method takes a CpdReport and a Writer, and deprecate the old interface. This way we can add members to CpdReport later without breaking the renderer interface

I wouldn't add an AbstractCpdRenderer either. The renderer interfaces should remain simple and an abstract class hints otherwise. We can use an adapter pattern to convert from the old interface to the new one.

maikelsteneker · 2022-06-29T10:21:42Z

@oowekyala Thank you for the feedback!

I think you raised some very good points, so I have reworked my solution to resemble your proposal more closely. That is, I have introduced a new interface, the XML renderer is the only one that is changed as a result and the remaining renderers use an adapter to remain functioning. I guess that in the PMD 7.x branch, all renderers should implement just the latest interface, with the older ones being removed.

@adangel @oowekyala Could you take another look to see if this new implementation is up to your standards? Thanks in advance for your efforts!

oowekyala

Thanks for doing these changes, just a couple of minor comments

pmd-core/src/main/java/net/sourceforge/pmd/cpd/CPDTask.java

pmd-core/src/main/java/net/sourceforge/pmd/cpd/CPDReport.java

pmd-core/src/main/java/net/sourceforge/pmd/cpd/CPDConfiguration.java

pmd-core/src/main/java/net/sourceforge/pmd/cpd/CPD.java

Co-authored-by: Clément Fournier <[email protected]>

adangel

Thanks for the changes!

From my point of view the only thing we should decide about is: Should we return a List<Matches> instead of Iterator<Matches> in the new CPDReport?

My other comments are some minor things that can also be changed later.

pmd-core/src/test/java/net/sourceforge/pmd/cpd/XMLRendererTest.java

pmd-core/src/test/java/net/sourceforge/pmd/cpd/CPDCommandLineInterfaceTest.java

pmd-core/src/main/java/net/sourceforge/pmd/cpd/XMLRenderer.java

pmd-core/src/main/java/net/sourceforge/pmd/cpd/CPDReport.java

adangel

Thanks for the updated PR.

I'll change the things I've mentioned and merge this afterwards.

pmd-core/src/main/java/net/sourceforge/pmd/cpd/renderer/CPDRendererAdapter.java

pmd-core/src/main/java/net/sourceforge/pmd/cpd/CPDConfiguration.java

pmd-core/src/main/java/net/sourceforge/pmd/cpd/CPDReport.java

adangel changed the title ~~Add total number of tokens to XML reports~~ [core] Add total number of tokens to XML reports Jun 23, 2022

adangel added the an:enhancement An improvement on existing features / rules label Jun 23, 2022

oowekyala self-requested a review June 23, 2022 13:35

adangel reviewed Jun 24, 2022

View reviewed changes

maikelsteneker marked this pull request as draft June 28, 2022 13:07

Add total number of tokens to XML reports

9fc8a56

maikelsteneker force-pushed the report_total_number_of_tokens branch from e7c08e6 to 9fc8a56 Compare June 29, 2022 09:37

maikelsteneker marked this pull request as ready for review June 29, 2022 10:21

oowekyala reviewed Jun 29, 2022

View reviewed changes

maikelsteneker and others added 2 commits June 30, 2022 11:11

Improve encapsulation of CPD report contents

d544efe

Hide CPDReport constructor

a15758f

Co-authored-by: Clément Fournier <[email protected]>

adangel reviewed Jun 30, 2022

View reviewed changes

maikelsteneker added 2 commits July 1, 2022 10:34

Remove catch from test case

f4dd873

Move sorting to CPDReport class

4d78901

adangel added this to the 6.48.0 milestone Jul 1, 2022

adangel self-requested a review July 18, 2022 19:33

Merge branch 'master' into pr-4021

ce6fead

adangel approved these changes Jul 21, 2022

View reviewed changes

adangel added 3 commits July 21, 2022 15:45

[core] Internalize methods in CPDConfiguration and CPDRendererAdapter

ee8622e

[core] Refactor how CPD Renderers are determined

33bfd00

[core] Refactor CPDReport to use a List

94c19d2

adangel changed the title ~~[core] Add total number of tokens to XML reports~~ [core] CPD: Add total number of tokens to XML reports Jul 21, 2022

adangel added the in:cpd Affects the copy-paste detector label Jul 21, 2022

adangel added 2 commits July 21, 2022 16:48

[doc] Add deprecation notice for CPDRenderer

43a1733

[doc] Update release notes (pmd#4021)

4568176

adangel merged commit 029b4b2 into pmd:master Jul 21, 2022

pacvz mentioned this pull request Aug 17, 2022

[core] CPD: Added begin and end token to XML reports #4095

Merged

4 tasks

adangel mentioned this pull request Jan 12, 2023

[cpd][core] Add displaying total number of tokens per file #1569

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] CPD: Add total number of tokens to XML reports #4021

[core] CPD: Add total number of tokens to XML reports #4021

maikelsteneker commented Jun 22, 2022 •

edited

Loading

pmd-test commented Jun 22, 2022 •

edited

Loading

adangel left a comment

maikelsteneker commented Jun 24, 2022

oowekyala commented Jun 24, 2022

maikelsteneker commented Jun 29, 2022

oowekyala left a comment

adangel left a comment

adangel left a comment

[core] CPD: Add total number of tokens to XML reports #4021

[core] CPD: Add total number of tokens to XML reports #4021

Conversation

maikelsteneker commented Jun 22, 2022 • edited Loading

Describe the PR

Ready?

pmd-test commented Jun 22, 2022 • edited Loading

adangel left a comment

Choose a reason for hiding this comment

maikelsteneker commented Jun 24, 2022

oowekyala commented Jun 24, 2022

maikelsteneker commented Jun 29, 2022

oowekyala left a comment

Choose a reason for hiding this comment

adangel left a comment

Choose a reason for hiding this comment

adangel left a comment

Choose a reason for hiding this comment

maikelsteneker commented Jun 22, 2022 •

edited

Loading

pmd-test commented Jun 22, 2022 •

edited

Loading