Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension CommitsLOC sometimes counts wrong. #120

Open
apepper opened this issue Aug 15, 2011 · 4 comments
Open

Extension CommitsLOC sometimes counts wrong. #120

apepper opened this issue Aug 15, 2011 · 4 comments

Comments

@apepper
Copy link
Contributor

apepper commented Aug 15, 2011

I wrote a sql-query to better understand the data quality of the extension Hunks. For this I'm summing up the added and removed lines of a hunk per commit and comparing it with the output of extension CommentsLOC (which parses git log --shortstat).

This is the query:

SELECT cl.commit_id as commit_id, s.rev as rev, cl.added as added, h.added as calc_added, cl.removed as removed, h.removed as calc_removed, s.message
FROM (
    SELECT commit_id, SUM(old_end_line - old_start_line + 1) as removed, SUM(new_end_line - new_start_line + 1) as added
    FROM hunks
    GROUP BY commit_id
  ) as h
RIGHT JOIN commits_lines cl ON h.commit_id = cl.commit_id
JOIN scmlog s ON s.id = cl.commit_id
WHERE h.added != cl.added or h.removed != cl.removed

While investigating, why some commits don't add up, I already published some patches to increase the data quality:


One thing, that is really annoying, is that CommitsLOC sometimes counts wrong up to 5 lines. I investigated the issue and found, that this is a bug with git itself. I already send a bug report to the git mailing list, but so far, no answer.

Here is, what I observed with repo https://github.com/voldemort/voldemort.git :
The command git log --numstat c21ad764 shows for the commit c21ad764 and file .../readonly/mr/HadoopStoreBuilderReducer.java 25 lines added and 22 lines removed.
But the patch of HadoopStoreBuilderReducer.java that I get with git show c21ad764 -- contrib/hadoop-store-builder/src/java/voldemort/store/readonly/mr/HadoopStoreBuilderReducer.java adds 30 lines and removes 27.

So 5 added and 5 removed lines are missing with git log --shortstat!

More commits where I observed this problem on the same repository:

  • 7e00fb6d2cf131dfed59c180f2171952808cc336 src/java/voldemort/client/rebalance/MigratePartitions.java
  • 78ad6f2a6ea327dbae2110f4530a5bd07e5deaac src/java/voldemort/client/rebalance/MigratePartitions.java (same commit on another branch)
  • 7871933f0f0f056e2eeac03a01db1e9cf81f8bda src/java/voldemort/client/protocol/admin/AdminClient.java
  • 2d6f68b09c3bdc23dcf3ae1f91c9285fbd668820 src/java/voldemort/store/readonly/ExternalSorter.java
  • 6fcacee866307ec34eb32b268e2c2b885a949319 build.xml

Maybe someone has an idea or C skills to build a working patch for git.

@apepper
Copy link
Contributor Author

apepper commented Aug 17, 2011

An update:
On the git mailing list Carlos Martín Nieto confirmed, that git show --stat and git show | diffstat sometimes count different. It seems, that empty lines inside a hunk are not counted with git show --stat. For now no solution from git is offered.

Any ideas how we could solve this?

@philip-iii
Copy link
Contributor

I have been concerned with this issue for a while as well and have created a similar query (albeit a bit more crude perhaps) to check the reliability of the hunks data. The query is:

  COALESCE(old_start_line,0) as osl,COALESCE(old_end_line,0) as oel,
  COALESCE(new_start_line,0) as nsl,COALESCE(new_end_line,0) as nel,
  SUM(COALESCE(old_end_line,0)-COALESCE(old_start_line,1)+1) as h_removed,
  SUM(COALESCE(new_end_line,0)-COALESCE(new_start_line,1)+1) as h_added,
  removed,added,
  (SUM(COALESCE(old_end_line,0)-COALESCE(old_start_line,1)+1)-removed) as d_removed,
  (SUM(COALESCE(new_end_line,0)-COALESCE(new_start_line,1)+1)-added) as d_added,
  patch
FROM commits_lines l
LEFT JOIN hunks h ON h.commit_id = l.commit_id
JOIN patches p ON p.commit_id = l.commit_id
GROUP BY l.commit_id
HAVING h_removed!=removed OR h_added!=added

While most results are identical, within a small sample (produced with a slightly outdated version of cvsanaly) it does report around 30% more results, some due to deleted files.

Unfortunately, I have no specific idea how to fix this at the moment, but I'd be very interested to see how this gets resolved as well

@apepper
Copy link
Contributor Author

apepper commented Aug 19, 2011

I asked at stackoverflow and got an answer. This may work: http://stackoverflow.com/questions/7122833/how-to-tell-git-log-numshortstat-to-count-empty-lines

I'll have a closer look at that on monday.

@apepper
Copy link
Contributor Author

apepper commented Aug 22, 2011

I just wrote an extension PatchLOC to overcome this weeknes of git. Instead of asking git, who many lines where added and removed, PatchLOC counts all added and removed line by parsing patches from the patches extension and running regular expressions on them. See pull request #123. So CommitsLOC still count's wrong, but with PatchLOC there is a way around.

Using the following sql-query, you can see how many mismatches hunks has down to the file level:

SELECT pl.commit_id as commit_id, pl.file_id, s.rev as rev, pl.added as added, h.added as calc_added, pl.removed as removed, h.removed as calc_removed
FROM (
  SELECT h.id, h.commit_id, h.file_id, SUM(new_end_line - new_start_line + 1) as added, SUM(old_end_line - old_start_line + 1) as removed
  FROM hunks h
  GROUP BY commit_id, file_id
  ) as h
RIGHT JOIN patch_lines pl ON h.commit_id = pl.commit_id AND h.file_id = pl.file_id
JOIN scmlog s ON s.id = pl.commit_id
WHERE IFNULL(h.added,0) != pl.added or IFNULL(h.removed,0) != pl.removed

Querying voldemort reveals 8 mismatches. I'll look into it in more details the next days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants