Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Git] Data mismatch for lines of code added/removed for a release commit #659

Open
linonymous opened this issue Apr 30, 2020 · 5 comments
Open
Labels

Comments

@linonymous
Copy link
Contributor

linonymous commented Apr 30, 2020

Issue:
The lines of code added and removed do not match with the exact commit data on github.

Command:

p2o.py --enrich --index sds-lfn-pnda-git-raw --index-enrich sds-lfn-pnda-git -e [redacted] -g --bulk-size 500 --scroll-size 1000 --db-host [redacted] --db-sortinghat [redacted] --db-user [redacted] --db-password [redacted] git https://github.com/pndaproject/platform-salt

Data for a commit (JSON)

Actual Data:

{
  ..............................
  "_source": {
    "branches": [],
    "title": "Support for Hortonworks HDP",
    "author_user_name": "",
    "time_to_commit_hours": 0.48,
    "author_gender": "Unknown",
    "lines_changed": 3266,
     ....................
    "files": 70,
    "lines_added": 3040,
    "tag": "https://github.com/pndaproject/platform-salt",
    "author_uuid": "ce136f854a3309899616dc4583176a89abb7a2f3",
    "hash": "c305399c1bccbdb0021395f1d6066419228846b6",
    "Commit_name": "James Clarke",
    "lines_removed": 226,
    ...............................
}

Expected Data:

For the following commit:
pndaproject/platform-salt@c305399

lines_added : 3085
lines_removed: 188

This is specifically, as you can see is a tagged release commit. Do these commits are processed in a different way than rest? and 40+ error per commit constitutes for large errors after aggregation.

@valeriocos
Copy link
Member

valeriocos commented May 1, 2020

Hi @linonymous, thank you for reporting this issue!

Context
From the command posted above, I see that you are fetching data from this repo: https://github.com/pndaproject/platform-salt. I understand that you are reporting a mismatch between the data of the GitHub UI and the one generated by ELK.

I don't understand why you are then referencing a different commit of another repo, can you clarify it?

For the following commit:
cntt-n/CNTT@11f95c4

Analysis
For the commit with hash c305399c1bccbdb0021395f1d6066419228846b6 of the platform-salt repo, I confirm that the data reported above is the one generated by GrimoireLab:

    "files": 70,
    "lines_added": 3040,
    "lines_removed": 226,
    "lines_changed": 3266,

However the GitHub interface for that commit (pndaproject/platform-salt@c305399) reports different data. In particular, one more file:

captura_385

I executed Perceval on that repo and the number of files returned is 70 for that commit (see command below). Thus, the problem may be in Perceval (which is the tool in charge of the collection phase).

perceval git https://github.com/pndaproject/platform-salt --git-path /tmp/x2

I compared the file names returned by Perceval with the one listed in the GitHub UI and the missing file is salt/hdp/setup_hadoop.sls.

I'm transfering the issue to the Perceval tracker.

@valeriocos valeriocos transferred this issue from chaoss/grimoirelab-elk May 1, 2020
@valeriocos
Copy link
Member

Under the hood Perceval runs the following Git command to inspect the log history:

git log --reverse --topo-order --raw --numstat --pretty=fuller --decorate=full --parents -M -C -c --branches --tags --remotes=origin

The content retrieved for the commit c305399c1bccbdb0021395f1d6066419228846b6 is provided below.

 commit c305399c1bccbdb0021395f1d6066419228846b6 2cd2631984666624d06770283a5aba2816082625
Author:     James Clarke <[email protected]>
AuthorDate: Fri May 12 12:08:18 2017 +0100
Commit:     James Clarke <[email protected]>
CommitDate: Wed Aug 9 11:39:12 2017 +0100
    Support for Hortonworks HDP
    
    PNDA-2445
...
:100644 100644 5abe419 05cff9b M	salt/cdh/setup_hadoop.sls
:100644 100644 5abe419 a899e66 C050	salt/cdh/setup_hadoop.sls	salt/hdp/setup_hadoop.sls
...
2	2	salt/cdh/setup_hadoop.sls
33	40	salt/{cdh => hdp}/setup_hadoop.sls

The JSON generated for that commit is:

{
    "backend_name": "Git",
    "backend_version": "0.12.0",
    "category": "commit",
    "classified_fields_filtered": null,
    "data": {
        "Author": "James Clarke <[email protected]>",
        "AuthorDate": "Fri May 12 12:08:18 2017 +0100",
        "Commit": "James Clarke <[email protected]>",
        "CommitDate": "Wed Aug 9 11:39:12 2017 +0100",
        "commit": "c305399c1bccbdb0021395f1d6066419228846b6",
        "files": [
            ...
            {
                "action": "C050",
                "added": "33",
                "file": "salt/cdh/setup_hadoop.sls",
                "indexes": [
                    "5abe419",
                    "a899e66"
                ],
                "modes": [
                    "100644",
                    "100644"
                ],
                "newfile": "salt/hdp/setup_hadoop.sls",
                "removed": "40"
            },
            ...
        ],
        "message": "Support for Hortonworks HDP\n\nPNDA-2445",
        "parents": [
            "2cd2631984666624d06770283a5aba2816082625"
        ],
        "refs": []
    },
    "origin": "https://github.com/pndaproject/platform-salt",
    ...
    "uuid": "16d77c2ccc2de8218a8932f2a7d2b3f5165ee2fd"
}

Analysis

  1. The total number of files is 71 when running the Git command, however the number of files returned by Perceval is 70
  2. The file salt/hdp/setup_hadoop.sls is captured by the Git command, which marks it as renamed from salt/cdh/setup_hadoop.sls.
  3. For the file salt/hdp/setup_hadoop.sls, GitHub reports +76 -0, while Perceval reports +33 -40.
  4. The information 2 2 salt/cdh/setup_hadoop.sls is not returned by Perceval

Summary
The file salt/cdh/setup_hadoop.sls is inserted twice in the dict self.commit_files: https://github.com/chaoss/grimoirelab-perceval/blob/master/perceval/backends/core/git.py#L680. The first time the entry contains this data:

{
  'modes': 
  ['100644', '100644'], 
  'indexes': ['5abe419', '05cff9b'], 
  'action': 'M', 
  'file': 'salt/cdh/setup_hadoop.sls', 
  'newfile': None
}

The second time the entry contains this data:

{
  'modes': ['100644', '100644'], 
  'indexes': ['5abe419', 'a899e66'], 
  'action': 'C050', 
  'file': 'salt/cdh/setup_hadoop.sls', 
  'newfile': 'salt/hdp/setup_hadoop.sls'
}

The information loss in Perceval occurs in the method _handle_stats_data (ref: https://github.com/chaoss/grimoirelab-perceval/blob/master/perceval/backends/core/git.py#L694). In particular, if the filename already exists in the commit_files dict, the previous information is replaced by the new one.

       if filename not in self.commit_files:
            self.commit_files[filename] = {'file': filename}

        self.commit_files[filename]['added'] = data['added']
        self.commit_files[filename]['removed'] = data['removed']

@linonymous
Copy link
Contributor Author

linonymous commented May 1, 2020

@valeriocos Thanks a lot for this in-depth RCA!

As per my understanding, for commits with renamed files will suffer loss of data?

@valeriocos
Copy link
Member

You're welcome @linonymous ! We should run more tests, but I would say that the data loss occurs when the same file appears twice in the commit (e.g., a modification on the old file and rename)

@sduenas
Copy link
Member

sduenas commented May 5, 2020

We have two problems here:

  • Files that have two actions in the same commit. In this case the number it's wrong. We should aggregate the data. I wasn't expecting to find something like this.
  • Files that are renamed. In this case, I think is GitHub that does the work wrong. We don't consider, as Git does, these files as new ones.

@sduenas sduenas added the bug label Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants