Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#569] Allow finding previous authors for ignored commits #1565

Merged

Conversation

FH-30
Copy link
Contributor

@FH-30 FH-30 commented Jul 13, 2021

Resolves #569.

Enable RepoSense to find previous authors based on IgnoreCommitList

The current implementation of RepoSense assigns lines last modified by a commit in the IgnoreCommitList to an unknown author. This means that when a commit involving a bulk change occurs (ex: a formatting change) and that commit is added to the ignore commit list, the authorship.json file generated by RepoSense will not contain much meaningful information as most of the lines are assigned to an unknown author.

This PR fixes this issue by utilizing git's --ignore-revs-file command. RepoSense will create a .git-blame-ignore-revs file (containing the commit hashes from the IgnoreCommitList in the final config object passed to the analyzeRepo method in ReportGenerator.java) in the directory of each cloned repository that has the option to find previous authors enabled. This can be enabled either by adding yes to the Find Previous Authors column in repo-config.csv or by adding --find-previous-authors / -F through the CLI. The proposed implementation in this PR has the default behavior to still assign lines due to ignored commits to an unknown author in the event that there is no previous commit that touched that line before the ignored commit. Note that the blamed lines still remain the same (can be seen in authorship.json file generated) but only the author to who it is blamed to is changed when Finding Previous Authors is enabled.

Commit Message:

[#569] Allow finding previous authors for ignored commits

The current implementation of RepoSense assigns lines last modified by
a commit in the IgnoreCommitList to an unknown author. This means that
when a commit involving a bulk change occurs (ex: a formatting change)
and that commit is added to the ignore commit list, the
```authorship.json``` file generated by RepoSense will not contain
much meaningful information as most of the lines are assigned to an
unknown author.

Let's add a --find-previous-authors flag or -F to tell RepoSense to
use Git blame's --ignore-revs-file argument and generate a
.git-blame-ignore-revs file in the repositories analyzed to allow
RepoSense users to have a more meaningful authorship report from
RepoSense's analysis in the event that a bulk change occurs in the
repositories analyzed.

@FH-30 FH-30 marked this pull request as draft July 13, 2021 16:45
@damithc
Copy link
Collaborator

damithc commented Jul 15, 2021

@reposense/21-22-active-team any feedback for @FH-30 ?

@dcshzj
Copy link
Member

dcshzj commented Jul 17, 2021

I think the idea is great! However, I have some questions about this feature:

  1. Any reason for having both a CLI argument and a repo-config.csv column for specifying whether to find previous authors? What happens when they conflict?
  2. When this setting is enabled and a particular ignored commit adds a new file, will subsequent commits to that new file have the entire file attributed to that author, or will it retain the previous behavior of attributing to an unknown author?
  3. This Git blame ignore revs feature in Git was introduced in 2019 which is rather recent (version 2.23). Luckily GitHub Actions can support it as it is running >=2.31, but some users may still be running an older version of Git on their computers. Will there be a fail-safe mechanism implemented in the event that the user is running RepoSense locally that uses an older version of Git?

@FH-30
Copy link
Contributor Author

FH-30 commented Jul 19, 2021

Thanks for the feedback @dcshzj , these are my answers to your questions:

1.
The reason why I have decided to add both a CLI argument and a repo-config.csv column for specifying whether to find previous authors is to allow for flexibility. There might be scenarios where we want to just ignore commits (without finding previous authors) for some repositories but we want to find the previous authors of certain commits for some repositories. Placing it in repo-config.csv allows for users to utilize this feature in RepoSense for such scenarios . The CLI argument on the other hand makes it easy if the user wants to set the finding previous authors mode for all the repositories without having to add the yes flag to all rows in the FInd Previous Authors column in repo-config.csv.

If the CLI argument happens to conflict with the repo-config.csv argument, there are four possible scenarios as shown in the table below:

repo-config.csv for Find Previous Authors Column CLI for -F flag Find Previous Authors
False False False
True False True
False True True
True True True

This is similar to how current conflicting arguments in RepoSense are handled.

2.
I'm not sure if I got your question right, do correct me if I got it wrong (I'm not sure which author is "that author" referring to). Assuming "that author" is the author that modified the file, when this setting is enabled and a particular ignored commit adds a new file then subsequent commits to that file will not necessarily attribute that whole file to the author. Different lines of the file will remain attributed to the individual authors that modified those lines. An example of how this feature works in RepoSense:

Commits in chronological order:
Commit 0dfba by author Alice adds a new txt file to a repository called Test_Repo.
Commit 5dedf by author Bob modifies the txt file by adding something to line 1.
Commit abde5 by author Dave modifies the txt file by modifying line 1.
Commit 1ab41 by author Eve modifies the txt file by adding something to line 2.

Some different possibilities:

Content of Ignore Commit List Line 1 blamed to Line 2 blamed to
Dave Eve
0dfba Dave Eve
5dedf;1ab41 Dave Unknown
abde5 Bob Eve
5dedf;abde5 Unknown Eve
0dfba;5dedf;abde5 Unknown Eve
0dfba;5dedf;abde5;1ab41 Unknown Unknown

The creating of a file is never taken note of in authorship.json by RepoSense (this is not due to the PR change but it is in fact RepoSense's default behaviour) therefore Alice is never blamed for anything. The only evidence Alice created the file is probably in commits.json.

3.
I was also contemplating about this, is it preferable to throw an error (asking the user to upgrade his version of git if he wants to use this feature) when Find Previous Authors is enabled but the current git version of the user is not >= 2.23 or is it better to try to resort to using the git hyper-blame tool suggested in the issue discussion even though it seems to have been discontinued (no longer maintained) and is probably not as accurate as git's own implementation.

For this last question, maybe prof @damithc can give his opinion too, thanks!

@damithc
Copy link
Collaborator

damithc commented Jul 20, 2021

@FH-30 Good to see you are doing a deep investigation on this issue. Keep it up!

I was also contemplating about this, is it preferable to throw an error (asking the user to upgrade his version of git if he wants to use this feature) when Find Previous Authors is enabled but the current git version of the user is not >= 2.23 or is it better to try to resort to using the git hyper-blame tool suggested in the issue discussion even though it seems to have been discontinued (no longer maintained) and is probably not as accurate as git's own implementation.

For this last question, maybe prof @damithc can give his opinion too, thanks!

I think no need to use an additional tool if the user's git version is not matching, in the interest of keeping the code simple.

In general, it is OK to expect the user to have a recent version of Git, although we may not want to go for a bleeding edge version as the CI platforms we piggy back on e.g., GitHub Actions, Travis, etc. might not have support for very new Git versions.

It is ideal (but not compulsory) if the rest of the features work with older versions of Git if only the feature in concern requires a new version of Git.

@FH-30
Copy link
Contributor Author

FH-30 commented Jul 24, 2021

Hi @dcshzj, may I have your feedback based on my previous response? Thanks in advance!

@FH-30
Copy link
Contributor Author

FH-30 commented Jul 26, 2021

I have decided to throw an error instead here at line 81, can inform me if it is advisable not to do so.

@dcshzj
Copy link
Member

dcshzj commented Jul 28, 2021

@FH-30 Apologies for the delay, here are my replies to your comment.

The reason why I have decided to add both a CLI argument and a repo-config.csv column for specifying whether to find previous authors is to allow for flexibility.

Sounds good, let's go ahead with your implementation of having the configuration on both repo-config.csv and as a CLI argument. Do make sure that the documentation is clear about how the CLI argument has a higher priority (i.e. even if a user explicitly disables finding previous authors for a particular repository, it will still be done when the CLI argument is provided).

Different lines of the file will remain attributed to the individual authors that modified those lines.

Yep, this is the ideal behaviour. I was thinking of the following scenario:

  • Commit A by person A creates a new file of 500 lines.
  • Commit B by person B modifies the same file by changing only the last line

If commit A is added to the ignore commit list and Git cannot find any previous authors to this file, I am concerned whether Git will assign the rest of the 499 lines to person B. RepoSense's default behaviour is to assign to an unknown author, but I am not sure if Git blame ignore revs will do the same as RepoSense, or does it perform differently.

I was also contemplating about this, is it preferable to throw an error (asking the user to upgrade his version of git if he wants to use this feature) when Find Previous Authors is enabled but the current git version of the user is not >= 2.23 or is it better to try to resort to using the git hyper-blame tool suggested in the issue discussion even though it seems to have been discontinued (no longer maintained) and is probably not as accurate as git's own implementation.

My personal preference is to notify the user that this feature is not supported after checking the current Git version that the user has, and perform the default behaviour of not finding previous authors. I think let's not introduce hyper-blame into our project.

Finally, do mark this PR as ready to review, when you are ready to have it reviewed.

@FH-30
Copy link
Contributor Author

FH-30 commented Jul 28, 2021

Thanks for the review @dcshzj , no worries regarding the delay! I will proceed with implementing the tests and documentation shortly!

@FH-30
Copy link
Contributor Author

FH-30 commented Aug 29, 2021

@dcshzj All the checks have passed, you may proceed to review this PR! I do hope this PR gets wrapped up by the end of this week, thanks in advance!

@FH-30
Copy link
Contributor Author

FH-30 commented Sep 4, 2021

Hi @dcshzj , sorry for bothering (I know you must be busy too). I just hope that you can give a review over the weekend as the deadline for this PR is next Friday, thanks in advance!

src/main/java/reposense/git/GitVersion.java Show resolved Hide resolved
src/main/java/reposense/report/ReportGenerator.java Outdated Show resolved Hide resolved
src/main/java/reposense/util/FileUtil.java Outdated Show resolved Hide resolved
src/main/java/reposense/util/FileUtil.java Outdated Show resolved Hide resolved
src/test/java/reposense/parser/RepoConfigParserTest.java Outdated Show resolved Hide resolved
src/test/java/reposense/template/GitTestTemplate.java Outdated Show resolved Hide resolved
src/main/java/reposense/model/RepoConfiguration.java Outdated Show resolved Hide resolved
src/main/java/reposense/report/ReportGenerator.java Outdated Show resolved Hide resolved
@dcshzj dcshzj self-requested a review September 6, 2021 07:57
Copy link
Member

@dcshzj dcshzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@dcshzj dcshzj requested a review from a team September 6, 2021 08:15
@FH-30
Copy link
Contributor Author

FH-30 commented Sep 6, 2021

Hi, @fzdy1914. I'm guessing that you must be the stage 2 reviewer for this PR. I do hope you can review this PR soon as the deadline is this Friday, thanks!

@damithc
Copy link
Collaborator

damithc commented Sep 6, 2021

I do hope you can review this PR soon as the deadline is this Friday, thanks!

It's a soft deadline @FH-30 Can take a few more days if needed.

@FH-30
Copy link
Contributor Author

FH-30 commented Sep 6, 2021

Ah that's good to hear, thanks prof!

@fzdy1914
Copy link
Member

fzdy1914 commented Sep 9, 2021

I am sorry that this PR is too long for me to review in time, @dcshzj Can just merge before Friday.

@dcshzj dcshzj removed the request for review from a team September 9, 2021 03:25
@dcshzj dcshzj changed the title [#569] IgnoreCommitList: enhance to produce better results [#569] Allow finding previous authors for ignored commits Sep 10, 2021
@dcshzj
Copy link
Member

dcshzj commented Sep 10, 2021

@FH-30 Looks like the system tests are failing after #1586 was merged. Can I trouble you to look into it?

@FH-30
Copy link
Contributor Author

FH-30 commented Sep 10, 2021

Since I added new system tests for this PR and I noticed that the PR you just merged into the master branch had all the expected authorship.json and commits.json changed for the existing system tests, I believe it is the case that I need to change the expected authorship.json and commits.json for the new system tests in this PR. @dcshzj is it alright if I do so?

@dcshzj
Copy link
Member

dcshzj commented Sep 10, 2021

@FH-30 Yup please do, thanks!

@FH-30
Copy link
Contributor Author

FH-30 commented Sep 10, 2021

@dcshzj All the tests are now passing!

@dcshzj dcshzj merged commit 387401f into reposense:master Sep 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IgnoreCommitList: enhance to produce better results
5 participants