Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create computed nCommit & nAuthor per month by project #25

Open
noah-22 opened this issue Feb 18, 2021 · 8 comments
Open

Create computed nCommit & nAuthor per month by project #25

noah-22 opened this issue Feb 18, 2021 · 8 comments
Labels
clickhouse enhancement New feature or request

Comments

@noah-22
Copy link

noah-22 commented Feb 18, 2021

@audrism

When analyzing a project's activity over time, commit counts and unique author counts per month (not counting one author multiple time if they have made multiple commits) are great metrics. Could we pre-compute this metric in one of our databases to allow for a query of Project.month.nCommits and Project.month.nAuthors?

@audrism audrism added clickhouse enhancement New feature or request labels Feb 18, 2021
@audrism
Copy link
Collaborator

audrism commented Feb 18, 2021

OK, this may be a good candidate for CH as the number of records would exceed billion?

@audrism
Copy link
Collaborator

audrism commented Feb 18, 2021

Also, what about number of blobs, types of files changed, APIs involved?

@noah-22
Copy link
Author

noah-22 commented Feb 19, 2021

Number and type of files changed, along with blobs and API's, would be very useful!

I propose a format here, in an existing clickhouse table conversation/issue, since this is now a clickhouse related issue.

@audrism
Copy link
Collaborator

audrism commented Feb 20, 2021

/da1_data/basemaps/gz/P2mncFullS*.s
has project;month;nAuthors;nCommits

@noah-22
Copy link
Author

noah-22 commented Feb 26, 2021

This basemap is wonderful, thank you!

My group from the hackathon, Inflection Points, is interested in continuing research. We are interested in rootFork projects, though, which are often not the deforked (P) project. Blobs changed would also be a great metric to have. Could the map be expanded to include both of these?

@audrism
Copy link
Collaborator

audrism commented Feb 26, 2021

Take a look at new P_metadata.S in mongodb: it might be good for sampling, though it does not have monthly blobs. You can always find P for rootFork via p2P map. Otherwise these projects should be nearly identical. Do you have instances where they are not?

@noah-22
Copy link
Author

noah-22 commented Feb 26, 2021

Is WoC.P_metadata.S an improved version of WoC.proj_metadata.S? They appears to have similar fields, with the addition of monthly commit count and core size. I have questions about these additions, though.

  1. What is the purpose of the NumForks field and what advantage does it provide over CommunitySize? NumForks seems to often be 0, and is therefore only useful when P = RootFork.
  2. Is Core defined by the authors that make up 80% of the contributions to the project? I've heard this metric mentioned before and am wondering if it the same used here.

I'll take it to be true that P is similar enough to the RootFork project that computing time series for both is not necessary.

@audrism
Copy link
Collaborator

audrism commented Feb 27, 2021

  1. NumForks is github-derived with all associated issues (often incorrect, many missing, hierarchical) while community size is more general, purely based on shared commits. RootFork is also derived from GH.
  2. Correct
  3. Generally, I would avoid using GH attributes like NumForks or RootFork since ghTorrent data is spotty and the
    actual fork relationships in GH are not always what you think they are. Yes, I would not bother getting a separate time series for root fork unless you have many examples to the contrary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clickhouse enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants