Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gitlogg is broken for some repositories #7

Closed
Inventitech opened this issue Oct 11, 2016 · 21 comments
Closed

Gitlogg is broken for some repositories #7

Inventitech opened this issue Oct 11, 2016 · 21 comments

Comments

@Inventitech
Copy link
Contributor

gitlogg-parse-json.js is broken.

sh gitlogg-generate-log.sh                   
-e Generating git log for all repositories located at '../travistorrent-tools'. This might take a while!
-e The file ./gitlogg.tmp generated in: 0s
Generating JSON output...
[stdin]:40
  author_name = item[16].replace(/"/g, "'"),
                        ^

TypeError: Cannot read property 'replace' of undefined
    at [stdin]:40:25
    at Array.reduce (native)
    at [stdin]:23:4
    at Object.exports.runInThisContext (vm.js:54:17)
    at Object.<anonymous> ([stdin]-wrapper:6:22)
    at Module._compile (module.js:409:26)
    at node.js:579:27
    at nextTickCallbackWith0Args (node.js:420:9)
    at process._tickCallback (node.js:349:13)

Minimally demonstrative example: Analyse repository TestRoots/travistorrent-tools.

Thanks for looking into this :-), I liked the way you prepare the git log output in gitlogg-generate-log.sh. Should this problem be related to git log behaving strangely (printing/not prinitng empty lines), I might have a clean fix with awk.

@dreamyguy
Copy link
Owner

Hi there @Inventitech, I missed this one completely, been away for a while.

Let me look into this and see if I can solve it within the gitlogg-parse-json.js.

Through this project I've come to realise that git log's output is surprisingly inconsistent across multiple repositories, but I've managed to isolate different issues by testing with 490+ repos so far.

Cheers for the heads-up!

@dreamyguy
Copy link
Owner

Hi there again, I have some good news! ✨

I got the json output for TestRoots/travistorrent-tools rendered correctly, and it's on this gist.

screen shot 2016-11-10 at 00 14 25

It ran pretty fast too. 🚀

I could not reproduce the error you've reported with gitlogg-parse-json.js. I used what I call the "Simple Mode" (https://github.com/dreamyguy/gitlogg#simple-mode), by simply doing a git clonewithin the repos/ folder, which should be created at the repo's root. The repos directory is present on .gitignore, so whatever is placed there won't be tracked by git.

@Inventitech
Copy link
Contributor Author

Hey @dreamyguy, Thanks for the reply.

Interesting. I suspect there is some inconsistency in the intermediate representation extracted from git log. Which version of Git are you running?

➜  gitlogg git:(master) git --version
git version 2.9.3

Does parsing my gitlogg.tmp work for you? If it does, the error lies elsewhere.

@OrenRysn
Copy link

OrenRysn commented Nov 10, 2016

Hey @dreamyguy! I'm actually running your gitlogg tool as well. Looking to utilize it to provide some clean, useful information across multiple repositories.

I'm seeing the same issue, but I've yet to debug the root cause. I've added a loop in gitlogg-parse.json.js to print every value in the item array:

var output = fs.readFileSync('gitlogg.tmp', 'utf8')
  .trim()
  .split('\n')
  .map(line => line.split('\\t'))
  .reduce((commits, item) => {

+    var i = 0;
+    for (i = 0; i < item.length; i++) {
+        console.log(chalk.blue('item[' + i + '] = ' + item[i]))
+    }
+
    // vars based on sequential values ( sanitise " to ' on fields that accept user input )

At some point while trying to run gitlogg.sh, I will eventually see item[68] which is usually populated with the value for "stats" populated with no information, and then the following item[0] which contains the string "commits" will also contain the stats.

Expected (from same output just prior to issue)

item[65] = commit_notes
item[66] =
item[67] = stats
item[68] = 1 file changed, 1 insertion(+), 2 deletions(-)
item[0] = commits
item[1] = repository

Actual:

item[65] = commit_notes
item[66] =
item[67] = stats
item[68] =
item[0] = 2 files changed, 11 insertions(+), 41 deletions(-) commits
item[1] = repository

Trying to determine if there's an issue with the parsing logic that removes/converts newlines/line-breaks.

Edit: To clarify, the first time this issue happens does not cause the error, but what happens as a result is that every subsequent items group suddenly sees the same issue of the stats values being placed in item[0] instead of item[68]. Which results in one last newline at the bottom of gitlogg.tmp containing just the stats value.

The result is a single line at the bottom of gitlogg.tmp that attempts to be parsed but only contains information in item[0], while item[1-68] are undefined, throwing the error.

Edit 2: Autofilled the wrong username at the top. Meant to direct to @dreamyguy

@OrenRysn
Copy link

After breaking down the git log and parsing step by step, I discovered that there was actually a carriage return (^M) character in one of the Git commit messages being parsed, which was adding an extra newline and throwing all subsequent parsing off as a result.

Modifying the parsing in gitlogg-generate-log.sh to include a carriage return deletion allows me to get past the issue.

        git log --all --no-merges --shortstat --reverse --pretty=format:'commits\trepository\t'"${PWD##*/}"'\tcommit_hash\t%H\tcommit_hash_abbreviated\t%h\ttree_hash\t%T\ttree_hash_abbreviated\t%t\tparent_hashes\t%P\tparent_hashes_abbreviated\t%p\tauthor_name\t%an\tauthor_name_mailmap\t%aN\tauthor_email\t%ae\tauthor_email_mailmap\t%aE\tauthor_date\t%ad\tauthor_date_RFC2822\t%aD\tauthor_date_relative\t%ar\tauthor_date_unix_timestamp\t%at\tauthor_date_iso_8601\t%ai\tauthor_date_iso_8601_strict\t%aI\tcommitter_name\t%cn\tcommitter_name_mailmap\t%cN\tcommitter_email\t%ce\tcommitter_email_mailmap\t%cE\tcommitter_date\t%cd\tcommitter_date_RFC2822\t%cD\tcommitter_date_relative\t%cr\tcommitter_date_unix_timestamp\t%ct\tcommitter_date_iso_8601\t%ci\tcommitter_date_iso_8601_strict\t%cI\tref_names\t%d\tref_names_no_wrapping\t%D\tencoding\t%e\tsubject\t%s\tsubject_sanitized\t%f\tcommit_notes\t%N\tstats\t' |
          sed '/^[ \t]*$/d' |               # remove all newlines/line-breaks, including those with empty spaces
+          tr -d '\r' |                      # Delete carriage returns
          tr '\n' 'ò' |                     # convert newlines/line-breaks to a character, so we can manipulate it without much trouble
          tr '\r' 'ò' |                     # convert carriage returns to a character, so we can manipulate it without much trouble
          sed 's/tòcommits/tòòcommits/g' |  # because some commits have no stats, we have to create an extra line-break to make `paste -d ' ' - -` consistent
          tr 'ò' '\n' |                     # bring back all line-breaks
          sed '{
              N
              s/[)]\n\ncommits/)\
          commits/g
          }' |                              # some rogue mystical line-breaks need to go down to their knees and beg for mercy, which they're not getting
          paste -d ' ' - -                  # collapse lines so that the `shortstat` is merged with the rest of the commit data, on a single line

@dreamyguy
Copy link
Owner

dreamyguy commented Nov 15, 2016

Hi @Inventitech, thanks for posting the gitlogg.tmp output. It broke on the 5th line. :(

As @OrenRysn indirectly pointed out, gitlogg-parse-json.js is completely dependent on gitlogg-generate-log.sh to be able to parse gitlogg.tmp correctly, and if that fails the parsing will break.

Nearly all problems I've had with gitlogg-generate-log.sh so far were caused by unexpected characters finding their ways into the git log. The carriage return ^M within placeholders was new to me, but I see now that this is possible and should be solved.

I'll look into this asap, and cheers for the feedback!

@dreamyguy
Copy link
Owner

@Inventitech and by the way, we have the same git version, 2.9.3.

@dreamyguy
Copy link
Owner

@OrenRysn only to comment on your first post, the index in the parser gets messed up when the output of gitlogg.tmp is broken, like the one mentioned above. It takes a single unexpected character on a single commit to break the whole structure of the temporary file, unfortunately. Characters that trigger a line-break are the absolute worst.

I'll see what I can do, and thanks again for the heads-up.

@dreamyguy
Copy link
Owner

Hi guys, I did 3 commits that will hopefully help.

The first 2 didn't really correct anything, but help show what's happening on the console (how many repos will be parsed and which repo is getting its git log generated - live.

All credit for the 3rd commit goes to @OrenRysn 🏆 , but I did choose to replace carriage return with space instead of deleting it.

@Inventitech I'm really struggling to reproduce your problem, I haven't managed to get a broken output on gitlogg.tmp, even before my commits. I tried with your repo by itself (I pulled the latest changes) and mixed with other 476 repos, it worked every time.

Could you pull my latest changes and try again?

@dreamyguy
Copy link
Owner

dreamyguy commented Nov 19, 2016

Another update, with a few more commits since this issue was opened.

@Inventitech since I've been at it, I have successfully generated gitlogg.tmp from TestRoots/travistorrent-tools, by itself and along other repos without a glitch. As long as that goes well, gitlogg-parse-json.js will do its job.

Try running gitlogg exactly as described here, with the latest changes, and let me know how it goes.

@Inventitech
Copy link
Contributor Author

Error still persists :(

Generating git log for the one repository located at '../repos/*/'. This might take a while!
Outputting travistorrent-tools
The file ./gitlogg.tmp generated in: 1s
Generating JSON output...
[stdin]:42
  author_name = item[17].replace(/"/g, "'"),
                        ^

TypeError: Cannot read property 'replace' of undefined
    at [stdin]:42:25
    at Array.reduce (native)
    at [stdin]:23:4
    at Object.exports.runInThisContext (vm.js:54:17)
    at Object.<anonymous> ([stdin]-wrapper:6:22)
    at Module._compile (module.js:409:26)
    at node.js:579:27
    at nextTickCallbackWith0Args (node.js:420:9)
    at process._tickCallback (node.js:349:13)

@dreamyguy
Copy link
Owner

dreamyguy commented Nov 21, 2016

That's really too bad @Inventitech. Is your gitlogg.tmp still breaking at the same line?

I have to find a way to validate gitlogg.tmp before running the js parser, otherwise one gets the impression that the problem is with the javascript.

I have prepared a gist with travistorrent-tools's gitlogg.json output, so you can play with it until a solution is found. 🎯

Just for curiosity, what system are you on? I'm on OSX 10.11.6, El Capitan.

BTW I've pushed a new release today, which makes the initial setup simpler and less error-prone - hopefully. The README has been updated accordingly.

I'd try starting from scratch with a new git clone and take it from there.

@Inventitech
Copy link
Contributor Author

I'm a Linux dude.

If I remember correctly, what breaks it is that occasionally, you just get a stats output for a whole range of commits, not a single commit. I'll have time to look into this in more depth in about two to three weeks.

@dreamyguy
Copy link
Owner

@Inventitech while testing your pull-request with humongous git repositories, I came across a few problems around gitlogg-generate-log.sh, problems that break gitlogg.tmp's output.

I'll be creating issues on them when time allows.

@dreamyguy
Copy link
Owner

dreamyguy commented Dec 15, 2016

I can now say - with a hand in my heart - that this issue is resolved with the v0.1.9 release. 🎆

I've tested all these repos, in a single run of npm run gitlogg.

Note that the commit count is not updated, since I cloned these repos many weeks ago.

[634,959]  linux
[424,810]  android_kernel_yu_msm8916
[399,784]  core
[106,402]  odoo
[ 96,062]  nixpkgs
[ 70,822]  homebrew-core
[ 60,224]  rails
[ 45,091]  git
[ 23,494]  django
[  8,529]  react-native
[  7,590]  react
[  7,040]  beets
[    539]  travistorrent-tools
[    523]  fbctf
[    182]  flexbox-layout
[    118]  open-color
[     85]  color-consolidator
[     32]  sidhree-com
---------
1,886,349  total commit count

Stats:

gitlogg.tmp generation:     2,266 s ~= 37.76667 mins
gitlogg.json parsing:       137,331 ms ~= 2.28885 mins

gitlogg.tmp file size:      2,658,113,356 bytes ~= 2.65 GB
gitlogg.json file size:     1,515,488,035 bytes ~= 1.51 GB

gitlogg.tmp nr. lines:      1,809,164
gitlogg.json nr. lines:     1,809,168

There are less lines on the output file than the number of total commits, but that's because I exclude merges with the --no-merges git CLI.

Do try v0.1.9 out! 🏆

@Inventitech
Copy link
Contributor Author

Still no luck, sir :(

➜  gitlogg git:(dd44796) ✗ sh scripts/gitlogg.sh -n 1
Generating git log for all 4 repositories located at './_repos/*/'. This might take a while!
Outputting ghtorrent-update
Outputting gi
Outputting gitlogg
Outputting UnifiedASATVisualizer
The file _tmp/gitlogg.tmp generated in: 5s
Parsing JSON output...
Something went wrong, _output/gitlogg.json could not be written / saved
[stdin]:162
  var time_array = author_date.split(' '),
                              ^

TypeError: Cannot read property 'split' of undefined
    at Transform.parser._transform ([stdin]:162:31)
    at Transform._read (_stream_transform.js:167:10)
    at Transform._write (_stream_transform.js:155:12)
    at doWrite (_stream_writable.js:300:12)
    at writeOrBuffer (_stream_writable.js:286:5)
    at Transform.Writable.write (_stream_writable.js:214:11)
    at LineStream.ondata (_stream_readable.js:542:20)
    at emitOne (events.js:77:13)
    at LineStream.emit (events.js:169:7)
    at readableAddChunk (_stream_readable.js:153:18)

@dreamyguy
Copy link
Owner

@Inventitech I ran into this TypeError: Cannot read property 'split' of undefined all the time while using Gitlogg with xargs. The problem was in the mixing of one commit with another, in one line, and the indexes got messed up.

Please do test this on a fresh clone of Gitlogg, so you're sure to have the very latest version, v0.1.9, which no longer has the parallelization changes.

I think you're still running an older version, since you created a pull-request two hours ago with code from v0.1.8. I released v0.1.9 quite late yesterday, so it's very new.

@Inventitech
Copy link
Contributor Author

Inventitech commented Dec 15, 2016

No, as you can see above gitlogg git:(dd44796), I am on v.0.1.9. Also, I used -n 1 which disables parallelism.

@dreamyguy
Copy link
Owner

dreamyguy commented Dec 16, 2016

Are these the repos you tested with?

https://github.com/gousiosg/ghtorrent-update
https://github.com/dspinellis/gi
https://github.com/dreamyguy/gitlogg
https://github.com/ClintonCao/UnifiedASATVisualizer

Note that parallel processing is completely removed on v0.1.9, so you can omit the CLI option.

@Inventitech
Copy link
Contributor Author

Yes (regarding repositories).

@dreamyguy
Copy link
Owner

dreamyguy commented Dec 16, 2016

@Inventitech I just tested these repos and got no error.

To be 100% sure we're taking the exact same steps, I've put this one-liner together. It's the same line I've used to test:

mkdir gitlogg-test && cd gitlogg-test && git clone https://github.com/dreamyguy/gitlogg.git && cd gitlogg && npm run setup && cd _repos && git clone --bare https://github.com/gousiosg/ghtorrent-update.git && git clone --bare https://github.com/dspinellis/gi.git && git clone --bare https://github.com/dreamyguy/gitlogg.git && git clone --bare https://github.com/ClintonCao/UnifiedASATVisualizer.git && npm run gitlogg

Run the line and let me know how it goes.

If you don't get it to work, let me know the specs of your OS and I'll open another issue that's specific to Linux, for I can't reproduce it on OSX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants