I'm trying to work on an entry to the Github Data Challenge and I'm trying to analyze a set of PushEvents, but I'm getting some strange(?) results.
users = Hash.new(0)
(0..23).each do |hour|
gz = open("http://data.githubarchive.org/2013-04-01-#{hour}.json.gz")
js = Zlib::GzipReader.new(gz).read
Yajl::Parser.parse(js) do |event|
if event["type"] == "PushEvent" && event["actor_attributes"] && event["actor_attributes"]["login"]
users[event["actor_attributes"]["login"]] += 1
end
end
end
This script works, fine but when I look at the most commits made by a person via
users.values.max
I see someone has made over 7k commits in a day. When I go through and print out
event["payload"]["shas"]
all of the printed results are essentially the same:
585a2f02f36da9ee0625a42aa2d5e98836c8a2de
[email protected]
Notes added by 'git notes add'
Jenkins
true
I presume that the commit message associated with the PushEvent is "Notes added by 'git notes add'", so does this seem right? Or am I misreading some data here?
I know this is a pretty old question but I just bumped into this today. When you state "essentially the same"... what does it mean? is that last boolean true in all of them?
Cause if I'm not mistaken (and I may be... haven't found much documentation on the format of this archive dumps) that last boolean should be if that commit SHA is unique on that specific Push (meaning if that specific commit has not been seen yet in that repository). Chances are the same SHA and message and all could be pushed several times but only one of them should have that boolean set to true.
Because Git is distributed and you are just merging to see a person's commit, I recommend doing a unique check based on the commit SHA or simply count the number of 'true' flags as the number of commits. The same SHA will appear several times in PushEvents as forks and branches open and close/merge throughout the history of a repository.
As a side note, the name 'Jenkins' kind of tells you that that was a commit made by a Continuous Integration system (http://jenkins-ci.org/) so there may be bugs or automated tasks involved in generating those 7k repeated commit messages.