I wondered how this worked but couldn’t find a link to the source. I think this is it: https://github.com/Nagasaki45/krihelinator/tree/master/lib. The starting point seems to be periodically scraping the GitHub trending page. I wonder if it would be feasible to use GitHub Archive as an alternative data source.
Thanks for the reply. Your reference to the source code is correct.
Regarding “how it works”: There are two main background processes that gather and update the data.
The 1st runs continuously, fetches repositories from https://api.github.com/repositories (all of the repositories on github, 100 per page, in a ~3 weeks loop), and pass them to scrapers that scrape the pulse page. The code for this process, which I term the “pipeline”, is here.
The 2nd runs every 6 hours, and (1) keep only the 5K most active projects and removes the rest, (2) scrapes the trending page to make sure I’m not missing anything importent, (3) rescrape everything to update stats. The code for this “periodic” process is here.
I really need to start documenting the project properly…
Thanks for pointing me to GitHub Archive. Someone already suggested that in the past. The reason I didn’t use it is only my lack of awareness to its existance when I started the project. However, considering the current state of the project I’m not sure if refactoring it will be profitable compared to the amount of work it will require.
How does this counter the Matthew effect? Or other first-mover advantages? Every project that attempts to rank OSS projects by such blunt metrics (stars, commit rate) cannot help but reinforce existing network effects. (I realize this might not be an explicit goal.) GitHub stars can/will be gamed, as can commit rate, and incentivizing commit rate rewards poor designs that need tons of tweaks (thinking of a few Ruby libraries here that have ridiculously high numbers of commits).
Fundamentally, I disagree with the notion that additional flawed metrics are better than no metrics. The best way I’ve seen is word of mouth from experienced folk, especially when it deviates from current consensus.
First, thanks for the feedback!
I completely agree that the basic metric I suggested in this project is far from measuring contribution rate “objectively”. But why should it be?
When I started the project I thought about different methods to create and later validate metrics. I planned to collect projects, present them to a group of “experts”, and find a regression line from a set of features to the experts ratings. At the end I concluded that every choice I will made will be flawed, as you rightly argued. Instead, I decided to choose a simple and explicit metric, explain how it’s calculated in the about page, and keep it simple and opinionated. For your question, therefore, it doesn’t counter the Matthews effect (thanks for the reference!).
“Fundamentally, I disagree with the notion that additional flawed metrics are better than no metrics.”
It will be hard to argue against this :-) However, considering the GitHub trending page as a reference point (which I was addicted to in the past), and in my subjective experience:
“The best way I’ve seen is word of mouth from experienced folk, especially when it deviates from current consensus.”
Totaly agree about that! but they are not always around :-)
Thanks for following up. I don’t mean to discourage you from working on the project.
I’m encouraged by the fact that it can resist some fad-seeking behavior. I hope you continue to work on it and refine it to see if there’s something that does even better.
considering the GitHub trending page as a reference
I am not sure this answers the GP comment: it is not a high bar to clear, and I guess I agree that discrediting the whole idea of something similar being useful could be better than trying to incrementally improve from such a low start.
collect projects, present them to a group of “experts”, and find a regression line
You have project history, you could measure the projects' past and see what correlates across a year or two of time and what doesn’t (if you already fetch a ton of stats for a lot of projects). By the way, if this yielded a page «projects likely to overperform compared to their relative lack of hype», that would be net-positive in terms of finding interesting stuff and fighting Matthew effect.
microsoft/vscode is listed twice with different capitalisation — and GitHub always ignores the letter case. It’s also funny to see a random fork of chromium be near the top for the (upstream) commits alone.
Is there any explicit reason that the metric doesn’t include active project members (with write-access)? Maybe offset by restricting pull-request count to non-committers.
Right now if a person provides a lot of simple clear fixes, the metrics would rank the project higher if this person does not have commit rights (it would be a PR per fix instead of commit per fix); I would prefer to rely on a project where small uncontroversial fixes do get in even if a few top people are too busy to care.
Disclosure: an implementation of such a correction is likely to raise NixPkgs in the ranking, which is the highest-ranking project where I contribute.
Thanks for the feedback. The microsoft/vscode is a reported bug.
I’m not quite sure if I understand the improvement you are suggesting, so correct me if I’m wrong. Right now when someone submit a PR and it is get accepted, his contribution will be counted in 3 ways: as one more contributor, as X more commits (as many as there commits in the PR), and as one more PR. In general, the metric I’m using is a simple weighted sum of the information from the pulse page.
Does this answer your question? If not, please elaborate.
As for the bug: you lack a large «source code» link in the menubar and in the about-page, so I didn’t know immediately where to report (and I was writing a comment for other reasons anyway). Thanks for the link.
My question is about the following two situations:
The amount of useful work performed for the project is the same. Your metric thinks the second scenario is significantly better. From the trending/growing/healthy point of view, I would expect the second scenario to show better viability in most cases.
Do you mean in the first scenario?
Yes, thanks for the correction. Switched scenarios throughout the comment.
First, I added a ticket for adding a link to the source code. Thanks for pointing that out!
Now I understand the question. It’s a good idea, but it can complicate things significantly. For example, what about contributors with write access that submit PRs, let’s say for code review? In general, as I said in a previous comment, I don’t mind the fact that the metric is “not accurate” and prefer simplicity.
Having said that, we can still open a ticket for this enhancement proposal.