How do you choose between technologies X and Y when you don’t know enough to understand the differences? If popularity plays a role in the decision how do you measure it? In general, how do you track which are the most popular / trending / active open source project out there? The Krihelinator tries to answer some of this questions by proposing an alternative to GitHub trending page, when instead of stars, repositories are ranked by weekly contribution rates (number of contributors, commits, pull requests, and issues).
I’m working on this project for few months now, and believe that it is ready to get some exposure and feedback. Any criticism / feedback / comments will be much appreciated!
How does this counter the Matthew effect? Or other first-mover advantages? Every project that attempts to rank OSS projects by such blunt metrics (stars, commit rate) cannot help but reinforce existing network effects. (I realize this might not be an explicit goal.) GitHub stars can/will be gamed, as can commit rate, and incentivizing commit rate rewards poor designs that need tons of tweaks (thinking of a few Ruby libraries here that have ridiculously high numbers of commits).
Fundamentally, I disagree with the notion that additional flawed metrics are better than no metrics. The best way I’ve seen is word of mouth from experienced folk, especially when it deviates from current consensus.
First, thanks for the feedback!
I completely agree that the basic metric I suggested in this project is far from measuring contribution rate “objectively”. But why should it be?
When I started the project I thought about different methods to create and later validate metrics. I planned to collect projects, present them to a group of “experts”, and find a regression line from a set of features to the experts ratings. At the end I concluded that every choice I will made will be flawed, as you rightly argued. Instead, I decided to choose a simple and explicit metric, explain how it’s calculated in the about page, and keep it simple and opinionated. For your question, therefore, it doesn’t counter the Matthews effect (thanks for the reference!).
“Fundamentally, I disagree with the notion that additional flawed metrics are better than no metrics.”
It will be hard to argue against this :-) However, considering the GitHub trending page as a reference point (which I was addicted to in the past), and in my subjective experience:
“The best way I’ve seen is word of mouth from experienced folk, especially when it deviates from current consensus.”
Totaly agree about that! but they are not always around :-)
Thanks for following up. I don’t mean to discourage you from working on the project.
I’m encouraged by the fact that it can resist some fad-seeking behavior. I hope you continue to work on it and refine it to see if there’s something that does even better.
I am not sure this answers the GP comment: it is not a high bar to clear, and I guess I agree that discrediting the whole idea of something similar being useful could be better than trying to incrementally improve from such a low start.
You have project history, you could measure the projects' past and see what correlates across a year or two of time and what doesn’t (if you already fetch a ton of stats for a lot of projects). By the way, if this yielded a page «projects likely to overperform compared to their relative lack of hype», that would be net-positive in terms of finding interesting stuff and fighting Matthew effect.
I wondered how this worked but couldn’t find a link to the source. I think this is it: https://github.com/Nagasaki45/krihelinator/tree/master/lib. The starting point seems to be periodically scraping the GitHub trending page. I wonder if it would be feasible to use GitHub Archive as an alternative data source.
Thanks for the reply. Your reference to the source code is correct.
Regarding “how it works”: There are two main background processes that gather and update the data.
The 1st runs continuously, fetches repositories from https://api.github.com/repositories (all of the repositories on github, 100 per page, in a ~3 weeks loop), and pass them to scrapers that scrape the pulse page. The code for this process, which I term the “pipeline”, is here.
The 2nd runs every 6 hours, and (1) keep only the 5K most active projects and removes the rest, (2) scrapes the trending page to make sure I’m not missing anything importent, (3) rescrape everything to update stats. The code for this “periodic” process is here.
I really need to start documenting the project properly…
Thanks for pointing me to GitHub Archive. Someone already suggested that in the past. The reason I didn’t use it is only my lack of awareness to its existance when I started the project. However, considering the current state of the project I’m not sure if refactoring it will be profitable compared to the amount of work it will require.
microsoft/vscode is listed twice with different capitalisation — and GitHub always ignores the letter case. It’s also funny to see a random fork of chromium be near the top for the (upstream) commits alone.
Is there any explicit reason that the metric doesn’t include active project members (with write-access)? Maybe offset by restricting pull-request count to non-committers.
Right now if a person provides a lot of simple clear fixes, the metrics would rank the project higher if this person does not have commit rights (it would be a PR per fix instead of commit per fix); I would prefer to rely on a project where small uncontroversial fixes do get in even if a few top people are too busy to care.
Disclosure: an implementation of such a correction is likely to raise NixPkgs in the ranking, which is the highest-ranking project where I contribute.
Thanks for the feedback. The microsoft/vscode is a reported bug.
I’m not quite sure if I understand the improvement you are suggesting, so correct me if I’m wrong. Right now when someone submit a PR and it is get accepted, his contribution will be counted in 3 ways: as one more contributor, as X more commits (as many as there commits in the PR), and as one more PR. In general, the metric I’m using is a simple weighted sum of the information from the pulse page.
Does this answer your question? If not, please elaborate.
As for the bug: you lack a large «source code» link in the menubar and in the about-page, so I didn’t know immediately where to report (and I was writing a comment for other reasons anyway). Thanks for the link.
My question is about the following two situations:
The amount of useful work performed for the project is the same. Your metric thinks the second scenario is significantly better. From the trending/growing/healthy point of view, I would expect the second scenario to show better viability in most cases.
Do you mean in the first scenario?
Yes, thanks for the correction. Switched scenarios throughout the comment.
First, I added a ticket for adding a link to the source code. Thanks for pointing that out!
Now I understand the question. It’s a good idea, but it can complicate things significantly. For example, what about contributors with write access that submit PRs, let’s say for code review? In general, as I said in a previous comment, I don’t mind the fact that the metric is “not accurate” and prefer simplicity.
Having said that, we can still open a ticket for this enhancement proposal.
I’d considered doing something related–basically, a fork of Hex that would only show projects that had:
I could really care less about the number of stars or commit frequency or whatever as long as those key metrics were being addressed–and yet, nobody has seen fit to use them to limit library visibility yet!
That’s just silly.
Why do you say that?
Coverage is precisely as easy to game as LoC or commit count/frequency.
As is number of stars, number of maintainers, or anything else. To define a concrete metric is to define a path for exploitation.
In the general, non-adversarial case, your observation isn’t a good reason not to try.
Several reasons. First, metrics can easily be gamed- I could easily generate 100% test coverage with tests that always end with
assert(true)
. But even ignoring that, it’s silly because of the law of diminishing returns.Writing tests takes time and effort. Time and energy spent writing tests is time and energy that doesn’t go into writing useful features. There is a cost to writing tests. The benefit of writing tests is that you’ve reduced risk and your code has fewer bugs.
The problem you’ll have is that with each test you have, it gets progressively harder for that test to eliminate/prevent new bugs, because your other tests have already covered very similar cases. Your first 10% of code coverage is hugely valuable, but the next 10% is a little less valuable. By the time you’re up to 70-80% code coverage, the benefits of adding new tests are pretty low, on average.
NB: this depends, of course, on your risk tolerance, risk likelihood and risk severity. If, for example, you’re making the software that controls how much radiation a radiotherapy system delivers to your patient, a bug in your code can kill people. 100% test coverage with static analysis may be perfectly reasonable in that case. If, on the other hand, you’re building a piece of social media software, or a video game, or one of a million productivity applications- then it’s probably not worth the time or energy to get to 100% test coverage. Delivering features is more important, even if those features are slightly unreliable.
The same mechanisms apply to documentation.
So, your assertions are:
As mentioned above, any metric people pick–lines of code, commit count, commit frequency, version number, number of stars, etc.–can be gamed by a project. As I said above, to define a metric is to define a way of subverting it. That said, we can’t just give up on metrics because somebody somewhere at sometime could do something–unless, of course, you write software with the Bush administration!
And yes, people can write tests that don’t matter, sure. But again, that’s the sort of thing we fix with training and mentoring. The same folks also write garbage code that doesn’t support features to spec.
Writing tests takes away time from writing features.Perhaps, if you assume a fixed time budget and a zero defect rate.
The thing is, if you don’t write tests (or do manual testing) you’ll never know if you actually delivered the feature you promised. Your clients and users will have to bear the brunt of the code’s failings and weird code behavior.
Again, if you say “I don’t care about defect rate MOAR FEE CHAIRS” then sure, your point of view makes sense, but then you’re just slinging code and not doing anything approximating engineering.
More practically, the time saved by having good test coverage with good tests comes in to play when:
I’ve worked in shops and projects with good testing, in shops with no testing, and on projects with 100% code coverage from tests–and I can guarantee that in a team setting things go a hell of a lot faster when tests catch fuckups early. And the things they miss? Those wouldn’t have been caught with fewer tests.
Writing tests follows diminishing returns.So, lemme quote you here:
That math, to put it plainly, bears no relation to reality.
First, the idea that new tests are covering fewer bugs because they overlap older tests implies that you have tests which test overlapping features. If you’ve written your tests that way–in the pathological case, adding a test for a string always being uppercase when you already have a test that makes sure strings only use uppercase
[A-Y]
when those characters appear–then you’re writing your tests wrong.Tests should be written to cover disjoint features, precisely because of the thing you mentioned. And if your codebase/API seems to continually require testing non-orthogonal code? Maybe your code is poorly designed and could use a rewrite. That’s the sort of thing writing tests will reveal.
Second, the idea that there is some massive jump in bugs revealed between 0-10% test coverage and then 10%-20% fewer and so on ignores how code is structured. Beyond the first bug discovered (whose bug ratio is
1/0
) there is no guarantee that that first 10% of coverage will actually show more bugs than the next 10% because the thing that matters is which 10% is covered. You can totally write 10% test coverage for leaf functions that do trivial things and never find a bug. You could manage to pick exactly the 10% of the functions that contain bugs, and get 100% bug reduction.If you want to make the argument “prioritize the parts that are more likely to be buggy”, I agree with you–but your presented model of bugs and testing here is incorrect.
Third, you assert that the value of adding new tests is low. The value of writing an individual function or object is probably pretty low too–but you do it enough and suddenly you have a product! I don’t know how you define value, and that’s a very project-specific and developer-specific thing, but consider the fact that the easiest time to add tests is before the system grows into a massive untested hairball.
Also, if you have decided to skip new tests on a minor new thing–say, credit-card billing; after all there are so many other features a simple Braintree integration is pretty minor–because additional tests are, on average, pretty low value (in terms of finding bugs), you’re irresponsible and should be fired.
Writing documentation follows diminishing returnsThis was not substantiated in any way, but I’ll address it. Something I didn’t note explicitly but will here: by 100% code documentation I didn’t mean a comment on every line, because that is absurd. A comment on each function or data structure is what I meant.
For a library, the entire surface of the exported API is what users will touch. It doesn’t matter if 80% is documented if your users only seem to use and keep banging their heads against that last 20%.
For an application, dumping a new developer into a codebase without an explanation of what each function does is a sure-fire way to get duplication of effort and to get “bugfixes” that just make more bugs, because it isn’t clear what the functions should do, what functions already exist, and how they fit into the bigger picture.
More features beats reliable features.This is the part of your post that I disagree with most heavily, honestly.
From the library perspective “MOAR FEE CHAIRS” is the rallying cry of several communities, like the JS ecosystem. And instead of delivering reliable building blocks we can all depend on, we get endless churn and new fancy ways of doing things and not a lot of careful forethought.
cat
andBLAS
have been in use for almost over 30 years without significant new features–how much of the new, feature-rich stuff is going to last that long?How much harder is it to upgrade a library that has a bajillion overlapping features that may or may not play well together than it is to upgrade a library that does a few clearly-documented things and stops? How much harder is it to maintain, when each feature might have a user you can’t drop but additional features might not guarantee new users?
Hell, how much harder is it to learn how to use a library that contains eighty ways of doing the same thing when compared with one or two?
And remember, this is all in the context of git repos, so we kinda assume the main usecase is libraries.
That’s a lot of text, but I’m starting to suspect that we’re talking about two different things. When you say, “100% code coverage”, I hear, “100% code coverage”, which means every single line in your code is touched by a test. But then you say something like this:
What the hell does that have to do with code coverage? That’s feature coverage. Sure, every feature should have tests. But features live at a totally different level of abstraction than code. If you wanted to say, “100% feature coverage”, I’d agree with you, because my objections are with targeting 100% code coverage.
Even if features live at a totally different level of abstraction, they are still impacted by the implementation flaws at the code level. Those flaws are only really reliably spotted by rigorous testing (or long-term use, which is similar).
And again, I come at this from the angle of writing and consuming libraries–if you want to talk about applications, that’s slightly different–after all, you can be doing really dumb heinous shit in an application and as long as the client doesn’t notice you still get paid.
I hold libraries meant to be building blocks for others to a much higher standard; if you don’t, that’s your choice, but I probably will prefer to avoid using your code.
But again, this comes down to a risk management problem. How much effort is required to write the test? How likely is the test to be correct? How bad would it be if the code doesn’t work as advertised? What’s the probability of the negative outcome? Assuming your test is incorrect, what’s the probability that it conceals a bug? What’s your risk tolerance?
Your risk tolerance is apparently zero, which is silly.
Considering that the code may well outlast me, and that every person who has to work around or suffer from a bug is losing valuable life force doing so, I ask you–how can you not want to make your library code as perfect as you can?
How many man-hours have been spent working around dumb bugs in, say, Node or Wordpress or Django or Rails?
The idea of risk management only applies if you can reasonably bound the risk–which, when you share a library that might be used by others, you can’t. I might accept that every once in a while my datetime library fucks up, but if somebody decides to use it for a medical records system or a critical product demo and gets bit I still have some moral responsibility for releasing bad software.
And yeah, we all throw the standard “no warranty given” stuff on our code, but that’s again not how engineering works. If we are too chickenshit to take responsibility for our code not working, we probably shouldn’t be sharing it with other people.
I’m sorry that you find writing good and thorough tests difficult, but maybe you should try working on smaller projects or just getting better at writing tests. For utility libraries and specialized things, provide you built them to be tested, it’s not that difficult.
I’d fear that would exclude reastically 80% of the useful libraries…
But the remaining 20% would probably need a lot less headdesk time.
If we ever want to make software engineering a real thing, we’ve got to start standardizing parts with published documentation and testing information. This would be a step in that direction.