As someone who has described myself as a failed scientist, this is a great article for describing how science is done, especially since it makes it clear how much the back-and-forth has degenerated. And it’s also clear from it what the next step in this research should be: everyone says “okay this is interesting but it’s gotten so personal you can’t trust either side too well and more work won’t get much out of it”, gives it a few years to settle down, then someone else tries to tackle the same fundamental question with a hopefully-fresh approach.

Also, one of the most important part of the article:

Disclaimer: I’m not a neutral party here, as the first part of this debate features prominently in my talk on Empirical Software Engineering (ESE) and I picked a side. I eventually ended up reviewing a draft of the Replication Rebuttal Rebuttal. Finally, Emery Berger answered some technical questions for me, but he had no input on this essay and has not seen any of the drafts. I’ll try my best to be impartial, but there’s no guarantees I succeeded.

All science and all scientists are biased. The good ones recognize it, try to measure it, and attach the appropriate caveats to their information so the reader knows what the biases are.

The P-value is the probability that you would have seen the same result if the hypothesis wasn’t true, purely due to other factors.

It’s kind of subtle and a big reason why Bayesians don’t like frequentist stats, but this is not what a p-value is.

A p-value is the probability that you would observe a result more extreme than the one you observed under the assumption that the null hypothesis is true. So you observed something with a p-value of 0.02. That means that under your current assumptions, you would only observe something more extreme 2% of the time in the long run.

There is no probability statement associated to the actual result you saw. And you haven’t calculated anything under the assumption the null hypothesis was false. Your calculations and resulting numbers all assumed your null hypothesis was true. And there’s no probability statement associated with whether the null hypothesis is true or not. It’s either true or false, with probability 1 or 0.

This difficulty of interpretation of the p-value is why Bayesians don’t like frequentist stats. Everyone always wants to be allowed to make probability statements about uncertain statements, but the frequentist framework just doesn’t allow that.

Out of curiosity, does anybody call themselves a frequentist or defend frequentism? I’ve seen a lot of people identify as Bayesians but not the converse, and I’d really love to see a steelman of “frequentism”.

It used to be more common to be against Bayesian stats because they require more calculations and the choice of priors was questionable. Now that we have big computers and more consensus on priors, Bayesian stats have become feasible. There’s also the subjectivity of Bayesian stats, but I don’t think that’s as strong of a deterrent.

Anyway, yes, John Venn of the diagrams, for example, was the originator of the frequentist interpretation. And the whole framework of probability that we still teach to undergrads in math and science is almost completely frequentist too, again, because it’s what you can do when observations are difficult to obtain and calculating power is scarce. So perhaps only accidentally, a lot of mathematicians and scientists are supporting frequentism by teaching it so much.

Feller who wrote a very influential series of probability texts in the 1950s is probably a more modern anti-Bayesian. You can read about that here:

Is it “more extreme (>)” or “at least as extreme (>=)”, please? My intuition is that > vs >= should not matter because we’re talking about real numbers so the boundary is infinitely thin anyway.

In most cases yeah, > and >= are the same. It’s rare for hypothesis testing/confidence intervals on discrete distributions. I’m not quite sure what would be done there. Probably stick to >

Who is more trustworthy: the people who spent years getting their paper through peer review, or a random internet commenter who probably stopped at the abstract?

Really good point, one that is certainly worth considering more often.

That quote is misleading without the context, which is that even if the premise of a published article is nonsense (and should make reviewers think twice before passing peer review), the way science/publishing is structured, it’s not enough to merely point out that the premise is nonsense, and therefore the entire article is not worth your time. You actually have to make the effort to write a good rebuttal and publish it, too.

As someone who has described myself as a failed scientist, this is a

greatarticle for describing how science is done, especially since it makes it clear how much the back-and-forth has degenerated. And it’s also clear from it what the next step in this research should be: everyone says “okay this is interesting but it’s gotten so personal you can’t trust either side too well and more work won’t get much out of it”, gives it a few years to settle down, then someone else tries to tackle the same fundamental question with a hopefully-fresh approach.Also, one of the most important part of the article:

All science and all scientists are biased. The good ones recognize it, try to measure it, and attach the appropriate caveats to their information so the reader knows what the biases are.

It’s kind of subtle and a big reason why Bayesians don’t like frequentist stats, but this is

notwhat a p-value is.A p-value is the probability that you would observe a result more extreme than the one you observed under the assumption that the null hypothesis is true. So you observed something with a p-value of 0.02. That means that under your current assumptions, you would only observe something more extreme 2% of the time in the long run.

There is no probability statement associated to the actual result you saw. And you haven’t calculated anything under the assumption the null hypothesis was false. Your calculations and resulting numbers all assumed your null hypothesis was true. And there’s no probability statement associated with whether the null hypothesis is true or not. It’s either true or false, with probability 1 or 0.

This difficulty of interpretation of the p-value is why Bayesians don’t like frequentist stats.

Everyonealways wants to be allowed to make probability statements about uncertain statements, but the frequentist framework just doesn’t allow that.Out of curiosity, does anybody call themselves a frequentist or defend frequentism? I’ve seen a lot of people identify as Bayesians but not the converse, and I’d really love to see a steelman of “frequentism”.

It used to be more common to be against Bayesian stats because they require more calculations and the choice of priors was questionable. Now that we have big computers and more consensus on priors, Bayesian stats have become feasible. There’s also the subjectivity of Bayesian stats, but I don’t think that’s as strong of a deterrent.

Anyway, yes, John Venn of the diagrams, for example, was the originator of the frequentist interpretation. And the whole framework of probability that we still teach to undergrads in math and science is almost completely frequentist too, again, because it’s what you can do when observations are difficult to obtain and calculating power is scarce. So perhaps only accidentally, a lot of mathematicians and scientists are supporting frequentism by teaching it so much.

Feller who wrote a very influential series of probability texts in the 1950s is probably a more modern anti-Bayesian. You can read about that here:

http://www.stat.columbia.edu/~gelman/research/published/feller8.pdf

Possibly relevant article by Cosma Shalizi and Andrew Gelman, but I’m way out of my depth here: https://datajobs.com/data-science-repo/Bayesian-Statistics-%5bGelman-and-Shalizi%5d.pdf.

Is it “more extreme (>)” or “at least as extreme (>=)”, please? My intuition is that > vs >= should not matter because we’re talking about real numbers so the boundary is infinitely thin anyway.

In most cases yeah, > and >= are the same. It’s rare for hypothesis testing/confidence intervals on discrete distributions. I’m not quite sure what would be done there. Probably stick to >

Thank you

I think what I am missing in all these interactions is professional courtesy (on both sides).

@hwayne, there’s a typo under the “What about the false alarms?” section: “At the very least

issuggests…” should be “At the very leastitsuggests… “Pushed a fix!

Really good point, one that is certainly worth considering more often.

The way you quoted it is liable to be misunderstood by others.

That quote is misleading without the context, which is that

evenif the premise of a published article is nonsense (and should make reviewers think twice before passing peer review), the way science/publishing is structured, it’s not enough to merely point out that the premise is nonsense, and therefore the entire article is not worth your time. You actually have to make the effort to write a good rebuttal and publish it, too.Great read!