I don’t think that this is intentional on the part of the author, but I find this article somewhat misleading.

The anecdote about de Moivre and livers describes a common misuse of the CLT, casually assuming that a distribution because we don’t understand the underlying process but think it could be the sum of many factors. And this is indeed bad in measuring livers or IQ. But I don’t want to throw out the baby with the bathwater! There are a lot of situations where the parts that you need to rigorously invoke the CLT are there, and you can use it to make strong conclusions about the real world. This applies to other distributions too; for example, the Poisson distribution has a simple set of assumptions that you also won’t meet in sociology or physiology but come up a lot in the real world.

I also disagree with this other somewhat central point:

The problem with Named Distributions is that one can’t really point out the difference between mathematical truths and the priors within them.

You can, and someone who teaches these distributions without describing what the differences are is doing their students a disservice.

We shouldn’t call a venerable tool an artifact because people aren’t clued in to how to use them right.

Also:

if anyone knows of a good article/paper/book arguing for why simple models are inherently better, please tell me and I’ll link to it here.

The AIC and BIC are quantitative justifications for simpler models.

You can, and someone who teaches these distributions without describing what the differences are is doing their students a disservice.

Could you expand a bit more on this?

I mean, I’m not a statistician, but I’m fairly decent at ML and math and I think I’m past the stats 101 level, but to me it always seemed that there’s no “mathematical truth” in e.g. the Poisson distribution more so than in any other arbitrary equation.

As mentioned below, there are certain formulas that work because of the equation defining the distribution, but once you remove the prior from the distribution you’re left with an inherently different equation and those formulas no longer work.

As opposed to e.g. a neural network, where you can think of the set of equations as a prior, but if one of our assumptions is something like “Batch norm and l1 regulariztion will help in case x,y,z” you can change everything else about the network, or at least some thing about it and that assumption still holds (e.g. you can change the activation, the shape, the operations used upon the weight & activation & bias to yield the next activation… etc)

It doesn’t seem to me like an equivalent can be done with distributions, .e.g if you switch from standard to bimodal the whole edifice now has a different set of properties and knowing loads of stuff about the standard distribution won’t help you work with a bimodal distribution.

I don’t know enough about ML to speak about that comparison.

From my perspective, I think of the Poisson distribution not as the thing that’s defined by its pmf or cdf, or something chosen to be easy to work with, or something drawn because it fits neatly through observed points, but instead as the distribution of counts in an interval of a Poisson process. The Poisson process I think of as being the unique process that has independent increments with expected value that linearly increase with size, and a vanishing probability of observing two counts simultaneously. It’s a mathematical object that follows from its axioms. We might be divided on whether these axioms could be considered priors, but I don’t think so, it’s useful to make a distinction.

I don’t think that this is intentional on the part of the author, but I find this article somewhat misleading.

The anecdote about de Moivre and livers describes a common misuse of the CLT, casually assuming that a distribution because we don’t understand the underlying process but think it could be the sum of many factors. And this is indeed bad in measuring livers or IQ. But I don’t want to throw out the baby with the bathwater! There are a lot of situations where the parts that you need to rigorously invoke the CLT

are there, and you can use it to make strong conclusions about the real world. This applies to other distributions too; for example, the Poisson distribution has a simple set of assumptions that you also won’t meet in sociology or physiology but come up a lot in the real world.I also disagree with this other somewhat central point:

You can, and someone who teaches these distributions without describing what the differences are is doing their students a disservice.

We shouldn’t call a venerable tool an artifact because people aren’t clued in to how to use them right.

Also:

The AIC and BIC are quantitative justifications for simpler models.

Could you expand a bit more on this?

I mean, I’m not a statistician, but I’m fairly decent at ML and math and I think I’m past the stats 101 level, but to me it always seemed that there’s no “mathematical truth” in e.g. the Poisson distribution more so than in any other arbitrary equation.

As mentioned below, there are certain formulas that work because of the equation defining the distribution, but once you remove the prior from the distribution you’re left with an inherently different equation and those formulas no longer work.

As opposed to e.g. a neural network, where you can think of the set of equations as a prior, but if one of our assumptions is something like “Batch norm and l1 regulariztion will help in case x,y,z” you can change everything else about the network, or at least some thing about it and that assumption still holds (e.g. you can change the activation, the shape, the operations used upon the weight & activation & bias to yield the next activation… etc)

It doesn’t seem to me like an equivalent can be done with distributions, .e.g if you switch from standard to bimodal the whole edifice now has a different set of properties and knowing loads of stuff about the standard distribution won’t help you work with a bimodal distribution.

I don’t know enough about ML to speak about that comparison.

From my perspective, I think of the Poisson distribution not as the thing that’s defined by its pmf or cdf, or something chosen to be easy to work with, or something drawn because it fits neatly through observed points, but instead as the distribution of counts in an interval of a Poisson process. The Poisson process I think of as being the unique process that has independent increments with expected value that linearly increase with size, and a vanishing probability of observing two counts simultaneously. It’s a mathematical object that follows from its axioms. We might be divided on whether these axioms could be considered priors, but I don’t think so, it’s useful to make a distinction.