It’s useful to place this within a larger class of “information collapse” errors.
In general, averages can lead you astray. The fix might be including the standard deviation, or (as here) a min or max statistic, but any time you replace the distribution itself with lossy statistics, you risk errors.
Many cognitive biases and puzzles in probability essentially boil down to this.
Rather than attempting to estimate the entropy of a scheme, wouldn’t it be better to just define a scheme that represents randomness directly in the encoding? If you target N bits of randomness, then you will not fall into these sorts of traps if you directly encode those bits in a symmetric number scheme, such as base16, base32, or diceware (base ~12.92).
The math is sort of interesting, but I’m not sure why anyone would in practice want to do anything other than directly encoding randomness.
I became interested in this because I was trying to develop highly memorable passwords. For example, you might want to choose a random grammatical English sentence as your passphrase. A natural way to do this might be to choose a random parse tree of a given size, but if you do that you’ll have some duplicated sentences (with ambiguous parses).
There are also the EFF word lists, which are designed to be memorable, real, and chosen by dice roll. I particularly like the list where each word can be uniquely identified by its first three letters: https://www.eff.org/deeplinks/2016/07/new-wordlists-random-passphrases
I’ve created rpass for this. It generates random mnemonics, ie. rpass 128 yields:
juthor kezrem xurvup kindit puxpem vaszun bok
Is it really English if the majority of the words are made up?
You’re right, isn’t English, but “juthor” and “krezrem”, and, “puxpem” are pretty memorable non-words IMO.
[Comment removed by author]