1. 24
  1.  

  2. 8

    KS gets some serious support in this article for, basically, good reason. But before you rush out and start using KS everywhere take note that, as is typical, by making fewer assumptions a non-parametric test will be less powerful/efficient than its parametric cousins. It’s also somewhat significantly more complex to describe the decision boundaries of interest for this statistic which is why the article talks about tables and approximation.

    Non-parametrics are great when you have a significant lack of knowledge about your data, but be cognizant of the tradeoffs they brings.

    1. 1

      Indeed,

      Further the text says: “The test is non-parametric and entirely agnostic to what this distribution actually is.” But I really don’t think it’s possible to have an algorithm that “works no matter what data you are working with” - garbage in, garbage out, is kind of the first rule of statistics.

      And in fact the article goes on to say K-S “is much better at detecting distributional differences when the sample medians are far apart than it is at detecting when the tails are different but the main mass of the distributions is around the same values.” which seems to contradict “agnosticism”.

      I assume a better way to phrase it is “this statistics works for a wide variety of distributions that are practically encountered” or some such.

      1. 1

        Usually what that means is that nobody made and distributional assumptions besides the “standard” ones. It’s not always clear what those are and you’d have to read the proof to be sure, but typically these are quite mild. So within that whole class of mildly-constrained distributions the proof will hold.

        But the proof will merely say something like “things will converge at infinite data almost surely” or “this test will almost certainly have the proper alpha value”. In practice, test power/efficiency is important. Essentially, “how much data does it take to see a valid result accepting some minimal chance of false positivity?”

        K-S will be lower powered than a similar test making distributional assumptions. It will require more data.

    2. 6

      This was a great article! Really thorough explanation of KS test, practical implementation, overview of Rust, empirical demonstration of KS in rust, and finally best practices about QuickCheck (in rust!)

      Really enjoyable read. The KS test has been on my todo list to add to Elasticsearch for a while, this article has motivated me to start hacking on it over the new years holiday :)

      1. 6

        Remember too that learning to manage memory safely in C/C++ is much harder than learning Rust.

        I completely disagree with this statement. It’s to subjective.

        Either learning manage memory safely in C nor grasp an ownership abstraction in Rust is hard. Very hard. And it’s depend on your background. The fact is, we don’t have so much time to invest in learning completely different abstraction.

        there is no compiler checking up on you in C/C++ to make sure your memory management is correct.

        That is what tools like valgrind for!

        1. 11

          As someone who only vaguely knows C++, I like Rust because it simply won’t let me compile something horribly broken. The difficulty / learning curve may ultimately be the same, but the timeline of feedback is very different.

          Rust forces me confront my lack of knowledge immediately, or it simply won’t compile. Yes, this hard. And yes, this can be very frustrating. But at least I won’t churn out some piece of code that superficially looks ok but is a ticking time bomb.

          In contrast, C/C++ will generally let me compile something that is horribly broken as long as it satisfies the language semantics. It’s only later that I’ll discover my dumb mistake, usually after I’ve moved on to different sections of code.

          As a newbie to manual memory management, that’s huge for me. I want to know right now that I messed something up, not hours/days later when things mysteriously start crashing or misbehaving. I don’t want to blissfully continue coding, thinking I’m doing things right, when in reality it’s all a house of cards waiting to tumble down.

          That is what tools like valgrind for!

          Eh, that’s like saying its ok that a knife cuts you because there are bandaids you can apply to your skin. It’d be better if the knife simply couldn’t cut skin (or your skin was impervious to knife cuts). Don’t get me wrong, tools like valgrind are great! But I’d prefer if the language was a bit more proactive in protecting my (dumb) self instead of relying on secondary tools.

          1. 2

            Eh, that’s like saying its ok that a knife cuts you because there are bandaids you can apply to your skin. It’d be better if the knife simply couldn’t cut skin (or your skin was impervious to knife cuts).

            I love your metaphor. But, what if we don’t play with knife at the first place? For me, it’s like using valgrind as part of continuous integration pipeline. No time bomb running in production. (Again, maybe it’s not simple as running rustc. if build it’s OK then you believe that no time bomb running in production)

            I don’t mean to contra on Rust. I learn Rust too though. And I agreed on the timeline of feedback in Rust. It’s like Iterative programming on steroid.

            1. 2

              Eh, that’s like saying its ok that a knife cuts you because there are bandaids you can apply to your skin. It’d be better if the knife simply couldn’t cut skin (or your skin was impervious to knife cuts). Don’t get me wrong, tools like valgrind are great! But I’d prefer if the language was a bit more proactive in protecting my (dumb) self instead of relying on secondary tools.

              This is why I trust only Luke Cage to code in C safely.

              1. 2

                Here is something that might be of interest then:

                http://www.tedunangst.com/flak/post/heartbleed-in-rust

                1. 3

                  Yep, I’ve read that…and I agree with it. I don’t think anyone is claiming that Rust (or other languages that strive to be safer) will protect you from everything. You can still live-lock yourself, reuse buffers, not validate input, call FFI with bad parameters, botch your unsafety, etc etc. It’s not a panacea.

                  But Rust does protect you from certain classes of bugs that C/C++ does not. I’ll take some over nothing any day :)