1. 44
  1. 10

    While I love mining data (the bulk of my Ph.D. was mining data from a set of experiments I did), I always remind myself that this is actually the wrong way of doing science and the wrong way of looking for effects, though, given real world constraints, a lot of research is done this way.

    The danger with this approach is identical to over-training a neural network: you will probably find some pattern, but the only way to verify this is to do a new, different, experiment designed specifically to see if that signal exists, which of course we rarely do. What we may have found, no matter how compelling a story we make of the data, is that we’ve just extracted some random construct - amplified noise.

    Tangentially, there is Simpson’s Paradox 1 2

    1. 2

      Also most of the time, people lack training in Experimental Design because, well it is not fun :)

    2. 5

      Embrace the complexity by rearranging, augmenting, and using the data itself to provide context.

      This reminds me of my favourite all-time quibble. I no longer remember when I first heard it, or where. But it goes like this: a problem doesn’t get any simpler just because you’re too busy to hear all of it.

      The insidious cult of PowerPoint has given rise to a very strange way of doing things, where data is first cleaned up, averaged, and plotted, and the pretty graphs are used to draw conclusions. That’s the wrong way to go about it: first you ask the questions, then you try to get the answers from the data.

      Case in point: as an electrical engineer, the captions below the second and the third plot are exactly what I would want to know if someone were to ask for my advice regarding any non-trivial energy planning/policy/infrastructure problem. (There are likely others as well; I have only a basic familiarity with these things, my expertise is… closer to the milliamp end of the current scale).

      The first plot is basically useless for pretty much anything other than showing it during morning news to make TV viewers feel knowledgeable. It tells you something any first-year EE student could tell you – that (in places like California) average power consumption is higher during summer months – but that’s all you can gleam from it. If you look carefully at the second plot, especially in the first half of the third quarter, you’ll notice it doesn’t even give you a good figure for the base load (which is super fun to explain to sophomore-year would-be engineers…).

      Questions like those implied by the captions (how large are demand variations? How abrupt? When do they occur? What are they correlated with? How do they vary) are the ones from which you start solving virtually any problem related to managing and developing energy delivery – not just at the production/transport/delivery end (power plants, power grids) but at the consumer end, too (better and more power-efficient ACs, refrigerators, whatever).

      Granted, not drawing too many conclusions based on averages is pretty much just good data science. That being said, IMHO, if an organisation finds itself relying on data scientists to come up with those plots by themselves, what it lacks is not competent data scientists, but domain expertise and, more often than not, competent management.

      Tangential: here’s a fun read.

      1. 3

        There were some really nice examples in here. I sometimes have a hard time applying some of Tufte’s visualization principles, and this article gave me some new ideas. I especially liked the concept of “background data” that gives the viewer a basis of comparison when analyzing a subset.

        1. 2

          This was kind of mind-blowing. I’ll definitely be thinking about it next time I try charting/investigating metrics at work.

          1. 2

            As an ops person creating lots of graphs, I absolutely love it, even though it mentions “heatmap” by name only once. The first thing I check for in every monitoring solution is: does it allow displaying heatmaps. It’s such a basic thing and yet missing almost everywhere. Grafana gets top scores for making it available for any non-preaggregated data.

            Line plots are so easy to make and almost always wrong.

            1. 1

              This article contains some pretty great visualizations and advices! I just learned about Observable Plot and it’s seems like a good complement to ggplot2/plotnine and matplotlib.

              But to get to the point of good visualizations, one usually needs to do some data processing and getting the data in the right shape. So far, I’ve been using pandas (as we use python), but that experience is usually quite frustrating. Anyone has good recommendations for processing data in a comfortable way?

              1. 1

                Obligatory reference to Anscombe’s quartet: