1. 11
  1.  

  2. 16

    As usual with science reporting, the headline doesn’t quite match the study’s conclusions. I’ve linked directly to the study, below, for convenience. The headline is true in nearly every real scenario, but there isn’t some theoretical limit, as it seems to suggest.

    What the study does do is demonstrate how easy re-identification is, in practice, with commonly used techniques for anonymization. This is not news to anyone who works in privacy. We’re all pretty well aware that, for example, medical research has never particularly cared about more than lip service to patient privacy, and the anonymization techniques considered standard in medicine are in fact extremely weak.

    Sometimes, even when everyone already knows the conclusion, you need a study that says that exact thing, so that policymakers can no longer pretend to not understand it. I believe that that is the case here, and therefore I am glad to see this research.

    https://www.nature.com/articles/s41467-019-10933-3

    1. 2

      Well said all around. Bears noting that “pretending to not understand” often seems like a job description bullet point for many policymakers, at least higher up/more elected. :-) So, this helps, but probably cannot fully solve that problem. :)

      1. 2

        It is totally unreasonable for someone to be identified simply by knowing the last four of their social, place of birth*, their gender, race, approximate age, and favourite teacher :)

        But more seriously I agree, the effectiveness of anonymization is greatly overplayed by industry that depends on user data for commercial purposes, whereas in things like medical studies a lot of the things that are functionally required for the data to be useful are also useful for de-anonymization (race, age, where they live[d], family history etc can all be legitimately relevant to such studies)

        But as far as industry - don’t listen to companies that say they anonymize your data if they sell it (because purchasers will aggregate sources), or use it for their own revenue (if they can’t identify you uniquely then the data has no value - not directly linking your name to all the data is “anonymization”).

        • I know until relatively recently SSN and PoB were correlated but let’s ignore that
      2. 1

        Seems a counterexample would be a dataset of type Identity x Boolean, and anonymization would be dropping the Identity coordinate. That’s definitely totally anonymous.

        1. 4

          As with everything, that depends on what the data is, and on what other information the attacker has. Suppose it’s an academic course roster, and the boolean represents who passed. Suppose the attacker was a student in that course and already knows several other people’s grades. In that case, they might gain new information even from something so simple as the count of how many trues and how many falses the data set contains.

          Of course that’s a contrived example, but the point is that there are very few safe generalizations about anonymization. If you’re not using some mathematically rigorous framework such as differential privacy, you’ve always got risks like the above.

          1. 2

            I could see that happening quite easily - my university use to post test+assignment grades on a board next to student id #s. That wouldn’t take much work to deanonymize, but even actual random#s for a study would need consistent # for each student across multiple courses, and if the course sizes were small enough (post grad CS courses at my university went from 2 - largely through attrition intentionally caused by the lecturer - to maybe 25. Correlating random numbers to actual people would likely not have been too difficult even before you got to grades)