1. 21
  1.  

  2. 6

    Can’t wait to download all of gmail’s spam training data…

    1. 5

      How do you “explain” a multilayer neural network?

      1. 12

        That’s a good point, and one of the reasons that certain industries can’t use neural networks. I’ve heard that credit card companies have to use something like a decision tree because they have to be able to prove that race wasn’t a factor in the decision.

        1. 7

          Wow, that’s interesting to know. Leaving aside the clash of politics and science/engineering, why can’t they just use a NN and leave out race data from the feature dimensions? I would expect the result is the same.

          1. 11

            From TFA:

            It is important to note that paragraph 38 and Article 11 paragraph 3 specifically address discrimination from profiling that makes use of sensitive data. In unpacking this mandate, we must distinguish between two potential interpretations. The first is that this directive only pertains to cases where an algorithm is making direct use of data that is intrinsically sensitive. This would include, for example, variables that code for race, finances, or any of the other categories of sensitive information. However, it is widely acknowledged that simply removing certain variables from a model does not ensure predictions that are, in effect, uncorrelated to those variables (e.g. Leese (2014); Hardt (2014)). For example, if a certain geographic region has a high number of low income or minority residents, an algorithm that employs geographic data to determine loan eligibility is likely to produce results that are, in effect, informed by race and income.

            Thus a second interpretation takes a broader view of ‘sensitive data’ to include not only those variables which are explicitly named, but also any variables with which they are correlated. This would put the onus on a data processor to ensure that algorithms are not provided with datasets containing variables that are correlated with the “special categories of personal data” in Article 10.

            However, this interpretation also suffers from a number of complications in practice. With relatively small datasets it may be possible to both identify and account for correlations between sensitive and ‘non-sensitive’ variables. However, as datasets become increasingly large, correlations can become increasingly complex and difficult to detect. The link between geography and income may be obvious, but less obvious correlations—say between browsing time and income—are likely to exist within large enough datasets and can lead to discriminatory effects (Barocas & Selbst, 2016). For example, at an annual conference of actuaries, consultants from Deloitte explained that they can now “use thousands of ‘non-traditional’ third party data sources, such as consumer buying history, to predict a life insurance applicant’s health status with an accuracy comparable to a medical exam” (Robinson et al., 2014). With sufficiently large data sets, the task of exhaustively identifying and excluding data features correlated with “sensitive categories” a priori may be impossible. The GDPR thus presents us with a dilemma with two horns: under one interpretation the non-discrimination requirement is ineffective, under the other it is infeasible.

            1. 4

              Right. And depending on your threshold for correlated, you can’t use ANY variable.

              It’s also interesting that gender, marital status and age are not excluded - at least in the US. Car insurance rates are gender, age and marital status dependent.

              1. 1

                Right. And depending on your threshold for correlated, you can’t use ANY variable.

                That’s the second horn of the dilemma mentioned in the last line.

                It’s also interesting that gender, marital status and age are not excluded - at least in the US. Car insurance rates are gender, age and marital status dependent.

                I think this is because it is possible to make a specific business case for it; all three are considered protected classes, and are forbidden from discrimination in other cases (like employment and housing).

                Also responding to your previous comment, there are variables which can be used with high fidelity as proxies for e.g. race or sex, like name.

              2. 2

                Is there any way to look for correlations with protected classes in the data, and remove those correlations, while still preserving the ability of making inferences off whatever information remains?

                1. 1

                  If there is, it’s probably related to differential privacy. It is subject to a problem of incentives, though; what motivation does anyone have to make the filter good?

          2. 3

            Map each connection to an input for a markov generator?

            1. 4

              :) One could also print out the weight matrix …