1. 16
  1.  

  2. 9

    I started the article thinking “while this does sound unfortunate, is it really a big deal?”. And part way through, this paragraph really stuck out to me:

    Any bias contained in word embeddings like those from Word2vec is automatically passed on in any application that exploits it. One example is the work using embeddings to improve Web search results. If the phrase “computer programmer” is more closely associated with men than women, then a search for the term “computer programmer CVs” might rank men more highly than women.

    But it totally answered my own question: Yes, this is a big deal!

    This made me to another question: what other applications out there do my works depend on that enforces a certain potentially negative status quo? Nothing immediately comes to mind, but that’s not to say that it doesn’t exist, perhaps it’s just not something I’m used to thinking about…

    1. 6

      Or it might not? Seems like a bug for the association to outrank the literal text. This is testable, no? Build a search engine and see what happens.

      Now, as for the actual test they did perform. Man : programmer :: woman : homemaker. They say they’ve corrected the bias, but what’s the new result? If I query man : programmer :: woman : X? What do I get with the fixed data?

      1. 4

        Certainly, there is absolutely no reason to believe a production search engine will naively use the said corpus and perform ranking based on word association frequency. It’s an interesting point that I felt I had overlooked and gave me a “eureka” moment more than anything.

        However, I don’t believe it’s necessarily a “bug” for associations of certain kinds to outrank literal text–after all, that’s one of the main reasons Google eventually won vs. Alta Vista, Yahoo!, and other search engines of its time. It placed “associations” (of a certain kind) above literal text!

        1. 3

          Or it might not? Seems like a bug for the association to outrank the literal text.

          The problem isn’t the association outranking the literal text. It’s what happens when you have more matching literal text than can be displayed, and you need to pick what’s relevant. To pick a more neutral example, if people associate ‘computer programmer’ with ‘writing code’, then it makes sense to bring up articles mentioning both ‘programmer’ and ‘writing code’ up before ones that mention ‘programmer’ and ‘fishing’.

          This is fine, and totally expected. And it would also the correct thing to do for ‘programmer’ and ‘male’, if it wasn’t for the fact that society is actively trying to erase gender associations for certain professions.

          1. 2

            Ok, sure, but if I search for programming and get random articles from men’s health, gender bias aside, that’s just shitty results. If I get dungeons and dragons results, equally shitty, but not sexist.

            Full disclosure: I’ve not been impressed with TR after deciding their reporting could be better categorized as speculative fiction. They pick some experiment, often even a failed one, and then ask “but what if it worked?!?”. The article is written to really hype the if. scientists announce they fused two hydrogen atoms. Now all they need to do is scale it up a billion times and they’ve solved the energy crisis!

            This article doesn’t quite follow the mold, but it’s close. They’ve identified a problem, although it seems more theoretical than observed. They’ve got a solution. Just enumerate every combination of sexist and racist associations, then have an army of underpaid but totally unbiased humans evaluate them. And of course, despite having solved the problem, they didn’t actually demonstrate it working. As I noted, what is the corrected analogy after warping?

            I don’t question that ML techniques can perpetuate biases. I think there are more motivational examples than this.

            As a point not even addressed here, what do we do about correlations that are correct but “wrong”?

      2. 3

        AI says the darndest things. Im sure if we feed it some public school curriculum data it will come out just fine.