1. 2
  1. 3

    I mean, yes, it’s cheating. But it’s also pretty brilliant. Perhaps they should have used the scrapped data as additional data, gaining an advantage by either overfitting, or actually using it correctly to build a better model. Something, something, tagged data is key to ML.

    1. 3

      Seems like a bad rule, IMO. If the actual data for an entry is available, why predict anything? It gives better results that way, and isn’t that the real goal?

      Of course breaking the rule gave them an unfair advantage because others followed it, so I understand the disqualification in that sense. For the best results they should rerun the contest and allow all of the AI models to break this rule.

      1. 2

        But then it becomes a return existing data competition, which wouldn’t be very useful or interesting.

        1. 3

          I disagree. Knowing the exact answer because there’s an actual measurement is a special case.

          But skimming the article again just now, I think my point is moot. I misunderstood how the cheat worked. The problem isn’t that they returned the actual data for entries they had been trained on, but that they scraped PetFinder’s website to expand their training data, and returned actual results for entries their program wasn’t supposed to have seen before. So instead of training on a subset of PetFinder’s database and being tested on a different subset, they trained on the entire database.