1. 12
  1. 3

    VIctor: I wonder what percentage of the data set was used to train the model and if the test results (i.e. accuracy, F1 score) are from the previously unseen portion during testing. That would be good to note alongside the results.

    Also, do you think more accurate results could be achieved by not using an [unordered] bag of words model? For example, would an RNN (or, specifically, LSTM) for sequence classification perform better? Here’s an example of what I mean. It seems like a good portion of the “profane/hate speech” requires more than one word to go over the line, as it were.

    1. 5

      hey, good questions. I actually experimented with a lot of different models (including LSTM-based models), and the ones that performed better than the BOW model did so at a huge cost to performance. Since this library is intended to be accurate but also performant, I decided to go with the BOW model because it’s quite robust in many cases while also being extremely fast.

      The train/test split was 80/20, and the test results are of course using unseen data. I followed standard procedures when experimenting.

    2. 3

      This is absolutely terrific! I’ve also loved how scikit-learn-based code is so short. I recently sat with my daughters, 10 and 12, to write a missing shoe detector in scikit-learn, the actualy meat of the program was about eight lines of code. All very high-level concepts that the children could grasp.

      And now if they find a wayward shoe, all they have to do is measure it, weigh it, and mention its color and Python will let them know to which family member the shoe belongs!