Thanks for putting this together! I’m interested in doing some natural language processing but I’m not sure if it falls under the heading of sentiment analysis. I’d like to extract common themes from a bunch of text snippets, which is sort of like collocations, but with perhaps a little more fuzz. (i.e. a theme doesn’t have to be word-for-word identical) One example would be Yelp’s review highlights (“15 people said something like this…”).
Do you know what tools or techniques people tend to use for this?
If you are looking at reviews (typically subjective text expressing opinions), this could fall into aspect based sentiment analysis. Getting the themes out of texts can be seen as a topic modeling task. Hope that helps :)
This looks quite useful, thanks!
The only real caveat I’d add is that people should be careful assuming off-the-shelf implementations will work for their problem, and make sure to look at the model assumptions, validate whether they’re accurate on your own domain, etc.
Two common pitfalls:
Many papers and systems are explicitly or implicitly targeted at specific kinds of texts and/or specific kinds of sentiments. The less you are like the ones they were built for, the bigger grain of salt to take their outputs with. A large proportion of research targets either mining positive/negative sentiment from product reviews, or mining positive/negative sentiments from news articles. Both of these are pretty structured domains, with relatively short texts and generally not a “literary” style of writing. There are lots of other things you might want to use sentiment analysis for; for example, I’ve seen people want to use it to graph the sentimental trajectories of novels. If you use a system designed for classifying product reviews on a novel, it’ll produce numbers and you can graph the result, but it may or may not show what you claim it does.
Be extra careful if using the extracted sentiment in further processing, since errors can magnify, and worse, are often biased errors. For example, if correlating sentiment with some other variable, you probably can’t assume (without some kind of validation) that even a 95%-accurate sentiment classifier is not going to completely skew your data: if those 5% misclassifications are highly non-random, it can throw everything in the analysis off (and they often are… sentiment classifiers more often fail on whole categories of sentences, versus misclassifying sentences with uniformly random probability).
The short version of this is probably: don’t assume sentiment analysis is a solved problem where you can download something that just works (except in very specific cases).
Thanks for the remark! It is an important information that I left out. I’ve added the warning in relevant section https://github.com/xiamx/awesome-sentiment-analysis#open-source-implementations
This has been posted to Lobsters a couple of times before but your comment really reminds me of this Google paper talking about pitfalls of machine learning.
Machine Learning: The High Interest Credit Card of Technical Debt
Author here. We use this approach to write large REST api running on production. It has helped us catch a few bugs early in the development stage. Would love to hear what you guys think.