We got a complete documentation about the state of spellcheckers (or what is mostly accept as one), which problems they have and what would be a better solution. I think this is a very valuable grumbling, even though you may not have found the peace of writing a holistic solution :)
Having trained a neural network on a large corpus before: gathering data is only half the battle. You have to do training, you have to clean your data, and in this case you may also want examples of incorrect spelling along with corrections (that is, sentences with typos and the fixed versions).
And shoving text from books into your training data corpus can run into Extremely Fun Copyright Issues.
No, because they provide only half of the puzzle. A spell checker, at its most reductionist, is a map from misspellings to correct spellings. In actual implementation, it’s a lossily compressed map from misspellings to correct spellings using a lot of clever tricks to make the loss low and the compression high. A set of books provides you with the output of this map, but not its input and the input is the most valuable bit. If you’ve got a load of volunteers, you could ship a spell checker that only has the dictionary of correct things and records the things that people type and the thing that they correct them to. I believe this is how a lot of on-screen keyboards for mobile things work: they collect the corrections that people make, aggregate them, send them to the provider, and then train a neural network (a generalised data structure for lossily encoding map functions) on the output.
Note that I’m suggesting crowdsourcing the incorrect spellings, not the correct ones. You can build the correct ones from a large corpus of publications (you can probably do that even with a not-very-currated set by excluding words that only occur below a threshold number of times in a sufficiently large corpus of mostly correct text, such as Wikipedia). The thing that you’re trying to get from the crowdsourcing is the set of common typos that people then correct to things in your corpus.
Others already have given good answers here, but I’ll add that one of my points is: “one-size-fits-all” spellchecking wouldn’t work well (well, it would work well in 80% of cases, might be enough for some); the same goes about “corpus from books” (NYTimes/Wikipedia): you’ll probably need to put source text in a myriad of “bins” to teach your spellchecker that “in mid-XX-century formal context, this sequence of words is very probable”, “in XXI century blog post, probabilities are different”, “in the tech-related article, there would be a lot of uncommon words that more probable to be misspellings in other contexts”, “in literary critic, the entire structure of the phase would imply different word possibilities” etc etc etc. And it changes everyday (what’s probably a good word, what’s suggested by phrase structure etc.)
For formal writing, yes. But a lot of writing is now informal. Then again, I suppose one could argue that informal writing doesn’t really need to be spell-checked.
Of course. But “good” spellchecker (what I am writing about: which will consider context, and guess “it is incorrect” and “how to correct it” the best way possible) will be different for posts/comments than in an article for the magazine, other than for a fiction book, other than for a work email. (Like, misspell “post” into “pots” and in some contexts, it can be caught by a good spellchecker, but not with the word-by-word one, and not without understanding at least some of the context.)
We got a complete documentation about the state of spellcheckers (or what is mostly accept as one), which problems they have and what would be a better solution. I think this is a very valuable grumbling, even though you may not have found the peace of writing a holistic solution :)
Corrected by
hHunspellWould not texts from major books and other manually-checked publications provide a reasonable basis for training data?
Having trained a neural network on a large corpus before: gathering data is only half the battle. You have to do training, you have to clean your data, and in this case you may also want examples of incorrect spelling along with corrections (that is, sentences with typos and the fixed versions).
And shoving text from books into your training data corpus can run into Extremely Fun Copyright Issues.
No, because they provide only half of the puzzle. A spell checker, at its most reductionist, is a map from misspellings to correct spellings. In actual implementation, it’s a lossily compressed map from misspellings to correct spellings using a lot of clever tricks to make the loss low and the compression high. A set of books provides you with the output of this map, but not its input and the input is the most valuable bit. If you’ve got a load of volunteers, you could ship a spell checker that only has the dictionary of correct things and records the things that people type and the thing that they correct them to. I believe this is how a lot of on-screen keyboards for mobile things work: they collect the corrections that people make, aggregate them, send them to the provider, and then train a neural network (a generalised data structure for lossily encoding map functions) on the output.
Given how many misspellings I have to remove from autocorrect I doubt crowdsourcing is the way to go.
Note that I’m suggesting crowdsourcing the incorrect spellings, not the correct ones. You can build the correct ones from a large corpus of publications (you can probably do that even with a not-very-currated set by excluding words that only occur below a threshold number of times in a sufficiently large corpus of mostly correct text, such as Wikipedia). The thing that you’re trying to get from the crowdsourcing is the set of common typos that people then correct to things in your corpus.
Others already have given good answers here, but I’ll add that one of my points is: “one-size-fits-all” spellchecking wouldn’t work well (well, it would work well in 80% of cases, might be enough for some); the same goes about “corpus from books” (NYTimes/Wikipedia): you’ll probably need to put source text in a myriad of “bins” to teach your spellchecker that “in mid-XX-century formal context, this sequence of words is very probable”, “in XXI century blog post, probabilities are different”, “in the tech-related article, there would be a lot of uncommon words that more probable to be misspellings in other contexts”, “in literary critic, the entire structure of the phase would imply different word possibilities” etc etc etc. And it changes everyday (what’s probably a good word, what’s suggested by phrase structure etc.)
For formal writing, yes. But a lot of writing is now informal. Then again, I suppose one could argue that informal writing doesn’t really need to be spell-checked.
I’d consider writing posts on here ‘informal’, but I’d still want it to be spellchecked.
Of course. But “good” spellchecker (what I am writing about: which will consider context, and guess “it is incorrect” and “how to correct it” the best way possible) will be different for posts/comments than in an article for the magazine, other than for a fiction book, other than for a work email. (Like, misspell “post” into “pots” and in some contexts, it can be caught by a good spellchecker, but not with the word-by-word one, and not without understanding at least some of the context.)