1. 54
  1.  

  2. 8

    My guess is languages with fewer users have their dictionaries prepared by professional linguists, while more common languages dictionaries are authored by IT people.

    Falsehoods programmers believe about languages

    1. 10

      Hehe. BTW, I once thought that during digging into hunspell I found enough interesting stuff to write quite a lengthy article in a “falsehoods programmers believe about”, fully dedicated to spellchecking :)

      1. 5

        I would certainly read that!

    2. 3

      Absolutely fascinating read! This is the kind of stuff that I enjoy Lobste.rs for so much!

      1. 2

        Thanks!

      2. 2

        blossoming complexity of organically evolved software that solves the complicated task.

        Assuming the same complicated task (with the possible exception of the extra features you identified as not being used in any dictionaries), do you have any ideas or opinions on how it could be done more elegantly if one were designing the format from scratch?

        1. 5

          This is actually The Question which I am looking an answer for with the entire project of “understanding Hunspell” (or, “rebuilding Hunspell in order to understand it”). TBH, at the start of the project I had some optimizm about existence of analytical solution; but after spending some months trying to wrap my head about all the edge cases that brought all this… I have a firm belief that the task is more statistical by its very nature and ML-based approach is inevitable.

          I’ll cover this line of thinking in one of the next posts, but one of the points of this one - is that what’s looking like analytical way of storing linguistic knowledge, is rather a way of manually “encoding” some experience (e. g. a model in ML sense) to cover a never-ending spread of edge cases.

          1. 1

            The ML approach sounds interesting and makes you wonder how well it would pass a test suite against Hunspell!

            Most languages have exceptions and exceptions to exceptions, so I’m happy to deal mostly in Finnish, which is quite regular. Complex, sure, but regular enough that you can type mostly anything into the Verbix conjugator and it will spit out an analytically-generated (and quite accurate) table of forms.

            Looking forward to reading the further installments, keep ’em coming!

            1. 2

              The ML approach sounds interesting and makes you wonder how well it would pass a test suite against Hunspell!

              It would be hardly an interesting experiment :( Hunspell’s test suite is mostly dedicated to checking “it works like Hunspell”, I wrote a bit about it here:

              The current Hunspell’s development consensus “what’s the best suggestion algorithm” is maintained by a multitude of synthetic test dictionaries, validating that one of the suggestion features, or set of them, works (and frequently indirectly validating other features). This situation is both a blessing and a curse: synthetic tests provide a stable enough environment to refactor Hunspell; on the other hand, there is no direct way to test the quality—the tests only confirm that features work in an expected order. So, there is no way to prove that some big redesign or some alternative spellchecker passes the quality check at least as good as Hunspell and improves over this baseline.

              Most languages have exceptions and exceptions to exceptions, so I’m happy to deal mostly in Finnish, which is quite regular. Complex, sure, but regular enough that you can type mostly anything into the Verbix conjugator and it will spit out an analytically-generated (and quite accurate) table of forms.

              Oh, Finnish! While playing with Hunspell, I found out that it can’t properly support Finnish at all (which is quite weird considering the Hungarian—the language Hunspell originally was created for—is of the same language family); so I have a plan/dream once to lay my hands on Voikko to understand how its approaches differ :)

              Looking forward to reading the further installments, keep ’em coming!

              Thanks, will do!

              1. 1

                What a pity about the tests :/

                I know very little about this domain, but a common opinion around here is that Hungarian isn’t as close as some people think. Spoken Hungarian can still sound a lot like Finnish, and some Hungarian words double as stems in Finnish, I believe, like (iirc) “ver” and “mes” and “käs”. These are not words I’d have picked out in speech if I wasn’t told, though.

                Yet spoken Estonian does not sound the same, it’s more legible. That includes not knowing what the words mean, but recognizing some common structurality, while Hungarian looks quite alien.

                Is Hunspell any good with Estonian, do you know?

                1. 2

                  What a pity about the tests :/

                  Yeah… In hindsight, probably Hunspell should’ve gathered and kept all the realistic cases that users brought up (so they lead to all this complexity), but… Here we are!

                  I know very little about this domain, but a common opinion around here is that Hungarian isn’t as close as some people think.

                  Well, I know even less, I just relied on vaguely remembering Hungarian to be classified as “Finno-Ugric” in some trivia or another (like “did you know that in the middle of Europe some people speak a language which is unrelated to all of their neighbors’ ones”?) :)

                  Is Hunspell any good with Estonian, do you know?

                  I know very little about it, but the fact that all links from (existing) Hunpell Estonian dictionaries lead to author’s page: http://www.meso.ee/~jjpp/speller/ – and it has the words Voikko in the very top paragraph (which Google Translate translates exactly as one might expect: everything here is outdated, use Voikko) is quite telling.

                  1. 1

                    Finno-ugric, yeah. I’ll probably go down a rabbit hole at some point and see if there’s easily understandable research into the geneology, to not go spouting my understandings as too factual, though I have heard from Hungarians as well that Finnish spoken down the corridor, so you don’t make out the words, sounds real Hungarian. Works both ways!

                    That also says 2013, all this stuff is surprisingly old as well ;) But please do share your findings on Voikko if you get any!