1. 3

    Warning: this story ends on a cliffhanger. I recommend waiting to read this until Part 3 is posted “next week”.

    1. 6

      My bad :’-( Really wanted to tell entire story in one go, but got carried away!

      1. 2

        don’t sweat it, all good second acts end on a cliffhanger

    1. 3

      Amazing idea for advent of code. I only read the summary at the end, but did you find anything in particular that really surprised you?

      1. 6

        As the whole set of techniques behind the procedural art generation was very vaguely familiar to me, the most surprising was probably how the most tender and “hand-drawn” things emerge from the clever combination of simple geometry and a gradient noise. The original art was a lot of hand-crafting, but the tools used are incredibly simple—and powerful at the same time!

      1. 2

        Also relevant here is the NIST TREC CAR track, which provided a comprehensive structured parse of Wikipedia. The Mediawiki parser is available. The tools used to process the extraction are also available, although being internal tools the documentation could be better.

        full disclosure: I have previously served as an organizer of TREC CAR and was responsible for much of the data pipeline.

        1. 2

          Thanks, that’s super-interesting, and I haven’t seen it previously!

          (As an aside note, for some time I went the road of MediaWiki markup parsing, too, but at some point decided it wouldn’t work. The reasons are in WikipediaQL’s README)

        1. -3

          The article forget to describe prior art, except Mathematica, Wolfram Alpha, and Google. Further research will yield more tools like conceptnet.io (that scrape wiktionaries), dbpedia (that scrape wikipedia infoboxes), further wikidata side to make wikidata easier to query https://query.wikidata.org/querybuilder/, also there is Marie Destandau work ergonomic SPAQRL federated query builder, and mediawiki $10M project dubbed abstract wikipedia. There is also the work on wikidata Q/A aka. https://hal.archives-ouvertes.fr/hal-01730479/document.

          The article did not mention the fuzzy situation regarding the licensing terms wiktionary and wikipedia vs. wikidata and more broadly the fuzzy legal framework.

          More fundamental problem that the OP @zverok seems to have no clues about: extracting structured data from unstructured or semi-structured data like html or plain text in the general case is a very hard problem and possibly AI-Complete.

          So, yes I agree given wikipedia and wiktionary, and the fact they are mainstream well-established, eventually they became user-friendly, it would have been better for wikidata to bet on extracting structured RDF triples from those, and invest into making automated and semi-automated approach to extract structured data from unstructured data, merely for legal reason, mediawiki C level want wikidata to be CC0.

          Also, I want to stress that ML mainstream frenzy shadowed one of Google tool of choice: freebase, and its related tools and practices.

          We need to parse our civilization’s knowledge and make it programmatically available to everybody. And then we’ll see what’s next.

          Yes!

          On a related note: I stopped giving money to wikimedia.

          1. 7

            The article forget to describe prior art

            That’s not a scientific article, rather a blog post about some problems I am thinking about and working upon. I am aware (at least of some) of the “prior art” (well, since I am playing with the problems at hand for 6 years now, my “links.txt” is a few books worth). I honestly don’t feel obliged to mention everything that is done in the field unless my work is inspired by/related to others’ work. Yes, a lot of people do a lot of stuff, some of it is dead-ends, some of it is fruitful, and I have the highest respect for them, but being just a developer I am, I am just doing what seems interesting and writing about it, no more, no less.

            More fundamental problem that the OP @zverok seems to have no clues about: extracting structured data from unstructured or semi-structured data like html or plain text in the general case is a very hard problem and possibly AI-Complete.

            What makes you think that person stating that spent many years on a problem “has no clues about” one of the most obvious aspects of it? The problem is obviously unsolvable “generically” (e.g. “fetch any random page from Internetz and say what information it contains”), and that’s the exact reason I am investigating approaches to solve it for some practical purposes—in a somewhat generic way.

            1. 5

              The problem is obviously unsolvable “generically” (e.g. “fetch any random page from Internetz and say what information it contains”)

              People saying a problem is AI-complete is a big red flag for me in terms of taking the rest of what they say seriously. I am reminded of the famous Arthur C. Clarke quote:

              If an elderly but distinguished scientist says that something is possible, he is almost certainly right; but if he says that it is impossible, he is very probably wrong.

              The issue is not that there aren’t AI-complete problems. It is not even that this problem isn’t AI-complete. The point is that the AI-completeness is just not a relevant criterion for evaluation a project that aims to be used by humans. Language translation is also AI complete but I have heard some people use google translate…

              An AI-Complete problem is by definition achievable by human intelligence (if not it is simple called an impossible problem) and therefore tools that help humans think, remember or formulate things can help with all of them.

              Also any time you take an AI-complete problem and reduce it to a finite and programmatically defined domain it ceases to become AI-complete. I remember when I used to try to convince people that real language AI should be part of computer games (I have given up now). This was people’s standard reply. But it is simply not true. AI understanding real human language in every possible case in the real world is AI complete. AI understanding and generating basic descriptive sentences in a single non-evolving language, in a world that is small and has well defined boundaries (both in actual size and also in complexity), and where every aspect of the world is programmatically defined and accessible to the AI is not even really AI-hard, it is merely a quite large and complicated software problem that will take careful planning and a lot of work.

              There is a long history of people saying “AI will never be able to do X” and being completely wrong. The people working on AI don’t really listen though.

            2. 4

              More fundamental problem that the OP @zverok seems to have no clues about

              That’s quite harsh. Was this really necessary in this context?

              1. 1

                I did not mean to be harsh \cc @zverok, sorry! I agree with the other threads, whether it may be AI-Complete or not, I like the project, it is an interesting project, possibly difficult and possibly with low-hanging fruit. The tone of my comment was not the one it meant to have, I was trying to draw a picture of existing, and some time far fetched related projects, so that someone (OP or else) can jump in more easily, in a subject that interests me a lot, and imo that deserves more attention!

                Re @Dunkhan, the whole comment is interesting, I quote one part:

                AI understanding and generating basic descriptive sentences in a single non-evolving language, in a world that is small and has well defined boundaries (both in actual size and also in complexity), and where every aspect of the world is programmatically defined and accessible to the AI is not even really AI-hard, it is merely a quite large and complicated software problem that will take careful planning and a lot of work.

                +1, hence the importance to review prior art, whether one is a science officer or a hobbyist, and put on continuum what-is-done vs. what-is-impossible, and try to find a new solution for “it for some practical purposes—in a somewhat generic way”.

                I am eager to read a follow up article on the subject.

                Again sorry for the tone of my comment. I repeat that, for sure, I am clueless about what OP is clueless about. I only meant to do common good.

            1. 4

              What are the disadvantages of Wikidata here? I’ve made SPARQL queries against Wikidata and have gotten similar behavior to what the Mathematica example was showing. Wikidata is certainly still incomplete compared to Wikipedia, but I think it’s fairly feasible to programmatically bridge the two.

              1. 6

                That’s a good question I answer every time I am talking about this project/set of projects (so maybe the next blog post will be dedicated to it).

                First of all, I am not rejecting the Wikidata (the “reality” project used it, too, alongside Wikipedia and OpenStreetMap) But currently, Wikipedia has two huge advantages: a) just more content (including structured content), and b) the path to content is more discoverable for “casual” developer/data users.

                On “more content”: look for example at Everest’s entities in Wikipedia and Wikidata. The former has many interesting structured and semi-structured tables and lists not represented (yet?) in the latter (like “Selected climbing records” or even “Climate” (neat and regular table present in many geographic entities); in addition to unstructured text data which still can be fruitfully regex-mined.

                Or, look at, linked from the Wikidata item, another item list of 20th-century summiters of Mount Everest… and corresponding Wikipedia article.

                On “discoverability”: if we’ll look at, say randomly, Bjork at Wikipedia and Wikidata, the latter does have albums associated, but not the filmography (movies Bjork starred in). If you’ll start from the movie you can see they are linked via cast member predicate, so “all movies where Bjork is a cast member” can be fetched with SPARQL, but you need to investigate and guess the data is there; in Wikipedia, most of the people-from-movies just have “Filmography” section in their articles.

                That being said, playing with Wikidata to make it more accessible is in the scope of my studies definitely :)

                1. 4

                  If you can extract the data from Wikipedia programmatically then it sounds as if it would then be quite easy to programmatically insert it into Wikidata. I’d love to see tooling for this improved. I hope that eventually Wikipedia will become a Wikidata front end so any structured data added there is automatically reflected back.

                  1. 3

                    Yeah, just posted the same here: https://lobste.rs/s/mbd6le/why_wikipedia_matters_how_make_sense_it#c_kxgtem (down to the front/back wording :))

                2. 2

                  Maybe we can kill two birds with one stone here. What is the chance that a project like this could actually be used to assist wikidata contributors? I don’t think it would be great to attempt to just programatically fill out wikidata like this, but having a nice tool where you can produce wikidata content for a specific wikipedia entry and then look over it and manually check it would save a lot of time for contributors.

                  I also feel like the wikidata approach is slightly flawed. Wikipedia collects its data by leveraging the combined knowledge of all humans (ideally) in a very accessible way. If we have a common data system that tries to collect the same data, but in a much less accessible way and only from trained data professionals we have lost a huge amount of utility. Especially given that the less accessible approach means way fewer eyeballs looking for errors.

                  1. 2

                    Maybe we can kill two birds with one stone here. What is the chance that a project like this could actually be used to assist wikidata contributors?

                    Bingo! One of my hopes is indeed that the new tool (once matured a bit) can be used fruitfully both on Wikipedia (to analyze articles structure) and Wikidata (to propose missing data on basis on Wikipedia)

                    I also feel like the wikidata approach is slightly flawed. Wikipedia collects its data by leveraging the combined knowledge of all humans (ideally) in a very accessible way. If we have a common data system that tries to collect the same data, but in a much less accessible way and only from trained data professionals we have lost a huge amount of utility. Especially given that the less accessible approach means way fewer eyeballs looking for errors.

                    That’s totally true. My (distant) hope is that Wikipedia/Wikidata might once become coupled more title: with Wikipedia being the “front-end” (both more friendly for the reader and for the contributor) with Wikidata the “back-end” (ensuring the formal structure of important parts). But there is a long road ahead, even if this direction is what Wikimedia wants.

                    1. 1

                      This was exactly where I was going with my reply in fact. Cheers.

                  1. 4

                    One thing I’d be interested hearing your thoughts on: Using Wikipedia as an Impromptu Rotten Tomatoes API was in Ruby. Why did you switch to Python for this? Making it more accessible to a wider audience?

                    1. 5

                      Why did you switch to Python for this? Making it more accessible to a wider audience?

                      Yes, exactly. I am Rubyist for the last 20 years, and that’s why when I first started doing all this (the set of projects for “programmatic access to common knowledge”, with reality being the most high-level, but in general the GitHub org contains a lot of work), like, 5 years ago, I believed that I can make cool non-Rails related stuff in Ruby and make it noticeable and used, and (if they will want, ported to other languages). I was basically wrong :) (as with most of assumptions about this area for those first 5 years, TBH)

                      So that article was just a quick demo “look, Wikipedia has a lot of stuff”, using the libraries I worked on since 2016—you can say, stretching my muscles for the next attempt to approach the problem.

                      For better or for worse, Python seems to be today’s go-to language for working with data, so, here we are.

                    1. 3

                      What’s the value of presenting the regular facts in such irregular prose? Why does Wikipedia hold for it? Is there a middle way between “everything’s prose” and “everything’s RDF statement” (Wikidata), which WikiMedia project will follow?

                      As an aside, I found it really hard to get actually get data out of WikiData. A few weeks ago I wanted to get some aggregates of election data for the UK and NL to make a certain point. The data is right there, but actually getting at it programmatically requires a crapton of code – more than I was prepared to write so I just gave up after an hour or so 🤷

                      It’s a shame, because there’s loads of really good information in there. It was the only structural resource of this kind of data I could find (the government publishes them, of course, but as PDFs in inconsistent format varying from election to election).

                      Wikidata’s documentation sucks too.

                      1. 3

                        I find the SPARQL Query Endpoint to be quite easy to use. There’s a learning curve to SPARQL, but it’s not too difficult if you have knowledge of SQL and have some understanding of how RDF graphs work.

                        EDIT: I’m always happy to answer SPARQL questions since I really love the tech, if not the practical realization.

                        1. 1

                          From my experience of using Wikidata for one experimental project, it is actually quite powerful; but the power is a bit unconventional, you kinda should “know where to look” and get intimate with its implementation model.

                          Between MediaWiki API (extended with Wikidata-specific modules like wbsearchentities) and SPARQL query endpoint which has a lot of examples and links to docs, you can do a lot of complex things just in a few queries… But first you need to wrap your head around all of it :)

                          1. 1

                            you can do a lot of complex things just in a few queries… But first you need to wrap your head around all of it :)

                            No doubt it’s powerful, and no doubt that I just don’t understand it well enough to use it effectively. I tried to adapt some of the examples for my own needs, but I found the entire SPARQL thing hard to understand in a reasonable amount of time and couldn’t really get it to do what I want. I really don’t want to spend 2 or 3 days (if not a lot more) just for a Reddit post.

                            IMHO WikiData would be a lot more valuable if more people could use it with fairly little training, instead of being usable by just a comparatively small group of SPARQL experts. The entire project really comes off as managed by a bunch of tech people with no real vision on UX, which is a real shame because I think it has a lot of potential.

                        1. 4

                          If you’re downloading the html of each wiki page, why not do the same with RT? Then the formatting of each rating would be consistent.

                          1. 10

                            That’s an interesting question with a long answer!

                            1. First, in the code that article/gist shows I am neither downloading HTML, nor of each wiki page.
                            2. What the Infoboxer.wikipedia.category('Best Drama Picture Golden Globe winners') does is using Wikipedia query API, fetches lists of the pages, 50 at a time (so for 81 pages it is just two HTTP requests)
                            3. The pages’ content is used in Wikitext, which is then parsed and navigated from Ruby
                            4. Even if (as the last sections suggest) we’ll need to actually fetch rendered HTML, we’d still better utilize Wikipedia API: query to fetch page lists by any criteria we want, and then use parse API to fetch rendered text.

                            So, “why not just fetch from RT”? Mainly because of two reasons:

                            1. It is harder to automate. For a singular movie, you’ll probably just construct an URL like https://www.rottentomatoes.com/m/nomadland, but for “The Resort (2021)”, if you’ll try to just fetch “The Resort”, RT will just give you the wrong one: https://www.rottentomatoes.com/m/the_resort, while Wikipedia will include hint (autoparseable!) that’s not what you are looking for: https://en.wikipedia.org/wiki/The_Resort. Also, in Wikipedia, you can mass-fetch movies by years, genres, awards, and many other grouping criteria; and fetch it efficiently.
                            2. And second, there is a question of acceptability/legality of what you are doing. Wikipedia API is definitely bot-friendly, and its content is definitely CC BY-SA; RT scraping is probably what they don’t want you to do (and will throttle and ban automated attempts to do so), and its content is copyrighted.

                            About (2) — I am not sure (as I stated in the article) that “fetching bits of copyrighted content from open sources” is 100% legally clear, but it is definitely less questionable than direct scraping.

                          1. 3

                            I’m not sure why the author is so convinced that no one could be doing something new or fun in Ruby land or that if they did no one would talk to them…

                            1. 2

                              Because he tried for many years, maybe? And observed the community for said years? ¯\_(ツ)_/¯

                              1. 1

                                Ins’t stuff the Andrew https://github.com/ankane build consider cool? He did lot of ML stuff. All of his gem integrate nicely without heavy external dependencies

                                Same with https://github.com/ioquatix

                                1. 2

                                  Andrew is an absolute hero indeed, and there are several other groups working on ML gems; but in the scope of the point I’ve tried to make it is important to notice that

                                  • a) these gems typically have very low usage in the community (even if “considered cool”, it is more “Oh, it is a cool thing, I’ll bookmark it for the case somebody says Ruby doesn’t do ML on Reddit”); and
                                  • b) it is more of a “chasing the leader” kind of work (“Ruby now can do this and that too”).

                                  Nothing wrong with chasing the leader, and it is work that needs to be done.

                                  But my argument was about the fact that it is very hard to do something new in Ruby to be noticeable in the industry in general, not just mentioned in the “Ruby newsletter” or upvoted on /r/ruby—while Ruby’s traits are very suitable for this kind of work (inventing something “completely new” out of the blue).

                            1. 1

                              Would not texts from major books and other manually-checked publications provide a reasonable basis for training data?

                              1. 7

                                Having trained a neural network on a large corpus before: gathering data is only half the battle. You have to do training, you have to clean your data, and in this case you may also want examples of incorrect spelling along with corrections (that is, sentences with typos and the fixed versions).

                                And shoving text from books into your training data corpus can run into Extremely Fun Copyright Issues.

                                1. 3

                                  No, because they provide only half of the puzzle. A spell checker, at its most reductionist, is a map from misspellings to correct spellings. In actual implementation, it’s a lossily compressed map from misspellings to correct spellings using a lot of clever tricks to make the loss low and the compression high. A set of books provides you with the output of this map, but not its input and the input is the most valuable bit. If you’ve got a load of volunteers, you could ship a spell checker that only has the dictionary of correct things and records the things that people type and the thing that they correct them to. I believe this is how a lot of on-screen keyboards for mobile things work: they collect the corrections that people make, aggregate them, send them to the provider, and then train a neural network (a generalised data structure for lossily encoding map functions) on the output.

                                  1. 2

                                    Given how many misspellings I have to remove from autocorrect I doubt crowdsourcing is the way to go.

                                    1. 1

                                      Note that I’m suggesting crowdsourcing the incorrect spellings, not the correct ones. You can build the correct ones from a large corpus of publications (you can probably do that even with a not-very-currated set by excluding words that only occur below a threshold number of times in a sufficiently large corpus of mostly correct text, such as Wikipedia). The thing that you’re trying to get from the crowdsourcing is the set of common typos that people then correct to things in your corpus.

                                  2. 3

                                    Others already have given good answers here, but I’ll add that one of my points is: “one-size-fits-all” spellchecking wouldn’t work well (well, it would work well in 80% of cases, might be enough for some); the same goes about “corpus from books” (NYTimes/Wikipedia): you’ll probably need to put source text in a myriad of “bins” to teach your spellchecker that “in mid-XX-century formal context, this sequence of words is very probable”, “in XXI century blog post, probabilities are different”, “in the tech-related article, there would be a lot of uncommon words that more probable to be misspellings in other contexts”, “in literary critic, the entire structure of the phase would imply different word possibilities” etc etc etc. And it changes everyday (what’s probably a good word, what’s suggested by phrase structure etc.)

                                    1. 1

                                      For formal writing, yes. But a lot of writing is now informal. Then again, I suppose one could argue that informal writing doesn’t really need to be spell-checked.

                                      1. 4

                                        I’d consider writing posts on here ‘informal’, but I’d still want it to be spellchecked.

                                        1. 2

                                          Of course. But “good” spellchecker (what I am writing about: which will consider context, and guess “it is incorrect” and “how to correct it” the best way possible) will be different for posts/comments than in an article for the magazine, other than for a fiction book, other than for a work email. (Like, misspell “post” into “pots” and in some contexts, it can be caught by a good spellchecker, but not with the word-by-word one, and not without understanding at least some of the context.)

                                    1. 2

                                      blossoming complexity of organically evolved software that solves the complicated task.

                                      Assuming the same complicated task (with the possible exception of the extra features you identified as not being used in any dictionaries), do you have any ideas or opinions on how it could be done more elegantly if one were designing the format from scratch?

                                      1. 5

                                        This is actually The Question which I am looking an answer for with the entire project of “understanding Hunspell” (or, “rebuilding Hunspell in order to understand it”). TBH, at the start of the project I had some optimizm about existence of analytical solution; but after spending some months trying to wrap my head about all the edge cases that brought all this… I have a firm belief that the task is more statistical by its very nature and ML-based approach is inevitable.

                                        I’ll cover this line of thinking in one of the next posts, but one of the points of this one - is that what’s looking like analytical way of storing linguistic knowledge, is rather a way of manually “encoding” some experience (e. g. a model in ML sense) to cover a never-ending spread of edge cases.

                                        1. 1

                                          The ML approach sounds interesting and makes you wonder how well it would pass a test suite against Hunspell!

                                          Most languages have exceptions and exceptions to exceptions, so I’m happy to deal mostly in Finnish, which is quite regular. Complex, sure, but regular enough that you can type mostly anything into the Verbix conjugator and it will spit out an analytically-generated (and quite accurate) table of forms.

                                          Looking forward to reading the further installments, keep ’em coming!

                                          1. 2

                                            The ML approach sounds interesting and makes you wonder how well it would pass a test suite against Hunspell!

                                            It would be hardly an interesting experiment :( Hunspell’s test suite is mostly dedicated to checking “it works like Hunspell”, I wrote a bit about it here:

                                            The current Hunspell’s development consensus “what’s the best suggestion algorithm” is maintained by a multitude of synthetic test dictionaries, validating that one of the suggestion features, or set of them, works (and frequently indirectly validating other features). This situation is both a blessing and a curse: synthetic tests provide a stable enough environment to refactor Hunspell; on the other hand, there is no direct way to test the quality—the tests only confirm that features work in an expected order. So, there is no way to prove that some big redesign or some alternative spellchecker passes the quality check at least as good as Hunspell and improves over this baseline.

                                            Most languages have exceptions and exceptions to exceptions, so I’m happy to deal mostly in Finnish, which is quite regular. Complex, sure, but regular enough that you can type mostly anything into the Verbix conjugator and it will spit out an analytically-generated (and quite accurate) table of forms.

                                            Oh, Finnish! While playing with Hunspell, I found out that it can’t properly support Finnish at all (which is quite weird considering the Hungarian—the language Hunspell originally was created for—is of the same language family); so I have a plan/dream once to lay my hands on Voikko to understand how its approaches differ :)

                                            Looking forward to reading the further installments, keep ’em coming!

                                            Thanks, will do!

                                            1. 1

                                              What a pity about the tests :/

                                              I know very little about this domain, but a common opinion around here is that Hungarian isn’t as close as some people think. Spoken Hungarian can still sound a lot like Finnish, and some Hungarian words double as stems in Finnish, I believe, like (iirc) “ver” and “mes” and “käs”. These are not words I’d have picked out in speech if I wasn’t told, though.

                                              Yet spoken Estonian does not sound the same, it’s more legible. That includes not knowing what the words mean, but recognizing some common structurality, while Hungarian looks quite alien.

                                              Is Hunspell any good with Estonian, do you know?

                                              1. 2

                                                What a pity about the tests :/

                                                Yeah… In hindsight, probably Hunspell should’ve gathered and kept all the realistic cases that users brought up (so they lead to all this complexity), but… Here we are!

                                                I know very little about this domain, but a common opinion around here is that Hungarian isn’t as close as some people think.

                                                Well, I know even less, I just relied on vaguely remembering Hungarian to be classified as “Finno-Ugric” in some trivia or another (like “did you know that in the middle of Europe some people speak a language which is unrelated to all of their neighbors’ ones”?) :)

                                                Is Hunspell any good with Estonian, do you know?

                                                I know very little about it, but the fact that all links from (existing) Hunpell Estonian dictionaries lead to author’s page: http://www.meso.ee/~jjpp/speller/ – and it has the words Voikko in the very top paragraph (which Google Translate translates exactly as one might expect: everything here is outdated, use Voikko) is quite telling.

                                                1. 1

                                                  Finno-ugric, yeah. I’ll probably go down a rabbit hole at some point and see if there’s easily understandable research into the geneology, to not go spouting my understandings as too factual, though I have heard from Hungarians as well that Finnish spoken down the corridor, so you don’t make out the words, sounds real Hungarian. Works both ways!

                                                  That also says 2013, all this stuff is surprisingly old as well ;) But please do share your findings on Voikko if you get any!

                                      1. 3

                                        Absolutely fascinating read! This is the kind of stuff that I enjoy Lobste.rs for so much!

                                        1. 2

                                          Thanks!

                                        1. 8

                                          My guess is languages with fewer users have their dictionaries prepared by professional linguists, while more common languages dictionaries are authored by IT people.

                                          Falsehoods programmers believe about languages

                                          1. 10

                                            Hehe. BTW, I once thought that during digging into hunspell I found enough interesting stuff to write quite a lengthy article in a “falsehoods programmers believe about”, fully dedicated to spellchecking :)

                                            1. 5

                                              I would certainly read that!