1. 37
  1. 11
    1. 8

      As a former editor, I’m really happy to see this sort of permanent extension of Wikipedia. Automatic generation of articles or knowledge boxes based on Wikidata is just the sort of ability that is needed to continue scaling the encyclopedic effort. Mere data is so much more inclusive than individual hand-written articles, in the inclusionist/deletionist paradigm.

      The technical paper is pretty good and addresses every concern one might have about the problem. The actual implementation plan looks much more daunting, though.

      1. 1

        Mere data is so much more inclusive than individual hand-written articles

        That statement is at odds with several statements in the whitepaper, such as

        contributing to Abstract Wikipedia will possibly have a higher bar than contributing to most Wikipedia projects

        and

        compared to natural language the expressivity of Wikidata is extremely limited

        But, more to the point, “mere data” has no meaning in itself. Gathering it is the easiest part of the problem.

        I find the whitepaper almost humorous in how thoroughly it hand-waves its way through a big swath of really thorny problems, while studiously ignoring many others. Not to mention the long history of failed attempts at this particular pipe dream. The first bullet point in the “risks” section says

        Leibniz was neither the first nor the last, but probably the most well-known proponent of a universal language,which he called Characteristica universalis[74]. Umberto Eco wrote an entire book describing the many failures towards this goal [24]. Abstract Wikipedia is indeed firmly within this tradition, and in preparation fort his project we studied numerous predecessors

        … but, you just have to take his word for it, because he doesn’t actually cite any real prior work, let alone do any analysis.

        I’d be willing to take 10:1 odds on this project going nowhere. I sure hope the Wikimedia Foundation doesn’t waste any money on it. They should instead invest in cleaning up their (absolutely terrible) code base.

        1. 2

          Most Wikipedia projects are the single-language encyclopedias, and most of those have relatively low bars to contribution. The proposal author works on Croatian Wikipedia, and part of the motivation is to gain lots of articles in Croatian with the same structure and details as in English. This leads to the imagined bar for quality; we might imagine that Abstract Wikipedia presentations are capable of generating English Wikipedia Featured Articles, which have one of the highest bars across all of the Foundation’s projects.

          When I say that this move is inclusive, I mean explicitly that the number of articles that can be maintained this way, generalized and template-generated, is strictly greater than the number that can be handwritten. Additionally, some articles that might have been too expensive to write by hand, due to the long-tail effect, might be easy to generate automatically; this could be the future of stubbed articles.

          Wikidata is highly structured. While any single datum is not meaningful, the structure between data can be recovered. If we look at a particularly famous datum representing a famous author, we can see that each piece of associated data is structured as a span which connects the author, by each resepctive relation, to each respective associated datum. Also, citation is built directly into the structure.

          This structure of spans gives a well-studied category which generalizes Rel, the category of sets and relations. (Details are at the nCat page on spans.) The upshot is that we can do all of our ordinary relational algebra on Wikidata. While I’m not a fan of SPARQL, there is a well-documented SPARQL interface to Wikidata.

          Hm, ten-to-one is about 91%, which is just over one nine. I would take one nine’s belief that this effort, or something extremely similar with the same design and goals, replaces the knowledge-box construction that’s currently used across Wikipedia. Those boxes are built from MediaWiki templates, which is nice for letting ordinary editors work on them, but perhaps they could start to be built from Wikidata queries instead. Now, as to replacing actual article generation on any single-language Wikipedia, I’d say less than one nine. This is a grand design that needs to achieve many intermediate steps in order to even possibly succeed.

          I think we all know that MediaWiki will never be replaced, and also that nothing short of a full rewrite can fix its architectural issues. I don’t like it either.

      2. 2

        Very interesting. I would definitely try to use Grammatical Framework for a prototype, since its “resource grammars” for natural languages represent a ton of usable linguistic work. It’s very briefly cited in the technical paper but only to say that functional languages are promising for language generation. Indeed GF works today and has a pleasant formalism.

        1. 1

          Whenever I see something like this all I can think is: Fuck translation. Make everyone use English.

          Good command of English is almost necessary for lots of things and definitely incredibly useful. Make people practice and let’s all benefit from not needing translations anymore.

          English is also the best of the widespread languages to standardize on.

          1. 1

            The plan does not seem possible. Did anybody already create a turing complete language “from scratch” in less than one year?