1. 13
  1.  

  2. 4

    What are the disadvantages of Wikidata here? I’ve made SPARQL queries against Wikidata and have gotten similar behavior to what the Mathematica example was showing. Wikidata is certainly still incomplete compared to Wikipedia, but I think it’s fairly feasible to programmatically bridge the two.

    1. 6

      That’s a good question I answer every time I am talking about this project/set of projects (so maybe the next blog post will be dedicated to it).

      First of all, I am not rejecting the Wikidata (the “reality” project used it, too, alongside Wikipedia and OpenStreetMap) But currently, Wikipedia has two huge advantages: a) just more content (including structured content), and b) the path to content is more discoverable for “casual” developer/data users.

      On “more content”: look for example at Everest’s entities in Wikipedia and Wikidata. The former has many interesting structured and semi-structured tables and lists not represented (yet?) in the latter (like “Selected climbing records” or even “Climate” (neat and regular table present in many geographic entities); in addition to unstructured text data which still can be fruitfully regex-mined.

      Or, look at, linked from the Wikidata item, another item list of 20th-century summiters of Mount Everest… and corresponding Wikipedia article.

      On “discoverability”: if we’ll look at, say randomly, Bjork at Wikipedia and Wikidata, the latter does have albums associated, but not the filmography (movies Bjork starred in). If you’ll start from the movie you can see they are linked via cast member predicate, so “all movies where Bjork is a cast member” can be fetched with SPARQL, but you need to investigate and guess the data is there; in Wikipedia, most of the people-from-movies just have “Filmography” section in their articles.

      That being said, playing with Wikidata to make it more accessible is in the scope of my studies definitely :)

      1. 4

        If you can extract the data from Wikipedia programmatically then it sounds as if it would then be quite easy to programmatically insert it into Wikidata. I’d love to see tooling for this improved. I hope that eventually Wikipedia will become a Wikidata front end so any structured data added there is automatically reflected back.

        1. 3

          Yeah, just posted the same here: https://lobste.rs/s/mbd6le/why_wikipedia_matters_how_make_sense_it#c_kxgtem (down to the front/back wording :))

      2. 2

        Maybe we can kill two birds with one stone here. What is the chance that a project like this could actually be used to assist wikidata contributors? I don’t think it would be great to attempt to just programatically fill out wikidata like this, but having a nice tool where you can produce wikidata content for a specific wikipedia entry and then look over it and manually check it would save a lot of time for contributors.

        I also feel like the wikidata approach is slightly flawed. Wikipedia collects its data by leveraging the combined knowledge of all humans (ideally) in a very accessible way. If we have a common data system that tries to collect the same data, but in a much less accessible way and only from trained data professionals we have lost a huge amount of utility. Especially given that the less accessible approach means way fewer eyeballs looking for errors.

        1. 2

          Maybe we can kill two birds with one stone here. What is the chance that a project like this could actually be used to assist wikidata contributors?

          Bingo! One of my hopes is indeed that the new tool (once matured a bit) can be used fruitfully both on Wikipedia (to analyze articles structure) and Wikidata (to propose missing data on basis on Wikipedia)

          I also feel like the wikidata approach is slightly flawed. Wikipedia collects its data by leveraging the combined knowledge of all humans (ideally) in a very accessible way. If we have a common data system that tries to collect the same data, but in a much less accessible way and only from trained data professionals we have lost a huge amount of utility. Especially given that the less accessible approach means way fewer eyeballs looking for errors.

          That’s totally true. My (distant) hope is that Wikipedia/Wikidata might once become coupled more title: with Wikipedia being the “front-end” (both more friendly for the reader and for the contributor) with Wikidata the “back-end” (ensuring the formal structure of important parts). But there is a long road ahead, even if this direction is what Wikimedia wants.

          1. 1

            This was exactly where I was going with my reply in fact. Cheers.

        2. -3

          The article forget to describe prior art, except Mathematica, Wolfram Alpha, and Google. Further research will yield more tools like conceptnet.io (that scrape wiktionaries), dbpedia (that scrape wikipedia infoboxes), further wikidata side to make wikidata easier to query https://query.wikidata.org/querybuilder/, also there is Marie Destandau work ergonomic SPAQRL federated query builder, and mediawiki $10M project dubbed abstract wikipedia. There is also the work on wikidata Q/A aka. https://hal.archives-ouvertes.fr/hal-01730479/document.

          The article did not mention the fuzzy situation regarding the licensing terms wiktionary and wikipedia vs. wikidata and more broadly the fuzzy legal framework.

          More fundamental problem that the OP @zverok seems to have no clues about: extracting structured data from unstructured or semi-structured data like html or plain text in the general case is a very hard problem and possibly AI-Complete.

          So, yes I agree given wikipedia and wiktionary, and the fact they are mainstream well-established, eventually they became user-friendly, it would have been better for wikidata to bet on extracting structured RDF triples from those, and invest into making automated and semi-automated approach to extract structured data from unstructured data, merely for legal reason, mediawiki C level want wikidata to be CC0.

          Also, I want to stress that ML mainstream frenzy shadowed one of Google tool of choice: freebase, and its related tools and practices.

          We need to parse our civilization’s knowledge and make it programmatically available to everybody. And then we’ll see what’s next.

          Yes!

          On a related note: I stopped giving money to wikimedia.

          1. 7

            The article forget to describe prior art

            That’s not a scientific article, rather a blog post about some problems I am thinking about and working upon. I am aware (at least of some) of the “prior art” (well, since I am playing with the problems at hand for 6 years now, my “links.txt” is a few books worth). I honestly don’t feel obliged to mention everything that is done in the field unless my work is inspired by/related to others’ work. Yes, a lot of people do a lot of stuff, some of it is dead-ends, some of it is fruitful, and I have the highest respect for them, but being just a developer I am, I am just doing what seems interesting and writing about it, no more, no less.

            More fundamental problem that the OP @zverok seems to have no clues about: extracting structured data from unstructured or semi-structured data like html or plain text in the general case is a very hard problem and possibly AI-Complete.

            What makes you think that person stating that spent many years on a problem “has no clues about” one of the most obvious aspects of it? The problem is obviously unsolvable “generically” (e.g. “fetch any random page from Internetz and say what information it contains”), and that’s the exact reason I am investigating approaches to solve it for some practical purposes—in a somewhat generic way.

            1. 5

              The problem is obviously unsolvable “generically” (e.g. “fetch any random page from Internetz and say what information it contains”)

              People saying a problem is AI-complete is a big red flag for me in terms of taking the rest of what they say seriously. I am reminded of the famous Arthur C. Clarke quote:

              If an elderly but distinguished scientist says that something is possible, he is almost certainly right; but if he says that it is impossible, he is very probably wrong.

              The issue is not that there aren’t AI-complete problems. It is not even that this problem isn’t AI-complete. The point is that the AI-completeness is just not a relevant criterion for evaluation a project that aims to be used by humans. Language translation is also AI complete but I have heard some people use google translate…

              An AI-Complete problem is by definition achievable by human intelligence (if not it is simple called an impossible problem) and therefore tools that help humans think, remember or formulate things can help with all of them.

              Also any time you take an AI-complete problem and reduce it to a finite and programmatically defined domain it ceases to become AI-complete. I remember when I used to try to convince people that real language AI should be part of computer games (I have given up now). This was people’s standard reply. But it is simply not true. AI understanding real human language in every possible case in the real world is AI complete. AI understanding and generating basic descriptive sentences in a single non-evolving language, in a world that is small and has well defined boundaries (both in actual size and also in complexity), and where every aspect of the world is programmatically defined and accessible to the AI is not even really AI-hard, it is merely a quite large and complicated software problem that will take careful planning and a lot of work.

              There is a long history of people saying “AI will never be able to do X” and being completely wrong. The people working on AI don’t really listen though.

            2. 4

              More fundamental problem that the OP @zverok seems to have no clues about

              That’s quite harsh. Was this really necessary in this context?

              1. 1

                I did not mean to be harsh \cc @zverok, sorry! I agree with the other threads, whether it may be AI-Complete or not, I like the project, it is an interesting project, possibly difficult and possibly with low-hanging fruit. The tone of my comment was not the one it meant to have, I was trying to draw a picture of existing, and some time far fetched related projects, so that someone (OP or else) can jump in more easily, in a subject that interests me a lot, and imo that deserves more attention!

                Re @Dunkhan, the whole comment is interesting, I quote one part:

                AI understanding and generating basic descriptive sentences in a single non-evolving language, in a world that is small and has well defined boundaries (both in actual size and also in complexity), and where every aspect of the world is programmatically defined and accessible to the AI is not even really AI-hard, it is merely a quite large and complicated software problem that will take careful planning and a lot of work.

                +1, hence the importance to review prior art, whether one is a science officer or a hobbyist, and put on continuum what-is-done vs. what-is-impossible, and try to find a new solution for “it for some practical purposes—in a somewhat generic way”.

                I am eager to read a follow up article on the subject.

                Again sorry for the tone of my comment. I repeat that, for sure, I am clueless about what OP is clueless about. I only meant to do common good.