1. 27
  1.  

  2. 4

    If you’re downloading the html of each wiki page, why not do the same with RT? Then the formatting of each rating would be consistent.

    1. 10

      That’s an interesting question with a long answer!

      1. First, in the code that article/gist shows I am neither downloading HTML, nor of each wiki page.
      2. What the Infoboxer.wikipedia.category('Best Drama Picture Golden Globe winners') does is using Wikipedia query API, fetches lists of the pages, 50 at a time (so for 81 pages it is just two HTTP requests)
      3. The pages’ content is used in Wikitext, which is then parsed and navigated from Ruby
      4. Even if (as the last sections suggest) we’ll need to actually fetch rendered HTML, we’d still better utilize Wikipedia API: query to fetch page lists by any criteria we want, and then use parse API to fetch rendered text.

      So, “why not just fetch from RT”? Mainly because of two reasons:

      1. It is harder to automate. For a singular movie, you’ll probably just construct an URL like https://www.rottentomatoes.com/m/nomadland, but for “The Resort (2021)”, if you’ll try to just fetch “The Resort”, RT will just give you the wrong one: https://www.rottentomatoes.com/m/the_resort, while Wikipedia will include hint (autoparseable!) that’s not what you are looking for: https://en.wikipedia.org/wiki/The_Resort. Also, in Wikipedia, you can mass-fetch movies by years, genres, awards, and many other grouping criteria; and fetch it efficiently.
      2. And second, there is a question of acceptability/legality of what you are doing. Wikipedia API is definitely bot-friendly, and its content is definitely CC BY-SA; RT scraping is probably what they don’t want you to do (and will throttle and ban automated attempts to do so), and its content is copyrighted.

      About (2) — I am not sure (as I stated in the article) that “fetching bits of copyrighted content from open sources” is 100% legally clear, but it is definitely less questionable than direct scraping.

    2. 3

      What’s the value of presenting the regular facts in such irregular prose? Why does Wikipedia hold for it? Is there a middle way between “everything’s prose” and “everything’s RDF statement” (Wikidata), which WikiMedia project will follow?

      As an aside, I found it really hard to get actually get data out of WikiData. A few weeks ago I wanted to get some aggregates of election data for the UK and NL to make a certain point. The data is right there, but actually getting at it programmatically requires a crapton of code – more than I was prepared to write so I just gave up after an hour or so 🤷

      It’s a shame, because there’s loads of really good information in there. It was the only structural resource of this kind of data I could find (the government publishes them, of course, but as PDFs in inconsistent format varying from election to election).

      Wikidata’s documentation sucks too.

      1. 3

        I find the SPARQL Query Endpoint to be quite easy to use. There’s a learning curve to SPARQL, but it’s not too difficult if you have knowledge of SQL and have some understanding of how RDF graphs work.

        EDIT: I’m always happy to answer SPARQL questions since I really love the tech, if not the practical realization.

        1. 1

          From my experience of using Wikidata for one experimental project, it is actually quite powerful; but the power is a bit unconventional, you kinda should “know where to look” and get intimate with its implementation model.

          Between MediaWiki API (extended with Wikidata-specific modules like wbsearchentities) and SPARQL query endpoint which has a lot of examples and links to docs, you can do a lot of complex things just in a few queries… But first you need to wrap your head around all of it :)

          1. 1

            you can do a lot of complex things just in a few queries… But first you need to wrap your head around all of it :)

            No doubt it’s powerful, and no doubt that I just don’t understand it well enough to use it effectively. I tried to adapt some of the examples for my own needs, but I found the entire SPARQL thing hard to understand in a reasonable amount of time and couldn’t really get it to do what I want. I really don’t want to spend 2 or 3 days (if not a lot more) just for a Reddit post.

            IMHO WikiData would be a lot more valuable if more people could use it with fairly little training, instead of being usable by just a comparatively small group of SPARQL experts. The entire project really comes off as managed by a bunch of tech people with no real vision on UX, which is a real shame because I think it has a lot of potential.