1. 20

I’m planning to setup a Gopher site with text versions of news sites and a URL converter. This is the python code as a starting point. Readability is built in, not using an external service.

Demo: https://s7.spnw.nl/txtnws/

The GopherHole should also include RSS feeds over Gopher.

The site won’t be accessible via http and I hope to stay under the radar, since all those ads and trackers are stripped out as well.

Any tips or suggestions? Especially on the Cookiewall part? There are a few workarounds already but those are simple. I don’t want to use selenium or a full fledged browser. Just enough to pass the Cookiewall…


  2. 2

    Really cool stuff. It reminds me of the sites on gopher://fritterware.org/1/. I’ve added your gopher hole to my aggregator on gopher.black. Keep up the awesome work

    1. 2

      Thank you for adding me! Could you maybe tell a bit more on what runs gopher.black? Self written scripts, language, how do you track GopherHoles without rss, etc?

      Fritterware looks like my idea indeed. Most of the rss feeds are incomplete due to the html parsing I guess. That is always an uphill battle sadly.

      1. 1

        gopher.black runs on a raspberry pi from my apartment. I run motsognir as my main gopherd, but also have gophernicus running for tor (I could have done both over motsognir, but did the gophernicus bit for more complex silly reasons). I author content using my gopher helper, burrow (https://github.com/jamestomasino/burrow) and push all content into git where it is mirrored through cron-job pulls on several other systems (sdf, tilde.town, etc). I have a few stand-alone scripts that do things like pull my reading lists out of goodreads (via rss). Most importantly I run a very large aggregator using Alex Schroeder’s moku-pona (http://github.com/kensanata/moku-pona). This is updated on a cron 4 times a day. My site list for moku-pona is tracked in git (https://github.com/jamestomasino/dotfiles/blob/master/mokupona/sites.txt) in case the pi dies or I get eaten by a whale.

        Come join us in irc.sdf.org or irc.tilde.chat in the #gopher rooms

        1. 1

          That is really cool! Let’s hope the whale 🐋 thing doesn’t happen anytime soon. This would be worthy of it’s own article even if you have some time for that.

    2. 1

      Since you explicitly asked for suggestions regarding cookie walls: Googlebot as user-agent might work

      (cf. my comment https://lobste.rs/s/6yo4wi/totext_py_convert_url_rss_feed_text_with#c_53ujqr )

      1. 1

        Thanks for the suggestion! The user agent is hard coded but could be integrated in the cookie workarounds. For Twitter I already get some sort of security token and some cookie stuff otherwise they serve a 403.

        One Dutch site, tweakers.net, gives you an immediate IP ban if you set your user agent to Google bot. But they do provide a http header, X-cookies-accepted: 1, to skip the Cookiewall. I’d love a uniform way like that for all sites.

        1. 1

          Just had another idea: ublock origin somehow makes the popup at derstandard.at go away after 1 second. I did not check how they do this. They might either automatically click a button or they might set the required cookies. In the second case, maybe you could download the relevant rules that ublock origin uses and use them for your project.

          Might also be “I don’t care about cookies” doing this… I did not check which plugin makes the popup go away :)

      2. 1

        Fantastic! This is a lot better than my solution to that problem.

        Regarding cookiewalls, I’d be surprised if urllib couldn’t supply/spoof cookies. Are you planning to allow user-agent spoofing too? A lot of sites simply won’t serve content if they don’t recognize your user-agent string as belonging to a major browser.

        1. 1

          Cookiewalls would require me to reverse engineer each site. The script now sets the user agent to tiny tiny RSS, but does support customs headers and cookies. I guess when I want to scrape a site then I’ll add a specific handler if its behind a Cookiewall.

          And, if your script works for you, why would it be bad? w3m seems simpler than python with a boatload of external libraries just to get some text ;)

          1. 2

            Using Googlebot as the user agent in my experience also solves problems with many sites. I would prefer not to spoof the user-agent, but what should we do if the websites break the internet…

            E.g. Austrian newspaper derStandard gives human visitors (and also my crawler when I was still crawling with my custom user-agent) a big banner before you can read the news. Unlike on most other websites the content was not available in the source code, either. If you do not have the right cookie set(?) they just serve you a website without the actual content. Until you click the banner.

            Thus, I started using Googlebot as user-agent for derStandard and now it works fine.

            Bonus point: German newspaper Zeit has a group of articles in between paywalled and public (require free registration). With user agent Googlebot you do not need a registration.

            (These experience come from running a newspaper monitoring service for almost two years now)

            1. 1

              Could you tell me more on the newspaper monitor?

              1. 1

                I started crawling derstandard.at in summer 2017 and added more newspapers later. Now I started setting up some analysis, but that’s still early-alpha and not much there yet. I still have to think about stuff to analyse and program it.

                This was inspired by a talk called SpiegelMining at the Chaos Communication Congress. I think a categorization for each newspaper would be nice (like “far right”, “liberal”, …). But first I want to have a nice site setup with the simple statistical data like number of articles per week/year, categories, trends in category distributions (what becomes more important), …

                One or two weeks before the parliament elections in Germany I focused on the coverage regarding parties. There were two different results from this:

                1. Right-wing AfD was mentioned in extremely many article titles, but in the article body it was not so extreme
                2. It was possible to estimate the election results with an error of ~5% only based on how often party names were mentioned in articles

                My goal is to have monitoring over the most important newspapers across whole Europe.

            2. 1

              w3m seems simpler than python with a boatload of external libraries just to get some text ;)

              It’s more that, by processing RSS with regex, I’m probably getting both false positives and false negatives.

              Also, w3m -dump produces hard newlines at some assumed terminal width, which is a problem if I want to then process the results or view at a different terminal width. For instance, I occasionally scrape sites for use in training markov models, & because distinguishing paragraph breaks from other kinds of whitespace produces better results in markov model output for longform training data, I usually need to jump through hoops to guess which newlines are real and which were injected by the browser in order to reconstruct a more ‘natural’ rendering of the text. Because the python stuff here seems to be extracting text as opposed to attempting a console-based ‘rendering’ of the document, it seems like it’s less likely to try to reconstruct frames or tables, interpret layout markup, or inject its own newlines based on the value of $COLS.

          2. 1
            1. 1

              Those long lists of simple, unadorned links leading to simple, unadorned text files - the way the web was meant to be IMnsHO - reminds me of things like the old DejaNews archive (for which I made a two-pane interface which worked just fine until Google bought and subsequently closed them down, alas). Add it to gopher if you want but it is already usable the way it is in any text browser.

              1. 1

                Gopher example is up as well: