1. 6

Something I often wanted to see on Hacker News was an automatic link to a cached copy of a submission, so when the site goes down, there’s still something to read. I thought about taking it one step further and making the text of the article automatically available.

A user recently requested story texts be made available through RSS and it made me think about doing this some more, so I coded up the feature on a branch.

When a story is submitted, it would cache the text (using Diffbot) and show it as a collapsed block in the story:

http://i.imgur.com/Dmna6.png –> http://i.imgur.com/NW8eo.png

This text would be displayed as the RSS text, as well as be included in the Sphinx index so that articles could be found through the search engine.

Of course there are copyright concerns caching the pages of others, not to mention the decreased traffic and possible ad revenue that might result from people reading the story here or through RSS rather than clicking through. Diffbot doesn’t seem to honor robots.txt or meta tags that would otherwise restrict Google or Wayback Machine’s archiving.

What do you guys think?

  1.  

  2. 6

    As an occaisional content creator (aka writer) I would not want my blog posts handled this way. It’s not about ad revenue, but about bringing readers to my site, getting them to comment there, and hopefully having them click around, read other stuff, and subscribe to my feed.

    1. 5

      Could we make it an opt-in feature? That way, content creators who want the ad revenue could opt to not cache the text, in which case the article description or comments would show up in RSS, and everything would be “business as usual” for the end-user on lobste.rs

      I also think that for both size reasons and respect, we should truncate the text to just the first paragraph. It would give users a general idea of what the article says, since titles can sometimes be misleading.

      1. 4

        A compromise might be to cache the stories but keep them private. Then give users the ability to “flag” a link as broken. Once a certain amount of members flag link the cached link is also displayed for a limited time (eg 24 hours).

        I think both sides of the issue have good valid arguments. Writers deserve credit and traffic but if the site is down, their getting neither either way. I would opt for a solution similar to this.

        1. 4

          I think it’s a neat idea, but your last paragraph seems like the killer problem. If the linked-to sites are OK with it, it seems fine (IMO). But figuring out if it’s OK or not sounds like a hard problem.

          1. 3

            Legally, it seems like it’s asking for trouble. Somebody’s bound to get ornery about their copyright at some point.

            It also just feels a little tacky — it’s a bit of a kick in the teeth to anyone whose blogging to drive traffic to their site for ad-revenue, or to increase the awareness of their personal or corporate brand, or whatever. People are either losing name/brand recognition on the content they produce, or having their name associated with a presentation of their content that they may prefer not to be associated with (think formatting, layout, style, embedded images, code-section formatting, etc.).

            Explicit opt-in, fine.

            1. 3

              If the goal is to display the text when originating sites go down, how about a cron that pings the page, searching for load issues, and displaying the text if/when the site goes down?

              1. 4

                Maybe on submission it should just fetch the Coral URL to bring the page into its cache, and just always show a “cached” link on the story page. That way nothing is hosted on Lobsters, site owners can control Coral’s fetching if they want to opt-out, and the original page is (hopefully) presented in its original form from the cache.

                1. 1

                  You’d have to ping pretty aggressively to make it worthwhile or users will be clicking dead links between pings, and then you’re just making the problem worse.

                2. 2

                  Perhaps a link to a cached copy on the comments page. The main links would only go to the original site. This way, seeing the local cache is more work than visiting the original, so people will be encouraged to go where they should. Google’s model is vaguely like this.

                  Only show the cached copy to logged in users, so they aren’t universally scrapeable (plus robots.txt). Expire them too?