1. 12
    1. 4

      Using wget is cute, but I’d rather Firefox stored the dom, which would allow me to search by content in js dependent pages

      1. 1

        Why store the DOM, which doesn’t have a standard representation? National libraries have standardised on WARC which records something more faithful to what the servers actually send.

        Storing a static DOM would require you to choose a canonical point in time, which would be hard to automate. What if content is only inserted into the DOM after the splash screen is dismissed? WARC solves this.

        1. 2

          My understanding of WARC may be incomplete, but is just replaying the stream of blobs received from the server enough? Can it recreate the output of js code paths that depend on other sources like current time and user inputs? What if the content is not fetched until after the user dismisses the splash screen? Then it won’t show up in the WARC, making it too just as hard to automate.

          But having the browser simply dump the state of the current loaded web page when you bookmark it is fool-proof, because who in their right mind would bookmark a splash screen?

    2. 2

      Firefox bookmarks are stored in a sqlite file, wget can download a full page will all dependencies, and Python will play happily with both.

      Another way to get your bookmarks from Firefox is to do Ctrl+Shift+O and then create a backup from the opened window. You will get a JSON file containing all of your bookmarks. Maybe it’s less convenient than reading an SQLite file directly, but JSONs are a bit simpler to work with.

    3. 1

      If the goal is search, I bet you could make the download size a lot smaller. You don’t even need to keep the text (let alone HTML and JavaScript), you just need a unique list of words.

      1. 2

        This gave me an idea! :)
        https://github.com/deejayy/content-saver-extension

        (just a proto)

        1. 2

          Ping me if it works, I would like to test it.

      2. 1

        Back in the old days (before widespread https), I wrote this as an http proxy that indexed every page you viewed. I guess today it would need to be either an extension, or an after-the-fact history walker.

        1. 1

          MITM proxy is also valid if it’s running locally

        2. 1

          If you are running it locally, you can install a local signing cert in your browser / OS and still intercept everything.

    4. 1

      I’ve thought about doing this before and never pulled the trigger. But not just for bookmarks - go one step further and do it for your history and then slap it into a vector db for semantic search (and to keep your own personal cache of fodder for LLM contextual priming).

      Edit: I actually read the article rather than skimmed it and noticed the last line: :D

      Now imagine having this offline library, with local AI to search all that…

    5. 1

      This is funny. The Firefox Screenshots feature started off as a Page Shots experiment of Ian Bicking (check his blog!) and actually created sharable content from the pates as excerpts from the DOM.

      My understanding is that - through iteration and user feedback - it became the screenshot feature that Firefox has now. People mostly didn’t like to share potentially outdated versions but rather exact screenshots or just links. The in-between use case was mostly unpopular and didn’t work for people / on messengers in social contexts.

      Maybe this is worth exploring in an archival context than the social one. Probably interesting for getpocket.com.