I’m working on a follow-up post about the technical details of my search engine but the tl;dr is I’m not actually doing any spidering yet; just taking one direct list of URLs, checking robots.txt and content-type, sending the HTML thru pandoc, and stuffing it in SQLite’s full-text indexer. The search side is a very simple web app written in a framework created and abandoned in 2016 called moonmint that uses “luv” which is the Lua bindings to libuv, the I/O subsystem used by node.js.
A bit off-topic, but thanks for your work on Fennel, it made working on Factorio mods a lot more pleasant. I think it might finally be my gateway Lisp as well, after so many glancing blows with that family of languages.
I’ve gotta ask the question I always ask when SQLite FTS comes up. How do you sanitise the input of the search query before putting it in the SQL query? FTS queries have their own DSL of sorts.
Yeah the search function in that file is key. Replacing all non-ascii alphanumeric characters with a space. Sqlite really should provide a better way to do this natively.
This reminds me of an old interview/talk with Joe Armstrong, where he gets asked to predict the future. One of those “what do you think will happen in the next 10 years?” type of questions. His answer, which stuck with me, was something along the lines of: “you have to look at where things change by an order of magnitude and think about what the implications of that will be”. “For example”, he continued, “when disks grow by two [or three?] orders of magnitude we’ll be able to store all everything humans have ever written on a single computer”.
I believe it was this conclusion that perhaps lead him to think about search. He gave a talk called Sherlock’s Last Case, which is about searching for similar things using compression. A year later he refined some of the ideas as part of another talk. If I remember correctly, his argument was something a long the lines of: we can do better search locally, because we have more compute (per user) than what we get on a shared server hosted by a search engine provider (for example, his compression based comparison wouldn’t scale).
Peter Norvig’s genre is search. He literally wrote the book on good old-fashioned AI, where every problem is reduced — for better or worse — to a search problem.
I fully expect this is the year of semantic web and GOFAI on the desktop. At least, it is at my org.
Very nice. My method is I keep browser history around forever (I also have a personalized script to clean up the URL patterns I don’t want to stick around from the sqlite db) and set Firefox to show me suggestions from history and bookmarks instead of query completions, and I’ve also started adding search keyword bookmarks but honestly I keep forgetting to use them.
I have worked on a Web-based system (from the beginning, so it is even my fault) where there are multiple SBCLs in an while true; in separate screens behind Nginx balancing between them… The total numer of unique users having logged at least once in a specific month is non-single digit thousands.
And whatever issues arise, it is not because of while true; loops failing to restart a process!
Uh, I’ve had a very similar situation (nginx, sbcl, screens and thousands of people swarming the server at the same time), but nginx’s load balancing was disabled because sbcl is just so great, it never crashed.
Good times
The one time we had to restart the service was because of a bug in the hypervisor. That was hell to debug xD
Well, we did hit an old SBCL FD leak bug when calling LaTeX, but also some of our admin-side summary operations could run out of memory if one asked for too much data… The notion of too much evolved with the deployment changes, of course.
I wonder if it would be feasible to standardize on an indexing format and federate a search engine that searches through the collected indexes. I’m not super interested in finding stuff in sites I already know about and frequent. But maybe you know things I don’t.
Totally, so common crawl uses the WARC format from the web archive to store all it’s raw crawled sites. This seems like the correct choice for crawling information and offers any tool that wanted to consume crawled information a huge corpus to test on. However I’m not convinced that what we want is a federated crawling database rather than a federated index. Specifically an index should be much smaller and avoid any content distribution oddities. So rather than passing around WARCs I wonder if it would be more useful to pass around a term frequencies mapped to urls (plus metadata needed, outlinks, maybe a crawler signature, etc.).
You know, on several occasions I’ve been unable to find something that I know I bookmarked but couldn’t remember the title. This kind of thing would be useful for that scenario too!
This is really neat! I’m working on a project that’s like an activitypub-based version of this, with all the niceties you’d expect from a full-blown bookmark manager. I think including bookmarks from friends, and friends of friends will make a pretty high-quality index.
How well does ArchiveBox work these days, with all the “Prove you’re not a bot” interstitials?
I’ve been using the WebScrapbook Firefox extension for what I don’t use yt-dlp for because it saves the current page, as already loaded into the browser.
Neat! I did something vaguely similar a while back where I wanted full text search browser history, so I could have an actually searchable full history of everything I’ve ever read/visited without needing to manually bookmark (because I always forget) then i toyed with the idea of spidering to build a kind of personal search engine consisting of things that would likely be related to all the stuff I visit. That was built in Go and used a full-text search database called Bleve which was quite nice but took a bit of tuning to get working.
I didn’t really continue working on it but if you want the code I could open source it, it seems tangential somewhat!
The majority of things I read on the web, I download the HTML page first, convert it to text second, and then read in Vim. Sometimes this happens in real time, sometimes my script grab stuff in advance. I don’t index properly, though, but from time to time I grep for stuff.
An observation about modern Internet: often, curl-impersonate pretending to be a relatively old Chrome gets you usable HTML with all the content, while an unknown user-agent gets a JS-only version (typically with a CAPTCHA)
Out of curiosity, what did you use for this? I’m using pandoc -f html -t markdown and it’s pretty good but I haven’t spent much time evaluating alternatives.
For a personal project I was looking for a way to turn HTML to text/markdown. I ended writing something using moz-readability plus turndown but I also found trafilatura which looks great and even includes a command line tool if you rather not write Python.
so I could have an actually searchable full history of everything I’ve ever read/visited without needing to manually bookmark (because I always forget)
I thought about something similar, but never implemented anything. How did you save the history? I thought about writing a browser plugin that would save the text of all visited pages. However thinking about it again now as I write this, perhaps it would be easier to access the actual browser history (the urls) and re-download their content periodically…
i did first think of using this approach but all the browsers on all the platforms use different approaches so it was way more work compared to just building a browser extension, plus I needed a user interface for the searching and having that inside the browser itself was the only good choice
So this app was two things: a browser extension and a desktop app written in Go (I chose Go because I didn’t actually need a user interface on the desktop app, which made life easier - all it needed was a simple system tray icon - the user interface itself is handled entirely by the extension)
the extension then just requests access to visited sites and every time you load a page, it sends that URL plus some metadata that’s available directly to the desktop app (similar to how most of these extension+app products work, via a small list of random ports, the app binds to the first available, kinda like how 1password works)
then the desktop app handles the rest, indexing into the bleve database and responding to queries. the extension then renders search results either in the omnibar via a chrome api or in the new-tab page in a nice ui with link previews and search/filter tools.
all in all the foundations were there and it worked, I even managed to build installers for windows and mac (though you can’t do automated code signing/deploy to mac store in CI, you need to do it via an apple device, absolute nightmare, windows was vastly simpler) but I lost interest and realised selling it would be very difficult due to the privacy concerns of syncing between devices, I’d have to build a whole bunch of e2ee infra etc. so I binned it and moved on!
Thanks OP for reminding me that my similar project is sitting rotting.
I finally finished the text scoring today after reading this, so I think it’s time to push it to production. The code is an incomprehensible mash of snippets cobbled together when I can snatch a few moments. So needs a tidy before I publish it anywhere.
I did something along these lines in January after getting fed up with all the bookmark websites being walled gardens.
I combined the MarkDownload browser plugin, which allows you to convert the current page to markdown and download it locally with an obsidian vault. Instead of sending things to Pinboard, I have the full markdown copy + images locally. I can search through it with Obsidian (or ripgrep or later turn it into embeddings and use it in a RAG system to answer questions).
I’m working on a follow-up post about the technical details of my search engine but the tl;dr is I’m not actually doing any spidering yet; just taking one direct list of URLs, checking robots.txt and content-type, sending the HTML thru pandoc, and stuffing it in SQLite’s full-text indexer. The search side is a very simple web app written in a framework created and abandoned in 2016 called moonmint that uses “luv” which is the Lua bindings to libuv, the I/O subsystem used by node.js.
All the code is written in Fennel: https://git.sr.ht/~technomancy/search/
A bit off-topic, but thanks for your work on Fennel, it made working on Factorio mods a lot more pleasant. I think it might finally be my gateway Lisp as well, after so many glancing blows with that family of languages.
I’ve gotta ask the question I always ask when SQLite FTS comes up. How do you sanitise the input of the search query before putting it in the SQL query? FTS queries have their own DSL of sorts.
Belatedly: Apparently like this? https://git.sr.ht/~technomancy/search/tree/main/item/sql.fnl#L10
THINK I found the right code at least.
Yeah the search function in that file is key. Replacing all non-ascii alphanumeric characters with a space. Sqlite really should provide a better way to do this natively.
This reminds me of an old interview/talk with Joe Armstrong, where he gets asked to predict the future. One of those “what do you think will happen in the next 10 years?” type of questions. His answer, which stuck with me, was something along the lines of: “you have to look at where things change by an order of magnitude and think about what the implications of that will be”. “For example”, he continued, “when disks grow by two [or three?] orders of magnitude we’ll be able to store all everything humans have ever written on a single computer”.
I believe it was this conclusion that perhaps lead him to think about search. He gave a talk called Sherlock’s Last Case, which is about searching for similar things using compression. A year later he refined some of the ideas as part of another talk. If I remember correctly, his argument was something a long the lines of: we can do better search locally, because we have more compute (per user) than what we get on a shared server hosted by a search engine provider (for example, his compression based comparison wouldn’t scale).
From the Sudoku affair:
I fully expect this is the year of semantic web and GOFAI on the desktop. At least, it is at my org.
Very nice. My method is I keep browser history around forever (I also have a personalized script to clean up the URL patterns I don’t want to stick around from the sqlite db) and set Firefox to show me suggestions from history and bookmarks instead of query completions, and I’ve also started adding search keyword bookmarks but honestly I keep forgetting to use them.
Doesn’t load for me, but the repository copy does https://git.sr.ht/~technomancy/search/blob/main/static/why.html
Doesn’t load for me either
Thanks for the heads-up; clearly I need some daemon management more sophisticated than “
make runinside tmux”. =)I can highly recommend
while true; do make run; sleep 1; done!Here’s a tiny life quality improvement on that pattern:
while sleep 1; do thing; doneMakes it so much easier to Ctrl-C that loop!
And that is how i run my son’s Minecraft server. Good enough.
I have worked on a Web-based system (from the beginning, so it is even my fault) where there are multiple SBCLs in an
while true;in separate screens behind Nginx balancing between them… The total numer of unique users having logged at least once in a specific month is non-single digit thousands.And whatever issues arise, it is not because of
while true;loops failing to restart a process!Uh, I’ve had a very similar situation (nginx, sbcl, screens and thousands of people swarming the server at the same time), but nginx’s load balancing was disabled because sbcl is just so great, it never crashed.
Good times
The one time we had to restart the service was because of a bug in the hypervisor. That was hell to debug xD
Well, we did hit an old SBCL FD leak bug when calling LaTeX, but also some of our admin-side summary operations could run out of memory if one asked for too much data… The notion of too much evolved with the deployment changes, of course.
I wonder if it would be feasible to standardize on an indexing format and federate a search engine that searches through the collected indexes. I’m not super interested in finding stuff in sites I already know about and frequent. But maybe you know things I don’t.
As fas I’m aware commoncrawl.org has a standard format. I seem to remember a federated search project. But can’t find it now.
Totally, so common crawl uses the WARC format from the web archive to store all it’s raw crawled sites. This seems like the correct choice for crawling information and offers any tool that wanted to consume crawled information a huge corpus to test on. However I’m not convinced that what we want is a federated crawling database rather than a federated index. Specifically an index should be much smaller and avoid any content distribution oddities. So rather than passing around WARCs I wonder if it would be more useful to pass around a term frequencies mapped to urls (plus metadata needed, outlinks, maybe a crawler signature, etc.).
YaCy?
You know, on several occasions I’ve been unable to find something that I know I bookmarked but couldn’t remember the title. This kind of thing would be useful for that scenario too!
Why only index your own bookmarks? I shared mine on my website and wrote a tool for searching them and other’s bookmarks: https://jak2k.schwanenberg.name/post/indiebookmarks/
Because I’m just getting started! This is just a few days worth of work so far but it is by no means finished.
(But also because my friends don’t share their bookmarks, at least not yet.)
This is really neat! I’m working on a project that’s like an activitypub-based version of this, with all the niceties you’d expect from a full-blown bookmark manager. I think including bookmarks from friends, and friends of friends will make a pretty high-quality index.
That’s nice! I decided to keep it simple because I wanted it to work with static site generators.
My approach to friends of friends indexes is, that you just also bookmark nice stuff from them or use a blogroll to discover indexes.
This inspires me to take my existing ArchiveBox archive, add my browser history to it, and then use my kbgrep tool for a nice full text search TUI.
I have no idea why I haven’t thought to do this yet. Thanks.
How well does ArchiveBox work these days, with all the “Prove you’re not a bot” interstitials?
I’ve been using the WebScrapbook Firefox extension for what I don’t use
yt-dlpfor because it saves the current page, as already loaded into the browser.Neat! I did something vaguely similar a while back where I wanted full text search browser history, so I could have an actually searchable full history of everything I’ve ever read/visited without needing to manually bookmark (because I always forget) then i toyed with the idea of spidering to build a kind of personal search engine consisting of things that would likely be related to all the stuff I visit. That was built in Go and used a full-text search database called Bleve which was quite nice but took a bit of tuning to get working.
I didn’t really continue working on it but if you want the code I could open source it, it seems tangential somewhat!
The majority of things I read on the web, I download the HTML page first, convert it to text second, and then read in Vim. Sometimes this happens in real time, sometimes my script grab stuff in advance. I don’t index properly, though, but from time to time I grep for stuff.
An observation about modern Internet: often, curl-impersonate pretending to be a relatively old Chrome gets you usable HTML with all the content, while an unknown user-agent gets a JS-only version (typically with a CAPTCHA)
Out of curiosity, what did you use for this? I’m using
pandoc -f html -t markdownand it’s pretty good but I haven’t spent much time evaluating alternatives.Custom Common Lisp converter. So that, for example, for many URL-in-URL tricks I get both the full link and the extracted probable-target link.
For a personal project I was looking for a way to turn HTML to text/markdown. I ended writing something using moz-readability plus turndown but I also found trafilatura which looks great and even includes a command line tool if you rather not write Python.
I thought about something similar, but never implemented anything. How did you save the history? I thought about writing a browser plugin that would save the text of all visited pages. However thinking about it again now as I write this, perhaps it would be easier to access the actual browser history (the urls) and re-download their content periodically…
Firefox stores browser history in SQLite, so it’s like a 1-liner to extract URLs without any plugins or even running Firefox at all.
sqlite3 "file:places.sqlite?immutable=1&cache=shared" "..."will let you run the query without having to quite firefox.i did first think of using this approach but all the browsers on all the platforms use different approaches so it was way more work compared to just building a browser extension, plus I needed a user interface for the searching and having that inside the browser itself was the only good choice
So this app was two things: a browser extension and a desktop app written in Go (I chose Go because I didn’t actually need a user interface on the desktop app, which made life easier - all it needed was a simple system tray icon - the user interface itself is handled entirely by the extension)
the extension then just requests access to visited sites and every time you load a page, it sends that URL plus some metadata that’s available directly to the desktop app (similar to how most of these extension+app products work, via a small list of random ports, the app binds to the first available, kinda like how 1password works)
then the desktop app handles the rest, indexing into the bleve database and responding to queries. the extension then renders search results either in the omnibar via a chrome api or in the new-tab page in a nice ui with link previews and search/filter tools.
all in all the foundations were there and it worked, I even managed to build installers for windows and mac (though you can’t do automated code signing/deploy to mac store in CI, you need to do it via an apple device, absolute nightmare, windows was vastly simpler) but I lost interest and realised selling it would be very difficult due to the privacy concerns of syncing between devices, I’d have to build a whole bunch of e2ee infra etc. so I binned it and moved on!
Thanks OP for reminding me that my similar project is sitting rotting.
I finally finished the text scoring today after reading this, so I think it’s time to push it to production. The code is an incomprehensible mash of snippets cobbled together when I can snatch a few moments. So needs a tidy before I publish it anywhere.
I did something along these lines in January after getting fed up with all the bookmark websites being walled gardens.
I combined the MarkDownload browser plugin, which allows you to convert the current page to markdown and download it locally with an obsidian vault. Instead of sending things to Pinboard, I have the full markdown copy + images locally. I can search through it with Obsidian (or ripgrep or later turn it into embeddings and use it in a RAG system to answer questions).
Here are the notes on how I set it up for myself: https://gist.github.com/tednaleid/2947b2fad2b5307071c398218fdf7d74
Joplin’s Web Clipper plugin is similarly very handy - clip the full page to Markdown or HTML format, or grab the selection, URL, or a screenshot.
This is a good reminder though, I should use it more often, perhaps in addition to the couple bookmarking services I have now.