Tinysearch is great! I was chatting with Matthias briefly after I realized we had been working on something similar. He put a lot more work into the data structure, doing some really cool work implementing a bloom filter to get the index size down. I focused on ease of integration, making it a fully hosted library. Eventually I want to borrow some of Matthias’ ideas for Stork!
I would be interested to see a comparison to the performance of other in-browser search index schemes, like lunr.js, flexsearch, fuse.js
(@jil you might want to check your homepage design in WebKit browsers - on my iPhone 10 a lot of text was off the edge of the screen with no way to fix it!)
Stork seems large. It’s like 180k and from my reading of the code doesn’t do any sort of stemming it just looks up words in a hashmap and finds the associated documents.
I’m a bit of a skeptic about the viability of rust+wasm on the web.
The index of the federalist papers is a further 1.8mb. There’s 70k of ancillary script, too. Add them all together and you’ve got a transfer size comparable with the header image on a medium blog.
That’s going to add 2 seconds to your first-load-time on a world-median 6mbps connection (highly cacheable resources, so no network transfer on page 2). In real terms that could be a very worthwhile tradeoff for some workloads.
I’ve been lately thinking about adding search to some documentation servers we have at work, which are essentially serving static content produced from a bunch of asciidocs.
Both stork and tinysearch could be useful for that task, as we could hack somehow which plaintext goes into which page. However, I have a feeling it would introduce a big overhead with pushing unnecessarily (for our use case) all the indexes to the clients, where we actually control the web server, and could run a search process there. Are there any tools for the use case where one has a static site, but also controls the webserver? I would like to avoid elasticsearch, but I’m not aware of any other.
This is probably the best time to send indexes to the client! Your business probably provides decent machines to developers/doc users, as well as a good uplink to wherever the docs are hosted!
For an example of a client-side search system that works well, check out mkdocs. Mkdocs is a static site generator written in python that includes lunr.js client-side indexed search; see the docs here: https://www.mkdocs.org/user-guide/configuration/#search
I built the engineering documentation system at Airbnb on mkdocs with a hybrid search model, here’s how it works: Each project generates a standard mkdocs static site. The build outputs a bunch of HTML files as well as an index.json which contains flattened, stopword-processed plain text that’s very easy to index. When a project’s docs are deployed, this build output gets pushed up to S3 for static serving. Inside a single project we used mkdocs’ build-in search feature to search that index on the client.
There’s also a small node.js server that enumerates all the projects and pulls all the index.json files for each project into memory and combines them into a single lunr.js index. The server handles top-level queries when the user isn’t sure what project they need to look at. Lunr.js and S3 are both fast enough that the server process can re-fetch and re-index any changed index.json content as part of servicing a search request - at least, for the ~100ish projects that changed docs ~about 10 times a day. Search quality was probably lower than you’d get with Elasticsearch, but the simplicity of the system was well worth it - I made this whole system happen in hours instead of the days or weeks it would have taken to magick up a cluster, figure out all the fiddly elasticsearch bits, etc.
How does Stork compare to tinysearch, which was posted here recently? https://endler.dev/2019/tinysearch/
Tinysearch is great! I was chatting with Matthias briefly after I realized we had been working on something similar. He put a lot more work into the data structure, doing some really cool work implementing a bloom filter to get the index size down. I focused on ease of integration, making it a fully hosted library. Eventually I want to borrow some of Matthias’ ideas for Stork!
I would be interested to see a comparison to the performance of other in-browser search index schemes, like lunr.js, flexsearch, fuse.js
(@jil you might want to check your homepage design in WebKit browsers - on my iPhone 10 a lot of text was off the edge of the screen with no way to fix it!)
Coming in kinda late, but just fixed the text on the site. Thanks again for alerting me.
Same on Firefox Preview (Android)
Oh no! Thanks for letting me know. I had fixed a layout bug everywhere else… and must have accidentally borked things on mobile.
CSS is frustrating sometimes.
Stork seems large. It’s like 180k and from my reading of the code doesn’t do any sort of stemming it just looks up words in a hashmap and finds the associated documents.
I’m a bit of a skeptic about the viability of rust+wasm on the web.
The index of the federalist papers is a further 1.8mb. There’s 70k of ancillary script, too. Add them all together and you’ve got a transfer size comparable with the header image on a medium blog.
That’s going to add 2 seconds to your first-load-time on a world-median 6mbps connection (highly cacheable resources, so no network transfer on page 2). In real terms that could be a very worthwhile tradeoff for some workloads.
That said it’s kind of amazing that this works at all!
I’ve been lately thinking about adding search to some documentation servers we have at work, which are essentially serving static content produced from a bunch of asciidocs.
Both stork and tinysearch could be useful for that task, as we could hack somehow which plaintext goes into which page. However, I have a feeling it would introduce a big overhead with pushing unnecessarily (for our use case) all the indexes to the clients, where we actually control the web server, and could run a search process there. Are there any tools for the use case where one has a static site, but also controls the webserver? I would like to avoid elasticsearch, but I’m not aware of any other.
This is probably the best time to send indexes to the client! Your business probably provides decent machines to developers/doc users, as well as a good uplink to wherever the docs are hosted!
For an example of a client-side search system that works well, check out
mkdocs
. Mkdocs is a static site generator written in python that includes lunr.js client-side indexed search; see the docs here: https://www.mkdocs.org/user-guide/configuration/#searchI built the engineering documentation system at Airbnb on mkdocs with a hybrid search model, here’s how it works: Each project generates a standard mkdocs static site. The build outputs a bunch of HTML files as well as an
index.json
which contains flattened, stopword-processed plain text that’s very easy to index. When a project’s docs are deployed, this build output gets pushed up to S3 for static serving. Inside a single project we used mkdocs’ build-in search feature to search that index on the client.There’s also a small node.js server that enumerates all the projects and pulls all the index.json files for each project into memory and combines them into a single lunr.js index. The server handles top-level queries when the user isn’t sure what project they need to look at. Lunr.js and S3 are both fast enough that the server process can re-fetch and re-index any changed index.json content as part of servicing a search request - at least, for the ~100ish projects that changed docs ~about 10 times a day. Search quality was probably lower than you’d get with Elasticsearch, but the simplicity of the system was well worth it - I made this whole system happen in hours instead of the days or weeks it would have taken to magick up a cluster, figure out all the fiddly elasticsearch bits, etc.