Zoekt is for text (trigram, regex) search, this is for precise code intelligence, which understands the semantics of the code being queried (go-to-def, find refs, diagnostics, etc)
Are you still using SQLite databases for the original repos, or did I miss a complete transition away from SQLite somewhere along the way? If the former, how do you handle durable storage with high availability in a cloud deployment? For example, do you use an NFS-based storage service, upload the SQLite files to object storage and download them when needed, or something else?
Good question! We’re still using SQLite for each bundle, which buys us some nice properties in terms of eviction (it’s easy to just delete an SQLite file when the commit it’s attached to gets sufficiently far away from the tip of the branch, as no one is likely to request code intel results for that commit anymore). It does have some issues with exclusive access.
I tried to swap out SQLite for something like BadgerDB (https://github.com/sourcegraph/sourcegraph/pull/11052), but the performance wasn’t big enough to merit the swap (and force the required migrations on users). I’m still curious about other backends we can use - both embedded ones (a la LevelDB) and client/server ones (nothing is preventing us from re-evaluating Dgraph as a storage backed after the product and our knowledge of it has progressed a bit).
Right now we aren’t planning on horizontally scaling the bundle manager via replication, but by sharding. That means that each SQLite file would be guarded by one process, and we can split hot bundle managers by increasing the shard count. The performance of bundle managers at our current scale isn’t an issue and we haven’t had to increase the shard count past one (but we do have some precedent with sharding services like this - it’s how we scale our gitservers).
I’ve also had the idea for a while that we can keep “hot” bundles in a close SSD cache but keep them permanently in block storage. This would work well since the SQLite databases are write-once read-many and would also allow us to scale our bundle managers horizontally via replication: any bundle manager replica could pull down the same bundle without issue. I haven’t yet evaluated this strategy (bundles can be kind of large, so making the initial requests fast would take some engineering work), but it’s still on my radar.
I’ve seen this pattern in my own test code as well (and the code from co-workers over multiple jobs). I wrote https://github.com/derision-test/go-mockgen in order to automate the interface mock definitions (and keep them in sync with the code via go:generate).
This somewhat defeats the purpose of mocked interfaces, which, at least at some level, exist to explicitly allow drift between consumer and producer. Interfaces are consumer contracts: lightweight schemas defined by consuming code to model, and decouple from, dependencies. If you regenerate your client-side mocks every time your server-side implementation changes, or if you define the mocks on the producer side rather than with the consumer, the two sides are still tightly coupled.
These are good points and I’ve changed my thinking over time to prefer defining interfaces at the use site, not near the definition.
That said I still find utility in this tool when mocking out interfaces defined as part of the code under test.