1. 20
  1. 4

    other documents in the web seem to have more in detail documentation about the technology used for the search index etc.. even tho these are quite old. At least these are what ive found:

    https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.157.7362&rep=rep1&type=pdf

    back then they used a system based on an “distributed in memory index” that is queried by requesters using udp broadcasts to locate the requested files stored on the disks.

    I wonder how that system has evolved in the last decade.

    1. 5

      Until recently, index scans were performed very infrequently because each index scan caused the permanent loss of up to 10 hard disks. The specific cause of the disk failures seems to have been related to insufficient data center cooling capacity. Actively accessing the disks raised the machine room temperature by at least 5 degrees Fahrenheit.

      That is the greatest thing I’ve read today.

    2. 2

      great talk, thanks for sharing, I didn’t know the internet archive was a library!

      1. 1

        I’m really glad they have their own infrastructure, it makes sense for something so core to the Internet. I was really surprised to see that all their storage is in the Bay Area though. Hopefully there’s at least an off-site backup somewhere else in the world, far far away… At the end it’s mentioned that they’re working on replicating cross country, but I’m surprised it wasn’t already.

        I’m also surprised that the cables are literally buried under the sidewalk, which they dig up to route a new one. I would have expected a city like San Francisco to have a massive existing underground network of tunnels that could be used.