1. 6

Not sure if such posts are allowed if not let me know or downvote me into oblivion :) I have been thinking about this for a while and I wanted a community to vote on this: Idea: Implement a distributed search engine leveraging some of the blockchain concepts

Values: * Global, Neutral, Transparent Search Engine * Everyone is allowed access to the Search Engine + API for everyone to use. * Privacy is the default no tracking for end users unless they want to be tracked (for them to get personalized search results) * Use global computing power to run a super scalable learning algorithm * Anti-manipulation built in to prevent nodes from manipulating or skewing search results. * Entire web is included Clearnet and Darknet

At a high level (there are enough details to publish a paper but i will make it high level) * user operated nodes which run discovery, crawling, indexing, learning,… * website owners, server owners, webmasters are encouraged to run their own nodes which will index and learn their content, provide search function for their own sites, plus help get their content discovered. * API constructs allow you to both run search queries and FEED your structured data learning data to the network. * Serving Ads (or we can call it sponsored content, or random content) is allowed, if you run a node you are allowed to serve ads, proportional to your computing power that has been used to discover, crawl, index the web) * User facing Search site that is distributed and runs on the various nodes * When user accesses DS.com DNS resolves to nearest node. Nearest node serves queries either from its local index, or tries to find ranking from nearest nodes, etc.

Ask questions, Poke holes, upvote or downvote into oblivion .. and anyone wants to work with me on this?

  1.  

  2. 5

    Trust is a major problem. You touch on it by saying you’ll have “anti-manipulation built in”, but there’s a big cost in work duplication or other overhead in dealing with the problem that, at any time any node could return data that is subtly wrong. That cost appears at literally every step of “discovery, crawling, indexing, learning” and searching… even in resolving trust disputes and other meta-level activities like assessing terms like “proportional to your computing power”.

    Google doesn’t pay these costs because it’s a singular entity. It does every step, every piece of the puzzle itself. They have to guard against bugs, but they don’t have to build a recursive, self-validating trust network among unauthenticated peers with financial incentives to be bad actors just for one node to ask another if the page at “https://lobste.rs” contains the word “lobsters”.

    Designing that trust network would be a good way to earn a Ph.D. in computer science, even if it doesn’t entirely work.

    (BTW, use * instead of - to get a bulleted list.)

    1. 3

      So, lemme suggest a modification of this idea. I’m not a security guy, or a search guy, so maybe this is all garbage.

      That said:

      Implementation details:

      • Index is considered basically <search term, [ results by page rank ]>. This may be wrong definition to use.
      • Index is stored redundantly across all participating nodes, using something similar to Bittorrent. I suspect the index itself isn’t that large given modern storage constraints.
      • Searching is done using the local copy of the index, because speed (and because security). This also means that you never leak your search information.
      • Crawling is done by nodes, node assignments in the crawlspace derived using hashing. Hash should be crippled to force multiple nodes to crawl the same pages, thus preventing one node from having authoritative history on a particular crawl.
      • Index amendments are signed by nodes, so defective/compromised nodes can be backed out later or routed around when referencing the index.

      Operational limits:

      • No support for structured data. Most users don’t care about that.
      • Clearnet only, as nodes should not be punished by local authorities for accessing crimethink.
      • Somehow omit indexing of news sites, aggregators, etc. This is done to prevent the explosion of useless product spam and news from clogging up the participating node indices.
      • No provision for ads–ads are the mindkiller. This would be a public-service sort of thing.

      These guidelines should be enough to at least hack together a proof-of-concept for a distributed, opt-in search engine. There are a lot of ways of screwing this up by adding fancy-pants buzzwords, but a straightforward concept like the above might be a good first step.

      1. 2

        unless I’m missing something YaCy comes close to what you’re describing

        1. 1

          I looked into YaCy and Faroo before starting to think about this. Faroo is not even close. YaCy on the other hand is close enough but comes with a number drawback such as that it is very slow (due to fraud and spam protection) and does not rely on AI/ Learning but instead on conventional ranking methods. With that said it might be a good codebase or concept to start from.

          YaCy also works conventionally by building an index and then traversing that index to respond to queries but what I have in mind (and have yet to start experimenting with) is No Index. Just a large scale global neural network that holds the information within that network. Now this comes with a million issues but on the other hand since you don’t fully understand the impact of your node on the search result (since is is part of the larger global network) you cannot in theory manipulate the search results..

          1. 1

            PeARS is also worth looking into (recently funded by Mozilla).

            About the NN, I’m not sure what that would do..

            since you don’t fully understand the impact of your node on the search result you cannot in theory manipulate the search results

            Not sure about how that would translate into practice.

            YaCy on the other hand is close enough but comes with a number drawback such as that it is very slow

            Reaching feature-parity with YaCy would require a ton of effort. But, you can always fork YaCy and try out your ideas and see how it goes.

        2. 2

          crypto-based system in dire need of an overhaul of a distributed nature: DNS and x509. With a blockchain based system, there would not be any need to trust a cert issuing authority. https://github.com/okTurtles/dnschain

          1. 2

            One of the core problems is “What is interesting? What is Good? How should you rank pages?”

            Reddit / Lobste.rs / Voat like sites use “upvotes” as a very crude measure of the hive mind.

            What if you could make the page ranking personal to you?

            Imagine if you ranked every piece of content you came across on a -10 to 10 scale. -10 meaning, never show me stuff like that again, 0 means meh who cares, 10 means that’s Gold".

            Now imagine if I also ranked some subset of that content.

            We could at least tell how similar we were.

            Now if you and I were very similar, my rankings and search spider cache would be useful to you (and vice versa).

            No doubt we are not identical…. but if you were searching for something I was interested in… asking my search engine would be A Good Idea.

            Now to the maths…..

            Regard each persons “interest vector” as a high dimensional vector in the space of all content.

            Your ranking of a piece of content the “dot product” between your interest and the content.

            ie. The space of all content is very redundant, very high dimension, non-orthogonal basis for the “content” vector space.

            However the maths behind this is doable. It’s called Singular Value Decomposition.

            So we can have our own personal spider crawling the web… ranking stuff according to your interests (Probably via something like the Bayesian Spam filters identify spam)..

            If the spider finds another such spider, it works out the difference in interests vector and asks the other spider for the most interesting things in the direction the asker is interested in.

            1. 1

              Would the blockchain here be used as a storage database of some sort?

              It seems like the concepts from TAHOE-LAFS where people give up space and some computing power for a “distributed cloud” would work a little better here, especially considering the awesome drama going on in the Bitcoin community about their block sizes.

              Overall it does seem like a pretty cool idea, and i guess having a distributed index that any search engine system can access would make it possible to have multiple implementations of search engines.

              1. 1

                It is not exactly blockchain - but basically imagine a few things: 1- A global tree that is the index of keywords with sites, etc.. We won’t need to retain the actual content of the site only enough data for the indexing and retrieval of data. 2- Whatever additional input that is necessary to support the learning algorithm that runs and serves queries.

              2. 1

                Bitcoin-style proof-of-work relies on the fact that the compute is burned, wasted for any other purpose. What’s to stop someone putting a lot of compute into indexing zillions of variations on something they want to promote? And how would you force the indexing work to be “genuine” (e.g. what’s to stop a hostile node pretending it did a lot of crawling when in fact it had direct access to the index)?

                1. 1

                  great questions :) let me think about this one a little more