1. 3

  2. 3

    This is not ready to be posted :(

    1. 4

      You didn’t add anything to suggest it was a draft, and it’s ‘live’ on the internet, so you can’t blame them for posting it here…

    2. 2

      Making your own search engine comes with a lot of challenges:

      • There exists no open source web search engine. The best shot you have is using Lucene, diving deep to make it scale, add PageRank support, etc.
      • Crawling is impossible. Cloudflare blocks all (non-bigtech) crawlers. https://commoncrawl.org/ is a tiny dataset and just using wikipedia’s dump and crawling its external links would give you better results.
      • It’s still too costly if you just want to use it yourself - you’d have to make it a business, and at that point you really need to worry about scaling and remember, no open source solutions currently exist.
      1. 1
        • That is why I am working on one

        • Cloudflare can not block every IPs, and people can spoof user agent, my project is to help people host their own search engine

        • Too costly at scale, not necessarly for a personnal search engine

        1. 1

          I thought about that a few months ago. I came to the conclusion that the only way was to do a hybrid SE: meta search + collaborative.

          The meta part uses the API (or any privileged access) of some reference websites (Wikipedia, SO, official websites, …)

          The collaborative part is a web browser plug-in that reads any page you visit, build the inverse index and send it to the SE pipelines. The advantage is that you bypass any Cloudflare/captcha because you are a real human. The human is the crawler.
          Problem to be solved: privacy. How to anonymize data that reveals your browsing history?

          About the PageRank algorithm, let users decide what pages are relevant (through the plug-in) by voting. The plug-in may ask “Is this page relevant according to your terms: “Python” “socket” “hang””?

          I have no idea what the result would be. However, I’m sure it’d be pretty fun to run that.