1. 12
  1.  

  2. 10

    Seems like a great optimization to add to the query planner based on table statistics!

    1. 8

      Yeah this is genuinely embarrassing. I feel embarrassed for them

    2. 9

      They use CrateDB as a database for storing and searching product data. They chose CrateDB because it allows them to scale the webshop easily, according to Gestalten.de CEO Frank Rakow.

      With ~4.6M rows, they’d be well served by vanilla PostgreSQL. Then again, I don’t know all of their requirements, perhaps they have need of ElasticSearch’s features (CrateDB is a SQL+management layer on top of ElasticSearch). Hopefully Gestalten.de is using this database for analytics and not important data storage.

      That said, you couldn’t pay me enough to use ElasticSearch as a primary datastore. A search index? Absolutely. Temporary log storage? Sure. Analytics? I suppose. ElasticSearch has measurably improved over the years, but once bitten, twice shy as the saying goes.

      In any case, CrateDB is a remarkably immature database to be running a business on –And lets look at their marketing page:

      On the other hand, CrateDB may not be a good choice if you require:

      • Strong (ACID) transactional consistency
      • Highly normalized schemas with many tables and many joins

      Oh dear.

      1. 2

        Why hopefully? These numbers are embarrassing.

        I don’t understand why anyone would be proud of these numbers.

        product:([sku:3300?`5]; a:3300?0)
        xsell:asc ([] sku:4600000?`5; cross_sku:4600000?`5; tstamp:.z.p-4600000?0)
        \t select from xsell where not sku in exec sku from product
        87
        

        That’s msec; kdb is 100x faster than crate (on my macbook).

        I suppose I should be at least happy that crate.io puts some actual benchmarks up when they say it’s “fast” so that we know that they mean not at all fast.

        1. 0

          Grouping in a distributed database is much harder then on your local disk. (and yes, this can be a reason to not pick a distributed software, but if that’s not your main query, this is also okay)

          1. 6

            …but it’s 4m rows, you don’t need a distributed database, you don’t even need one for 1bn rows. Christ, this is roughly what I would say is the upper bound for CSV files chewed with UNIX sort/join!

        2. 0

          With ~4.6M rows, they’d be well served by vanilla PostgreSQL. Then again, I don’t know all of their requirements, perhaps they have need of ElasticSearch’s features (CrateDB is a SQL+management layer on top of ElasticSearch). Hopefully Gestalten.de is using this database for analytics and not important data storage.

          Webshops have the problem that they are rarely written by yourself and come with their own share of issues. For example, a popular system is OXID. That means that you are often not so free to choose the database layer.

          That said, you couldn’t pay me enough to use ElasticSearch as a primary datastore. A search index? Absolutely. Temporary log storage? Sure. Analytics? I suppose. ElasticSearch has measurably improved over the years, but once bitten, twice shy as the saying goes.

          Elasticsearch themselves does not sell themselves this way.

          ES is very popular in the shop scene, where the databases of your shop software are often a pain to work with and most of your frontend is search anyways. So, what happens is that they still use the shop software to store all articles, user data and transactions and sync that to Elasticsearch, with which they drive their full frontend. I’ve seen that in a quite a number of deployments and it works well.

          Elasticsearch has good stability, just no guarantees. It’s perfectly fine to use it for something important, just be able to recreate it if it blows.

          On the other hand, CrateDB may not be a good choice if you require:

          • Strong (ACID) transactional consistency
          • Highly normalized schemas with many tables and many joins

          That’s perfectly fine if none of this is needed on that store.

        3. 6

          In my experience with PostgreSQL, it’s almost always the opposite.

          1. 3

            Note that cratedb is a distributed store based on Elasticsearch, both the on-disk format and the query execution vastly differ from what PostgreSQL does.

            1. 1

              My experience with SQL Server mostly agrees with yours. But the real offender is the IN operator, not the subquery per se. A subquery in the FROM clause isn’t (always) that bad.

            2. 1

              I tend to always use subqueries. I like them because I can write them incrementally. I.e. write my sub query, make sure it works, then drop it inside of another query, make sure that works…repeat until done.

              I certainly understand joins and use them, but i’m not as fast writing them. I’m curious about the opposite problem. When is it much slower to use a subquery over a joins?

              1. 1

                Don’t quote me on this but my very vague memory is that MySQL’s query planner used to be(*) famously not quite as good as you’d hope at spotting the equivalence between SELECT x FROM t1 WHERE y IN (SELECT z FROM t2 WHERE…) and SELECT x FROM t1 JOIN t2 ON y = z.

                (* I have no idea if it does better now, no idea about what the state of any of the forks like Maria or Percona is).