1. 16
  1.  

  2. 15

    Sigh. Here we go.

    Why is it so slow?

    Not because of MySQL. Actually, MySQL is doing the right thing here.

    It’s because the term “commercial” used in my query is common (~300,000 documents contain the term)

    Yes, sort of.

    and because MySQL is not good at merging indexes

    Not relevant.

    Basically, MySQL uses the full-text index to lookup for the term “commercial”, and then it does a kind of nested join to lookup for the 300,000 rows and check the condition on id.

    Mostly true.

    The former is very quick,

    Nope.

    but the latter is extremely slow.

    No, that’s the fast part.

    It would be really great for MySQL to be able to combine multiple index using an in memory bitmap.

    For full text search? That would be fucking insane. And I do mean insanely dumb, not insanely cool.

    Since InnoDB wasn’t written in an intro to algorithms course by a college sophomore, it doesn’t just use a basic inverted index of words -> document ids. That’s a blatantly awful strategy. Sure, looking up the document list for a single word is O(1), but then you still have a list of n documents to deal with. English writing is mostly the same 10,000 words, and each document typically has more than one word, so these per-word document lists are going to be huge. Even with a small dataset of 2 million documents, a word like “commercial” has 300,000 documents! For people with significant data volume, that type of index would be completely worthless.

    Bitmap joins are cute but they can’t change that you’ll be joining on 1-5% of all your documents for every single search. 15% in your example.

    Why would you even want that sort of join to begin with? When your database says “ur query matches these 5 million docz haha LOL here u go” you do not LOL, you weep with your head in your hands, wondering why anyone let the JS hipsters touch a problem like full text search. Because a huge unsorted pile of 5 million documents isn’t quite the level of specificity your user had in mind, and they will not LOL either.

    A good full text search engine is optimized to return the top N most relevant results for a search.

    You took a full text search engine that cleverly, carefully, iteratively produces the top N results rather than materializing ~300,000 rows for teh lolz, and forced it to materialize all ~300,000 anyway. And then you wrote a bitchy blog post about how MySQL full text search is slow, when really you just have no idea what you’re doing. Maybe you would have saved some of your “precious hours” if you had read the fucking docs.

    Optimizations are applied to certain kinds of FULLTEXT queries against single InnoDB tables. Queries with these characteristics are particularly efficient:

    • FULLTEXT queries that sort the matching rows in descending order of score and apply a LIMIT clause to take the top N matching rows. For this optimization to apply, there must be no WHERE clauses and only a single ORDER BY clause in descending order.

    It’s almost as if it was written just for you.

    And there’s a good reason too. Exactly how the fuck would MySQL optimize that query? If you go through the normal filters first, you can’t use the full text index. If you go through the full text index first, the most relevant documents might not match the filters for a while. Or ever, in this contrived example. MySQL prefers the full text index. Why? Because if you wrote a full text match in your where clause, you probably want to use the fucking full text index.

    And if you don’t want the full text index because you know your filters are highly selective, you do this:

    select id, match(content) against ('commercial' in boolean mode) as rank
    from document 
    where id > 10000000
    and rank > 0
    order by rank desc
    limit 50
    

    This applies the normal filters first, then orders by relevance within that subset of documents. A perfectly legitimate thing to do. It’s a waste to go through the more complex full text index looking for just a few documents. Likewise, it would be downright stupid to join a few documents with ~300,000 document ids when you can just match on the fly. Should MySQL try to apply this optimization using table statistics? Maybe. Since statistics can be incomplete or out of date, and that would fuck you particularly hard in this circumstance, I personally say no.

    If your filters are broad, no problem. If you’re smart you’ll still protect yourself from runaway queries by limiting the full text search first in a sub select, and logging anything that hits it. The tricky part is when your filters are categorizing your data into big groups that aren’t perfectly uniform. For example, a comment system provider that supports multiple sites. Filtering where site = 'database-love.org' and match(content) against ('sql' in boolean mode) won’t cause any problems, but when someone searches for sql on cupcake-nation.net, you’re gonna trudge through a ton of stuff before you hit a even single document you want.

    A compound key (site_id, content) would be nice, but you can’t make compound keys that cover normal and fulltext indexes in MySQL. Instead you’d need to partition the indexes out to multiple tables, which honestly isn’t a terrible idea to begin with. Full text indexes need maintenance that regular tables don’t, so having them detached can save your ass. Anyway, something like this:

    create table group_within_things_fts_index (
        id bigint unsigned primary key,
        content longtext not null,
        fulltext key (content),
        foreign key (id) references things(id) on delete cascade
    );
    

    Naming the primary key FTS_DOC_ID causes InnoDB to use that as the internal document key, saving a little space.

    If you don’t have exclusive groups like sites or date ranges, or small groups you can match without the full text index, it kinda sucks. And not only in MySQL. I don’t know of a compact way to combine arbitrary general indexes with full text indexes, while still ordering the index by some notion of text relevancy. I’ve only seen a few systems that handle non-exclusive tagging well, and they’ve all been full custom proprietary stacks operated by big teams.

    In conclusion, full text search is hard, don’t be a whiner, especially when you’re wrong.

    1. 6

      Since InnoDB wasn’t written in an intro to algorithms course by a college sophomore, it doesn’t just use a basic inverted index of words -> document ids.

      And what do you think InnoDB uses? If you check the source code, you’ll discover that InnoDB full-text search uses an inverted index, like almost every other FTS engines (Lucene, Xapian and PostgreSQL FTS for example). Implementation details may differ, but the general principle remains the same.

      The FTS index is stored in an hidden InnoDB table with the following schema (source):

      CREATE TABLE $FTS_PREFIX_INDEX_[1-6] (
      	word         VARCHAR(FTS_MAX_WORD_LEN),
      	first_doc_id INT NOT NULL,
      	last_doc_id  UNSIGNED NOT NULL,
      	doc_count    UNSIGNED INT NOT NULL,
      	ilist        VARBINARY NOT NULL,
      	UNIQUE CLUSTERED INDEX ON (word, first_doc_id))
      

      The column ilist contains the list of document IDs and word positions where the word appears (source). The list is sorted.

      Bitmap joins are cute but they can’t change that you’ll be joining on 1-5% of all your documents for every single search. 15% in your example.

      Databases that support bitmap joins do that all the time without any issue.

      A good full text search engine is optimized to return the top N most relevant results for a search.

      You are implying that full-text search results should be sorted by relevance, which is a narrow use case. It is quite common to sort search results by date or by price, for example, and most FTS engines support this well.

      You took a full text search engine that cleverly, carefully, iteratively produces the top N results rather than materializing ~300,000 rows for teh lolz, and forced it to materialize all ~300,000 anyway.

      I’m not forcing MySQL to materialize the 300,000 rows. Quite the contrary. My query even includes a limit.

      Maybe you would have saved some of your “precious hours” if you had read the fucking docs.

      Optimizations are applied to certain kinds of FULLTEXT queries against single InnoDB tables. Queries with these characteristics are particularly efficient: […] FULLTEXT queries that sort the matching rows in descending order of score and apply a LIMIT clause to take the top N matching rows. For this optimization to apply, there must be no WHERE clauses and only a single ORDER BY clause in descending order.

      You’d be surprised, by I have read the “fucking docs”, including the paragraph you quoted. It says some kind of queries are “particularly efficient”. It doesn’t say other queries are extremely slow and should be avoided.

      Since we are quoting the docs, and you are cherry-picking the parts that support your point, I’d like to point out the following paragraph that explicitly acknowledges that being unable to use the Index Merge optimization with full-text indexes is a known deficiency of MySQL (source): “The Index Merge optimization algorithm has the following known deficiencies: […] Index Merge is not applicable to full-text indexes”.

      Exactly how the fuck would MySQL optimize that query? If you go through the normal filters first, you can’t use the full text index. If you go through the full text index first, the most relevant documents might not match the filters for a while.

      Your comment is funny because you’re almost describing the raison d’être of in-memory bitmaps, according to Wikipedia: “A bitmap index scan combines expressions on different indexes, thus requiring only one index per column to support all possible queries on a table”.

      MySQL could optimize the query by combining multiple indexes using an in memory bitmap, as I wrote in the article. The technique is explained in PostgreSQL documentation. Even MongoDB is considering it.

      select id, match(content) against ('commercial' in boolean mode) as rank
      from document 
      where id > 10000000
      and rank > 0
      order by rank desc
      limit 50
      

      Your query doesn’t work (Unknown column 'rank' in 'where clause'). You can’t use in the WHERE clause an alias declared in the SELECT clause.

      I can replace rank by the MATCH .. AGAINST condition:

      select id, match(content) against ('commercial' in boolean mode) as rank
      from document 
      where id > 10000000
      and match(content) against ('commercial' in boolean mode) > 0
      order by rank desc
      limit 50
      

      But then I’m back to my original issue: the query is extremely slow because I combine a “full-text” condition with a “normal” condition.

      I don’t know of a compact way to combine arbitrary general indexes with full text indexes, while still ordering the index by some notion of text relevancy.

      In most FTS engines, document selection and document sorting are mostly independent. The posting lists of each term are not sorted by relevance, they are sorted by document IDs. First, the documents are filtered according to the terms in the query, then the result is sorted by relevance, or by something else like the data or the price.

      In conclusion, full text search is hard, don’t be a whiner, especially when you’re wrong.

      FTS is hard, I agree, and as you can see, I tried to do my homework. For the record, I’m not the only one whining about MySQL FTS.

      One last thing: calling people “JS hipsters” or “whiners” and using the word “fuck” five times doesn’t help; technical arguments do.

      1. 1

        Apologies for any typos, it’s pretty late here and I really ought to go to bed.

        And what do you think InnoDB uses? If you check the source code, you’ll discover that InnoDB full-text search uses an inverted index, like almost every other FTS engines (Lucene, Xapian and PostgreSQL FTS for example). Implementation details may differ, but the general principle remains the same.

        Details? Details? The difference between having an index queryable in O(n) time for every operation, and an index actually usable at scale. That’s like saying the implementation details between a CSV file and a B-tree differ, because they’re both row stores, the general principle remains the same.

        Bitmap joins are cute but they can’t change that you’ll be joining on 1-5% of all your documents for every single search. 15% in your example.

        Databases that support bitmap joins do that all the time without any issue.

        Yes, but not for large scale document search. You highlight postgres, so what does it do? I’m ready to be impressed, I’ve been told that this… this is the good stuff!

        Index Scan using documents_pkey on documents
          Index Cond: (id > 1000000)
          Filter: (tsv_precomputed @@ '''commercial'''::tsquery)
        

        Hmmm, that’s weird… oh! Silly me, I forgot to run analyze!

        Seq Scan on documents
          Filter: ((id > 1000000) AND (tsv_precomputed @@ '''commercial'''::tsquery))
        

        Um. Okay, that wasn’t so good.

        With set enable_seqscan=false; postgres uses the index to search the whole table. Now querying for 15% of the database is decently quick, partly because the data scale is small. And then I order by rank and we’re back to miserably slow. Darn.

        A good full text search engine is optimized to return the top N most relevant results for a search.

        You are implying that full-text search results should be sorted by relevance, which is a narrow use case. It is quite common to sort search results by date or by price, for example, and most FTS engines support this well.

        Do they? Well postgres was a bust, but people mostly agree its FTS isn’t that great. I personally think it’s alright if you know how to wrangle FTS, e.g. the ability to embed precomputed text search vectors in the table (like I did above in the tsv_precomputed column) speeds up queries where you can’t rely on the GIN index. But the whole issue where you can’t quickly order by relevance, or anything else, is a legitimate concern. Anyway, onwards to a real FTS engine, Elasticsearch!

        Sorting on a full-text analyzed field can use a lot of memory.

        Yikes. Well I sure hope that this wasn’t what you had in mind, cause it sounds like it’s materializing all the doc ids in memory to sort.

        Hey now though, that was dishonest of me. I know full well we don’t actually want to sort by price or date, we want the top N for a price or date. ES supports this with the top_hits aggregation. Ah, but that’s a metrics aggregation, which doesn’t contribute to index selection. Still, MySQL doesn’t have a built in top N aggregation, so 1 point to ES for… making it easier to do a linear scan of 15% of your documents, I guess. At least ES supports multi-node setups so you can spread your hideously slow algorithm across a ton of CPUs, thus approximating the sensation of understanding algorithmic complexity.

        I’m not forcing MySQL to materialize the 300,000 rows. Quite the contrary. My query even includes a limit.

        Alright, do explain the execution plan you would choose for your query. Ah right, a bitmap join. Which will pull all 300,000 doc ids into a bitmap, perform the join on id relatively quickly, and then return 50 rows in no particular order whatsoever. Very useful. And if you then wanted to order by price or date you sort all the rows, because a bitmap join by nature destroys index ordering.

        [The docs say] some kinds of queries are “particularly efficient”. It doesn’t say other queries are extremely slow and should be avoided.

        Well I think you could have inferred that, since you ran a query violating the performance guidelines and it was extremely slow. However I will grant that the MySQL FTS docs suck. As is often the case with complex and nuanced technologies, pretty much all FTS docs suck. The MySQL ones are particularly bad since they outright lie to you about some things, but that’s just because some of the docs are legacy and haven’t been updated. No one pays enough attention to docs. Nevertheless, I think it’s pretty easy to deduce what’s actually going on if you know a thing or two about databases and full text search.

        Since we are quoting the docs, and you are cherry-picking the parts that support your point, I’d like to point out the following paragraph that explicitly acknowledges that being unable to use the Index Merge optimization with full-text indexes is a known deficiency of MySQL (source): “The Index Merge optimization algorithm has the following known deficiencies: […] Index Merge is not applicable to full-text indexes”.

        True, that does suck. However since your fundamental problem is about excessively large full text search result sets, this is irrelevant.

        Exactly how the fuck would MySQL optimize that query? If you go through the normal filters first, you can’t use the full text index. If you go through the full text index first, the most relevant documents might not match the filters for a while.

        Your comment is funny because you’re almost describing the raison d’être of in-memory bitmaps, according to Wikipedia: “A bitmap index scan combines expressions on different indexes, thus requiring only one index per column to support all possible queries on a table”.

        MySQL could optimize the query by combining multiple indexes using an in memory bitmap, as I wrote in the article. The technique is explained in PostgreSQL documentation. Even MongoDB is considering it.

        When I wrote this paragraph I assumed that we were on the same page that a scan of 15% of your doc ids was not a winning move, and constant factor optimizations won’t solve that algorithmic complexity problem. I saw from your response above, and now from your response here, that I was mistaken. But I do appreciate that you at least skimmed the wikipedia article for bitmap indexes before writing your blog post.

        A bitmap index scan does well when there isn’t any other strategy except suck it up and accept that you’re going to query a significant fraction of the table. For full text search this works great if your data volume is small, and querying significant fractions of your documents table for every user search is acceptable. You mentioned sorting by price, well that’s a great use case for not giving a shit about algorithmic complexity, because most eCommerce sites have a completely negligible SKU cardinality.

        Your query doesn’t work (Unknown column ‘rank’ in ‘where clause’). You can’t use in the WHERE clause an alias declared in the SELECT clause.

        I can replace rank by the MATCH .. AGAINST condition:

        select […]

        But then I’m back to my original issue: the query is extremely slow because I combine a “full-text” condition with a “normal” condition.

        Ah my mistake. I admit I wrote the query in the lobste.rs comment field and didn’t run it. You COULD put the match in the where clause, but you apparently know already that won’t work. I’m mystified how someone who did their homework doesn’t know about subselects:

        select id
        from (
          select id, match(content) against ('commercial' in boolean mode) as rank
          from document 
          where id > 10000000
          order by rank desc
          limit 50
        ) top_ranked
        where rank > 0;
        

        I don’t know of a compact way to combine arbitrary general indexes with full text indexes, while still ordering the index by some notion of text relevancy.

        In most FTS engines, document selection and document sorting are mostly independent. The posting lists of each term are not sorted by relevance, they are sorted by document IDs. First, the documents are filtered according to the terms in the query, then the result is sorted by relevance, or by something else like the data or the price.

        I don’t recall saying I don’t know how rudimentary FTS indexes work, I said I don’t know of a compact way to combine arbitrary general indexes with full text indexes, while still ordering the index by some notion of relevancy. Which you absolutely want to do, because separating document selection and document sorting only works at tiny scale. Making a specialized inverted index for an external key like price is trivial. Making a fast full text index optimized for relevancy is a lot harder, which is why all these full text search engines come with order by relevancy.

        You have a go at sorting your 300,000 documents matching “commercial” post-selection and let me know how that works out for you.

        I tried to do my homework.

        You did, and I’m proud of you. But you didn’t do enough to declare MySQL full text search is a waste of time.

        For the record, I’m not the only one whining about MySQL FTS.

        True. But you could at least bother to do it well. Otherwise, I’m calling your bullshit.

    2. 4

      It would have been useful to see the EXPLAIN statement and attempt to rewrite the query to see if it helps with the query planner.

      1. 3

        I agree.. but I can’t shake the feeling that the output would be truncated somehow..

        1. 1

          Exactly. This is why I haven’t included EXPLAIN output in the article. Here is the output: https://lobste.rs/s/p12ocv/don_t_waste_your_time_with_mysql_full_text#c_tyclwd

        2. 1

          Checking the query execution plan is of course one of the first things I did:

          mysql> EXPLAIN SELECT id
              -> FROM document
              -> WHERE match(content) AGAINST ('commercial' IN BOOLEAN MODE)
              -> LIMIT 50;
          +----+-------------+----------+----------+---------------+---------+---------+------+------+-------------+
          | id | select_type | table    | type     | possible_keys | key     | key_len | ref  | rows | Extra       |
          +----+-------------+----------+----------+---------------+---------+---------+------+------+-------------+
          |  1 | SIMPLE      | document | fulltext | content       | content | 0       | NULL |    1 | Using where |
          +----+-------------+----------+----------+---------------+---------+---------+------+------+-------------+
          1 row in set (0.01 sec)
          
          mysql> EXPLAIN SELECT id
              -> FROM document
              -> WHERE match(content) AGAINST ('commercial' IN BOOLEAN MODE)
              -> AND id > 10000000
              -> LIMIT 50;
          +----+-------------+----------+----------+-----------------+---------+---------+------+------+-------------+
          | id | select_type | table    | type     | possible_keys   | key     | key_len | ref  | rows | Extra       |
          +----+-------------+----------+----------+-----------------+---------+---------+------+------+-------------+
          |  1 | SIMPLE      | document | fulltext | PRIMARY,content | content | 0       | NULL |    1 | Using where |
          +----+-------------+----------+----------+-----------------+---------+---------+------+------+-------------+
          1 row in set (0.03 sec)
          

          I didn’t share it in the article because, as you can see, the query is so simple (no joins) that there is nothing relevant in EXPLAIN output.

          As a side note, I observe that EXPLAIN output is a lot more detailed and useful in PostgreSQL than MySQL.