1. 7
  1.  

  2. 3

    SELECT * FROM repositories ORDER BY RANDOM() LIMIT 3

    The author says this query forces the database to sort all the rows, but I’d expect a top-k sort. It could scan the table once and maintain a sorted list of just the top 3 elements. (And call random() once per row, in the same pass.) Was it not finding this plan, for some reason?

    1. 2

      Won’t this result in non-random grouping of results because the 2d positions are static. E.g. If I execute this query once and observe rows A and B near each other in the results, then I would expect to see A and B near each other in another batch of results.

      1. 1

        Yeah, I really feel like I’m missing something here. It seems you’d need to rebuild the index constantly to shuffle the points. This might be “random enough” for some applications but it’s very very different from the original inefficient query.

        Choosing three random points and unioning all of the single nearest neighbors to those points might give a uniform distribution, but intuitively that still seems incorrect to me — I would expect bias towards more “central” points. But someone who knows actual math might be able to explain if/why that would be a uniform distribution.

        It can’t be though, right? A point near a corner is going to show up much less frequently than a point nearest the center. Each point has a probability equal to like the area of the voronoi thingy, which given enough points will be approximately the same size but not actually the same size.

      2. 1

        Any database has a limit on the size of data that can be sorted in memory to ensure the stability of the database server. [… T]hese values are typically less than 10MB. Because of this, to sort a large data set the database creates a temporary file to sort small chunks one at a time and merge them together later.

        The article already suggests picking values from some indexed unique-valued column, and it seems like a lazy but effective way to address the memory limit issue would be to just use that to simply limit the amount of data loaded:

        SELECT * FROM repositories WHERE id IN (
          SELECT id FROM repositories ORDER BY RANDOM() LIMIT 3
        )
        

        This will be slower on very small tables but should be able to avoid hitting even the default memory limit until the table grows very large, even under a dumb query plan (per @dpercy’s comment), and should only degrade slowly even after that. This avoids all the distribution issues of alternative schemes as discussed both in the article itself (!) as well by @Forty-Bot and @ianthehenry here.

        1. 1

          Funny, I faintly remember this being a problem in mysql 15 years ago, but I’m not sure it was performance-related or simply “bad randomness” where basically everybody observed it being not really random.