1. 4

    You left us hanging… What effect did “Emissions” optimization had on the query? How did you solve it eventually? Just disabled JIT completely?

    I just upgraded a large (Django) DB from 11 -> 12. Can’t say I encounter weird plans such as the one described in the article.

    1. 3

      Ha-ha, funny how errors repeat. :) “Emissions” are not an optimization, it time that JIT took to emit optimized query! That whole line is statistics about JIT, not about query itself.

      So yeah, we’ve just disabled JIT for now. “Large” is relative. Our DB is ~300GB, but it seems that amount of rows in auth_user and user_userprofile is a little bit over default limits for JIT.

      1. 1

        I assume that auth user and user profiles have roughly the same amount of rows, so NL is probably not the right choice. I wonder what made the DB pick that plan.

        I haven’t had a chance to explore JIT yet, so I was glad to see an article that mentioned it. Thanks 👌

        1. 1

          Nested loop is the right choice when both sides of the join are point lookups to a single row. 1 * 1 = 1. You can see in the analyze that’s what happened: loops=1. The plan is fine, JIT is the problem here.

          1. 1

            Right! I missed the fact it’s fetching only one user.

    1. 6

      An interesting article! I personally tend to associate dependency injection with brain bruisingly complex Java frameworks like Spring or Guice, but as explained in this article I can at least see where the extra complexity is clearly adding value.

      That said, the article uses datetime.now() as an example of where DI can help, but I found myself wondering “Why not just use a test fake instead?”

      1. 8

        I was wondering why either mocking the date class was out of scope ie

        import datetime
        
        def test_tomorrow(mocker, monkeypatch):
            FakeDate = mocker.magicMock()
            fake_now = mocker.MagicMock()
            FakeDate.now = fake_now
            fake_now.return_value = datetime.date(1, 2, 3)
            with monkeypatch.context() as c:
                # notice how we override both the class and function with a mock:
                c.setattr(datetime, “date”, FakeDate)
                assert tomorrow() == fake_now.return_value
        

        You don’t have to override datetime.date inside the datetime module! Only the reference inside the module that uses it!! The above pattern is very common in our tests.

        1. 4

          That’s interesting. I was looking for different ways to mock datetime (mainly for research though, because DI hold additional benefits over Mock) and didn’t find any. Thanks!

      1. 8

        Like every other problem in Python, there are libraries that provide a solution. Two libraries that stand out are freezegun and libfaketime. Both provide the ability to mock time at different levels. However, resorting to external libraries is a luxury only developers of legacy system can afford. For new projects, or projects that are small enough to change, there are other alternatives that can keep the project free of these dependencies.

        The article only touches on libraries like freezegun, using a rather dismissive argument that “only developers of legacy systems can afford” them, but I tend to disagree - libraries like these are designed as conveniences rather than crutches. One example I have is testing token validation (in our case JWTs) - you can test the same token right before it’s supposed to be valid, during the valid period and after, without having to dig into the internals of the PyJWT library. With freezegun, this was trivial. When you don’t control the code actually calling datetime.now(), this method of mocking out dates is quite a bit easier to control.

        Note that there are plenty of good examples here… in particular some of the stuff around designing Protocols is super useful, but similar to the clickbait-style title, the datetime portion seems to brush past a valid answer because it doesn’t fit the narrative of the post.

        1. 2

          I agree that if you use an external library with an API like the one you are describing, may force you to use libraries such as libfaketime. Not because it’s the best, or the right way, but because you have to.

          Point of the article is that the library API can possibly be changed so that you won’t have to use libfaketime. Instead, the library can provide you with a simpler way of testing the token. For example, provide an argument that sets the time the token is validated at, like the article suggests.

        1. 4

          I find it funny that in the intro to the article promises a minimal and modern setup, and immediately after the table of contents lists three external dependencies (pyenv, poetry and click).

          I believe experienced developers can probably recognize when they don’t need an extra dependency, but new developers might actually think they need all of this stuff (they don’t, venv, requirements.txt and argparse are good enough 99% of the time).

          1. 2

            I don’t think this is entirely fair, click is only a dependency of the example project not of the workflow and probably is mainly used instead of argparse to show how to use external dependencies with this workflow.

            Also pyenv is just used to lock the python version used in the project, if you omit it the rest of the post would still work just fine.

            1. 2

              I don’t criticize the author’s choice of external dependencies. I was just commenting on his choice of words, specifically calling such setup “minimal”…

          1. 2

            Obligatory note: python-markdown isn’t CommonMark-compliant. Whether you care about this is really up to you; here’s a list of CommonMark implementations for Python if you do.

            1. 1

              Thanks, I did not know that. I did notice the common markup package for Python was last updated 5 years ago.

            1. 9

              I agree, but I find it annoying that it’s a “feature” in test frameworks. A “parameter” is something that a function provides, and every language has functions.

              Basically the problem is that test frameworks want to control the “main” and have a somewhat rigid view of testing.

              I just use a shell script to run the same test twice with different parameters (two different versions of the binary). I’m using a shell function with “parameters”, but you can also use env vars.

              https://github.com/oilshell/oil/blob/master/test/parse-errors.sh#L668

              If I’m unit testing in Python, I’ll also just write an explicit loop rather than using weird test framework features.

              1. 3

                If I’m unit testing in Python, I’ll also just write an explicit loop rather than using weird test framework features.

                I find it helpful to unroll the loop: that way the stack trace tells you which example input failed.

                examples = [
                    ("", 0),
                    ("asdf", 4),
                    ("asdfzxcv", 8),
                    ...
                ]
                for input, output in examples:
                    # This line failed, but which loop iteration was it?
                    test_it(input, output)
                

                vs

                test_it("", 0)
                test_it("asdf", 4)    # This line failed :)
                test_it("asdfzxcv", 8)
                ...
                
                1. 3
                  1. 1

                    Great tips! We’re using nosetest generated test for this. In above case, it will generate 3 tests.

                  2. 1

                    The C++ unit testing framework catch has logging macros which take scope into account, meaning you can log the current state in each of your nested loops, but it will only print it out when a test fails: https://github.com/catchorg/Catch2/blob/master/docs/logging.md#top

                  3. 2

                    In rspec, the parameter feature lets you have default values for the whole suite (near the top of the file) while specifying overrides closer to where they are used.

                    That’s the only time I have encountered a feature like that and thought “yes, this is better than a function call”.

                    1. 2

                      I too quite liked that feature in rspec, but the more I use it, the more I think more flexible scope in Common Lisp is better. Rather than being a feature of the test harness, it’s a language feature:

                      CL-USER> (defvar *id* 42 "The default ID we will override in tests")
                      *ID*
                      CL-USER> (defun make-entity () (list :id *id*))
                      MAKE-ENTITY
                      CL-USER> (equal
                                 (make-entity)
                                 '(:id 42))
                      T
                      CL-USER> (let ((*id* 43))
                                 (equal
                                   (make-entity)
                                   '(:id 43)))
                      T
                      CL-USER> (equal
                                 (make-entity)
                                 '(:id 42))
                      T
                      
                      1. 1

                        While I’m enthusing about Common Lisp, the language supports a similar approach to functions:

                        CL-USER> (ql:quickload :cl-smtp)
                        To load "cl-smtp":
                          Load 1 ASDF system:
                            cl-smtp
                        ; Loading "cl-smtp"
                        ..
                        (:CL-SMTP)
                        CL-USER> (flet ((cl-smtp:send-email (&rest _) (declare (ignore _)) "200 Success"))
                                   (cl-smtp:send-email "mail.example.com" "from@example.com" "to@example.com" "subject" "body"))
                        "200 Success"
                        

                        I forget who once said “patterns are what you do when you run out of language” … there’s a similar thing going on here with test harnesses.

                        There are good, feature-rich, test harnesses and mocking/stubbing libraries for Common Lisp. But the language is powerful enough to make their implementation easier than in many other languages. And you can use the same concepts to make your own code more powerful.

                    1. 1

                      I have a hard time understanding why things like Django’s ORM are a win. They make simple queries possible and complex queries tedious at worst, as compared to just writing the SQL. My workflow for writing non-trivial queries is something like:

                      1. Think about the query I want to write
                      2. Figure out how to express said query in the ORM
                      3. Look at the generated query to figure out why it’s slow and what I need to nudge to get the ORM to generate something reasonable

                      I’d be curious to hear about success stories from people using these tools in large, non-trivial codebases.

                      1. 1

                        Two benefits that come to mind are

                        1. Portability: using ORM reduces dependency in database engine. For products / projects that need to operate on multiple db engines using an ORM makes things much easier.

                        2. Composability: using an ORM, and Django QuerySet as an example, you can easily compose queries. Think about a website with a list of filters. Some filters are on fields on the same tables, other on related tables that require a join, some filters are on expression fields etc. How would you implement that w/o an ORM? String interpolation? I think you’ll end up creating your own tiny ORM to add fields and joins lazily..

                        As someone who is not bound by the first constraint, I often resort to RAW SQL in places the ORM falls short (example).

                      1. 1

                        https://hakibenita.com/

                        Static sure, using Python pelican with my own theme.

                        1. 2

                          Great read.

                          We usually have a type called State defined at the top of each component. For example:

                          type State = {
                              state: "searching",
                          } | {
                              state: "failed",
                              error: str | null,
                          } | { 
                              state: "results_found",
                              results: []any,
                          }
                          
                          1. 2

                            This is closer to how I do it. But I split out the definition of each variant, and define helper functions to construct the variant from the data. The result is close to algebraic data types, except for the lack of exhauativeness checking. I’ve seen some articles that try to use the type system to get exhauativeness checking, but I haven’t had any luck with it so far.

                            1. 1

                              We implement an “assert never” function to help with exhauativeness checking with this type of state definition and it’s been working pretty well (catching missing cases, alerting when unknown values are used).

                              Found this similar project on GitHub:

                              https://github.com/aikoven/assert-never/blob/master/README.md

                              1. 2

                                Yeah, that’s what I tried, but I think it only works if you’re returning from all the branches, which did not apply for me.

                          1. 1

                            On the “Use BETWEEN Only For Inclusive Ranges” example, I would use WHERE YEAR(created) = 2019, which is clear and sidesteps the issue completely. I use mysql but I assume pg will take it as well.

                            This is even more useful when working with months, because it means you don’t need to calculate what the next month or the last day of the month are.

                            Also when grouping - GROUP BY YEAR(created) - I dont want to even think how I’d do it with ranges.

                            1. 1

                              Truncating a timestap this way might be dengerous… See the section about time zones in the article.

                              1. 1

                                It seems a bit unfair to say this when you aren’t considering timezones in that example, but yes, depending on the case it’s important to keep in mind. And yet even then I would convert to the right timezone and truncate to year instead of using a range.

                                1. 3

                                  Your right, just thought I point it out.

                                  Another thing you might also want to keep in mind is that filtering on YEAR(date) might prevent the DB from using indexes on the field.

                                  I actually had a whole PR to Django Admin just to replace this behavior with range predicate for the sole purpose of utilizing the index on the date field.

                                  https://hakibenita.com/django-admin-range-based-date-hierarchy

                            1. 2

                              I dunno if it’s just me but the highlighted lines of query text are extremely hard to read. I’m colorblind so I’m not sure if that has something to do with it.

                              Example

                              1. 2

                                Yeah… I need to adjust the highlight colors in the dark theme. Sorry 😟

                                1. 1

                                  No problem, just thought I’d point it out!

                              1. 4

                                This is a very nice breakdown of the problem and an elegant solution; I loved reading the story from problem to solution.

                                In the past I had a similar issue with excel export/pdf export taking so long it would timeout but instead of consuming a day or two in debugging I wrapped the functionality in a worker job and added some JS to the front end that requested a export job and then updated the user on progress before finally pulling it down. I seem to remember one particular export taking more than five minutes to complete and we had it email the resulting file to the user requesting it so they didn’t need to keep a browser open.

                                1. 2

                                  Thanks ;) we talked several times in the past about creating such “background reporting system”, but even if you want to go down this road, you still want to make the reports faster as possible. Hopefully this solution bought us some time…

                                  1. 1

                                    you still want to make the reports faster as possible

                                    Absolutely. In my experience in the teams I have worked with, due to most endpoints completing in a sub 70ms time frame. They had automatic reporting that would flag for performance review any endpoint that consistently took more than 200ms to complete.

                                    This resulted in a lot of tasks being shifted to background jobs that would update the user on their progress if needed; even then we kept an eye on job completion times because even while having 32 worker processes that autoscaled to 256 if we had a queue of 1000 jobs each taking a few minutes to complete it would still take a noticeable length of time for a queue to process.

                                    Performance is king :)

                                    1. 1

                                      I’m actually most interested in this part:

                                      They had automatic reporting that would flag for performance review any endpoint that consistently took more than 200ms to complete.

                                      How did they check if an endpoint is consistently over 200ms? Weighted average on a rolling window? Some static threshold?

                                      1. 2

                                        Its been a few years since I worked there but from memory I think it was both. The weighted average on a rolling window would be used over n number of requests served with a 2xx (rather than over say 15 minutes). For alarms that proved too sensitive for the window would be increased.

                                        Then in addition to that there was a static threshold of is this 2xx response taking greater than double the average response time for that endpoint within this rolling window. Anything that returned a 5xx would immediately result in the digital team being notified.

                                        In addition to delta between request received and upstream returning a response we collected other light debug statistics grouped by endpoint such as application process time, template render time, database query time, number of queries plus memory consumption of all five. Query count for example had a static threshold in order to quickly catch any n^n sillyness the ORM might have produced in an edge case.

                                        1. 2

                                          I figured it should be something like that… I’ve ran into this problem several times already. Sometimes it’s difficult to set hard limits, especially on temporal series with seasonality. I often use something like MAD with a threshold. Thanks 👏

                                1. 1

                                  At my job I tend to jump the problematic projects, and/or to supervise them once in a while. It’s a very common to be hit by this, I’ve seen it in C#, Python, etc.

                                  Just today I rewrote a django->python->csv export to use openpyxl+lxml as they wanted to have autofilters on the file. Openpyxl besides being faster with lxml also has a write-only mode to be speedier and easier on memory.

                                  The key in all these exports is to avoid lazy queries (annotate/prefetch_related) and, if possible, avoid loading all data into memory before starting to stream it.

                                  Forgot to say avoid.. :D

                                  1. 2

                                    Hey, openpyxl should work faster when lxml is available. Do you happen to know by how much?

                                    1. 2

                                      I just did a quick test using their benchmark snippet, almost double with 10k rows

                                      Versions:
                                      python: 3.6.8
                                      Dimensions:
                                          Cols = 50
                                          Sheets = 1
                                          Proportion text = 0.10
                                      optimised = True
                                      
                                      Rows = 10000
                                      Times:
                                      openpyxl: 2.6.3 using LXML True:   3.70
                                      openpyxl: 2.6.3 using LXML False:   6.03
                                      
                                      Rows = 1000
                                      Times:
                                      openpyxl: 2.6.3 using LXML True:   0.37
                                      openpyxl: 2.6.3 using LXML False:   0.57
                                      

                                      Edit: I didn’t profile memory usage, but according to their docs, that should be considerably lower too.

                                      Edit 2: XlsxWriter is faster than Openpyxl proven by that benchmark, but we also need to read xlsx files too, I prefer to keep one library for both if possible.

                                      1. 1

                                        Thanks, that’s interesting.

                                        Your numbers are lower than the ones in the benchmark (hardware maybe?), but it does look like just by installing lxml you get double the speed.

                                        pyexcelerate is till faster, improved by a factor of ~3.5.

                                  1. 1

                                    xlsx is just zipped-up XML (go on, unzip an xlsx file and take a look), so I’d love to know what the original exporter was actually doing to be so slow. Outputting XML isn’t usually computationally expensive.

                                    1. 2

                                      Seems like openpyxl is using the built-in xml module[0], which is from the looks of it is significantly slower than lxml. One of the things I wanted to check following this article, is how much faster pyopenxl can get using lxml.

                                      [0] https://bitbucket.org/openpyxl/openpyxl/src/default/openpyxl/xml/functions.py

                                    1. 8

                                      Re: format choice — I would recommend considering minimalistic HTML file with a single table. Not much worse than CSV w.r.t. serialisation overhead and simplicity, contains its own encoding header, can be opened at any device capable of downloading the export file from the website in the first place, opens in a sane way in many spreadsheet and word processor applications.

                                      Not always the best option, but probably worth using more than it is used now.

                                      1. 4

                                        CSV is a terrible format: almost zero cross-platform consistency outside of plain us-ascii 6bit less the comma. Excel still thinks an interchange format should use the machines charset; BoM and embedded newline corner cases, multiple quoting “standards” and so on.

                                        I’ve probably written four xlsx parsers/generators at this point. Takes an afternoon— about as long to read some random unknown-quality “library” api and you’ll at least know the performance characteristics but look: if you can’t be bothered, at least consider using JSON and using a JS-Zip-based xlsx writer. Most users will be fine with that (and real api consumers will prefer JSON to guessing what particular flavour of csv this is)

                                        1. 1

                                          These were exactly our pain points with CSV - BOM, formatting and quoting, delimiters, unicode issues etc. Some of the fields in the data are user generated so we had to deal with how emojies and new lines are encoded… huge pain.

                                          Generating xlsx in the client is probably a good solution but it required some time we didn’t have back then.

                                        2. 2

                                          Without thinking of all the potential corner cases, that looks like quiet a good idea.

                                          Moreover, one could ship the HTML file with a JS library (linked or inlined) to produce an excel file on the client side

                                          1. 2

                                            Agree it’s great.

                                            Biggest issue with this approach (and it’s not that big an issue) is that large enough HTML tables will hang/crash most browsers.

                                          1. 1

                                            If you have the resources I would highly recommend setting up an observability tool like zipkin. It can set up measurements on your incoming requests so you can be signaled about what views are taking up the most time.

                                            You can also instrument things you think might take a while. For example instrumenting SQL queries can help you to see how much time is being spent in SQL land and how much in Python land.

                                            Doing this has been a huge eye opener in helping to find what is taking a while. It’s often what you think, but just enough times it’s something totally unexpected

                                            1. 1

                                              We use AWS and we have alerts in place using CloudWatch.

                                              One of the takeaways from this incident for us, was to set up alert on CPU as well. Our system is mostly IO bound so we have alerts in place for BurstIO etc.. We also adjusted thresholds for existing alerts that didn’t catch this on time.

                                            1. 1

                                              I appreciate the post, especially putting all the DDL in a <details>. I didn’t like how both credit events had the same date, so I put the 50 earlier than the 100.

                                              I write T-SQL mostly; I didn’t know about the distinct on trick, that’s cool. I am trying to get better at PostgreSQL.

                                              This is how I think about & approach the problem, with separate tables, row_number() & CTEs: db-fiddle.

                                              1. 1

                                                Thanks, I just realized I accidentally put the same date on two records. Good catch!

                                                The downside (and upside) to using row number is that it makes it harder to add other aggregates. I just added a bit about comments I got from readers with my thoughts on them.

                                                1. 1

                                                  The downside to using RANK/DENSE_RANK/ROW_NUMBER is that they don’t work with group by. As your example below demonstrate, you need to use more than one query and join the results. Another downside, which is probably more significant, is that rank etc. requires a sort. Using max(array[]) keeps a “running max” in memory dosen’t need any type of sort.

                                                  1. 1

                                                    I don’t understand what you mean by saying they don’t work with group by. Window functions have PARTITION BY syntax for evaluating over groups.

                                                    You are definitely right that performance could be improved on huge datasets by using your array trick. I would still prefer the window functions unless this query is a prominent bottleneck.

                                                    1. 1

                                                      Window functions cannot be used with group by. What you did was to combine the window function with DISTINCT ON, so you essentially calculate the aggregates for the entire set, and then take only the first row.

                                                      This is different than how a group by works in the sense that group by applying “reducing” rows to group while computing the aggregates. Window functions operate on the entire set (w/o distinct on you would get duplicate rows).

                                                      1. 1

                                                        I’m sorry, now I’m really confused. I did not write DISTINCT ON.

                                                        1. 1

                                                          It is a unique way of phrasing it, but if I were to guess, he’s saying: “What you [must have] did [at some point] was to combine the window function with DISTINCT ON [which while similar, has important differences]”

                                                  1. 1

                                                    Very nice, did not know the unlogged tables. Is there some way to transform them in logged one after?

                                                    1. 1

                                                      Thanks.

                                                      Is there some way to transform them in logged one after?

                                                      I never tried it and I can’t think of a use case for it, but from the documentation of ALTER TABLE it looks like it’s possible.

                                                    1. 1

                                                      If numeric type in PostgreSQL can go up to 16383 digits after the decimal point, how many comments can I store this way?

                                                      1. 4

                                                        If we’re taking Postgres, might as well have parents INTEGER[] and save yourself a lot of messing about.

                                                        1. 2

                                                          Or just use a recursive CTE