1. 4

    This is a very nice breakdown of the problem and an elegant solution; I loved reading the story from problem to solution.

    In the past I had a similar issue with excel export/pdf export taking so long it would timeout but instead of consuming a day or two in debugging I wrapped the functionality in a worker job and added some JS to the front end that requested a export job and then updated the user on progress before finally pulling it down. I seem to remember one particular export taking more than five minutes to complete and we had it email the resulting file to the user requesting it so they didn’t need to keep a browser open.

    1. 2

      Thanks ;) we talked several times in the past about creating such “background reporting system”, but even if you want to go down this road, you still want to make the reports faster as possible. Hopefully this solution bought us some time…

      1. 1

        you still want to make the reports faster as possible

        Absolutely. In my experience in the teams I have worked with, due to most endpoints completing in a sub 70ms time frame. They had automatic reporting that would flag for performance review any endpoint that consistently took more than 200ms to complete.

        This resulted in a lot of tasks being shifted to background jobs that would update the user on their progress if needed; even then we kept an eye on job completion times because even while having 32 worker processes that autoscaled to 256 if we had a queue of 1000 jobs each taking a few minutes to complete it would still take a noticeable length of time for a queue to process.

        Performance is king :)

        1. 1

          I’m actually most interested in this part:

          They had automatic reporting that would flag for performance review any endpoint that consistently took more than 200ms to complete.

          How did they check if an endpoint is consistently over 200ms? Weighted average on a rolling window? Some static threshold?

          1. 2

            Its been a few years since I worked there but from memory I think it was both. The weighted average on a rolling window would be used over n number of requests served with a 2xx (rather than over say 15 minutes). For alarms that proved too sensitive for the window would be increased.

            Then in addition to that there was a static threshold of is this 2xx response taking greater than double the average response time for that endpoint within this rolling window. Anything that returned a 5xx would immediately result in the digital team being notified.

            In addition to delta between request received and upstream returning a response we collected other light debug statistics grouped by endpoint such as application process time, template render time, database query time, number of queries plus memory consumption of all five. Query count for example had a static threshold in order to quickly catch any n^n sillyness the ORM might have produced in an edge case.

            1. 2

              I figured it should be something like that… I’ve ran into this problem several times already. Sometimes it’s difficult to set hard limits, especially on temporal series with seasonality. I often use something like MAD with a threshold. Thanks 👏

    1. 1

      At my job I tend to jump the problematic projects, and/or to supervise them once in a while. It’s a very common to be hit by this, I’ve seen it in C#, Python, etc.

      Just today I rewrote a django->python->csv export to use openpyxl+lxml as they wanted to have autofilters on the file. Openpyxl besides being faster with lxml also has a write-only mode to be speedier and easier on memory.

      The key in all these exports is to avoid lazy queries (annotate/prefetch_related) and, if possible, avoid loading all data into memory before starting to stream it.

      Forgot to say avoid.. :D

      1. 2

        Hey, openpyxl should work faster when lxml is available. Do you happen to know by how much?

        1. 2

          I just did a quick test using their benchmark snippet, almost double with 10k rows

          Versions:
          python: 3.6.8
          Dimensions:
              Cols = 50
              Sheets = 1
              Proportion text = 0.10
          optimised = True
          
          Rows = 10000
          Times:
          openpyxl: 2.6.3 using LXML True:   3.70
          openpyxl: 2.6.3 using LXML False:   6.03
          
          Rows = 1000
          Times:
          openpyxl: 2.6.3 using LXML True:   0.37
          openpyxl: 2.6.3 using LXML False:   0.57
          

          Edit: I didn’t profile memory usage, but according to their docs, that should be considerably lower too.

          Edit 2: XlsxWriter is faster than Openpyxl proven by that benchmark, but we also need to read xlsx files too, I prefer to keep one library for both if possible.

          1. 1

            Thanks, that’s interesting.

            Your numbers are lower than the ones in the benchmark (hardware maybe?), but it does look like just by installing lxml you get double the speed.

            pyexcelerate is till faster, improved by a factor of ~3.5.

      1. 1

        xlsx is just zipped-up XML (go on, unzip an xlsx file and take a look), so I’d love to know what the original exporter was actually doing to be so slow. Outputting XML isn’t usually computationally expensive.

        1. 2

          Seems like openpyxl is using the built-in xml module[0], which is from the looks of it is significantly slower than lxml. One of the things I wanted to check following this article, is how much faster pyopenxl can get using lxml.

          [0] https://bitbucket.org/openpyxl/openpyxl/src/default/openpyxl/xml/functions.py

        1. 8

          Re: format choice — I would recommend considering minimalistic HTML file with a single table. Not much worse than CSV w.r.t. serialisation overhead and simplicity, contains its own encoding header, can be opened at any device capable of downloading the export file from the website in the first place, opens in a sane way in many spreadsheet and word processor applications.

          Not always the best option, but probably worth using more than it is used now.

          1. 4

            CSV is a terrible format: almost zero cross-platform consistency outside of plain us-ascii 6bit less the comma. Excel still thinks an interchange format should use the machines charset; BoM and embedded newline corner cases, multiple quoting “standards” and so on.

            I’ve probably written four xlsx parsers/generators at this point. Takes an afternoon— about as long to read some random unknown-quality “library” api and you’ll at least know the performance characteristics but look: if you can’t be bothered, at least consider using JSON and using a JS-Zip-based xlsx writer. Most users will be fine with that (and real api consumers will prefer JSON to guessing what particular flavour of csv this is)

            1. 1

              These were exactly our pain points with CSV - BOM, formatting and quoting, delimiters, unicode issues etc. Some of the fields in the data are user generated so we had to deal with how emojies and new lines are encoded… huge pain.

              Generating xlsx in the client is probably a good solution but it required some time we didn’t have back then.

            2. 2

              Without thinking of all the potential corner cases, that looks like quiet a good idea.

              Moreover, one could ship the HTML file with a JS library (linked or inlined) to produce an excel file on the client side

              1. 2

                Agree it’s great.

                Biggest issue with this approach (and it’s not that big an issue) is that large enough HTML tables will hang/crash most browsers.

              1. 1

                If you have the resources I would highly recommend setting up an observability tool like zipkin. It can set up measurements on your incoming requests so you can be signaled about what views are taking up the most time.

                You can also instrument things you think might take a while. For example instrumenting SQL queries can help you to see how much time is being spent in SQL land and how much in Python land.

                Doing this has been a huge eye opener in helping to find what is taking a while. It’s often what you think, but just enough times it’s something totally unexpected

                1. 1

                  We use AWS and we have alerts in place using CloudWatch.

                  One of the takeaways from this incident for us, was to set up alert on CPU as well. Our system is mostly IO bound so we have alerts in place for BurstIO etc.. We also adjusted thresholds for existing alerts that didn’t catch this on time.

                1. 1

                  I appreciate the post, especially putting all the DDL in a <details>. I didn’t like how both credit events had the same date, so I put the 50 earlier than the 100.

                  I write T-SQL mostly; I didn’t know about the distinct on trick, that’s cool. I am trying to get better at PostgreSQL.

                  This is how I think about & approach the problem, with separate tables, row_number() & CTEs: db-fiddle.

                  1. 1

                    Thanks, I just realized I accidentally put the same date on two records. Good catch!

                    The downside (and upside) to using row number is that it makes it harder to add other aggregates. I just added a bit about comments I got from readers with my thoughts on them.

                    1. 1

                      The downside to using RANK/DENSE_RANK/ROW_NUMBER is that they don’t work with group by. As your example below demonstrate, you need to use more than one query and join the results. Another downside, which is probably more significant, is that rank etc. requires a sort. Using max(array[]) keeps a “running max” in memory dosen’t need any type of sort.

                      1. 1

                        I don’t understand what you mean by saying they don’t work with group by. Window functions have PARTITION BY syntax for evaluating over groups.

                        You are definitely right that performance could be improved on huge datasets by using your array trick. I would still prefer the window functions unless this query is a prominent bottleneck.

                        1. 1

                          Window functions cannot be used with group by. What you did was to combine the window function with DISTINCT ON, so you essentially calculate the aggregates for the entire set, and then take only the first row.

                          This is different than how a group by works in the sense that group by applying “reducing” rows to group while computing the aggregates. Window functions operate on the entire set (w/o distinct on you would get duplicate rows).

                          1. 1

                            I’m sorry, now I’m really confused. I did not write DISTINCT ON.

                            1. 1

                              It is a unique way of phrasing it, but if I were to guess, he’s saying: “What you [must have] did [at some point] was to combine the window function with DISTINCT ON [which while similar, has important differences]”

                      1. 1

                        Very nice, did not know the unlogged tables. Is there some way to transform them in logged one after?

                        1. 1

                          Thanks.

                          Is there some way to transform them in logged one after?

                          I never tried it and I can’t think of a use case for it, but from the documentation of ALTER TABLE it looks like it’s possible.

                        1. 1

                          If numeric type in PostgreSQL can go up to 16383 digits after the decimal point, how many comments can I store this way?

                          1. 4

                            If we’re taking Postgres, might as well have parents INTEGER[] and save yourself a lot of messing about.

                            1. 2

                              Or just use a recursive CTE

                          1. 4

                            On the same subject, there is a discussion on the Django developers form about using black.

                            1. 22

                              It pains me to see how happily people are getting herded around by GAFAM with dirty tricks like this, but I honestly don’t know what to do. I haven’t managed to convince even a single person to use “freer” software in years.

                              Once upon a time I would’ve tried to spread the Holy Word of open-source (or at least user-respecting) software like Firefox and Linux, but I’ve mostly given up on that. In my experience, “normal people” associate these things (yes, even Firefox) with “technical people”, so they’re too scared to try. (And actual technical people are too stubborn ;-D)

                              1. 29

                                My last job was a shop of 20 or so developers. We all had the same standard-issue macbook. I could tell mine apart because I put a Firefox sticker on it. I evangelized quite a bit with my teammates. By the end of my three-year tenure there, I had convinced nobody that at least supporting Firefox in our web applications was neccesary, and nobody to use Firefox as their main browser.

                                I left that job a few weeks ago, and I learned from my old boss that he and two others started using Firefox because they missed me and my ridiculous evangelism.

                                It’s okay to give up, or take a break, from evangelizing because it’s kind of exhausting and puts your likability with others at risk. But I think it’s hard to tell when you’re actually being effective. There’s no real feedback loop. So if you’re tired of it and don’t want to do it anymore, by all means stop evangelizing. But if you want to stop just because you don’t think you’re not being effective, you may be wrong.

                                1. 14

                                  You can’t convince “normal” people by saying it’s “freer” and open source (unfortunately, not even more private) - they don’t care about this stuff.

                                  I convinced my wife to try FF once quantum came out saying “it is much faster”. She got used to it.

                                  I managed to convince a teammate to use FF because of the new grid inspector.

                                  People only care about “what’s in it for me”.

                                  1. 9

                                    Of course that’s what they care about, I know. IMHO not having a multi-billion-dollar corporation breathing down your neck is a pretty big plus, but I guess that’s just a bit too abstract for most people, as you seem to be implying.

                                1. 2

                                  I was wondering why Django does not generate an SQL migration statement with the CONCURRENTLY keyword directly.

                                  I have found a recent discussion about it on the GitHub project: https://github.com/django/django/pull/10901

                                  1. 1

                                    I’ve seen this PR. It essentally does the same thing described in the article, but using a custom migration command.

                                    I feel the problem with this approach is that

                                    1. It forces you to edit the migration, replace the command and set the migration to be atomic. I think a safer solution (if any) that is offered by the migration fw, should be implemented as an attribute on the existing command.
                                    2. This is very PostgreSQL specific. As mentioned in the article, other database vendors, such as Oracle, also support this but using a different syntax.
                                  1. 5

                                    As a visually impaired developer, I’ve both used and developed screen readers, particularly on Windows. I’ll be happy to answer any questions anyone may have. Ask me anything.

                                    1. 2

                                      I have a code documentation and blog site. I’m pretty confident in the overall layout of the site and the prose - this is reasonably simple semantic html - but what about presenting code samples? My markup for those is kinda a mess, I’m not super happy with it semantically, but it renders fairly well to the eye after css styling. Lots of stuff like

                                      <pre><span class="line-number">1</span><span class="comment">/* foo</span>
                                      <span class="line-number">2</span><span class="comment">    stuff*/</span></pre>
                                      

                                      Except some of my samples are hundreds of lines long. Not necessarily because all that code is worth reading in detail, but because I want to provide a complete example for copy and paste tinkering. It is meant to be skimmed.

                                      Do you have any tips on a good way to better mark up those code samples to make them more usable to a visually impaired user?

                                      1. 2
                                        1. When adding a “back to top” button, it’s often recommended to move the focus as well. What am I suppose to move the focus to? The first focusable element on the page? What if this element is below the fold? Should I focus the first element even if not focusable?

                                        2. Can you give a rule of thumb to aria-live sections in SPA? Is it really useful? How should I move focus?

                                        A11y now days feels a lot like developing a website 10 years ago where you had to test it on chrome and IE. You have JAWS, talk back and ten others times chrome and edge. TBH im still unable to install a screen reader on gnome in my language. It’s still too hard!

                                      1. 16

                                        Having interviewed a lot of people, SQL is one of those things that everyone thinks they know well and very few actually do. In interviews now, if I see them claiming expertise, I ask if they can do something more than “SELECT * FROM foo” because so often that’s all it takes to “know SQL” on your resume.

                                        Good database knowledge can be extremely valuable. I can’t tell you how many times I’ve seen code that makes four or five round-trips to the DB, grabbing data, processing it client-side, and sending it back that could’ve been done in one round-trip with a well-crafted SQL query.

                                        There’s the other side too: I’ve seen some truly awful database designs that have required multiple round trips because the DB designer didn’t understand foreign keys or constraints or whatever.

                                        1. 4

                                          Depends on how you interview me.

                                          If you sit me down with a pen and paper and tell me to write a complex query…. I might well fail.

                                          If you sit me down with say sqlitebrowser… and an editor side by side, I will rapidly iterate starting from very simple queries (yes, I will start with select * from foo) to as complex as need be.

                                          And yes, I might occasionally refer to the syntax for the gnarlier corners.

                                          But I will get it done pretty rapidly in a single very well crafted query.

                                          Conversely, I’m a true believer in what CJ Date has been saying for decades…. so if you allow me to redesign the table definitions…. you’d be shocked by how simple my queries will be.

                                          1. 3

                                            Imo one of the best arguments against relying on ORMs is performance: in some situations a CTE or window function can get you one or two orders of magnitude improvement over a naive orm solution.

                                            1. 2

                                              Nothing prevent you from implementing those edge case in your ORM though. I personally use SqlAlchemy, and I feel like it cover what I need as-is 90% of the time, and the 10% of the time left it gives me the tool to build more complexe queries and even allow me to keep using the “object mapping”. SqlAlchemy supports CTE and Window function!

                                              For even too complexe query, it might also be possible to wrap them into SQL Function and simply map your ORM to this function.

                                              1. 2

                                                Oh, to clarify, I think ORMs are great, I just don’t think it’s great to rely on them. They do 90% of the things for you, but you need to know SQL for the other 10% of cases. Some things aren’t like that, where there’s not such a wildly significant benefit as knowing SQL gives you in this case.

                                                1. 1

                                                  This is very true. It’s also helpful to know (in rough terms) what the ORM is doing under the hood, something that’s only really possible if you understand the SQL behind the operations.

                                                2. 1

                                                  Yep, Peewee orm also supports things like CTE, window functions, insert/on conflict, etc. The query builders for both libraries are very flexible, but you pretty much have to know SQL and then translate it into the APIs the query builder exposes. For experienced programmers this is no problem, but someone less fluent with SQL is going to have no idea how to use this advanced functionality, regardless of how it is exposed.

                                                  1. 1

                                                    Definitely! My point was mostly about ORM and “advanced”/performant SQL not being mutually exclusive.

                                                    1. 1

                                                      I like ORM or query builders not because I don’t know SQL, but rather because I detest the SQL syntax. I wish there were a better relational language. Maybe Tutorial D in front on Postgres’ storage engine.

                                              2. 2

                                                What would you consider beyond “SELECT * FROM foo"? I don’t touch SQL on a daily basis, but I could throw together a left, right, and full join if I needed to, and I’m aware of how to use subqueries. What SQL skills would someone have in order for you to consider them competent / good at SQL?

                                                1. 4

                                                  JOINs, GROUP BY, that sort of thing. If they’re going to be building DBs, understanding indexes, unique constraints, and foreign keys.

                                                  If you’re gonna be my lead database programmer/DB admin/DB architect, I’d want to see understanding of good database design (normal forms, good foreign key relationships, good key choices), CTEs, triggers (in your preferred database), being able to spot when a full table scan might happen, understanding when and how to use precomputed values for future queries, and so on.

                                                  1. 4

                                                    The use of aggregate functions, defining new aggregate functions, the various joins, the 3VL truth table, grouping, subselects, CTEs, that kind of thing.

                                                    1. 3

                                                      I like asking candidates

                                                      • find me the employee with the highest salary (entire row, not just the salary) - will usually use sub select with max.
                                                      • find me the employee with the second highest salary - expected to use window function but can still get away with two subquries.
                                                      • find me the employee with the second highest salary in every department - expected to use window with partition.

                                                      If you found a guy that thought about equal values ( rank / dense rank / row number ) you know he did some work. Hire him.

                                                      Haven’t touched joins yet.

                                                      1. 2

                                                        I don’t know window function, but believe that I know join

                                                        1. 3

                                                          It’s well worth the time learning window functions.

                                                          As Maruc Winand (who created use-the-index-luke.com and the newer modern-sql.com) says in his talk on Modern SQL there’s life before windows functions and life after windows functions.

                                                        2. 1

                                                          I’d much prefer lateral joins be used in some of these cases.

                                                        3. 1

                                                          I’ve seen no mention yet of EXPLAIN queries (Or similar operation in other RDMS than Postgresql?). Never been doing a lot of SQL in the past, but lately I had to work with some complexe queries over large dataset and most of my learning involved playing with EXPLAIN [ANALYZE], understanding the result and tweak the base query to fix the performance issue. Once you understand that, you can work from there to find the best index, where to place subqueries, set large enough working memory, etc.

                                                      1. 2

                                                        Just wanted to mention that there is an interesting bit about structural subtyping using mypy typing_extensions in the article.

                                                        1. 2

                                                          Thank you for drawing attention to that section of your article. I have an instance of it whereby I have a variety of kinds of equipment that are rack mountable. In designing classes for this scenario you typically see some kind of abstract base class like physical-machine or rackable-computer or similar. I don’t like solving the problem that way, as those classes are not real in the same way a switch or a dom0 is. The equipment has compatible physical properties, more realistically expressed as an interface.

                                                          Seeing an example of accomplishing this in Python is a treat. I have not encountered Mypy before.

                                                          1. 2

                                                            Mypy is a static type checker for Python. It’s still not as common as it should be, but it’s gaining traction.

                                                            The motivation for using a Protocol is static type checking. The protocol does not act as an interface or abstract base class as you might know it from other languages.

                                                            We also started with abstract base classes but we quickly realized they are not a good fit for Django models (as described in the article, they cause some problems with migrations).

                                                            If you like the concept take a look at go.