1. 16
    1. 8

      Re: format choice — I would recommend considering minimalistic HTML file with a single table. Not much worse than CSV w.r.t. serialisation overhead and simplicity, contains its own encoding header, can be opened at any device capable of downloading the export file from the website in the first place, opens in a sane way in many spreadsheet and word processor applications.

      Not always the best option, but probably worth using more than it is used now.

      1. 4

        CSV is a terrible format: almost zero cross-platform consistency outside of plain us-ascii 6bit less the comma. Excel still thinks an interchange format should use the machines charset; BoM and embedded newline corner cases, multiple quoting “standards” and so on.

        I’ve probably written four xlsx parsers/generators at this point. Takes an afternoon— about as long to read some random unknown-quality “library” api and you’ll at least know the performance characteristics but look: if you can’t be bothered, at least consider using JSON and using a JS-Zip-based xlsx writer. Most users will be fine with that (and real api consumers will prefer JSON to guessing what particular flavour of csv this is)

        1. 1

          These were exactly our pain points with CSV - BOM, formatting and quoting, delimiters, unicode issues etc. Some of the fields in the data are user generated so we had to deal with how emojies and new lines are encoded… huge pain.

          Generating xlsx in the client is probably a good solution but it required some time we didn’t have back then.

      2. 2

        Without thinking of all the potential corner cases, that looks like quiet a good idea.

        Moreover, one could ship the HTML file with a JS library (linked or inlined) to produce an excel file on the client side

      3. 2

        Agree it’s great.

        Biggest issue with this approach (and it’s not that big an issue) is that large enough HTML tables will hang/crash most browsers.

    2. 4

      This is a very nice breakdown of the problem and an elegant solution; I loved reading the story from problem to solution.

      In the past I had a similar issue with excel export/pdf export taking so long it would timeout but instead of consuming a day or two in debugging I wrapped the functionality in a worker job and added some JS to the front end that requested a export job and then updated the user on progress before finally pulling it down. I seem to remember one particular export taking more than five minutes to complete and we had it email the resulting file to the user requesting it so they didn’t need to keep a browser open.

      1. 2

        Thanks ;) we talked several times in the past about creating such “background reporting system”, but even if you want to go down this road, you still want to make the reports faster as possible. Hopefully this solution bought us some time…

        1. 1

          you still want to make the reports faster as possible

          Absolutely. In my experience in the teams I have worked with, due to most endpoints completing in a sub 70ms time frame. They had automatic reporting that would flag for performance review any endpoint that consistently took more than 200ms to complete.

          This resulted in a lot of tasks being shifted to background jobs that would update the user on their progress if needed; even then we kept an eye on job completion times because even while having 32 worker processes that autoscaled to 256 if we had a queue of 1000 jobs each taking a few minutes to complete it would still take a noticeable length of time for a queue to process.

          Performance is king :)

          1. 1

            I’m actually most interested in this part:

            They had automatic reporting that would flag for performance review any endpoint that consistently took more than 200ms to complete.

            How did they check if an endpoint is consistently over 200ms? Weighted average on a rolling window? Some static threshold?

            1. 2

              Its been a few years since I worked there but from memory I think it was both. The weighted average on a rolling window would be used over n number of requests served with a 2xx (rather than over say 15 minutes). For alarms that proved too sensitive for the window would be increased.

              Then in addition to that there was a static threshold of is this 2xx response taking greater than double the average response time for that endpoint within this rolling window. Anything that returned a 5xx would immediately result in the digital team being notified.

              In addition to delta between request received and upstream returning a response we collected other light debug statistics grouped by endpoint such as application process time, template render time, database query time, number of queries plus memory consumption of all five. Query count for example had a static threshold in order to quickly catch any n^n sillyness the ORM might have produced in an edge case.

              1. 2

                I figured it should be something like that… I’ve ran into this problem several times already. Sometimes it’s difficult to set hard limits, especially on temporal series with seasonality. I often use something like MAD with a threshold. Thanks 👏

    3. 1

      If you have the resources I would highly recommend setting up an observability tool like zipkin. It can set up measurements on your incoming requests so you can be signaled about what views are taking up the most time.

      You can also instrument things you think might take a while. For example instrumenting SQL queries can help you to see how much time is being spent in SQL land and how much in Python land.

      Doing this has been a huge eye opener in helping to find what is taking a while. It’s often what you think, but just enough times it’s something totally unexpected

      1. 1

        We use AWS and we have alerts in place using CloudWatch.

        One of the takeaways from this incident for us, was to set up alert on CPU as well. Our system is mostly IO bound so we have alerts in place for BurstIO etc.. We also adjusted thresholds for existing alerts that didn’t catch this on time.

    4. 1

      xlsx is just zipped-up XML (go on, unzip an xlsx file and take a look), so I’d love to know what the original exporter was actually doing to be so slow. Outputting XML isn’t usually computationally expensive.

      1. 2

        Seems like openpyxl is using the built-in xml module[0], which is from the looks of it is significantly slower than lxml. One of the things I wanted to check following this article, is how much faster pyopenxl can get using lxml.

        [0] https://bitbucket.org/openpyxl/openpyxl/src/default/openpyxl/xml/functions.py

    5. 1

      At my job I tend to jump the problematic projects, and/or to supervise them once in a while. It’s a very common to be hit by this, I’ve seen it in C#, Python, etc.

      Just today I rewrote a django->python->csv export to use openpyxl+lxml as they wanted to have autofilters on the file. Openpyxl besides being faster with lxml also has a write-only mode to be speedier and easier on memory.

      The key in all these exports is to avoid lazy queries (annotate/prefetch_related) and, if possible, avoid loading all data into memory before starting to stream it.

      Forgot to say avoid.. :D

      1. 2

        Hey, openpyxl should work faster when lxml is available. Do you happen to know by how much?

        1. 2

          I just did a quick test using their benchmark snippet, almost double with 10k rows

          Versions:
          python: 3.6.8
          Dimensions:
              Cols = 50
              Sheets = 1
              Proportion text = 0.10
          optimised = True
          
          Rows = 10000
          Times:
          openpyxl: 2.6.3 using LXML True:   3.70
          openpyxl: 2.6.3 using LXML False:   6.03
          
          Rows = 1000
          Times:
          openpyxl: 2.6.3 using LXML True:   0.37
          openpyxl: 2.6.3 using LXML False:   0.57
          

          Edit: I didn’t profile memory usage, but according to their docs, that should be considerably lower too.

          Edit 2: XlsxWriter is faster than Openpyxl proven by that benchmark, but we also need to read xlsx files too, I prefer to keep one library for both if possible.

          1. 1

            Thanks, that’s interesting.

            Your numbers are lower than the ones in the benchmark (hardware maybe?), but it does look like just by installing lxml you get double the speed.

            pyexcelerate is till faster, improved by a factor of ~3.5.