1. 24
  1.  

  2. 16

    Regarding this comment on your website:

    My goto style when dealing with python dictionaries. In my latest one, I was building a dictionary of lists: try dict[k].append(element), except dict[k] = [element]. Glad to know the EAFP fancy name that I can throw around my peers. And thanks for the good insights on the performance cuts this approach entails.

    Use a defaultdict. If the key does not exist it defaults to a given type.

    from collections import defaultdict
    
    dict_of_list = defaultdict(list)
    dict_of_list[key].append(element)
    

    It is more readable and faster!

    1. 8

      Or setdefault method for the dict built-in: d.setdefault(key, []).append(element)

      1. 5

        That causes the python vm to build a new empty list for each element, and make two function calls instead of one. This adds up when looping, and may fragment the heap.

        >>> import dis
        >>> def asd(): return d.setdefault(key, []).append(element)
        >>> dis.dis(asd)
          1           0 LOAD_GLOBAL              0 (d)
                      2 LOAD_METHOD              1 (setdefault)
                      4 LOAD_GLOBAL              2 (key)
                      6 BUILD_LIST               0
                      8 CALL_METHOD              2
                     10 LOAD_METHOD              3 (append)
                     12 LOAD_GLOBAL              4 (element)
                     14 CALL_METHOD              1
                     16 RETURN_VALUE
        >>> def asd(): return d[key].append(element)
        >>> dis.dis(asd)
          1           0 LOAD_GLOBAL              0 (d)
                      2 LOAD_GLOBAL              1 (key)
                      4 BINARY_SUBSCR
                      6 LOAD_METHOD              2 (append)
                      8 LOAD_GLOBAL              3 (element)
                     10 CALL_METHOD              1
                     12 RETURN_VALUE
        

        I believe the defaultdict simply overrides the __missing__ method of dict, effectively making it a ‘ask for forgiveness’ implementation.

    2. 12

      The post sorely missies a benchmark measuring actual performance difference. Given that message_is_compressed(message) is probably not message.startswith("{"), the actual JSON parsing should completely dominate the check, and I would be surprised if this indeed makes a measurable difference.

      1. 3

        True, I felt the same writing the post, since since it’s from a year ago, and the reason for blogging it was a friend’s chat I didn’t have the chance to measure it.

        Will try to measure it next time. But the point is -even though not measured- a string scan easily adds up in overhead over millions of messages. For ~0.02% of the cases?

        1. 4

          What do you mean by “a string scan”? What was the nature of the check to see if a message was compressed? Was it trialing an unpacking of the message?

          I ask because you’ve brought it up a lot of times, once in reference to checking a single char at the start of the string, and it feels like you’ve been bitten by something related hard enough that you’ve got a visceral reaction here, rather than measuring things out?

      2. 10

        Clever trick, but I think that down this path lies madness. The 256kb limit hasn’t gone away; it will just creep up on you when you least expect it. Suppose you switch compression algorithms to buy more time - now every message detected as base64 will have to rotate through a bunch of decompression algorithms to find one that generates reasonable-looking output. There goes your performance improvement.

        Some alternatives:

        1. Spin up an alternative SNS topic for compressed messages, and have different lambdas processing compressed/uncompressed data. Possibly causes a maintenance explosion if you need to have multiple consumers of both topic.
        2. Prefix your payloads with some metadata (at least a version number), in a way that’s fast to parse.
        3. Use a faster language if speed really is that important, and probably pick up safety guarantees too given recent advances in popular typesafe languages.
        4. Store large bodies elsewhere, like S3 or DynamoDB. Adds a lot more complexity to your messaging system.
        1. 8

          Devil’s advocate: why impose a maintenance or complexity burden up front? If you can use a good enough solution right now, you can attack the complexity later on when you have a better idea of how the problem is going to change. If it never comes up then you avoided over designing your system.

          That said, I don’t totally disagree with you.

          1. 10

            Alternative 2 is the best idea imo. Encoding the message “kind” in a header can be very cheap and very fast.

            1. 1

              Agreed.

          2. 4

            We were already considering option #4, but as already mentioned by other comments, we didn’t want to add that complexity upfront at the moment.

            1. 3

              FWIW, I worked on a system dealing with SQS’s similar limit, and we’ve also dealt with compression and multiple formats. Observations, though I’m not sure they change anything for anyone:

              • We could be confident that compression does what we need because we know the kind of content we put in. For us it’s batches of similar items and we get much more than gzip’s typical ~3:1 for text or such.
              • We temporarily ran a fork of the system where individual items could be huge, and implemented option 4: a JSON struct in SQS with some metadata and a pointer to S3. (We also switched to zstd, which can better handle repetition across large items.) There was enough work to do on each batch that the S3 latency didn’t really matter.
              • Hacking in a second format without an explicit version number was easier in practice than in theory. We run the producer and the consumer. Our compressed messages predictably started with gzip’s magic bytes base64’d, and our JSON pointing to S3 with {. A msg.startswith is negligible even compared to just the SQS fetch, and we still have a lot of magic-number (or other format-distinguishing) space open.
              • SQS and SNS also both have message attributes you could use for version numbers, S3 pointers, etc. Seems like you’re essentially having your AWS library do a serialize/deserialize for you, with no advantage in terms of size limit or billing, but they still may be convenient.

              Enough of the code was the application (as opposed to the fetching/decoding) that I’m not 100% sure if we used msg.startswith or try/except and the difference is small compared to other work being done and other code.

            2. 4

              https://quoteinvestigator.com/2018/06/19/forgive/

              It didn’t sound right that Grace Hopper coined that phrase. I’d heard it in too many contexts for her to be the origin.

              1. 4

                Encoding message type in-band like this is a great way to create security vulnerabilities. I would reject this approach in review on that general principle.

                The specifics of how you’re suggesting doing the detection are also weird. In addition to the likely unique prefix of the JSON that matklad notes, your compressed blob probably already has a unique prefix (e.g., a gzip or zlib header). Checking for that magic in base64 form should have negligible cost.

                1. 3

                  If these are json messages, I’d check if maybe you can compress everything. Stream compression of text is really cheap and at 256kb, I wonder if the uncompressed sending takes longer than just compressing everything.

                  1. 3

                    Only recently did try/except become “zero-cost” in CPython. Before, the if may well have been faster.

                    1. 4

                      If the condition check itself is fast, sure.

                      1. 1

                        Yes, good point.

                    2. 3

                      Forget about performance – making a habit of this prevents race conditions and TOCTOU security problems.

                      1. 2

                        Most of the time I would chose clarity instead of performance, if I can’t have both, and IMHO the ifs version is clearer here. Moreover, in the case of a lambda function, I think you will spend way more time starting the python interpreter than computing an if statement !

                        1. 6

                          That happens mostly with cold starts, most of the time this lambda function is warm/hot, continuously processing a considerable load of messages. A string scan easily adds up in overhead over millions of messages. For ~0.02% of the cases?

                          Readability counts, but it also a tradeable trade-off in such cases. And I don’t personally see a try/catch that terribly less readable.

                        2. 1

                          To circumvent this limit, we’ve found that 256KB is well enough room if we compress these kind of messages. But compressing produces bytes (binary data), and AWS SNS accepts text-based messages, so we had to encode the data with base64 before sending over to SNS.

                          Could you store the binary value in a MessageAttribute? Then you’d save time on decoding the base64. Might work for the 99.8% of smaller messages, too: there’s probably a library for decoding json from binary files that’s faster than parsing text.

                          1. 1

                            For the specific example given, you may want to consider using Ascii85 instead of Base64? SNS apparently supports at least 0x20..0x7f, and you could even swap 0x7f = DEL for e.g. 0x0a = \n if you want printable characters, so…

                            1. 1

                              This is all well and good until your compression | base64 step accidentally generates valid JSON.

                              How likely is that to happen? I don’t know, perhaps impossible for gzip, but that might not hold for other algorithms. I wouldn’t like to be the one who has to chase down this possibility as part of a debugging run.

                              1. 2

                                Base 64 doesn’t include quotation marks or brackets, but it could be all digits. That’s extremely unlikely, but theoretically possible.

                                1. 2

                                  Yes, all digits was my thinking, or possibly null