1. 27
  1.  

  2. 5

    Personally, the most interesting aspect about this is the different ways you could represent the same information: schemaless (tuple, dict) vs schemaful (data class, named tuple), limited (NumPy arrays) vs effectively unlimited (others). The size of each object wouldn’t come into consideration except for extremely niche programs, such as those where the max size of each datum is known in advance (where NumPy arrays make sense).

    Also, how do each of these data types scale in memory if you have a million of them, each with a million fields?

    1. 1

      Store them in a more compact form such as pickle, and unpickle them ondemenad with a cache.

    2. 5

      This seems like it’s probably misleading, because it’s measuring the overhead of the high-level data structure and not accounting for how the size in memory will vary according to what you put in there. If you only ever store (Python) int and short strings containing code points from the latin-1 range, it’ll look very different than if you store, say, longer strings containing code points outside latin-1 (will at least double the storage for the string) or more complex types like lists or dicts.

      The string storage varies because internally Python always uses a fixed-width encoding, but chooses it per-string as the narrowest one capable of encoding the widest code point in the string — 1, 2, or 4 bytes.

      1. 2

        The string storage varies because internally Python always uses a fixed-width encoding, but chooses it per-string as the narrowest one capable of encoding the widest code point in the string — 1, 2, or 4 bytes.

        I thought that python used utf-8 internally. TIL, I guess.

        https://peps.python.org/pep-0393/

        1. 2

          As the PEP says, Python used to do either 2-byte (with surrogates) or 4-byte internal Unicode and which one was a compile-time flag baked into the interpreter. Then Python 3.3 switched to the dynamic per-string encoding choice used ever since.

          The advantages of this are that strings are always fixed-width storage — which is convenient for a lot of interactions with them, both at the Python and underlying C level — but without the inefficiency of UTF-32. There are cases where this makes better use of memory than “UTF-8 always” (since Python can get the first 256 code points in 1-byte encoding while UTF-8 can only get the first 128), and cases where it doesn’t (since a single code point past 256 switches the whole string to a wider encoding).

          1. 3

            What’s an example of an algorithm where fixed width code points is actually useful?

            1. 1

              Iterating is a lot easier with fixed width. Indexing is a lot easier with fixed width. Calculating length is a lot easier with fixed width.

              I know a lot of people really like to say that you shouldn’t be allowed to do those things to Unicode, but there are enough real-world use cases which require them that there’s no reason not to have them be that little bit nicer.

              Also it avoids the issue of how to expose variable-width storage to the programmer; in the old days before PEP 393, a “narrow” (2-byte Unicode) build of Python would leak lone surrogates up to the programmer. But if you’re going to normalize them to avoid that you have a complex API design problem of how to achieve that. Currently Python strings are iterables of code points, not of bytes or code units, which is a cleaner abstraction all around.

          2. 1

            Does any language use utf-8 internally? That seems like a very slow choice if you are doing a lot of string manipulation.

            1. 1

              Swift does for most purposes. If that is too slow, you are probably not dealing with proper strings.

              1. 1

                It doesn’t seem like it’s actually storing anything as UTF-8 in memory, but rather a collection of Character objects, which to me seems to be a subclass of an integer by reading the documentation. That is pretty much the OOP way of implementing strings as far as I can tell.

                1. 1

                  Storage is UTF-8 according to https://www.swift.org/blog/utf8-string/

                  A Character is a Unicode grapheme cluster and is realised on a much higher level than pure storage.

          3. 2

            Also, I don’t know if it’s still the case, but I know that it at least used to be true that Python dict aggressively resizes to keep the hash table sparse, so even if you use the lowest-memory-overhead container you can find, putting even a single dict inside that container is likely to wipe out all your careful micro-optimizing choices.

          4. 1

            How useful for me! Thank you for posting this.