1. 74
    1. 35

      A bit more of a writeup of the technical side: https://www.da.vidbuchanan.co.uk/blog/exploiting-acropalypse.html

      1. 25

        Oh, delightful. The root cause:

        Google was passing “w” to a call to parseMode(), when they should’ve been passing “wt” (the t stands for truncation). This is an easy mistake, since similar APIs (like POSIX fopen) will truncate by default when you simply pass “w”. Not only that, but previous Android releases had parseMode(“w”) truncate by default too! This change wasn’t even documented until some time after the aforementioned bug report was made.

        So changing the defaults in one place in the program broke something somewhere else. Easy mistake to make indeed!

        1. 18

          Google was passing “w” to a call to parseMode(), when they should’ve been passing “wt” (the t stands for truncation).

          I don’t think this is fair.

          The failure is the group/team that made a fairly major change that silently broke bincompat in a way that impacted data security. The bug is not “google should’ve been passing “wt”” it is “the library authors should not have changed the behavior of a shipped API in a way that impacted storage of user data”. But then ParcelFileDescriptor looks like it’s a google library so the answer is still “google did a mistake” but we should not put the blame on the google team that made the blackout the image code for using the API as specified, when the API providers silently change said behaviour.

          1. 10

            Oh, it’s not fair. It’s a stupid mistake that wasn’t even documented, let alone documented effectively. But you can 1000% sympathize with how it happened. Left hand and right hand didn’t know what each other was doing, and presumably the team that made this change had some reason for doing it.

            Now I’m wondering, what tools do we have to prevent this sort of thing from happening? A different kind of file mode API, such as proper sum types or bitflags, would probably have made the difference more clear, but wouldn’t have detected the changed default. Something like Rust’s File builder type is the same story. A unit test that exercises the “crop” function may or may not have caught it; I would write the test so that it crops a test image, then reads a reference image and compares the image pixels one at a time. If I was feeling a little fancy I could test that the actual output file was actually identical to a reference file, either by raw comparison or checksum, but I probably wouldn’t think to do that by default. A code coverage tool miiiiight have noticed that something changed, though most likely only if it were run on both the application and the library. Fuzzing could notice a different code-path was taken in the library but might not really highlight it as a concern.

            Tricky problem. Better invariants for the library (and tests to detect when they change) might really be the only good option then.

            1. 15

              This isn’t even left hand vs right hand.

              Once you ship an API you can’t change fundamental behaviours of that API. This API change obviously broke google’s code, but who know how many other projects use this API and have also been impacted.

              This is API maintenance 101 (especially for system libraries). It’s somewhat shocking that this change got through basic change review let alone platform API review without being automatically denied.

              1. 6

                Totally, but life also isn’t fair.

                Would I do this on purpose? I would certainly try not to. Would I trust myself not to do this sort of shit by accident? Hell no. Hence me pondering automated tools.

                1. 9

                  I take this as a lesson in accidental compatibility. To prevent latent accidents like this from accidentally compiling, rename the function when you change its behavior. I usually start with that, even if it’s just to invoke the help of the compiler and I’m going to change it back within the same commit. The difference is that if someone else uses your API, you don’t change it back.

            2. 8

              Now I’m wondering, what tools do we have to prevent this sort of thing from happening? A different kind of file mode API, such as proper sum types or bitflags, would probably have made the difference more clear, but wouldn’t have detected the changed default. Something like Rust’s File builder type is the same story. A unit test that exercises the “crop” function may or may not have caught it; I would write the test so that it crops a test image, then reads a reference image and compares the image pixels one at a time.

              IMHO this is one of those things that you can’t fix with tech. I’ve seen them fixed, but not purely at the tooling level. You can use tooling to implement the tech side of the fix, but this is a failure at the management “layer”.

              One way to fix it – I won’t claim it’s the best one, it’s just the one that I’ve seen working – is:

              1. Formally document default behaviours of all API functions (as in, have a “default behaviour” sub-section in the documentation)
              2. Ensure similar APIs follow similar default conventions via good ol’ cross-teams documentation review
              3. Don’t allow changes of default API function behaviour. If you really think that e.g. an API function is dangerous and should be phased out, deprecate it in favour of the “safe” one in the next release, then yank out the old one. If, for whatever reason, you absolutely need to change a default behaviour, do it early in a release cycle, as atomically as possible in terms of branch commits (oh a “release” happens across sixty repositories, sucks to be you…) backport it across all LTS branches, and make its integration a release-blocking task everywhere.

              The people who’d come up with this did it after getting bit by several tech-only approaches:

              • Banning modal arguments: they basically ran into literally the same bug, only with flash-backed key-value storage: it turned out you can still truncate by default with config_open_rw in one API and not truncate by default with app_storage_open in another API.
              • Comprehensive test suite to cover positive and negative outcomes of defaults. Caught default behaviour changes early (i.e. on WIP feature branches) but failed to prevent ensuing bugs because catching a default behaviour change doesn’t fix anything, everyone else changing their code does. Also, resulted in murky regressions in backported fixes, as the API change and all dependencies got ported, but old LTS code that had been refactored or even deprecated on the newer branch remained unchanged.
              • Periodic code coverage analysis. Caught default-altering changes but largely resulted in behavioural fixes: instead of walking back on the default-altering change – which was often done by someone in another team, in a library codebase, so not something folks on the application side were comfortable working with – people altered their code to handle it correctly, which certainly fixed the bugs. But also resulted in more complex code that broke more easily, and a lot of needless boilerplate, as a lot of code on the application side was now written to handle things that should’ve not been allowed on the library side. It went south, badly, when the default behaviour of an API changed twice.

              It was definitely draining to implement. It required real documentation to be written, rather than boilerplate “app_storage_open opens the application storage section”-type docs to tick off the “documentation” task in Jira. It also required cross-team reviews, which meant cross-team expertise had to be maintained, which makes managers iffy in a lot of companies. And it meant even early decisions had consequences, so people who did API work had to spend a lot of time figuring out how to build their interface (on the bright side, that also meant the examples/ folder was full of useful material halfway through!) “Worst” of all – it had been adopted before I’d been hired at that company, but folklore about its difficult adoption process was still circulating – it introduced straightforward accountability at the management layer, so it was extremely unpopular with some of the folks in middle management.

              Part of this approach was obviously implemented via better tooling and coding conventions. E.g. the default behaviours in #1 were covered by tests, and the ban on modal arguments actually stuck, because it made writing the tests more difficult. But this was just the tech side of a wider implementation that boiled down to tl;dr make sure everyone’s on the same page, which is an extremely un-tech thing.

              1. 2

                Banning modal arguments: they basically ran into literally the same bug, only with flash-backed key-value storage: it turned out you can still truncate by default with config_open_rw in one API and not truncate by default with app_storage_open in another API.

                What’s a modal argument? Enums?

                1. 2

                  Yeah, as in, the mode argument to open. The idea was that you’d no longer pass operation modes to e.g. I/O functions (read/write, blocking/non-blocking etc.) but instead call a specific function for the mode of operation you wanted (e.g. open_config_rw/open_config_r). This was actually meant to help with some other things as well, but one of the things we’d hoped it would do was smooth some dangerous differences, where – as above – some APIs took “write” to include “truncate” and some didn’t. Unfortunately, that still required people to follow convention, so you still had to check that they actually followed convention, which meant code and design reviews.

                  Modern languages do make this paradigm a lot more powerful. E.g. in Rust you can do something like s = DataStore::from_file(...).open().rw().truncate(). Requiring the truncate() to be explicit is useful, but it’s not a magical fix – you still have to mandate that rw() doesn’t truncate, and you still gotta enforce that with code and design reviews.

              2. 2

                Whew, thank you for all the info! I learned a lot.

                But this was just the tech side of a wider implementation that boiled down to tl;dr make sure everyone’s on the same page…

                Yeah, that’s the Real Goal. IMO the purpose of the tools are to help remind you to do it and help catch when things change under the hood. But the real place the bug occurs is in the interface between one person’s brain and another’s, and building institutional systems that help those puzzle-pieces fit together is its own kind of technology that IMO gets underrated. Especially since implementing it involves social skills.

            3. 3

              The main thing is to not have default modal arguments. Unfortunately in this case the change was from a safe default (wt) to one that presumably was deemed more convenient.

              For the consumer of apis, remember the rule: explicit is better than implicit. Always pass an argument and don’t accept the default. Fortunately linters are now advanced enough that you could have a warning about accepting default arguments.

        2. 13

          Also, the parseMode() change is in AOSP, while the screenshot editing tool is Pixel-only, so they’re almost certainly done by different teams.

      2. [Comment removed by author]

    2. 12

      You also have to be careful when sharing files in the HEIF format, because it supports lossless crop as a feature.

      1. 3

        Any idea what the use case for that is?

        1. 16

          Lossless editing is a useful feature for a lot of production pipelines. You crop the image to the size of the box that it will go in for the final print run, then there’s a lad-minute change to layout (an advertiser pulls something, copy editing makes an article slightly shorter and that pulls a paragraph back to the previous page, and so on). You then need to make the image larger, so you want to extend the crop rectangle slightly. This is much easier if the image that you’ve embedded is effectively the source plus a crop rectangle, rather than an extract. PostScript and most other things in the same space have something similar.

          It’s also useful sometimes to have a few pixels past the edge of the crop. If you’re printing an image on the edge of the page, you typically extend the crop rectangle slightly outside of the printed page so that if the cutting is off by half a millimetre you don’t get a white border.

          Beyond that, for any macroblock-based compression, if you crop anywhere that isn’t a macroblock boundary (or even one that is if the compression includes adjacent macroblocks) then you will introduce net artefacts. Try cropping one pixel from every side of a JPEG and saving, then look at the two images zoomed in: you’ll see artefacts. Ideally, you want to avoid introducing any loss as part of the editing process, so keeping the original image plus the set of transforms (crop, colour changes, and so on) lets the final render work on uncompressed raster data and send that to the printer.

        2. 1

          HEIF (AVIF, HEIC) is a bit of a kitchen sink of features, mainly to support everything that a photo library/digital camera roll may need.

          This is different from handling image sizes that aren’t an even multiple of a block size. That type of cropping is handled automatically in the codec, not in the metadata.

        3. 1

          Especially JPEGs from Instant Messengers like WhatsApp and especially LINE (geez their compression is aweful) give you low quality JPEGs. Using jpeg lossless cropping, you can crop those without re-encoding an already low quality image. Very useful to preserve quality, where otherwise a re-encode to preserve quality after copping would entail a large file size, since it would re-encode the low-quality artifacts as well.

      2. 1

        Jpeg also supports lossless crop. (Although it’s not in the standard itself, jpeg can be cropped losslessly on 8px steps)

    3. 10

      A genuinely baffling vulnerability. Absolutely astounding that an image crop can’t just be an image crop. If only there were an avenue for this to get Google in legal trouble.

      1. 17

        This is unfair. The image crop does everything you would expect assuming you were doing it yourself. The bug is an undocumented change to the IO routine. Specifically prior to android 10 there was a file IO routine, that takes a mode flag like C’s open, where you open a file to write by passing "w", e.g. open(“a.png”, “w”)`. That opens the file for writing, and if the file already exists the output is truncated to 0 size.

        So the code for editing the image did (pseudo code):

        pathToImage = "path/to/image"
        someImageFile = fopen(pathToImage, "r")
        someImageObject = parseImage(someImageFile)
        fclose(someImageFile)
        editImage(someImageObject) // crop stuff, user editing UI or whatever. This is just pseudo code
        // Now save in place
        outputFile = fopen(pathToImage, "w")
        someImageObject.storeImage(outputFile)
        

        This is essentially what google was doing, but instead of fopen() they were using an android API that took the mode string. The important thing is that when you open a file for writing with “w”, if the file already exists then it is truncated to zero bytes, e.g. it’s functionally a new file.

        That is the way the android API worked, which makes sense: you’ve got an API modeled after another API you should be consistent with that other API. At the point this image editing code was written everything was fine: crop, culling, redacting, etc your image is fine.

        But then in android 10 someone made a change (and apparently didn’t document it) that broke bincompat by changing the behavior so the default behavior. Now opening a file to write with “w” stopped erasing the old file, so the above code - with no changes - started leaking private user data. So where previously you got

        file.png: oooooooooo file.png: nnnnn (o for original n for new)

        but post this bincompat breaking change you get

        file.png: oooooooooo file.png: nnnnnooooo

        Now depending on what and where the image was edited you can reconstruct the state of the compression in the original image, so you can recover the original pixels from the tail of the “edited” image.

        It’s important to note that any program editing a file of any kind would start leaking information like this if the the underlying IO functions were changed like this (imagine how bad the result of changing this behavior in fopen() would be).

        The long story short is: the image editing was done correctly, and wasn’t doing anything “stupid”. An underlying library decided that their API wasn’t good and decided to change it, but they changed it after it was already in use, which is just not something you can do without creating this kind of problem. That’s why API design takes time, and why “API stability” doesn’t just mean ABI stability.

        1. 5

          My comment was posted prior to the technical write-up about the discovery of the bug, or explanation of the underlying API change, was made public (afaik). It’s still baffling, even knowing the reason (baffled by such a change to the API being used, going undocumented and subsequently unnoticed).

    4. 5

      So, with samsung making things up with super zoom, Google allowing uncrop… suddenly this scene from RedDwarf is far more realistic :)