1. 21

  2. 6

    The “AI art” debate usually makes an assumption that models for image generation are already as good as humans, or soon will be. While I enjoy a philosophical discussion, my personal experience with those image generators shows that it will take a long time for those discussions to actually become relevant. ;)

    In that release post, for example, the image for a dream of an open-source universe of creativity is kinda neat — a robed figure in front of filament domes with a starry sky in the background. I went to try that prompt and the first result looked like a generic space-themed image with “dreamm-drame” (sic) text on purple background. I tried a couple more, and I don’t know how many attempts it takes to get anything like the one they picked for the post.

    When DALL-E et al. became public, I was quite excited because I thought I finally could generate simple illustrations without bothering any humans. Those images in release announcements do look great for AI-generated outputs for sure, so I thought I can get the same.

    In reality, even in simple cases I had to sift through piles of garbage to pick out a somewhat decent-looking image. Generating a photo of an imaginary person without iconic GAN artifacts is still ten misses per one hit — most of the time you can see where exactly it stitched multiple faces together. So, out of a hundred images, 90 are just bad and the remaining ten are not bad but not necessarily what you want.

    For a time I thought that generating less challenging stuff is doable and I’m just bad at writing prompts. Writing better prompts kinda does help one get closer to what one wants, but it’s not a key to avoiding images that are poorly generated. They still misunderstand very simple instructions like “looking at viewer” routinely. Some things simply weren’t in the training data — both DALL-E and Stable Diffusion don’t understand what’s a shovel or a pitchfork, so generating images of farmers gives very amusing results.

    It’s interesting to imagine a future where AI-generated art competes with human illustrators, but that future isn’t there and I doubt it will be soon.

    1. 6

      both DALL-E and Stable Diffusion don’t understand what’s a shovel or a pitchfork, so generating images of farmers gives very amusing results.

      This is an interesting case of implicit bias - because the training set is publically available images, and a lot of those are stock photography, you will get good results for “businessperson” and “student” but less good results for those work categories where stock photography doesn’t have as much of a market.

      1. 2

        I have a related, and possibly stupid question that I hope someone who’s more familiar with modern AI/statistical learning can answer.

        So a long time ago (to put it in context, GPGPUs were a hot new thing) the team I was in experimented with some statistical models. This was in a completely unrelated field, with largely unrelated technology, and it’s distilled from boring shit based on Maxwell’s equation so the description may not be very clear, hence the “possibly stupid” part. Also it was a long time ago, sorry for the lack of detail – I really don’t remember much, this was a super-early experiment that our profs weren’t too confident about anyway, so I never built anything worth publishing.

        One of the problems we struggled with was that, the more “things” we tried to get our models to recognize, the worse it got at discriminating between them. Basically, when our model recognized “spheres”, “pyramids” and “cubes” (air quotes ‘cause those weren’t spheres and cubes) it did okay. But, as we expanded it to more “shapes”, it got worse at discriminating between things that didn’t look like archetypal “spheres” and “cubes”. Expanding the model effectively dilluted its ability to discriminate subtle features, to the point that “spheres” that looked enough like “cubes” if you squinted would be unreliably labeled as “spheres”, even though an earlier and technically not as trained model could reliably discriminate between them.

        This was, like I said, super early, and I had zero experience with statistical learning, so it’s obviously likely I just sucked at using this thing. However, I did get to meet someone who’d been distantly involved in the development of a similar model and she told me this was a problem that they’d been aware of.

        Do models like those involved in DALL-E and SD suffer from analogous problems? As in, if the training set is expanded to add further sample point categories (i.e. further work categories), without adding further samples for the currently-known categories, would the model get worse at generating images?

        1. 2

          I think the question is mostly about model capacity. Once your model capacity saturated, you tends to “forget” things previously seen. Large models as conventionally understands are “under-trained”, meaning while over-emphasize on particular samples would overfit, if the sample distribution is “fair”, it won’t. Also, attention (transformers) helps as demonstrated by GATO (https://www.deepmind.com/publications/a-generalist-agent) and there are other more advanced (but less popular, due to size and TBH, dubious about effectiveness) models such as MoE (Mixed of Experts) models try to address this issues of incremental learning, among other things.

          1. 1

            Ooh, cool, thanks! My handful of weeks of hands-on experience with this precede most of the articles in the GATO paper bibliography, so a lot of that material is way over my head, but I think I can get a basic grasp of this by hopping from keyword to keyword. Awesome!