I wish that any company with “Open” in their name would share the source code when they have achieved something interesting.
Exactly this! I was like “ok, that’s cool, show me the code!”
This is a culmination of two worlds that a niche group of researchers have been working towards, combining text transformers with image convolutional nets, and it’s really great to see this first mainstream outcome.
This direction is much bigger than “oh cool I can make images from text”. Because a big problem in NLP is that it has been trapped in just using text for learning. In reality we learn language from text, speech, images, and other stimuli.
When I hear people talk about NLU many folks lack the awareness that even if a model has perfectly represented a text label as an embedding, the embedding itself still lacks meaning. The label “dog” might be captured by a 768 dimension vector, but aside from co-occurrence with other words and contexts, it still doesn’t know what a dog is.
So here’s hoping that the marriage of different vector spaces will get us closer to this realization.
Shameless plug: I’m writing more about these concepts as a contributing author for the book AI-Powered Search. https://www.manning.com/books/ai-powered-search
As far as I can tell, this does not use CNN for images at all. Do you have any evidence to the contrary?
Yes you are right! They are using Variational Auto Encoders. I mixed it up with some other research I saw using outputs from Resnet and variants. Thanks for calling this out - I’m much more familiar with the text side of things