After the launch of OpenAI’s remarkable robotic artist DALL-E 2, the AI world is still figuring out and keeping up with its fantastic show of prowess that it can draw/paint/imagine just for what you ask. Along with OpenAI, Google Research has also rushed to broadcast a similar model it’s been working on, which they claim is even better than the former.

Imagen is a text-to-image diffusion model that utilizes a deep level of language understanding and an unprecedented degree of photorealism. 

For instance, the model takes up text inputs such as “a dog on a bike” and produces a corresponding image, which has been done for years but has recently witnessed an upward trend in quality and accessibility.

The first part of the model uses diffusion techniques. The process starts with a pure noise image and gradually refines it bit by bit until the model thinks it can’t make it look any more like a dog on a bike than its existing idea or picture. That was an improvement over top-to-bottom generators that could get it hilariously wrong on the first guess, while others could easily be led astray. 

Further, the other part is improved language understanding done through large language models using the transformer approach, which is led by the recent advancements of effective language models like GPT-3 and others.

How Does Imagen Work?

The model generates a small (64×64 pixels) image, and then two “super-resolution” passes on it to resize it up to 1024×1024. This isn’t like normal upscaling, though, as AI super-resolution develops new details following the smaller image, using the original as a basis to design a new one.

Let’s say you have a dog on a bike, and the dog’s eye is about 3 pixels in the first small picture. This image does not have a lot of room for expression! But in the second image, the same eye is 12 pixels across. And if you ask, ‘Where does the detail needed for this come from?’ It is because of AI.

The model knows the actual picture of a dog’s eye, so it generates more detail as and forth it begins to draw. Then this happens again when the eye is done again, but at 48 pixels across. It is like the story for many artists: they started with the equivalent of a rough sketch, filled it out in a study, and then went to town on the final canvas.

This isn’t the first time such technology has been used, and artists working with AI models have used this technique to create much larger pieces than the AI can handle in one attempt. If you split a canvas into several parts and super-resolution them separately, you will have a more detailed and larger image. This process can be done repeatedly until you have the outcome that you had desired or based on the first image. 

There are several advancements required as per the claim of Google’s researchers. According to them, the existing text models can be used for text encoding, and their quality is much more significant than merely increasing visual fidelity. That makes sense intuitively since a detailed picture of gibberish is worse than a slightly less clear picture of exactly what you asked for.

As per Google’s tests, Imagen came out ahead in human evaluation tests on accuracy and fidelity. Though, its limitations can still not be neglected. 

Limitations and Societal Impact of Imagen

  • The risks of misuse raise concerns regarding responsible open-sourcing of demos and code. Google’s team has decided not to release code or a public demo.
  • The data requirements of text-to-image models have led researchers to rely heavily on large web-scraped datasets. Such datasets reflect social stereotypes, oppressive viewpoints, derogatory or otherwise harmful associations, and undesirable content, such as pornographic imagery and toxic language. Thus, the model takes over large language models’ social biases and limitations.

OpenAI V/S Imagen

OpenAI is a step or two ahead of Google in various ways. DALL-E 2 is far more than just a research paper; it’s a private remote beta with individuals using it, as they used its predecessor and GPT-2 and 3. 

Ironically, the company with “open” in its name has focused on creating its text-to-image research, while Google has yet to attempt it.

That’s more than evident from the choice DALL-E 2’s researchers made, which was to curate the training dataset ahead of time and eliminate any content that might infringe their fundamentals and guidelines. Google’s team, however, extracted some large datasets known to enclose inappropriate material.

Final Thoughts

Like the others, Imagen is still clearly in the experimental phase, not ready to be utilized in anything other than a strictly human-supervised manner. When Google cajoles to make its capabilities more accessible, more learnings will happen about how and why it works.