It’s no secret that big models, such as DALL-E 2 and Imagen, trained in a large number of documents and images taken from the web, absorb the worst aspects of this data as well as the best. OpenAI and Google explicitly acknowledge this.
Scroll down the Image website, beyond the dragon fruit with a karate belt and the little cactus with a hat and sunglasses, to the section on social impact and you’ll get this: “While a subset of our training data was leaked to eliminate noise and unwanted content, such as pornographic images and toxic language, we also used [the] The LAION-400M dataset is known to contain a wide range of inappropriate content, such as pornographic images, racist insults, and harmful social stereotypes. Imagen is based on text coders trained with uncured web-scale data and therefore inherits the social biases and limitations of large language models. As such, there is a risk that Imagen has coded stereotypes and harmful representations, which guides our decision not to publish Imagen for public use without further guarantees. “
It’s the same kind of recognition that OpenAI did when it unveiled GPT-3 in 2019: “Internet-trained models have Internet-scale biases.” And, as Mike Cook, who researches AI creativity at Queen Mary’s University in London, points out, it’s the ethical statements that came with Google PaLM’s great language model and OpenAI’s DALL-E 2. . In short, these companies know that their models are capable of producing horrible content and have no idea how to fix it.