In April 2022, when Dall-E, a text-to-image visio-linguistic model, was released, it allegedly attracted more than a million users within the first three months. This was followed by Chatgpt, in January 2023, which apparently reached 100 million active users, just two months after launch. Both marks striking moments in the development of generative AI, which in turn brought an explosion of AI-generated content to the web. The bad news is that in 2024 it means that we will also see an explosion of manufactured, nonsensical information, wrong and disinformation and the aggravation of social negative stereotypes coded in these AI models.
The AI revolution has not been encouraged by a recent theoretical breakthrough – indeed, most of the fundamental work underlying artificial neural networks has existed for decades – but by the ‘availability’ of massive data sets. Ideally, an AI model captures a given phenomena – whether it is a human language, cognition or the visual world – in a way that is as closely representative of the actual phenomena.
For example, for a great language model (LLM) to generate human text, it is important that the model contains large amounts of data that somehow represents human language, interaction and communication. The belief is that the greater the data set, the better it catches human issues, in all their inherent beauty, ugliness and even cruelty. We are in an era characterized by an obsession with scaling models, data sets and GPUs. For example, current LLMs has now entered an era of trillion parameter machine learning models, which means they need billion size data sets. Where can we find it? On the web.
It is assumed that this data on the web captures a ‘ground truth’ for human communication and interaction, a proxy from which language can be modeled. Although different researchers have now shown that online data sets are often of poor quality, it tends to aggravate negative stereotypes, and it contains problematic content such as racial slurry and hateful speech, often to marginalized groups, it does not have the large AI – Businesses prevent not using such data in the race to scale.
With generative AI, this problem is about to get much worse. Rather than representing the social world in an objective way of input data, these models encode and strengthen social stereotypes. Recent work does indeed show that generative models encode and reproduce racist and discriminatory attitudes towards historically marginalized identities, cultures and languages.
It is difficult, if not impossible non-only with modern tracking instruments-to know how to know how much text, image, audio and video data are currently generated and in which pace. The researchers at the University of Stanford, Hans Hanley and Zakir Durumeric, estimate a 68 percent increase in the number of synthetic articles placed in Reddit between 1 January 2022 and 31 March 2023. , claim that he has so far generated 14.5 million songs (or 14 percent of recorded music). In 2021, Nvidia predicted that by 2030 there would be more synthetic data than real data in AI models. One thing is certain: the web is deceived by synthetically generated data.
The worrying thing is that these large amounts of generative AI outputs will in turn be used as training material for future generative AI models. As a result, a very important part of the training material for generative models in 2024 will be synthetic data from generative models. Soon we will be trapped in a recursive loop where we train AI models with only synthetic data produced by AI models. Most of these will be infected with stereotypes that will continue to strengthen historical and social inequalities. Unfortunately, it will also be the data we will use to train generative models that are applied to high -interest sectors, including medicine, therapy, education and law. We have not yet struggled with the disastrous consequences. By 2024, the generative AI explosion of content that we now find so fascinating will become a massive toxic dumping that we will return.