Synthetic data should be the option of last resort for training AI systems

Synthetic data should be the option of last resort for training AI systems

I recently read about how artificial intelligence (AI) makes its own data to train itself because it is running out of data to train on.

There was a story in financial times About how many large companies use AI to produce the data that the same AI system uses to train itself, for example, Large Language Models (LLMs) like Chat GPT. Another article Discusses AI systems trained using data generated by the AI ​​system.

I must say from the very beginning that there is no better synthetic data than data from the physical world. For example, if someone wants to create an AI system that can distinguish between cancer cells and normal cells, the best way is to give the AI ​​system pictures of cancer and normal cells taken from actual cells, not from synthetic cells.

Anything else, such as ultrastructural data on cancer and healthy cells, makes the AI ​​detection system less reliable. Despite all this, researchers generate synthetic data.

Artificial intelligence has changed many things in our lives, including the collection and use of data. One of the most exciting uses of artificial intelligence is to make data that doesn’t exist. Synthetic data is generated in a computer rather than coming from actual events. In that regard, synthetic data isn’t the real deal, but it is fake!

Fake data gives AI systems broken. Therefore, synthetic data should be the last option rather than the first choice when training AI systems. When synthetic data is used to train AI systems, it should be used with caution.

Synthetic data is artificially generated information that is similar to actual data in terms of basic characteristics and statistical properties but does not correspond to actual events. It is frequently used when actual data is limited, sensitive, or expensive to collect. When actual data is unavailable or unusable, synthetic data can be used for model training, testing, and validation.

Artificial intelligence, especially machine learning, is critical in creating synthetic data.

generative models such as Generative adversarial networks Often used to generate synthetic data. AI can also generate synthetic data using data augmentation techniques, which create new data by modifying existing data.

But the existing data should be representative. The dilemma of using non-representative data to make the data representative is non-representative is problematic. In the case of image data, possible techniques include rotate, scale, flip, and crop, but again, the representation dilemma also applies.

Algorithm bias difficulties

Some studies It has been estimated that up to 60% of the data used to train AI will be synthetic by 2024. Some advanced reasons for using synthetic data are to deal with issues of algorithm bias.

For example, more data is collected in Europe than in Africa, even though the population of Africa is larger than that of Europe. As a result, algorithms trained using this data for facial recognition, for example, will perform better for European faces than for African faces.

Technological solutions to augment the African data set with synthetic data so that AI algorithms understand African faces as well as European faces are fraught with difficulties. Again, the representation dilemma is at play here.

It is difficult to use the underrepresented African data set to create synthetic African data to augment the underrepresented African data set to make it representative.

The only way this will work is if the original African database, although limited, contains all the categories of people available in the African population, which is not always the case.

Thus, class representation is key to solving this dilemma. Class representation in the training data ensures that the AI ​​system is fair and comprehensive. Class representation is the distribution of different classes or classes within the AI ​​training data.

For example, in a binary classification problem, the two categories can be “positive” and “negative”. The training data should ideally contain an equal or at least adequate representation of all classes to ensure that the model learns to predict all classes accurately.

However, in practice, many of the datasets used to train AI models are unbalanced, with some categories over-represented (such as European faces in face recognition) and others under-represented (such as African faces). This imbalance can lead to skewed AI models that perform well for over-represented categories (European faces) but unfavorably for under-represented categories (African faces).

The imbalance in class representation directly affects the neutrality of AI systems.

Study of 2019 showed that biased training data can lead to discriminatory AI systems. For example, a healthcare AI system that is trained mostly on data from one gender may not work well for the opposite sex. Inequality in AI systems can have severe consequences, including exclusion and discrimination.

study by life and forced found that commercial gender classification systems had higher error rates for darker-skinned and female individuals due to lack of training data for these groups. This exclusion can exacerbate existing social disparities and create a digital divide.

Another strategy is to reduce the negative impact of class imbalances to ensure fairness and inclusion. In addition, AI systems can be made more transparent by revealing the characteristics of training data and system performance across different categories.

Ensuring that diverse and proportionate class is represented in the training data is essential when developing comprehensive AI systems.

Moreover, Silicon Valley, the worldwide hub of advanced technology, innovation, and social media, must become more inclusive. Silicon Valley and other similar hubs must have people from different backgrounds. Most workers in Silicon Valley are men, mostly white or Asian. There should be more women, both black and Latina and Indigenous.

This lack of diversity affects how AI is designed and used and leads to biased algorithms. Employment programs should focus on diversity training to deal with unconscious bias and mentor underrepresented groups.

We need to address the economic problems that have led to an excessive concentration of resources in one area to the exclusion of others. The African continent is a major part of the technology value chain. For example, a lot of the raw materials used in the technology are from Africa.

Therefore, it is necessary to reform the global financial architecture to ensure the creation of a just world digitally. We need to fix these issues so that the data poverty that leads to the need to generate synthetic data is minimized, especially in the developing world. DM

Source by [author_name]

Leave a Comment