Powerful AI Needs Powerful Data: The Two Are Inseparable

Deocrative thumbnail showing two people holding hands
Table of contents

Technology is changing our lives and our society at a rapid pace. Artificial intelligence (AI) is at the heart of this change. In the past, the focus has been on new (generative) AI models such as Large Language Models (LLM) and the general improvement of algorithms - ChatGPT being the most prominent example. However, what is often overlooked in the discussion is the essential ingredient for AI: access to comprehensive, high-quality and legally compliant data. 

Data is the basis for AI development

ChatGPT would be an exciting scientific project at best if OpenAI did not have access to a huge corpus of publicly accessible training data (the Internet) and exclusive data license agreements with companies such as Axel Springer, Reddit and Shutterstock. Only the training and fine-tuning of AI models with this public and proprietary data creates real added value. It is not without reason that many tech giants publish AI models (e.g. Google TensorFlow or Meta Llama) as open source - without additional access to proprietary training data, they are not really differentiated.

The shift in discussion from AI models to data access has also arrived in Germany: with the founding of DataHub Europe, Deutsche Bahn and Schwarz Digits have recognized that access to and secure exchange of data is essential for successful AI development. Even Aleph Alpha, one of DataHub Europe's first customers, is turning away from developing its own language models and focusing on the entire interplay of models, data, infrastructure and compliance with PhariaAI. 

There are different sources of training data for AI

Proprietary data forms the core 

If companies want to use AI, the focus should primarily be on internal and “proprietary” data. Proprietary data is unique and exclusive to each company, such as sales data, production data, financial data, process data or transaction data. This data can be used to train machine learning (ML) models or fine-tune LLMs to gain business-critical insights faster and make better predictions of business performance. AI can provide “micro-economic” context with proprietary data.  

Another advantage of the intensive use of internal data lies in total control: internal data can be checked for quality and legal certainty much more easily than external data. When using personal data for AI training, companies must ensure that they have valid declarations of consent from customers and users to process their data. This is in accordance with General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA) and other privacy regulations. The advantage of working with proprietary data is that companies can assure that they are GDPR- and CCPA-compliant much more easily than if they were to procure data from a third-party. 

Like privacy compliance, intellectual ownership is also easier to control and define when dealing with internal data. It is always extremely important to clarify whose intellectual property the data is. Companies can ask: does the internal data belong to our company? Or is it the intellectual property of our customers or partners? In the latter case, it must be clearly ensured in contracts and terms of use that this data may be used to train AI models. 

Data from business partners is essential

After internal data, data from business partners such as major customers, suppliers or service providers is of great importance for the application and training of AI models.  

For consumer goods companies such as Henkel, Unilever or Proctor & Gamble, it is business-critical for corporate management that they have timely and regular access to detailed sales figures from their major customers, such as Walmart or Schwarz Group. The shorter the time lag between the creation and exchange of data between two business partners, the better AI models can make predictions that are adjusted to the day and influence business decisions.  

For example, data from suppliers can be used to train AI models that make better predictions to secure the supply chain. If an AI model is continuously fed with production data from all suppliers, production bottlenecks can be identified at an early stage and a supply chain early warning system can be created. 

The exchange of data between two business partners must always be accompanied by clear agreements and contracts for the use of AI. Data exchange agreements have been around for a long time, contractual agreements specifically for the use of data in AI training is newer legal territory. If personal data is exchanged, there must be clear documentation for the unambiguous consent of data subjects to use the data for the AI training of partners. 

External data provides the big picture

If proprietary data from companies and their partners establish the micro-economic context for AI models, then external data primarily establishes the “macro-economic” context. External data includes open data from public authorities or research institutions, as well as commercial data from professional organizations.

External data comes in many forms: financial market data, weather data, company data, image data, map data, import/export data, and hundreds of other categories. There are not only thousands of open data portals worldwide, but also tens of thousands of commercial data providers. The trick for companies is to find exactly the right, relevant and legally compliant data for their AI models - the proverbial needle in a haystack.  

Imagine, for example, a large ice cream manufacturer: historical sales figures can be used to train an AI model that predicts sales on a daily basis, for example. The model only has the internal context here and can only make a prediction “looking in the rear-view mirror”. But what if the sale of ice depends largely on the weather? Here, for example, not only historical weather data, but also daily weather data and forecasts could provide the AI model with important context. This data allows the AI model to look “out of the side window and the windshield”. 

However, the use of external data also comes with challenges. Companies have little control over the data quality of third-party providers, have to diligently check the legal security in terms of intellectual property and data protection compliance of their suppliers, and finally arrive at a fair price for the data. Long-term availability and support also play a major role: what happens, for example, if an open data portal or data provider is no longer available? Do we have alternative data sources? Who will provide support if the data does not meet the expected quality and quantity?  

The responsible use of data 

In addition to the numerous benefits that data offers for AI development, we must also face the challenges that come with the use of data. Responsible use of data is crucial to ensure that the benefits of the AI revolution are enjoyed by all and not just a few. When it comes to personal data, data protection and data security are key issues. Companies must ensure that the data they collect and use is properly protected and that user privacy is respected. This requires not only robust technical solutions, but also clear ethical guidelines and transparency towards users. 

Originally published in Business Punk.

Monetize your data

150+ data companies use Monda's all-in-one data monetization platform to build a safe, growing, and successful data business.

Explore all features

Related articles

Monda makes it easy to create data products, publish a data storefront, integrate with data marketplaces, and manage data demand - data monetization made simple.

Data Sharing

53 Data Companies to Know About in 2024

Lucy Kelly

Data Sharing

9 Best Data Sharing Platforms 2024

Lucy Kelly

Data Sharing

What is a Data Catalog? Does Your Business Need One?

Lucy Kelly

Monda Logo

Grow your business with one data monetization platform.

Get a demo

Be the best informed in the data industry

Sign up to our newsletter for unique thought leadership and to be the first to know about every product update and event.

© Monda Labs, Inc. • 2024 • All rights reserved.