As a product manager, you might have already faced this challenge, or it's only a matter of time before you do in the AI-driven world of today: you're eager to train an AI/ML model for your next product—be it a groundbreaking innovation or a smarter update to an existing feature—but you lack the proper training data. Fortunately, you have at least three viable options to navigate this obstacle, which we'll explore in detail in this blog post.
Need for Data Imagine during product discovery you have identified with your team a high impact, high priority user pain point. It’s an ideal opportunity for an AI/ML solution that aligns perfectly with your product strategy. However, there’s a problem—your data scientist or machine learning engineer tells you that the available data is of insufficient quality or quantity to train a model that achieves the necessary accuracy for your product goal. Assuming alternative simpler solutions have been already ruled out due to the complexity and scale of the user need, and given your team is confident that there are ways to solve this situation, the decision has been made to source the right quality and quantity of data.
Before we explore potential solutions, it's important to address a fundamental question: Why do you need training data at all for AI products?
Data is the fuel for AI algorithms . Without sufficient high-quality fuel, the successful launch of your AI product is not feasible. Just as low-quality fuel can impair a vehicle's performance, poor-quality data can lead to unreliable and ineffective AI models—a classic case of "garbage in, garbage out." Poor data can impact everything from feature selection to the choice of algorithms, compromising the product utility and user satisfaction
It's not just about collecting initial training data: you also need high-quality validation and testing data. Training data allows a model to initially learn from and identify patterns, while validation data is used in order to refine the model iteratively during the development phase. Testing data, on the other hand, is used for an unbiased evaluation of the trained and validated model's performance against your chosen metrics on a new data set to ensure real-world accuracy.
Visualization of the initial data collection splits Keeping in mind both the quality and the quantity necessary to ML training, let’s explore some viable strategies to secure the data.
Solution 1: Start collecting data This point may seem obvious, but in certain situations, you may have no other choice. Due to various organizational, product, or technical reasons, there might be no quality data available for training an AI algorithm, and you may be restricted to using only internal data. This could be because you are developing a first-of-its-kind system, applying AI to a very specific, niche problem; or operating within a highly sensitive or secretive domain. In these cases, you need to start collecting the right quality and quantity of internal data.
Assuming the necessary data wasn’t historically recorded, you'll need to initiate data collection of user activities, such as views, clicks, watch time and so on. In addition to setting up data capture methods to record transactions and interactions with your product, you might also consider asking users for specific feedback, similar to how LinkedIn asks for feedback on suggested posts, and Netflix allows users to rate movies and shows with a thumbs up or down.
Depending on your product’s usage volume, accumulating enough data for AI training could take several weeks or months. If you can afford the time it might take to collect the right internal data, it’s better to start sooner rather than later. Moreover, it’s preferable to take the time to collect high-quality data rather than to train a model on suboptimal data points.
Solution 2: Source data internally or externally A common scenario occurs when the needed training data is already within reach—either elsewhere within your organization or externally in the digital ecosystem. You and your team just need to locate it and gain access.
If the data exists internally, such as in another division or team, finding it may be relatively straightforward. Yet, accessing it can be challenging due to data protection standards, or the data might not be in an ML-training-friendly format. These, however, are topics for another discussion.
When sourcing data externally, you have several options depending on your use case. You might use publicly available datasets, scrape data from the open internet (as it has been done for training some LLMs), purchase datasets from data brokers or data management platform vendors, or establish partnerships with other companies, like your suppliers, through data sharing agreements.
Regardless of the method, it’s crucial to ensure your data sourcing activities are legal and ethical, always adhering to applicable data privacy and copyright laws.
A real life example : Years ago our teams faced a similar challenge while developing an ML-driven product recommendation solution, aimed to help online grocery shoppers quickly build their next basket based on their previous purchase patterns, saving them time by not having to add recurring items one by one each time they ordered. To train the initial ML model, we already had some online purchasing data from a recently onboarded enterprise partner, with whom we had agreed to run a pilot. However, the data was not sufficient enough to train the model effectively. Rather than waiting to collect more internal data—a process which could have delayed our progress—we opted to source additional relevant purchase history from our partner’s previous online transactions. After lengthy discussions and alignment meetings, we got the green light to use their de-identified transactional data for our training and testing needs.Solution 3: Generate synthetic data An alternative to the above methods is to create your own "artificial" dataset. This involves generating synthetic data that mimics the patterns and characteristics of real-world data. Synthetic data generation is particularly useful when real-world data is unavailable, insufficient, or too sensitive to use directly in model training.
There are several ways to create synthetic data, such as using rule-based algorithms or generative AI tools, masking or de-identifying sensitive personal or health information, or even constructing simulations (like Nvidia Drive Sim ) that produce high-quality, privacy-conscious imitations of real-world data. For more insights into these, check this relevant research paper .
These methods can serve as an extension of the previous strategies, particularly when real-world data alone is insufficient. They can enhance existing datasets through partial data generation—filling in missing values, hiding sensitive information or increasing data volume—as needed for your specific use case.
Next Steps: Data integration Congratulations! You and your team have secured sufficient quantities of data for the model training. The next step is likely moving this data into a data warehouse or data lake. This centralized storage allows access for multiple data science teams and optimizes the data for exploration, training, and testing.
However, integrating data from various systems and formats into a unified system like a data lake, known as data integration , can be challenging. For instance, some datasets might reside in specific databases, others could be accessible via APIs, or they might exist in file form. This diversity requires a careful and systematic approach to ensure all data sources are combined effectively.
Data integration example Successfully integrating data, especially on an ongoing basis to ensure freshness, typically requires the expertise of a data engineering team. Fortunately, tools like Airbyte can simplify this process. Airbyte enables your team to create ELT (extract, load, transform) pipelines quickly with a no-code approach, which is particularly beneficial for those not deeply familiar with data engineering.
Now that the initial data has been transferred to a data lake, the next phases of data cleansing, normalization, feature extraction, and data labeling begin. This marks the start of the intensive data science work.
Wrapping up From here onwards the next steps are a “walk in the park”: select an appropriate AI model, train the algorithm, evaluate its performance, and iteratively refine it until its performance is sufficient for your audience. Following this, the model is deployed, and ongoing monitoring and refining are conducted to align with your product goals.
In this blog post, we focused on the initial data sourcing for the model training of an AI-driven product. In our next posts, we will explore the challenges of running such a solution in production. This includes tackling cold start problems, building a continuous data pipeline to keep the model updated with fresh data, and leveraging tools like Airbyte to enhance these processes. Stay tuned for the upcoming posts!