Article

Keeping Your Recommendation Engine Fresh: The Importance of Data Pipelines

•

May 24, 2024

•

10 min read

Summarize with ChatGPT

Congratulations, by now you've successfully navigated the cold start challenges of launching your AI-driven product. Now, the focus shifts to maintaining and enhancing the algorithm's performance over time. Ensuring the recommendation engine continuously learns from user interactions is core to its success.

So, how do you keep your AI model from becoming outdated and guarantee it stays accurate? Let’s dive into why models degrade and how to set up a robust data pipeline to keep your algorithm fueled with fresh data.

Models Degrade Over Time

Continuing from our running example, imagine yourself as an AI Product Manager leading a machine learning-driven product recommender solution on an e-commerce platform. Your product team is starting to notice a decline in the algorithm's performance. This is a typical issue where the predictive power of your model diminishes due to changes in user behavior, market trends, or other external factors. Eventually, this could lead to less relevant product recommendations with potentially declining customer satisfaction and even smaller basket sizes.

Your data scientist suggests that regularly retraining the model with newly collected data could maintain and even improve its accuracy. However, determining which data to use and how frequently to retrain the model requires a strategic approach considering user needs, business goals, technical feasibility and viability standpoints.

Static models degrade over time

Understanding Why Models Degrade Over Time

AI/ML models’ prediction accuracy can degrade over time mainly due to model drift, a phenomenon driven by evolving real-world data that diverges from original training data. In other words, the data your model was originally trained on is no longer representative of the current real-world data, causing the model's predictions to become less accurate. This can be attributed to three main reasons:

1. Data Drift: Changes in the statistical properties of the input data lead to decreased model performance. This can happen due to evolving user preferences, new user segments adopting your product, or new types of user interactions that the model was not originally trained on.

2. Concept Drift: The relationship between input data and the target variable changes over time. For example, a product recommender might see shifts in what features are predictive of a purchase due to changing market conditions or consumer trends. Concept drift can be:

gradual: evolving buyer behaviors or fraudster techniques,
sudden: due to global lockdowns or other major disruptions
recurring: seasonal divergences during holidays, like Black Friday or Christmas period.

3. Environmental Changes: External factors such as new competitors, economic shifts, or regulatory changes can impact the environment in which your model operates, requiring updates to the model to remain effective.

While model drift is a major cause of performance regression, other factors can also lead to performance degradation, such as the overfitting of training data, not selecting the proper ML model, or data quality issues. However, we remain focused on model drift as it is the most common.

Combating these issues requires thoughtful monitoring, frequent retraining, and awareness of how models impact the data they interact with. Careful data handling and continuous learning are essential for maintaining model accuracy in production over the long term. Regularly updating the model with fresh data helps mitigate these issues, ensuring that the recommendations remain accurate and relevant.

Solution: Establishing a Data Pipeline

According to AI Multiple’s research, based on the reason for the drift, the model can be retrained using recent data sources in different ways:

using only fresh data if previous data set is now outdated,
combining the old and new data if old data is still relevant, or
a hybrid approach of assigning higher weights to recent data, so that the model pays less attention to old data.

You also need to decide with your product team the predetermined frequency (likely weekly or monthly) or the trigger of retraining, such as the availability of a significant volume of new data or a certain threshold being hit in your performance metrics or business KPIs.

The alternative to the above is to employ online learning to continuously update the model with streaming data, where the model architecture supports this. For this to work your team should establish a real-time data pipeline that integrates data repositories with your data lakehouse, from where the model training occurs. By automating this data flow, you ensure that your model can be retrained at optimal intervals.

Refreshing models helps keep quality over time

Now, from the product team’s standpoint, a best-practice approach to building this data pipeline would be:

1. Identify Data Sources: Determine which data sources are most valuable for retraining your model. This could include user interaction data, purchase history, product reviews, and any other relevant first-party or third-party data. Consult with your Data Scientist about what data points might be able to help achieve your product goals.

2. Data Ingestion: Use a reliable data integration tool to move data from these sources to your data lakehouse. Airbyte, a leading provider of open-source data integration solutions, can facilitate this process efficiently. With Airbyte, you can set up connectors to various data sources and automate the data ingestion process.

3. Data Transformation: Cleanse and transform the ingested data to ensure it's in a suitable format for model training. This may involve standardizing data formats, handling missing values, and aggregating data to the appropriate granularity.

4. Model Retraining: Schedule regular intervals for model retraining based on the business needs and technical feasibility. This could range from weekly to monthly retraining sessions, depending on how fast your data changes and the computational resources available.

An over-simplified example from my past: we kept our grocery product recommender fresh and accurate by regular, but not too frequent re-training of the ML model with additional new purchase history of our user base, giving slightly more weight to recent purchases to gently pick up on trending seasonal foods or dietary changes of our users. If we had a tool like Airbyte available for our teams back then, likely long hours could have been saved for our engineers.

Using Airbyte for Data Integration

Airbyte offers several advantages for building your data pipeline:

Open-Source Flexibility: Customize and extend the platform to fit your specific needs. Airbyte's open-source nature means you can tailor the data ingestion process to your unique requirements without being locked into proprietary solutions.
Wide Range of Connectors: Access a variety of pre-built connectors for common data sources, including databases, APIs, and cloud storage services. This makes it easy to connect all your relevant data sources to your data lakehouse.
Ease of Use: Simplify the process of setting up and managing data pipelines with user-friendly interfaces and detailed documentation. Airbyte’s intuitive setup process allows your team to get up and running quickly without extensive technical expertise.

Conclusion

In summary, to maintain and improve your recommendation algorithm's performance, it's essential to regularly retrain it with fresh data, so your users remain delighted by the accuracy of the predictions. Establishing a data pipeline that integrates various data sources with your data lakehouse is a strategic move to enable the constant flow of new data into your model’s training pool. Tools like Airbyte can simplify this process, ensuring your model stays relevant and accurate over time.

In our next post, we explore how to enhance the accuracy of your recommendations by expanding the dataset used in predictions. Stay tuned!

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program ->

The data movement infrastructure for the modern data teams.

Try a 14-day free trial

About the Author

Ferenc is an AI Product Leader with years of experience leading product teams developing AI / ML-powered software solutions with high business impact at a global B2C - B2B SaaS platform. With over a decade in product management from start-ups to enterprises, he led cross-functional teams of engineers, UX designers, researchers, data scientists, and data analysts building a wide range of products.

Disclaimer: opinions are my own, examples are for illustrative purposes only.