Article

Data Normalization for Gen AI Applications

•

November 13, 2024

•

7 min read

Summarize this article with:

Integrating diverse datasets is essential for building accurate and robust models in both traditional analytics systems and Generative AI (Gen AI) workflows. However, the variety of data sources - each with unique formats, schemas, and quirks - can make data ingestion and utilization a daunting task. Airbyte simplifies this challenge by providing a platform that connects to any source and seamlessly extracts data regardless of its origin or structure.

At Airbyte, we intentionally adopt an unopinionated stance on the "proper" schema of our connectors. Most connectors output data exactly as retrieved from the source APIs. This approach offers significant advantages:

Flexibility: Users can decide what normalization best suits their specific needs, tailoring data processing to their unique contexts.
Ease of Migration: Users transitioning from homegrown ingestion systems can continue using familiar data models, as Airbyte's output closely mirrors the source APIs.
High data fidelity: All the nuances, metadata, and unique fields from a source are preserved. Furthermore, keeping the data in the same ‘shape’ as the source allows users to reference the provider documentation and terminology they use.

However, this strategy requires the user to comprehend the intricacies of each source, which adds cognitive overhead. While it might be the right approach for the majority of data-related use cases, your job might go beyond analytics. For Gen AI applications—where data must be harmonized to facilitate effective retrieval and model training—data normalization becomes indispensable. In such scenarios, platforms like Luppa AI can play a critical role by automating content creation and streamlining workflows. By leveraging AI tools that can handle a variety of data formats and seamlessly integrate insights, businesses can more easily align their datasets for effective model training and AI-driven decision-making.

In this blog post, we'll delve into the role of normalization in Gen AI data pipelines, explore its benefits, and provide practical advice to improve the maintenance and usability of your data.

The role of data normalization

Gen AI applications are powered by a multitude of data sources - APIs, databases, unstructured files. Each source brings its own set of challenges and quicks, such as varying formats, inconsistent schemas, and differing data quality. This diversity makes it difficult to combine and utilize the data effectively without significant investment in data wrangling and preprocessing.

Data normalization plays a crucial role in addressing these challenges by standardizing disparate data into a consistent, unified format. There are two main reasons why normalization is important:

1) Protecting data consumers from unexpected API changes

APIs evolve over time, often introducing changes that can disrupt existing data pipelines. These changes are outside of your control and can include modifications to data schemas, field names, data types, or even the addition and removal of entire data structures. Without a normalization layer, such changes can have a cascading effect, breaking downstream applications and analyses.

By implementing a normalization layer that translates the API's evolving schema into a fixed schema you control, you shield your data consumers - whether they are application developers or data analysts - from these unexpected changes. This abstraction provides several benefits:

Schema Stability: Your internal data model remains consistent over time, even as external APIs change.
Reduced Maintenance Effort: Developers and analysts don't need to constantly update their code to accommodate schema changes in the data sources.
Improved Reliability: Applications are less prone to breaking due to unforeseen changes, leading to more robust and dependable systems.

While this type of issues isn’t super frequent, they are always difficult to troubleshoot because they occur without any code change and aren’t guaranteed to be well documented in the API changelogs. An example of such issue was Hubspot’s API updating the Contact model’s hs_latest_source_timestamp field from a date to a datetime (“2023-03-07” to “2023-03-07T17:50:11.025000+00:00”), which broke downstream systems.

2) Enabling integration across multiple sources

Your application might need to integrate data from multiple sources that provide similar information but in different formats or structures. For instance, consider Customer Relationship Management (CRM) data from Salesforce and HubSpot, or ad reports from Facebook Marketing and Tiktok. While both platforms manage contacts, leads, and sales data, they use different schemas and field conventions.

By normalizing data to a common data model, you can:

Support Multiple Vendors: Your application can handle data from either Salesforce or HubSpot—or even both—without any changes to your codebase.
Simplify Business Logic: Developers can write application logic against a unified schema, reducing complexity and potential errors.
Facilitate Scalability: Adding new CRM sources in the future becomes more straightforward, as you simply map them to the existing normalized schema.

Even if you don't plan on supporting multiple sources immediately, normalizing your data ensures that your application isn't tightly coupled to the intricacies of a specific API's schema. As an example, the fields in Shopify’s bulk API for its products endpoints are formatted as CamelCase (inventoryQuantity), while they are formatted as snake_case (inventory_quantity) in their REST endpoint. Your Gen AI application shouldn’t have to worry about these types of inconsistencies.

By decoupling your application from the intricacies of data sources through normalization, you allow developers to focus on the semantic meaning of the data, making applications easier to maintain and future-proof.

The Gen AI Data Pipeline

While Gen AI introduces new possibilities, the foundational principles of data pipelines remain consistent with established best practices. One key aspect to emphasize is the importance of treating the data normalization process as separate from your typical transformation step.

In an ELT pipeline, data is first extracted from various sources and loaded into a centralized data storage system—such as a data lake or data warehouse—before any transformations are applied. This approach allows for the storage of raw data, providing flexibility and scalability, as transformations can be applied as needed and adapted over time.

However, for Gen AI applications, it's beneficial to further distinguish between normalization and the typical transformation step. This effectively results in an Extract, Normalize, Load, Transform (ELNT) pipeline. This is equivalent to the EtLT flow Airbyte recommends for various use cases.

Why Separate These Steps?

Stability of the Normalization Layer: The normalization step produces a fixed schema that remains consistent over time. This stable foundation allows developers to build applications without worrying about changes in source data schemas or underlying structures.

Flexibility in Transformations: The transformation step needs to evolve frequently. Technologies for data retrieval and processing—especially with Large Language Models (LLMs)—are advancing rapidly. New techniques and optimizations emerge regularly, necessitating updates to how data is prepared for these models. Implementing these transformations on top of a consistent foundation will make them easier to maintain.

What does this mean concretely?

At Airbye, we think about data pipelines for Gen AI use cases as a 4 step workflow:

‍

1) Extract the data

Pull data from all relevant sources and persist it in its raw form. This ensures that you have an unaltered copy of the data, which is valuable for auditing and reprocessing.

When processing sales opportunities, this might result in records representing deals. Using Hubspot, they might look like:

While in Salesforce, it might look like:

2) Normalize the data

Apply a light transformation to standardize the data into a consistent schema:

Convert dates, times, country names, etc. into a standardized format
Align the schemas by renaming fields and restructuring the data to fit a unified schema

This might result in a format like:

3) Load the data

Load the data in your datastore.

4) Transform according to your application needs

Perform further transformations tailored to your specific Gen AI applications. This might include feature engineering and data enrichment. You might for example use another LLM to perform sentiment analysis on the leads.

Mitigating loss of fidelity

While data normalization offers significant benefits for Gen AI applications, it’s important to be mindful of potential challenges and strategize accordingly. One potential downside of data normalization is it can lead to a loss in fidelity - the subtle nuances or specific details that might be omitted when standardizing data. This can be problematic, especially if those details are crucial for certain analyses or troubleshooting.

Retain the raw data alongside normalized records so no information is permanently lost. This will help diagnosing issues or discrepancies in your applications, and allow you to augment the normalized records during the transformation phase (if needed).
The normalized schema must be flexible enough to support future use cases - when modeling a contact or a user, be sure to support multiple addresses and phone numbers for the inevitable edge cases. Also, leave a spot in your schema to store additional vendor-specific "additional" properties.

Conclusion

Data normalization is not just a technical necessity for Gen AI; it's a strategic enabler that empowers your organization to unlock the full potential of your application. It allows you to:

Enhance Data Quality: Provide consistent, reliable data that improves model accuracy and insights.
Increase Operational Efficiency: Reduce the time and resources spent on data wrangling and troubleshooting.
Foster Innovation: Enable your teams to focus on developing advanced AI models and applications rather than grappling with data inconsistencies.
Support Scalability: Build a flexible data foundation that can accommodate new sources, technologies, and business requirements as they emerge.

As you navigate the complexities of modern data landscapes, remember that foundational practices like data normalization remain critical to your success. By investing in robust normalization processes and thoughtful pipeline architectures, you position your organization to thrive in the era of Gen AI.

Looking to speak with our AI team about what you're building with AI? Just fill out this form.

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program ->

Try the Agent Engine

We're building the future of agent data infrastructure. Be amongst the first to explore our new platform and get access to our latest features.

Try it free Talk to sales

The data movement infrastructure for the modern data teams.

Try a 30-day free trial

About the Author

Alexandre Girard is a tech lead at Airbyte leading all AI engineering initiatives. At Airbyte, he has most recently led efforts in integrations and AI development, with a particular focus on optimizing data retrieval and context engineering for AI agents.