Agentic Data Engineering Resources

Resource

How to Handle Unstructured Data

Learn how to handle unstructured data for AI agents with a pipeline that parses, chunks, embeds, and retrieves documents without silent quality failures.

Pedro Lopez

March 9, 2026

Summarize with AI:

Every stage of an unstructured data pipeline (parsing, chunking, embedding, indexing) can fail without throwing an error. The parser returns text, the chunker produces chunks, the embedding model returns vectors, and retrieval quality quietly degrades because the output at each stage is subtly wrong.

Teams that treat unstructured data handling as ongoing infrastructure build agents that actually work. Teams that treat it as a one-time ETL job debug retrieval problems for months.

TL;DR

Handling unstructured data for AI agents requires a seven-stage pipeline: ingest → parse → chunk → extract metadata → embed → index → enforce permissions at retrieval.
Parsing is where most information loss occurs. Each file format requires a different parser, and each parser loses specific information such as table structure, headers, and visual layout.
Chunking strategy must match embedding model constraints. Chunks exceeding the model's token limit are silently truncated. Chunks that are too small produce embeddings without enough context for accurate retrieval.
Permissions and quality failures often surface only at retrieval: ACLs must be attached to every chunk to enforce access, and detection requires evaluating retrieval results against known queries (not just pipeline health metrics).

How Do You Parse Unstructured Data?

Parsing converts raw files into plain text that downstream stages can process. The challenge is that every file format encodes information differently, and every parser makes trade-offs about what to preserve and what to discard. Here's what works and what breaks.

Format	Common Sources	Parsing Tools	What's Preserved	What's Lost	Failure Detection
PDF (text-based)	Google Drive, SharePoint, email attachments	PyPDF2, pdfplumber	Body text, basic formatting	Table cell boundaries, multi-column layout, headers/footers, image content	Manual review: compare parsed output to rendered PDF
PDF (scanned/image)	Legacy documents, signed contracts, faxes	Tesseract OCR, AWS Textract, Google Vision	Recognized text (varies by quality)	Handwriting, low-resolution text, layout structure, all non-text content	OCR confidence scores; compare character count to expected
Word (DOCX)	SharePoint, Google Drive, email	python-docx	Text, headings, basic structure	Tracked changes (included/excluded unpredictably), embedded objects, complex formatting	Diff parsed output against rendered document
HTML	Confluence, web pages, help docs	BeautifulSoup, lxml	Main content text	Navigation/sidebar content (often included as noise), JS-rendered content, boilerplate	Check for nav/header/footer elements in parsed text
Slides (PPTX)	Google Drive, SharePoint	python-pptx	Slide text, speaker notes (optional)	Visual layout, diagrams, image content, slide-to-slide relationships	Compare chunk count to slide count
Slack / Messages	Slack API, Teams API	API export	Message text, timestamps, author	Thread structure (flattened), emoji reactions, file attachments, edited history	Check for thread parent/reply relationships in output
Email (EML/MSG)	Gmail, Outlook	email (Python stdlib), extract-msg	Subject, body, sender, recipients	Attachments (require separate handling), HTML formatting, inline images	Verify attachment count matches source

Two parsing failures cause the most damage downstream.

Tables Are the Hardest Problem

PDF tables are the most common source of parsing failure.

Standard text extraction tools like PyPDF2 often flatten tables into sequences of words with no cell boundaries, column headers, or row relationships. "Q1 Revenue: $4.2M" becomes indistinguishable from surrounding body text. pdfplumber handles basic tables with just a few lines of code, exposing bounding boxes and character positions for layout preservation.

AWS Textract table detection provides automatic table detection and cell-level metadata, making it suitable when table extraction is critical.

No parser handles all table formats reliably. A 2024 comparative study across six document categories found that rule-based tools like Camelot, PDFPlumber, and Tabula all performed poorly on table extraction recall outside of a few specific document types. Table-heavy documents like financial reports, compliance docs, and technical specs require manual validation of parsed output against the source.

Boilerplate Pollutes the Embedding Space

HTML pages include navigation bars, cookie notices, footers, and sidebar content that parsers extract alongside main content. Confluence pages include breadcrumbs and page metadata. These boilerplate elements produce embeddings that compete with actual content during retrieval.

An agent searching for "quarterly revenue analysis" may retrieve a chunk that's 40% navigation text and 60% revenue content because navigation keywords contributed to the embedding. Trafilatura is a two post-parsing tool that strips boilerplate before embedding. Adding a boilerplate removal step after parsing typically improves retrieval precision.

How Do You Chunk for Retrieval?

Chunking splits parsed text into segments sized for the embedding model. This is where you determine what unit of information the model sees and what the agent retrieves.

Strategy	How It Works	Strengths	Weaknesses	Best For
Fixed-size	Split every N tokens (e.g., 500 tokens per chunk)	Simple, predictable chunk count, easy to implement	Splits mid-sentence, mid-paragraph; breaks semantic units	Homogeneous content with consistent density (e.g., legal clauses)
Recursive (text-splitter)	Split by paragraph → sentence → character boundaries, recursing to fit target size	Preserves sentence/paragraph boundaries; widely supported (LangChain, LlamaIndex)	May produce uneven chunk sizes; long paragraphs still split arbitrarily	General-purpose default for most content types
Semantic	Use sentence embeddings to detect topic shifts; split at low-similarity boundaries	Best semantic coherence per chunk; splits at natural topic transitions	Expensive (requires embedding pass before chunking); non-deterministic	High-value documents where retrieval precision matters most
Document-aware	Split by section headers (H1/H2/H3), maintaining document hierarchy	Preserves document structure; section titles become metadata	Requires parser that preserves headers; fails on documents without clear structure	Technical documentation, knowledge bases, wikis with consistent heading structure
Sliding window (overlap)	Add N tokens of overlap between adjacent chunks	Preserves cross-boundary context; reduces information loss at chunk edges	Increases storage and embedding cost (10-20% more chunks)	Any strategy above; applied as a modifier, not standalone

Chunk size must satisfy two constraints simultaneously. First, every chunk must be less than or equal to the embedding model's token limit. E5-large caps at 512 tokens, text-embedding-3-small accepts up to 8,191 tokens, and Qwen3-Embedding can go up to 32,768. Text exceeding the limit is either silently truncated or throws an error depending on the model. Either way, the embedding represents incomplete content.

Second, every chunk must contain enough context for the embedding to be meaningful. A 30-token chunk like "See Table 3 for Q2 results" produces an embedding that captures almost no retrievable meaning. You also want to avoid chunks that mix multiple topics; they dilute the semantic signal and can reduce recall for any single query.

The overlap parameter matters more than most teams realize. Adding 50-100 tokens of overlap between adjacent chunks means information at chunk boundaries appears in two embeddings. Without overlap, a sentence split across two chunks may not be retrievable by either chunk's embedding.

Start with recursive chunking at 400-500 tokens with 50-token overlap. Switch to document-aware or semantic chunking when retrieval evaluation shows quality problems.

How Do You Enforce Access Controls Across Parsing, Chunking, and Retrieval?

Source systems enforce access controls: a Google Drive document shared with the marketing team but not engineering, a Confluence space restricted to the legal department, a Slack channel limited to a project group. When the pipeline parses, chunks, and embeds that document, the access control must travel with every chunk.

Two enforcement approaches exist:

Pre-filtering applies permission checks before similarity search. This reduces the search space so unauthorized documents never leave the vector database. Pre-filtering requires more upfront implementation work than post-filtering. It requires storing permission metadata (user IDs, group memberships, access levels) alongside every chunk in the vector database, then filtering by that metadata at query time.
Post-filtering runs similarity search first, then removes unauthorized chunks from results. It is simpler to implement but wastes compute on chunks the user can't see and can load unauthorized content into application memory.

The harder problem is permission changes. When a user loses access to a source folder, every chunk from that folder's documents must be filtered going forward. In most implementations, you can update permission metadata in the vector store without re-embedding because the vectors do not change when access controls change. This requires the pipeline to track source-to-chunk lineage so it knows which chunks to update when access controls change.

How Do You Reprocess Chunks and Embeddings When Source Documents Change?

Unstructured data handling isn't a one-time ETL job. Source documents change: Confluence pages are edited, Slack conversations continue, and new files are uploaded to Google Drive. Each change means existing chunks and embeddings may be stale.

Full reprocessing re-ingests, re-parses, re-chunks, and re-embeds the entire corpus. Simple but expensive. For large knowledge bases, full reprocessing can take hours to days and can drive significant embedding API charges.

Incremental reprocessing detects which source documents changed via modification timestamps, content hashing, or Change Data Capture (CDC), then re-processes only those documents. CDC tracks modifications at the source level and detects changes as they happen. This requires tracking which chunks came from which source document so stale chunks can be deleted and replaced. Incremental processing reduces cost and latency but adds pipeline complexity because you need change detection mechanisms, versioning strategies, and state management for every source.

The freshness requirement depends on the data type. Sub-minute data like Slack messages and support tickets may need updates visible to users within 30 seconds via webhook-based CDC. Knowledge base articles that change weekly can tolerate daily reprocessing. Financial documents that change quarterly may only need monthly reprocessing.

Get the cadence wrong in either direction — reprocessing too often wastes compute, too rarely serves stale answers — and the failures won't announce themselves.

What Are the Most Common Failure Modes in Unstructured Data Pipelines?

Standard pipeline monitoring checks whether each stage completes without errors. Every failure mode in the table below passes that check. The pipeline runs green while retrieval quality degrades.

Stage	Silent Failure	Symptom at Retrieval	How to Detect	How to Fix
Parsing	Table rendered as flat text	Agent cites table data out of context ("Revenue: 4.2" without column header)	Spot-check parsed output against source documents	Use table-aware parser (pdfplumber for tables, Textract for complex layouts)
Parsing	Boilerplate included	Agent retrieves navigation text, cookie notices, disclaimers	Check for common boilerplate patterns in chunk sample	Add boilerplate removal step after parsing
Chunking	Chunk exceeds model token limit	Embedding represents truncated content; conclusions/summaries lost	Compare chunk token counts against model's max_tokens	Enforce max chunk size less than or equal to the model token limit
Chunking	Chunk too small (< 50 tokens)	Embedding lacks context; retrieval returns fragments	Check chunk size distribution for small outliers	Set minimum chunk size; merge small adjacent chunks
Metadata	Source permissions not extracted	Agent returns chunks from documents user can't access	Query as restricted user; check if unauthorized content appears	Extract ACLs at ingestion; attach to every chunk
Metadata	Section titles not attached	Agent can't distinguish which section a chunk comes from	Filter retrieval by section; check for empty metadata fields	Use document-aware chunking that preserves heading hierarchy
Embedding	Domain mismatch	Retrieval returns "close but wrong" results for specialized terms	Run retrieval evaluation with domain-specific queries; measure recall@5	Fine-tune embedding model on domain data or switch to a domain model
Indexing	Stale embeddings	Agent answers with outdated information despite source update	Compare source modification dates against embedding generation dates	Implement incremental re-embedding triggered by source changes

Detection requires retrieval-level evaluation: run a set of known queries with known correct answers, then measure whether the pipeline returns the right chunks. Key metrics include Precision@K (fraction of returned results that are relevant) and NDCG (Normalized Discounted Cumulative Gain), which scores result ranking quality from 0-1.

When recall drops, trace backward through the pipeline to find which stage degraded. Run this evaluation after every pipeline change and periodically on stable pipelines to catch drift from source document changes or embedding model updates.

What's the Most Reliable Way to Handle Unstructured Data for AI Agents?

The priority order matters: fix parsing before tuning embeddings, enforce chunk size limits before experimenting with chunking strategies, and attach permissions from day one rather than retrofitting them. Each stage depends on the one before it, so improvements at the wrong layer waste effort.

Teams assembling these stages from independent components — parsing libraries, chunking logic, embedding APIs, vector database clients, permission systems — spend most of their maintenance time on the boundaries between them. Does the parser's output match the chunker's input? Does the chunker respect the model's token limit? Do permissions survive through embedding and indexing?

Airbyte Agents eliminates those boundaries. As the context layer for AI agents, it connects to 600+ enterprise sources, handles parsing and metadata extraction across file formats, manages chunking and embeddings, and delivers to vector databases like Pinecone, Weaviate, Milvus, and Chroma — with row-level and user-level access controls preserved through every stage and incremental sync with CDC keeping embeddings fresh.

For retrieval-heavy workloads, Context Store gives Airbyte Agents a pre-materialized, search-optimized index of business systems, so agents know where to look for the right context before runtime. That lowers latency, token consumption, and context bloat compared with assembling context at query time.

Get a demo to see how Airbyte Agents handles unstructured enterprise data for production AI agents, or try Airbyte Agents today.

Frequently Asked Questions

What is the biggest challenge in handling unstructured data?

Parsing quality. Every downstream stage (chunking, embedding, retrieval) depends on the parser producing clean, structurally accurate text. Tables, multi-column layouts, scanned documents, and boilerplate content are the most common parsing challenges.

How do you choose a chunking strategy?

Recursive chunking at 400-500 tokens with 50-token overlap is a reliable starting point because it preserves sentence and paragraph boundaries while fitting within most embedding model limits. Switch to document-aware chunking (split by section headers) for structured documentation, or semantic chunking when retrieval evaluation shows topic-mixing problems.

How do you handle permissions for unstructured data?

Extract access controls at ingestion and attach them as metadata to every chunk. The harder challenge is permission changes: when a user loses access in the source system, every chunk derived from those documents must reflect the change. This requires source-to-chunk lineage tracking so the pipeline knows which metadata to update.

How often should you reprocess unstructured data?

Match reprocessing cadence to source change frequency and staleness tolerance. CDC-based incremental processing handles high-change sources like Slack and support tickets without reprocessing the full corpus. Stable content like knowledge bases and archived documentation can use scheduled batch runs.

Can you handle structured and unstructured data in the same pipeline?

Yes. Many enterprise data sources contain both: a CRM has structured contact records and attached proposals (unstructured), and a ticketing system has structured ticket fields and message threads (unstructured). Handling both in the same pipeline avoids maintaining parallel infrastructure and ensures permissions apply consistently across data types.

Try Airbyte Agents

Airbyte connects your agents to all of your data and assembles context before they run. Build agents that actually know your business.

Try it free Talk to sales

How to Handle Unstructured Data

Related posts

Try Airbyte Agents