How to Handle Unstructured Data

Every stage of an unstructured data pipeline (parsing, chunking, embedding, indexing) can fail without throwing an error. The parser returns text, the chunker produces chunks, the embedding model returns vectors, and retrieval quality quietly degrades because the output at each stage is subtly wrong.

Teams that treat unstructured data handling as ongoing infrastructure build agents that actually work. Teams that treat it as a one-time ETL job debug retrieval problems for months.

TL;DR

  • Handling unstructured data for AI agents requires a seven-stage pipeline: ingest → parse → chunk → extract metadata → embed → index → enforce permissions at retrieval.
  • Parsing is where most information loss occurs. Each file format requires a different parser, and each parser loses specific information such as table structure, headers, and visual layout.
  • Chunking strategy must match embedding model constraints. Chunks exceeding the model's token limit are silently truncated. Chunks that are too small produce embeddings without enough context for accurate retrieval.
  • Permissions and quality failures often surface only at retrieval: ACLs must be attached to every chunk to enforce access, and detection requires evaluating retrieval results against known queries (not just pipeline health metrics).


How Do You Parse Unstructured Data?

Parsing converts raw files into plain text that downstream stages can process. The challenge is that every file format encodes information differently, and every parser makes trade-offs about what to preserve and what to discard. Here's what works and what breaks.

Format Common Sources Parsing Tools What's Preserved What's Lost Failure Detection
PDF (text-based) Google Drive, SharePoint, email attachments PyPDF2, pdfplumber Body text, basic formatting Table cell boundaries, multi-column layout, headers/footers, image content Manual review: compare parsed output to rendered PDF
PDF (scanned/image) Legacy documents, signed contracts, faxes Tesseract OCR, AWS Textract, Google Vision Recognized text (varies by quality) Handwriting, low-resolution text, layout structure, all non-text content OCR confidence scores; compare character count to expected
Word (DOCX) SharePoint, Google Drive, email python-docx Text, headings, basic structure Tracked changes (included/excluded unpredictably), embedded objects, complex formatting Diff parsed output against rendered document
HTML Confluence, web pages, help docs BeautifulSoup, lxml Main content text Navigation/sidebar content (often included as noise), JS-rendered content, boilerplate Check for nav/header/footer elements in parsed text
Slides (PPTX) Google Drive, SharePoint python-pptx Slide text, speaker notes (optional) Visual layout, diagrams, image content, slide-to-slide relationships Compare chunk count to slide count
Slack / Messages Slack API, Teams API API export Message text, timestamps, author Thread structure (flattened), emoji reactions, file attachments, edited history Check for thread parent/reply relationships in output
Email (EML/MSG) Gmail, Outlook email (Python stdlib), extract-msg Subject, body, sender, recipients Attachments (require separate handling), HTML formatting, inline images Verify attachment count matches source

Two parsing failures cause the most damage downstream.

Tables Are the Hardest Problem

PDF tables are the most common source of parsing failure. 

Standard text extraction tools like PyPDF2 often flatten tables into sequences of words with no cell boundaries, column headers, or row relationships. "Q1 Revenue: $4.2M" becomes indistinguishable from surrounding body text. pdfplumber handles basic tables with just a few lines of code, exposing bounding boxes and character positions for layout preservation. 

AWS Textract table detection provides automatic table detection and cell-level metadata, making it suitable when table extraction is critical.

No parser handles all table formats reliably. A 2024 comparative study across six document categories found that rule-based tools like Camelot, PDFPlumber, and Tabula all performed poorly on table extraction recall outside of a few specific document types. Table-heavy documents like financial reports, compliance docs, and technical specs require manual validation of parsed output against the source.

Boilerplate Pollutes the Embedding Space

HTML pages include navigation bars, cookie notices, footers, and sidebar content that parsers extract alongside main content. Confluence pages include breadcrumbs and page metadata. These boilerplate elements produce embeddings that compete with actual content during retrieval.

An agent searching for "quarterly revenue analysis" may retrieve a chunk that's 40% navigation text and 60% revenue content because navigation keywords contributed to the embedding. Trafilatura is a two post-parsing tool that strips boilerplate before embedding. Adding a boilerplate removal step after parsing typically improves retrieval precision.

How Do You Chunk for Retrieval?

Chunking splits parsed text into segments sized for the embedding model. This is where you determine what unit of information the model sees and what the agent retrieves.

Strategy How It Works Strengths Weaknesses Best For
Fixed-size Split every N tokens (e.g., 500 tokens per chunk) Simple, predictable chunk count, easy to implement Splits mid-sentence, mid-paragraph; breaks semantic units Homogeneous content with consistent density (e.g., legal clauses)
Recursive (text-splitter) Split by paragraph → sentence → character boundaries, recursing to fit target size Preserves sentence/paragraph boundaries; widely supported (LangChain, LlamaIndex) May produce uneven chunk sizes; long paragraphs still split arbitrarily General-purpose default for most content types
Semantic Use sentence embeddings to detect topic shifts; split at low-similarity boundaries Best semantic coherence per chunk; splits at natural topic transitions Expensive (requires embedding pass before chunking); non-deterministic High-value documents where retrieval precision matters most
Document-aware Split by section headers (H1/H2/H3), maintaining document hierarchy Preserves document structure; section titles become metadata Requires parser that preserves headers; fails on documents without clear structure Technical documentation, knowledge bases, wikis with consistent heading structure
Sliding window (overlap) Add N tokens of overlap between adjacent chunks Preserves cross-boundary context; reduces information loss at chunk edges Increases storage and embedding cost (10-20% more chunks) Any strategy above; applied as a modifier, not standalone

Chunk size must satisfy two constraints simultaneously. First, every chunk must be less than or equal to the embedding model's token limit. E5-large caps at 512 tokens, text-embedding-3-small accepts up to 8,191 tokens, and Qwen3-Embedding can go up to 32,768. Text exceeding the limit is either silently truncated or throws an error depending on the model. Either way, the embedding represents incomplete content.

Second, every chunk must contain enough context for the embedding to be meaningful. A 30-token chunk like "See Table 3 for Q2 results" produces an embedding that captures almost no retrievable meaning. You also want to avoid chunks that mix multiple topics; they dilute the semantic signal and can reduce recall for any single query.

The overlap parameter matters more than most teams realize. Adding 50-100 tokens of overlap between adjacent chunks means information at chunk boundaries appears in two embeddings. Without overlap, a sentence split across two chunks may not be retrievable by either chunk's embedding.

Start with recursive chunking at 400-500 tokens with 50-token overlap. Switch to document-aware or semantic chunking when retrieval evaluation shows quality problems.

How Do You Enforce Access Controls Across Parsing, Chunking, and Retrieval?

Source systems enforce access controls: a Google Drive document shared with the marketing team but not engineering, a Confluence space restricted to the legal department, a Slack channel limited to a project group. When the pipeline parses, chunks, and embeds that document, the access control must travel with every chunk.

Two enforcement approaches exist:

  • Pre-filtering applies permission checks before similarity search. This reduces the search space so unauthorized documents never leave the vector database. Pre-filtering requires more upfront implementation work than post-filtering. It requires storing permission metadata (user IDs, group memberships, access levels) alongside every chunk in the vector database, then filtering by that metadata at query time.
  • Post-filtering runs similarity search first, then removes unauthorized chunks from results. It is simpler to implement but wastes compute on chunks the user can't see and can load unauthorized content into application memory.

The harder problem is permission changes. When a user loses access to a source folder, every chunk from that folder's documents must be filtered going forward. In most implementations, you can update permission metadata in the vector store without re-embedding because the vectors do not change when access controls change. This requires the pipeline to track source-to-chunk lineage so it knows which chunks to update when access controls change.

How Do You Reprocess Chunks and Embeddings When Source Documents Change?

Unstructured data handling isn't a one-time ETL job. Source documents change: Confluence pages are edited, Slack conversations continue, and new files are uploaded to Google Drive. Each change means existing chunks and embeddings may be stale.

Full reprocessing re-ingests, re-parses, re-chunks, and re-embeds the entire corpus. Simple but expensive. For large knowledge bases, full reprocessing can take hours to days and can drive significant embedding API charges.

Incremental reprocessing detects which source documents changed via modification timestamps, content hashing, or Change Data Capture (CDC), then re-processes only those documents. CDC tracks modifications at the source level and detects changes as they happen. This requires tracking which chunks came from which source document so stale chunks can be deleted and replaced. Incremental processing reduces cost and latency but adds pipeline complexity because you need change detection mechanisms, versioning strategies, and state management for every source.

The freshness requirement depends on the data type. Sub-minute data like Slack messages and support tickets may need updates visible to users within 30 seconds via webhook-based CDC. Knowledge base articles that change weekly can tolerate daily reprocessing. Financial documents that change quarterly may only need monthly reprocessing. 

Get the cadence wrong in either direction — reprocessing too often wastes compute, too rarely serves stale answers — and the failures won't announce themselves.

What Are the Most Common Failure Modes in Unstructured Data Pipelines?

Standard pipeline monitoring checks whether each stage completes without errors. Every failure mode in the table below passes that check. The pipeline runs green while retrieval quality degrades.

Stage Silent Failure Symptom at Retrieval How to Detect How to Fix
Parsing Table rendered as flat text Agent cites table data out of context ("Revenue: 4.2" without column header) Spot-check parsed output against source documents Use table-aware parser (pdfplumber for tables, Textract for complex layouts)
Parsing Boilerplate included Agent retrieves navigation text, cookie notices, disclaimers Check for common boilerplate patterns in chunk sample Add boilerplate removal step after parsing
Chunking Chunk exceeds model token limit Embedding represents truncated content; conclusions/summaries lost Compare chunk token counts against model's max_tokens Enforce max chunk size less than or equal to the model token limit
Chunking Chunk too small (< 50 tokens) Embedding lacks context; retrieval returns fragments Check chunk size distribution for small outliers Set minimum chunk size; merge small adjacent chunks
Metadata Source permissions not extracted Agent returns chunks from documents user can't access Query as restricted user; check if unauthorized content appears Extract ACLs at ingestion; attach to every chunk
Metadata Section titles not attached Agent can't distinguish which section a chunk comes from Filter retrieval by section; check for empty metadata fields Use document-aware chunking that preserves heading hierarchy
Embedding Domain mismatch Retrieval returns "close but wrong" results for specialized terms Run retrieval evaluation with domain-specific queries; measure recall@5 Fine-tune embedding model on domain data or switch to a domain model
Indexing Stale embeddings Agent answers with outdated information despite source update Compare source modification dates against embedding generation dates Implement incremental re-embedding triggered by source changes

Detection requires retrieval-level evaluation: run a set of known queries with known correct answers, then measure whether the pipeline returns the right chunks. Key metrics include Precision@K (fraction of returned results that are relevant) and NDCG (Normalized Discounted Cumulative Gain), which scores result ranking quality from 0-1. 

When recall drops, trace backward through the pipeline to find which stage degraded. Run this evaluation after every pipeline change and periodically on stable pipelines to catch drift from source document changes or embedding model updates.

What's the Most Reliable Way to Handle Unstructured Data for AI Agents?

The priority order matters: fix parsing before tuning embeddings, enforce chunk size limits before experimenting with chunking strategies, and attach permissions from day one rather than retrofitting them. Each stage depends on the one before it, so improvements at the wrong layer waste effort.

Teams assembling these stages from independent components — parsing libraries, chunking logic, embedding APIs, vector database clients, permission systems — spend most of their maintenance time on the boundaries between them. Does the parser's output match the chunker's input? Does the chunker respect the model's token limit? Do permissions survive through embedding and indexing?

Airbyte's Agent Engine eliminates those boundaries. As context engineering infrastructure, it connects to 600+ enterprise sources, handles parsing and metadata extraction across file formats, manages chunking and embeddings, and delivers to vector databases like Pinecone, Weaviate, Milvus, and Chroma — with row-level and user-level access controls preserved through every stage and incremental sync with CDC keeping embeddings fresh.

Connect with us to see how Airbyte handles unstructured enterprise data for production AI agents.

You build the agent. We'll bring the data.

Authenticate once. Fetch, search, and write in real-time.

Try Agent Engine →
Airbyte mascot


Frequently Asked Questions

What is the biggest challenge in handling unstructured data?

Parsing quality. Every downstream stage (chunking, embedding, retrieval) depends on the parser producing clean, structurally accurate text. Tables, multi-column layouts, scanned documents, and boilerplate content are the most common parsing challenges.

How do you choose a chunking strategy?

Recursive chunking at 400-500 tokens with 50-token overlap is a reliable starting point because it preserves sentence and paragraph boundaries while fitting within most embedding model limits. Switch to document-aware chunking (split by section headers) for structured documentation, or semantic chunking when retrieval evaluation shows topic-mixing problems.

How do you handle permissions for unstructured data?

Extract access controls at ingestion and attach them as metadata to every chunk. The harder challenge is permission changes: when a user loses access in the source system, every chunk derived from those documents must reflect the change. This requires source-to-chunk lineage tracking so the pipeline knows which metadata to update.

How often should you reprocess unstructured data?

Match reprocessing cadence to source change frequency and staleness tolerance. CDC-based incremental processing handles high-change sources like Slack and support tickets without reprocessing the full corpus. Stable content like knowledge bases and archived documentation can use scheduled batch runs.

Can you handle structured and unstructured data in the same pipeline?

Yes. Many enterprise data sources contain both: a CRM has structured contact records and attached proposals (unstructured), and a ticketing system has structured ticket fields and message threads (unstructured). Handling both in the same pipeline avoids maintaining parallel infrastructure and ensures permissions apply consistently across data types.

Loading more...

Try the Agent Engine

We're building the future of agent data infrastructure. Be amongst the first to explore our new platform and get access to our latest features.