Agentic Data Engineering Resources

Resource

What Is Unstructured Data and How Is It Used?

Unstructured data like PDFs, Slack threads, and docs requires parsing, chunking, and embedding before AI agents can retrieve it. Here's how.

Pedro Lopez

March 9, 2026

Summarize with AI:

Most enterprise knowledge lives outside of databases. Policies, conversations, contracts, and decisions are captured in documents, messages, and files that have no predefined schema, and for AI agents, this is the most common source of context they need to retrieve and reason over. It is also the hardest to make reliably accessible. The challenge is not storage, rather building a pipeline that parses, chunks, embeds, and indexes this content while preserving permissions and freshness at every stage.

TL;DR

Unstructured data is information without a predefined schema (docs, images, audio/video, messages) and represents the majority of enterprise data and organizational knowledge.
AI agents shift the challenge from storage to access: they need to retrieve and reason over documents and conversations, not just structured records.
Making unstructured data usable for agents requires a six-stage pipeline, ingestion, parsing, chunking, metadata extraction, embedding, and indexing, where chunking strategy is a primary driver of retrieval quality.
Permissions do not automatically survive the pipeline; access controls must be explicitly propagated through every stage or the agent may expose restricted data.

What Is Unstructured Data?

Unstructured data is information without a predefined data model. A Customer Relationship Management (CRM) record has defined fields (name, email, deal stage) and lives in a relational database with a rigid schema. A PDF contract, a Slack thread, or a recorded meeting has no consistent schema. The content is rich, but the format is unpredictable, and each type requires different storage, processing, and access patterns.

Structured vs. vs. Semi-Structured vs. Unstructured Data

Dimension	Structured	Semi-Structured	Unstructured
Schema	Predefined, rigid (rows and columns)	Partial (tags, markers, metadata)	None
Examples	CRM records, transaction logs, spreadsheets	JSON, XML, email headers, log files	PDFs, Word docs, images, audio, video, Slack messages
Storage	Relational databases (SQL)	Document databases, key-value stores	Data lakes, object storage, vector databases
Query method	SQL	JSON queries, XPath	Full-text search, vector similarity search, natural language processing (NLP)
AI agent relevance	Structured records (contacts, deals, tickets)	API responses, configuration data	Knowledge base content (docs, conversations, files)

Where Unstructured Data Lives in the Enterprise

For AI engineers building agents, the relevant question is not what unstructured data is in the abstract. It is where it lives in the enterprise tools your agent needs to access.

Source Category	Tools	Unstructured Data Types	Agent Use Case
Knowledge management	Confluence, Notion, SharePoint, Google Docs	Wiki pages, documentation, meeting notes, policies	Enterprise search, employee copilot, onboarding assistant
Communication	Slack, Teams, email, Intercom	Messages, threads, transcripts, support conversations	Customer support agent, internal Q&A, sentiment analysis
File storage	Google Drive, Dropbox, Box, S3	PDFs, Word docs, spreadsheets, presentations, images	Contract analysis, document review, research assistant
Project management	Jira, Linear, Asana, Monday	Ticket descriptions, comments, attachments	Sprint assistant, bug triage, project status agent
Customer data	Salesforce files, HubSpot attachments, Zendesk articles	Support articles, case attachments, proposals, contracts	Customer-facing copilot, deal intelligence, support automation

The agent use cases in the right column drive the practical importance of unstructured data. An enterprise search agent that only accesses structured CRM records misses the Confluence pages, Slack, Teams, email, Intercom Messages, threads, transcripts, support conversations Customer support agent, internal Q&A, sentiment analysis. Reading those support conversations as they arrive is what a customer sentiment tracking agent does to flag frustration before it turns into churn.

Why Does Unstructured Data Matter for AI Agents?

Large Language Models (LLMs) consume text natively. They were trained on unstructured data and reason over it directly, making unstructured enterprise content the highest-value input for AI agents, and Retrieval-Augmented Generation (RAG) is the primary pattern for connecting agents to it.

Here is what agents gain when they can access unstructured data through agentic RAG:

Context behind structured records. A CRM deal stage says "Negotiation," but the Slack thread explains why procurement stalled and the meeting notes in Notion capture the timeline change. Unstructured data turns a single field into a complete picture.
Knowledge base coverage. Confluence pages, support articles, contract PDFs, and conversation histories contain the policies and decisions that structured databases never capture.
Native LLM compatibility. LLMs already reason over text, so unstructured documents and conversations feed directly into retrieval and generation without transformation into tabular formats.
Cross-source reasoning. An agent answering a single question can pull from Google Drive, Slack, Notion, and Salesforce files simultaneously, synthesizing context that no single system holds on its own.

The question is how to make that context retrievable without rebuilding every integration from scratch.

How Does Unstructured Data Become Usable for AI?

Making unstructured data consumable for AI agents requires a six-stage pipeline, and each stage introduces its own engineering complexity.

Stage	What Happens	Engineering Challenge	Failure Mode
Ingestion	Collect files from sources (Google Drive, Confluence, SharePoint, S3)	Each source has different APIs, auth protocols, and file format support	Missing files, broken auth, incomplete sync
Parsing	Extract clean text from PDFs, DOCX, PPTX, HTML, images (OCR)	Different formats encode content differently; layout and structure often lost	Tables become flat text; images dropped; headers stripped
Chunking	Break text into segments (300–500 tokens) for embedding	Too small: loses context. Too large: dilutes relevance. Strategy must match content type	"The contract" chunk references "Section 4," but Section 4 is in a different chunk
Metadata extraction	Preserve source, author, date, permissions; enrich with section headers, topics	Permissions from source system must be explicitly carried through every stage	Permissions dropped; user sees documents they should not access
Embedding	Convert chunks to vectors using embedding models	General models underperform on domain-specific vocabulary; dimension choice affects cost	Legal term "force majeure" embedded near "military force" instead of "contract clause"
Indexing	Store vectors + metadata in vector database for similarity search	Index must support filtered queries (by permission, date, source), not just similarity	Agent retrieves semantically similar but access-restricted document

Three stages carry the most consequential tradeoffs.

Parsing Is Harder Than It Looks

Every file format encodes content differently:

A table in a PDF is positioned text elements, not a data structure
A heading in a Word document is implemented as a paragraph style (e.g., 'Heading 1'), and that style is used as a semantic marker for the document's heading structure
An image in a presentation requires Optical Character Recognition (OCR) before it becomes text

The failures get specific. A PDF contract with a signature block renders as positioned text elements. The parser extracts "John Smith" and "March 15, 2025," but loses the spatial relationship showing those are the signer name and date. The agent cannot determine who signed or when without the layout context. If parsing silently drops structure, every downstream stage inherits that loss.

Chunking Determines Retrieval Quality

Chunk size and strategy directly affect what the agent retrieves. Fixed-size chunking (split every 500 tokens) is simple, but it splits mid-sentence and mid-section. Semantic chunking groups related content, but it adds an embedding pass before the main embedding, which increases latency and cost. Structure-aware chunking follows headings and sections and preserves the author's intended organization.

The tradeoff is consistent: smaller chunks improve retrieval precision (find the specific paragraph), but they lose surrounding context (why that paragraph matters). Larger chunks preserve context, but they dilute relevance when the agent only needs one sentence.

Permissions Must Survive the Entire Pipeline

A document in Google Drive has access controls: specific users or groups can view it. When that document is parsed, chunked, embedded, and indexed in a vector database, access controls must travel with it as metadata. If permissions drop during any stage, the agent retrieves and surfaces content that the querying user should not see. This is the default behavior of most custom pipelines, which handle content conversion but not permission propagation. In regulated environments subject to SOC 2 or HIPAA requirements, this gap creates audit and compliance exposure on top of the data leak itself.

What Are the Tradeoffs Teams Underestimate?

There are several challenges that pipeline content typically minimizes.

Embedding Quality vs. Domain Specificity

General-purpose embedding models (OpenAI, Cohere, sentence-transformers) perform well on common text but struggle with specialized vocabulary. Domain-specific RAG systems often show accuracy drops when they confront specialized terminology. Legal terms, medical abbreviations, and internal company acronyms can embed imprecisely, so retrieval returns semantically similar but contextually wrong results. Fine-tuning embedding models on domain data can recover performance, but it requires labeled examples and adds pipeline complexity.

Freshness vs. Reprocessing Cost

Documents change. A Confluence page updated yesterday should produce different embeddings than it did last week. Full reprocessing, which includes re-parsing, re-chunking, and re-embedding the entire corpus, guarantees freshness but costs more at scale.

The math: an enterprise corpus of 50,000 documents with an average of 10 chunks each produces 500,000 embeddings. At approximately $0.0001 per embedding (OpenAI ada-002 pricing as of early 2025), full reprocessing costs roughly $50 per run. Daily reprocessing costs approximately $1,500/month. Delta processing that catches only the 2% of documents that changed daily costs around $30/month. Without freshness infrastructure, agents answer from outdated content while appearing to function correctly.

Storage Cost vs. Coverage

Embedding every document in an enterprise corpus generates significant vector storage costs. Not all documents are worth embedding. Outdated drafts, duplicate files, and irrelevant archives consume storage and add noise to retrieval results. Teams that skip a filtering step before embedding often find that retrieval quality degrades as the corpus grows, because the vector index returns increasingly irrelevant matches from stale or duplicate content.

How Do You Build an Unstructured Data Pipeline That Works?

The pipeline stages (described above) are well understood individually. The engineering challenge is assembling them into a system that handles multiple sources, multiple formats, permissions, and freshness without becoming a maintenance burden. Most teams face a build-vs-buy decision early.

What to Build vs. What to Buy

Teams building pipelines often assemble parsing libraries (Apache Tika, PyMuPDF), chunking logic, embedding model integrations, vector database connections, and metadata and permission propagation. Each component can work independently, but integrating them into a reliable, multi-source, multi-format pipeline with freshness and permission handling is the data engineering work that AI engineers underestimate.

The build-vs-buy inflection point is format coverage. Handling PDFs and plain text is manageable. Adding DOCX, PPTX, HTML, images, and Slack exports across multiple enterprise sources with different authentication and permission models crosses into infrastructure that a platform tends to handle better than custom code.

What Airbyte Agents Provides

Airbyte Agents handles structured records and unstructured files in the same connection. The platform provides 600+ replication connectors to enterprise sources, automatic parsing and metadata extraction across file formats, chunking with embedding generation, delivery to vector databases (Pinecone, Weaviate, Milvus, Chroma) for RAG pipelines, and permission-aware access controls through the entire pipeline. Incremental sync with Change Data Capture (CDC) handles freshness without full reprocessing — the difference between a pipeline that works on day one and one that still works at scale six months later.

What's the Fastest Way to Make Unstructured Data Usable for AI Agents?

Every stage of the pipeline, from parsing through indexing, has tradeoffs that most teams discover iteratively in production. The teams that ship agents fastest are the ones that remove data plumbing from their critical path entirely.

Airbyte Agents handles the full pipeline so your team focuses on retrieval quality, tool design, and agent behavior instead of integration maintenance. For teams focused on retrieval and governed context, Context Store fits naturally into this workflow by giving agents a search-optimized layer for business context before runtime. That helps reduce latency, token consumption, and context bloat while keeping retrieval tied to the systems your agents already need to access.

Get a demo to see how Airbyte Agents turn enterprise documents and files into governed, agent-ready context, or try Airbyte Agents today.

Frequently Asked Questions

What percentage of enterprise data is unstructured?

The commonly cited estimate is 80–90%, though the original 1998 methodology was never published. Regardless of the precise number, the operational implication is the same: any AI agent that only queries structured databases ignores the majority of an organization's knowledge. This is why unstructured data pipelines are a prerequisite for production-grade RAG.

What is the difference between unstructured and semi-structured data?

Semi-structured data has partial organization through tags or metadata (JSON, XML, email headers) but no rigid schema, which means it can be queried with known keys but not with SQL joins. Unstructured data has no predefined structure at all, so it requires parsing, chunking, and embedding before it becomes queryable. The distinction matters for pipeline design because each type enters the pipeline at a different stage with different tooling requirements.

Why is unstructured data important for RAG?

RAG grounds LLM responses in retrieved documents rather than relying on training data alone. Since most enterprise knowledge, including policies, contracts, meeting decisions, and support conversations, lives in unstructured formats, the quality of a RAG system depends directly on how well the pipeline handles these files. Without an unstructured data pipeline, agents cannot access organization-specific context.

What is the hardest part of building an unstructured data pipeline?

Permission propagation, because it fails silently. Parsing errors and chunking issues produce visibly bad output that teams catch in testing, but when access controls drop during embedding or indexing, the agent still returns fluent, correct-looking answers — it just surfaces content the querying user should not see. This makes permission failures the hardest to detect and the most consequential to miss, especially under SOC 2 or HIPAA audit.

Can structured and unstructured data be used together by AI agents?

Yes, and the strongest agent architectures combine both. A support agent resolving a billing dispute, for example, needs the structured transaction record from the payments database and the unstructured email thread where the customer described the issue. Routing the agent to only one data type produces either a factually correct but context-blind response or a well-contextualized response built on incomplete data.

Try Airbyte Agents

Airbyte connects your agents to all of your data and assembles context before they run. Build agents that actually know your business.

Try it free Talk to sales

What Is Unstructured Data and How Is It Used?

Related posts

Try Airbyte Agents