
Most teams pick an embedding model based on benchmark scores, then embed their data, and wonder why retrieval quality disappoints. The model gets the blame, but the real problem is usually everything upstream of it: parsing that mangles tables into flat text, chunks that split mid-paragraph, metadata that never makes it to the vector database.
The embedding model amplifies what it receives. Feed it broken input, and no amount of model swapping fixes retrieval.
TL;DR
- Embedding models convert data into dense numerical vectors where semantic similarity maps to mathematical proximity. They are the foundation of RAG and semantic search, but pipeline quality sets the ceiling on what any model can deliver.
- For AI agents, a missed retrieval produces a confident wrong answer, not just a less-relevant result. Model selection is high-stakes, but benchmark scores alone are a poor guide.
- Selection criteria that actually matter: retrieval accuracy on your data (not aggregate benchmarks), dimensions (often 512–1,024), token limit (512 to 32K), latency, and domain fit. Specialized domains often benefit from fine-tuning, which typically improves retrieval by 5–15%.
- The upstream pipeline (parsing, chunking, metadata) deserves more iteration than the model itself. Strong infrastructure should handle these steps and keep embeddings fresh.
What Are Embedding Models?
An embedding model is a machine learning model trained to convert input data (text, images, audio, code) into dense numerical vectors called embeddings. Inputs with similar meaning produce vectors that are mathematically close together, making semantic similarity searchable through vector distance. Every semantic search and RAG system depends on this conversion from meaning into math.
Traditional approaches miss this nuance entirely. Keyword matching treats "bank" as the same word whether it means a financial institution or a riverbank, and one-hot encoding produces sparse vectors with no learned relationships between terms. Embedding models capture context from training data, producing different vectors for the same word depending on its surrounding text.
The output vector's dimensions, typically 384 to 4,096 numbers, encode features the model learned during training. Each dimension doesn't correspond to a human-interpretable concept. The model discovered its own representation of meaning, organizing information along axes that capture patterns in language no one explicitly defined.
What Types of Embedding Models Exist?
The data your agent reasons over determines which model type you need.
Text embedding models dominate the RAG and agent ecosystem because most enterprise knowledge exists as text: documents, messages, tickets, and wiki pages. Code embedding models are a growing category for developer-focused agents. Multimodal models that embed text and images in the same vector space allow cross-modal retrieval (searching images with text queries), but add complexity and compute cost.
Most agent applications start with text embedding models. The decision to expand to other types should follow from a concrete retrieval gap, not from a desire for architectural elegance.
How Do Embedding Models Work?
The mechanics of how embedding models train and run inference explain why two models performing "similarity search" can produce very different retrieval results.
The Training Phase
Modern text embedding models are transformers trained with contrastive learning. The model sees pairs of texts labeled as similar or dissimilar and learns to produce vectors that are close together for similar pairs and far apart for dissimilar pairs.
The training dataset and objective define what "similar" means for that model:
- A model trained on search query-document pairs learns retrieval-oriented similarity
- A model trained on paraphrase pairs learns semantic equivalence
- A model trained on Natural Language Inference (NLI) data learns logical relationships like entailment and contradiction
This distinction matters because swapping models changes retrieval behavior in unpredictable ways, even when both models claim to do "semantic search."
The Inference Phase
When you call the model, it tokenizes input text into subwords, processes them through transformer layers that capture contextual relationships between tokens, then pools the result into a single fixed-size vector. Mean pooling (averaging all token vectors) is the most common and best-performing approach for retrieval tasks. The result is a dense vector of N dimensions that represents the input's semantic content.
What Are the Embedding Model Tradeoffs Most Teams Underestimate?
Three tradeoffs consistently catch teams off guard after deployment. Each one is difficult to detect and expensive to fix retroactively.
Domain Mismatch Is the Silent Killer
A general-purpose embedding model trained on web text embeds "force majeure" near "military force" instead of near "contract clause." "HIPAA compliance" embeds near "hip replacement" instead of near "regulatory requirements." The model has never seen these terms used in their domain-specific context.
The impact on domain-heavy question answering is severe. The Stanford HAI and Patronus AI FinanceBench paper reports low baseline performance for finance-focused RAG-style evaluation sets and shows that substantial system work is often required to reach high accuracy.
Fine-tuning on domain data typically improves retrieval by 5–15%, but it requires labeled training pairs (often 5,000–10,000 examples) and adds pipeline complexity. Most teams discover domain mismatch after deploying to production, when users report that the agent's answers are "close but wrong." By that point, the cost of fixing it includes not just the fine-tuning itself but also reprocessing every document through the updated model.
Token Limits Truncate Silently
When input text exceeds a model's token limit, the behavior varies by implementation. OpenAI's API returns explicit errors for oversized inputs. Some local models silently return zero-vector embeddings when input exceeds their limits. Those zero-vectors provide no retrieval value and trigger no alerts.
A 1,200-token chunk fed to a 512-token model loses its final 688 tokens, which are often the conclusion or most specific information. This failure is invisible: the model returns a valid-looking embedding, just one that represents an incomplete version of the content. The fix is implementing pre-embedding token counting and aligning chunk size with model capacity, but most teams don't add that safeguard until after silent truncation has already polluted their vector store.
Benchmark Scores Mislead
MTEB evaluates models across dozens of tasks: retrieval, classification, clustering, reranking. A model that tops the overall leaderboard may rank lower on retrieval specifically, which is the task that matters for RAG. NVIDIA's analysis of NV-Embed scores demonstrates how overall and retrieval scores can diverge within the same benchmark report.
Benchmark contamination compounds the problem. DataStax has documented cases of potential data leakage in MTEB evaluations. Evaluate on the retrieval subtask relevant to your use case, and validate against your own data. A model that ranks fifth on MTEB overall may outperform the leader on your specific corpus.
Why Does Your Data Pipeline Matter More Than Your Embedding Model?
Every upstream pipeline failure degrades embeddings in a specific, predictable way, and no model upgrade compensates for broken input.
Research on metadata-enriched chunking supports this priority order. Metadata-enriched chunking approaches achieved 82.5% precision compared to 73.3% for content-only approaches, a 12.6% relative gain from pipeline improvements alone. This is why teams building RAG pipelines iterate more on chunking strategy than on model selection.
How Do You Build the Pipeline That Feeds the Model?
Production RAG pipelines require parsing, chunking, embedding, vector storage, metadata propagation, and permission enforcement to work as a single coordinated system. Teams typically choose between assembling these components individually or adopting integrated infrastructure.
Assembled Components vs. Integrated Infrastructure
Teams building custom pipelines assemble parsing libraries, chunking logic, embedding model integrations, vector database connections, and metadata/permission propagation. Each component works independently, but the integration points are where quality degrades.
The questions that reveal whether a pipeline is production-ready are all about component boundaries:
- Does the parser's output match the chunker's expected input format?
- Does the chunker produce chunks within the model's token limit?
- Do permissions survive through embedding and indexing into the vector database?
When embedding model configurations drift between ingestion and query time, the geometric relationship between query embeddings and indexed document embeddings breaks down. The system returns irrelevant results without any obvious error.
Because these components are integrated rather than assembled, configuration drift between ingestion and query time is eliminated by design.
What's the Fastest Way to Get Embedding Models Working in Production?
Start with a production-tested model, such as E5 or BGE for open-source, or OpenAI text-embedding-3-large for a managed API, and test it against your actual data. Invest more iteration in the upstream pipeline (parsing quality, chunk sizing, metadata preservation) than in model swapping. Well-parsed, properly chunked, metadata-enriched input produces strong retrieval from most modern embedding models.
Your Embedding Model Is Only as Good as the Pipeline Feeding It
That's the core insight of this entire piece. Airbyte's Agent Engine handles the pipeline from source to vector database as integrated infrastructure, including 600+ governed connectors, parsing, auto-chunked embedding aligned with model token limits, permission preservation, and incremental sync through CDC. Your team iterates on retrieval quality, not data plumbing.
Get a demo to see how Agent Engine delivers enterprise data to your embedding models with governed, agent-ready context.
Frequently Asked Questions
What is the difference between an embedding and an embedding model?
An embedding is the output: a dense numerical vector representing a piece of data. An embedding model is the ML model that produces it, analogous to the relationship between a photograph and a camera. Different embedding models produce different embeddings for the same input, with varying quality and characteristics.
Which embedding model is best for RAG?
The best model depends on your data, deployment constraints, and whether you need multilingual support. OpenAI text-embedding-3-large, E5-large-v2, and Snowflake Arctic-Embed-L-v2.0 are strong starting points. Run Recall@5 on a representative sample of your own queries and documents before committing.
Do embedding models need to be fine-tuned?
General-purpose models work well for broad content. Domain-specific corpora (legal, medical, financial) often contain terminology that general models misplace in vector space. Start with a general model and fine-tune only if retrieval evaluation confirms consistent domain-specific gaps.
How do token limits affect embedding quality?
When input text exceeds a model's token limit (ranging from 512 to 32,000 tokens depending on the model), the excess is truncated or the API returns an error. Align your chunking strategy with your model's token limit and add token counting before embedding to catch oversized inputs.
How often do embeddings need to be updated?
Update embeddings whenever the source content changes. Delta processing using Change Data Capture (CDC) re-embeds only changed documents, avoiding the expense of full reprocessing while keeping retrieval current.
Try the Agent Engine
We're building the future of agent data infrastructure. Be amongst the first to explore our new platform and get access to our latest features.
.avif)
