Skip to main content
Use the Embeddings API when you want retrieval, semantic search, ranking, or a RAG pipeline.

Typical retrieval flow

1

Chunk the source material

Split documents into chunks that are small enough for retrieval but still semantically coherent.
2

Embed the chunks

Generate vectors for each chunk using one embedding model.
3

Store vectors and metadata

Keep the embeddings together with metadata such as document ID, title, or section so you can filter and cite later.
4

Embed the user query

Use the same embedding model for the query that you used for the indexed chunks.
5

Retrieve nearest chunks

Search for the closest matches in vector space.
6

Filter or rerank if needed

Narrow the candidate set before generation when your pipeline needs better precision.
7

Pass the final context into generation

Send the selected context to your generation step, usually through Responses API.
Use Responses API for the generation step and Embeddings API for the retrieval step.

Good defaults

  • Keep chunks semantically coherent instead of embedding entire long documents.
  • Store metadata with each vector so you can filter, cite, or deduplicate later.
  • Keep a stable embedding model per index version.
  • Retrieve a small candidate set before you generate the final answer.

Common Pitfalls

  • mixing embeddings from different models in one index
  • not versioning your embedding model choice
  • embedding documents and queries with different incompatible models
  • indexing chunks that are too large or too small for your retrieval goals
  • skipping evaluation, so retrieval quality degrades without being noticed

Retrieval checklist

  • Choose one embedding model for both documents and queries.
  • Re-embed the corpus when you change embedding models.
  • Measure retrieval quality on a small set of known-good queries.
  • Keep generation and retrieval concerns separate.