Multimodal RAG

This documentation is valid for:

Multimodal RAG enables you to work with documents containing various content types, including text and images. This approach leverages multimodal LLMs to enhance the capabilities of RAG Assistants.

Design Considerations

Here are the key considerations for designing your assistant:

Parse images, text, and tables from documents.
Use multimodal embeddings or text embeddings and have a strategy to interpret images and convert everything to text.
Use a Vector Store with multi-modality support.
Retrieve both images and text using similarity search, or the text representation for all the multimodal content.
Pass raw images and text chunks or all text chunks to a multimodal or text based LLM for answer synthesis.

Configuration Options

You can configure your RAG assistant in two main ways: text only or multimodal.

text only	multimodal
Use a multimodal LLM to produce text summaries from images and complex tables	---
Use text embeddings to embed and retrieve text	Use multimodal embeddings to embed images and text
Retrieve text using similarity search	Retrieve images and text using similarity search
Pass text chunks to an LLM for answer synthesis	Pass raw images and text chunks to a multimodal LLM for answer synthesis

The default configuration is text only; check the ingestion configuration parameters to match your type of documents.

Multimodal

The following configuration is necessary if you want to use multimodal embeddings and LLM.

Select a multimodal embeddings model (such as amazon.titan-embed-image-v1 or cohere.embed-english-v3).
Set the dimensions parameter accordingly; the previous sample models use 1024.
Set the mode parameter to multimodal; otherwise, everything will be treated as text.
Select an LLM with visual support (such as openai/gpt-4o, openai/gpt-4o-mini, vertex_ai/gemini-1.5-flash) for the LLM section.
Check the Ingestion Provider configuration and make sure to set the strategy parameter to hi_res so that all image content is treated as base64 representation.

When using this mode and cohere, you need to set the multimodalBatchSize parameter on the Profile Metadata to ingest one image at a time.

Samples

Assume you have a sample RAG Assistant with the previous configuration set. When answering questions, the Vector Store will retrieve a mixture of text/image content relevant to the query.

Considerations

Keep image resolution as low as possible; note that a margin of 200 to 250 DPI is enough. In particular, when an LLM needs to interpret it, do not use file sizes greater than 5 MB (5242880 bytes).
When using the text option, you may need to customize the prompt to interpret raw images.
The multimodal case means treating text and images natively by embeddings and LLM models; other multimodal content such as audio or video are not currently supported.
Some models do not report cost usages.
The history conversation of a RAG assistant does not consider previous images.
The assistant receives the question in text form.

Page Id

—

Created: 4 February 2025 - Last update: 11 February 2025 by rp

Next: File Support in Assistants

Backlinks