Retrieve and Rerank

This documentation is valid for:

In retrieval systems, rerankers add an extra layer of precision to ensure that only the most relevant information reaches the model. After the initial retrieval (where an embedding model and vector database pull a broad set of potentially useful document chunks) rerankers refine this set by re-evaluating the results with a more sophisticated model.

This step sharpens the relevance of the selected documents, so the generative model receives only high-quality, context-rich input. This is particularly important to RAG (Retrieval-Augmented Generation), as reranking of retrieval results will refine the initial set of retrieved blocks of chunks based on their relevance to the input query.

Since many documents from the vector store are passed to the LLM, the final answers sometimes consist of information from irrelevant documents, making it less accurate and sometimes incorrect. Also, passing multiple irrelevant documents makes it more expensive. Therefore, there are two reasons to use reranking - accuracy and cost.

This process involves rescoring the retrieved documents using a more sophisticated model, such as a cross-encoder, to better capture the semantic similarity between the query and the documents. The reranked list of document chunks is then used as input for the generation model, ensuring that the most relevant and accurate information is used to generate the final output.

Rerankers, while more accurate, are significantly slower than vector similarity calculations. For this reason, rerankers are often used as a second pass after the vector similarity has been computed. This two-stage retrieval approach combines the speed of vector search with the precision of rerankers.

Key Components of Reranking

When a User Question is made, the system first retrieves a set of potentially relevant pieces of information (initial document chunks candidates) based on traditional retrieval methods (like keyword matching or vector similarity).
After this initial retrieval, the rerank process takes this initial set of results and applies more sophisticated algorithms to reorder them.

The goal is to bring the most relevant results to the top of the list, improving the overall quality and use.

Configuration

The configuration is done on the rerank Profile Metadata section as follows:

{
  "chat": {
    "rerank": {
      "provider": "string",
      "modelName": "validModelName",
      "k": number, // optional, defaults to Chunk Count
      "relevanceScore": [0..1] // optional, defaults to 0.
    }
  }
}

Check the available options.

Sample

Use the awsbedrock/amazon.rerank-v1 model, for the top 3 elements considering a rerank relevance score of 0.3.

{
  "chat": {
    "rerank": {
      "provider": "awsbedrock",
      "modelName": "amazon.rerank-v1",
      "k": 3,
      "relevanceScore": 0.3
    }
  }
}

Suppose this information includes a query and 4 chunks.

rerank query
rerank response

{
    "query": "What is the Capital of the United States?",
    "documents": [
        "Carson City is the capital city of the American state of Nevada.",
        "The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean. Its capital is Saipan.",
        "Washington, D.C. is the capital of the United States.",
        "Capital punishment has existed in the United States since before it was a country."
    ]
}

{
    "results": [
        {
            "index": 2,
            "relevance_score": 0.890294253826141
        },
        {
            "index": 0,
            "relevance_score": 0.000487857119878754
        },
        {
            "index": 3,
            "relevance_score": 5.10419085912872E-05
        }
    ]
}

A valid reply is as follows, where the index 2 (Washington, D.C. is the capital of the United States.) is closest to the original query: What is the Capital of the United States?.

Considerations

Note that reranking isn't without trade-offs. While it can significantly improve relevancy, it also introduces latency and can impact Time to First Token (TTFT). Therefore, it's crucial to consider whether increased relevancy outweighs the need for faster response times in your specific use case.

It is essential to validate the selected approach through robust evaluations. Implementing reranking should be done thoughtfully, with careful testing to ensure it truly enhances system performance. These evaluations will help determine if the additional complexity and computational cost of reranking truly yields meaningful improvements in relevance and accuracy.

Retrieve and Rerank

Key Components of Reranking

Configuration

Sample

Considerations

See Also