How to evaluate an AI Assistant

This documentation is valid for:

This guide explains the resources available in the Globant Enterprise AI Evaluation Notebooks and how to use them effectively to evaluate your AI Assistants.

Step 1: Create the Dataset

Prepare a dataset that contains the test cases for your AI Assistant. Each row in the dataset represents a test case, defining:

The input question for the assistant.
The expected output based on predefined criteria.

You can create a dataset using the DataSet API in two ways:

Create a dataset from scratch: Work through the "Working with Datasets" section of the DataSetAPI.ipynb notebook. This section provides examples and code snippets for creating a new dataset using API endpoints and then populating it with rows.
Upload a complete dataset from a JSON file: If you already have your dataset in a JSON file, you can directly upload it. Work through the "Uploading Data via Files" section of the DataSetAPI.ipynb notebook. This section explains how to use the POST/dataSetApi/dataSet/FileUpload endpoint to create a dataset from a JSON file. If your dataset is in CSV format, you'll first need to convert it to JSON. The CSVtoJSONConversion.ipynb notebook provides a practical example and code to guide you through this conversion process.

Once you have created your dataset, the DataSetAPI.ipynb notebook also provides examples for managing your dataset, including:

Retrieve, update, and delete datasets.
Add, modify, and remove rows within a dataset.
Manage expected sources and filter variables associated with dataset rows.
Upload dataset rows via file uploads.

Step 2: Define the Evaluation Plan

Create an evaluation plan to evaluate your AI Assistant using the Evaluation Plan API, specifying the following:

The AI Assistant to be tested.
The dataset that will be used for testing.
The metrics that will be applied to assess the assistant's performance.

You can achieve this by working through the EvaluationPlanAPI.ipynb notebook, which provides examples and code snippets to:

Create, retrieve, update, and delete evaluation plans.
Associate system metrics with your evaluation plans and manage their weights.
Retrieve available system metrics and their details.
Execute a defined evaluation plan.

Step 3: Execute the Evaluation Plan

Run the evaluation plan to initiate the testing process. The evaluation engine will:

Instantiate the assistant for each row in the dataset.
Capture the assistant's response.
Apply the defined metrics to compare the actual results with the expected outputs.

You can execute an evaluation plan using the POST/evaluationPlanApi/evaluationPlan/{evaluationPlanId} endpoint of the Evaluation Plan API. Refer to the EvaluationPlanAPI.ipynb notebook for a practical example.

Step 4: Retrieve and Analyze the Evaluation Results

Use the GET/evaluationResultApi/evaluationResult/{evaluationResultId} endpoint of the Evaluation Result API to retrieve the results of the executed evaluation plan. The results will include:

The assistant's responses for each test case.
The computed metric scores based on the expected vs. actual outputs.

The EvaluationResultAPI.ipynb notebook provides examples and code snippets for retrieving evaluation results.

Basic Example

For a basic walkthrough of using the DataSet, Evaluation Plan, and Evaluation Result APIs together, refer to the EvaluationAPITutorial.ipynb notebook.

Page Id

—

Created: 6 March 2025 - Last update: 7 April 2025 by mshuster

Next: Evaluation Metrics

Backlinks

See all