This guide explains the resources available in the Globant Enterprise AI Evaluation Notebooks and how to use them effectively to evaluate your AI Assistants.
Prepare a dataset that contains the test cases for your AI Assistant. Each row in the dataset represents a test case, defining:
- The input question for the assistant.
- The expected output based on predefined criteria.
You can create a dataset using the DataSet API in two ways:
- Create a dataset from scratch: Work through the "Working with Datasets" section of the DataSetAPI.ipynb notebook. This section provides examples and code snippets for creating a new dataset using API endpoints and then populating it with rows.
- Upload a complete dataset from a JSON file: If you already have your dataset in a JSON file, you can directly upload it. Work through the "Uploading Data via Files" section of the DataSetAPI.ipynb notebook. This section explains how to use the POST/dataSetApi/dataSet/FileUpload endpoint to create a dataset from a JSON file.
If your dataset is in CSV format, you'll first need to convert it to JSON. The CSVtoJSONConversion.ipynb notebook provides a practical example and code to guide you through this conversion process.
Once you have created your dataset, the DataSetAPI.ipynb notebook also provides examples for managing your dataset, including:
- Retrieve, update, and delete datasets.
- Add, modify, and remove rows within a dataset.
- Manage expected sources and filter variables associated with dataset rows.
- Upload dataset rows via file uploads.
Create an evaluation plan to evaluate your AI Assistant using the Evaluation Plan API, specifying the following:
- The AI Assistant to be tested.
- The dataset that will be used for testing.
- The metrics that will be applied to assess the assistant's performance.
You can achieve this by working through the EvaluationPlanAPI.ipynb notebook, which provides examples and code snippets to:
- Create, retrieve, update, and delete evaluation plans.
- Associate system metrics with your evaluation plans and manage their weights.
- Retrieve available system metrics and their details.
- Execute a defined evaluation plan.
Run the evaluation plan to initiate the testing process. The evaluation engine will:
- Instantiate the assistant for each row in the dataset.
- Capture the assistant's response.
- Apply the defined metrics to compare the actual results with the expected outputs.
You can execute an evaluation plan using the POST/evaluationPlanApi/evaluationPlan/{evaluationPlanId} endpoint of the Evaluation Plan API. Refer to the EvaluationPlanAPI.ipynb notebook for a practical example.
Use the GET/evaluationResultApi/evaluationResult/{evaluationResultId} endpoint of the Evaluation Result API to retrieve the results of the executed evaluation plan. The results will include:
- The assistant's responses for each test case.
- The computed metric scores based on the expected vs. actual outputs.
The EvaluationResultAPI.ipynb notebook provides examples and code snippets for retrieving evaluation results.
For a basic walkthrough of using the DataSet, Evaluation Plan, and Evaluation Result APIs together, refer to the EvaluationAPITutorial.ipynb notebook.