Embedding data type

This documentation is valid for:

Embedding is a powerful data type that represents words or phrases as vectors of numbers. This feature allows you to perform semantic searches and comparisons in the applications generated with GeneXus Next.

Key Concepts

Embedding Models: These models transform words and phrases into mathematical vectors with various dimensions. Different models are suited for different tasks, such as text similarity or summarization.
Vector Databases: Specialized databases or extensions to existing databases that support storing and querying embedding data.
Semantic Search: The ability to search for similar concepts rather than exact matches, using vector similarity measures.
The current implementation uses cosine similarity to compare embeddings and determine the relevance of search results.
Cosine similarity is ideal for semantic searches because it allows results to be sorted based on similarity in meaning, rather than exact match.

Features and Capabilities

You can define Embedding as the data type of:

domains,
attributes,
variables, and
Structured Data Type (SDT) members.

These can be used at the code level, but not in layouts.

The Work With pattern has also been adapted to take advantage of the capabilities of Embedding. When an Embedding is included in the data structure of a Work With, the pattern automatically performs the following actions:

Adds a distance ordering: This allows results to be sorted according to their semantic similarity to a given query.
Generates a filter using an &search variable: This variable is of the same type as the Embedding attribute, allowing semantic searches to be performed efficiently.

For example, if you have an attribute named ProductEmbedding of Embedding type in your structure, the Work With pattern will automatically generate an order clause like this:

ORDERS: Order(ProductEmbedding.Distance(&search));

This adaptation of the Work With pattern facilitates the implementation of semantic searches in your applications, allowing you to take advantage of the power of embeddings without the need to write complex code manually.
Embedding data types have two important methods: FromString() and ToString().

FromString() calls Globant Enterprise AI to get the vector from the text, using the model configured for the field and environment-level settings.

ToString() returns a string with the serialized vector, although it is not symmetric to FromString().

The string returned by ToString() is the same string you would get in the corresponding field in a BC.ToJson() or SDT.ToJson().

At the Generator level, there is a new group of properties called “Enterprise AI configuration”. This group includes two important properties:

AI Provider Address (type Text): This property is used to specify the URL of the AI service provider that will be used to generate embeddings. For example, it could be the URL of the OpenAI API or any other AI service that provides embedding generation capabilities.
AI API key (password text type): The API key required to authenticate and authorize requests to the AI service provider is stored here.

Sample

Suppose you have an e-commerce application with a large product catalog. You want to improve the search functionality to provide more relevant results based on the semantic meaning of product descriptions, rather than relying solely on keyword matching. To achieve this, you can use embeddings to capture the semantic essence of each product.

Here's how you can define a Transaction called Product with an Embedding Type Attribute:

Transaction Product
{
  ProductId*                (Type:Numeric)
  ProductName               (Type:Character)
  ProductDescription        (Type:Description)
  ProductEmbedding          (Type:Embedding)
 }

In the following image, you can see how to define the Embedding attribute.

As you can see, the ProductEmbedding attribute is of Embedding type. It uses the 'openai/test-embedding-3-small' model, which is well-suited for text similarity tasks, with 512 dimensions.

Then, when configuring an Embedding data type, you need to set the following properties:

Model: Specifies the embedding model to use (e.g., 'openai/test-embedding-3-small'). The model determines how the text is converted into a vector representation. Different models are optimized for different tasks, such as text similarity, classification, or semantic search.
Dimensions: This property sets the number of dimensions in the vector (determined by the model). In this case, it's 512. The number of dimensions affects the richness of the semantic representation and the computational resources required for processing.

To make the embedding functionality work effectively, you need to add a rule that automatically generates the embedding when a product is created or updated.

Transaction Product
{
  ProductId*                (Type:Numeric)
  ProductName               (Type:Character)
  ProductDescription        (Type:Description)
  ProductEmbedding          (Type:Embedding)
   
  #Rules
    ProductEmbedding = ProductEmbedding.GenerateEmbedding(format("The product with Name %1 has this description: %2", ProductName, ProductDescription), &Messages) on aftervalidate;     
  #End
}

This rule is doing several important things:

GenerateEmbedding Method:
ProductEmbedding.GenerateEmbedding()creates the vector representation (embedding) of the product information.
Formatting the Input:
The format() function combines the ProductName and ProductDescription into a single string. This provides context for the embedding generation.
Automatic Triggering:
The presence of 'on aftervalidate' means that this rule runs automatically after the product data has been validated, ensuring the embedding is created or updated with the latest, valid information.
Assigning the Result:
The result of GenerateEmbedding() is assigned back to the ProductEmbedding attribute, storing the vector representation in the database.
Error Handling:
The &Messages parameter allows error messages to be captured during the embedding generation process.
Note: Keep in mind that you must add the variable messages of type Messages, GeneXus.Common to the Product Transaction.

This rule automates the creation of embeddings, ensuring that every product in your catalog has an up-to-date semantic representation without manual intervention. It's part of implementing semantic search in your e-commerce application.

Suppose that now you want to enhance your e-commerce platform with a smart search feature that understands the context of user queries, even when they don't match exact keywords in your product catalog. To achieve this, you can implement a search functionality that leverages these embeddings.

&search.FromString("show me products for surfers")
For each Count 5
  order ProductEmbedding.Distance(&search)
  where ProductEmbedding.Distance(&search) < 0.5
  where ProductIsActive
  msg(ProductName)
Endfor

This code allows you to perform a semantic search that goes beyond simple keyword matching, providing more relevant results to your customers and potentially improving their shopping experience.

Embedding data type

Key Concepts

Features and Capabilities

Sample

See Also