Table of contents
Official Content
  • This documentation is valid for:

Ingestion providers determine how your documents are processed when they are uploaded. The available options are:

  • Globant Enterprise AI (Default)
  • llamaParse
  • legacy

To set up an ingestion provider, go to the RAG Assistants home page and click on ADD DOCUMENTS. Next, select the Advanced tab and configure the ingestion provider according to your needs.

Globant Enterprise AI (geai)

The geia option allows you to take advantage of the information contained in the images and tables of PDFs, not only processing the raw text but also the images and tables (depending on the selected strategy). This ensures you make the most of all the content of the documents (scanned and standard PDF files are currently supported).

The geai option processes files differently based on their content. For PDF files they can be:

  • Scanned PDF where each page is treated as an image and will use by default the hi_res strategy and the scannedPrompt parameter. Therefore, for each page an LLM call will be executed to interpret the image and get as much text as possible.
  • Standard PDF containing text, images and tables will take into account the selected strategy parameter with a combination of imagePrompt/tablePrompt parameters.

Parameters

You can customize the geai processing with these parameters:

Parameter Description
strategy Determines the processing approach. Options:
-auto (default): Globant Enterprise AI selects the best option based on the document.
- hi_res: High-resolution processing. Requires the model parameter. More expensive but potentially yields better results for complex documents, especially those PDF documents that have images and tables. check an example here.
model Specifies the AI model for image processing when OCR is not feasible. Default: openai/gpt-4o. Use the format provider/modelname. You must use models with visual support. Some examples are: openai/gpt-4o, openai/gpt-4o-mini, anthropic/claude-3-5-sonnet-20240620, vertex_ai/gemini-1.5-flash etc.
imagePrompt Custom prompt for image interpretation and text generation. If not provided, a default prompt is used.
scannedPrompt Custom prompt for scanned documents where the whole page is an image.
tablePrompt Custom prompt for table interpretation and text generation. If not provided, a default prompt is used.
logoProcess Determines whether the visual model will process the logos within the document or not. Options:
-False (default): Does not process the logos.
-True: To process the logos means to extract the explanation of each logo.
dpi Defines the DPI (Dots Per Inch) used when processing images. Default: 200 DPI.
structure Specifies whether the document is assumed to have a table structure. Valid values are:
- (empty): Default value. Assumes no table structure.
- table : Assumes the document is in a table or tabular format, applicable only to csv and xls* formats; check an example here.

Default imagePrompt

The imagePrompt default is as follows:

You are an assistant tasked with extracting all text from images in the image text language.
These images are pages from a PDF document. Extract and transcribe all visible text in the image,
maintaining the structure and layout as much as possible. Include any headers, footers, and page numbers.
Be thorough and don't miss any text, no matter how small or where it's positioned in the image.

Default tablePrompt

The tablePrompt default is as follows:

You are an assistant tasked with extracting all text from images in the image text language. 
These images are pages from a PDF document. Extract and transcribe all visible text in the image, 
maintaining the structure and layout as much as possible. Include any headers, footers, and page numbers.
Be thorough and don't miss any text, no matter how small or where it's positioned in the image.

Default scannedPrompt

The scannedPrompt default is as follows:

You are an assistant tasked with extracting all text from images in the image text language. 
These images are pages from a PDF document. Extract and transcribe all visible text in the image, 
maintaining the structure and layout as much as possible. Include any headers, footers, and page numbers.
Be thorough and don't miss any text, no matter how small or where it's positioned in the image.

Samples

The following shows how to use Chat API for different ingestion provider options:

  • Minimal options:
curl -X POST "$BASE_URL/v1/search/profile/{name}/document" \
 -H "Authorization: Bearer $SAIA_PROJECT_APITOKEN" \
 -H "Content-Type: multipart/form-data" \
 --form 'file=@"/C:/temp/SampleFile.pdf"' \
 --form 'provider="geai"'
  • High-resolution strategy:
 
curl -X POST "$BASE_URL/v1/search/profile/{name}/document" \
 -H "Authorization: Bearer $SAIA_PROJECT_APITOKEN" \
 -H "Content-Type: multipart/form-data" \
 --form 'file=@"/C:/temp/SampleFile.pdf"' \
 --form 'provider="geai"' \
 --form 'strategy="hi_res"'
 
  • High-resolution with custom model and prompt:
curl -X POST "$BASE_URL/v1/search/profile/{name}/document" \
 -H "Authorization: Bearer $SAIA_PROJECT_APITOKEN" \
 -H "Content-Type: multipart/form-data" \
 --form 'file=@"/C:/temp/SampleFile.pdf"' \
 --form 'provider="geai"' \
 --form 'model="openai/gpt-4o"' \
 --form 'imagePrompt="Resume the image as succinctly as possible in markdown format"'

llamaParse

LlamaParse is an API by LlamaIndex to parse and represent files for efficient retrieval and context augmentation.

When selecting the llamaParse option, ingestion will redirect to get text representation of the associated documents; you need an API Key to use the service (check pricing and usage).

The following parameters are available:

Parameter Description
apiKey Required. Obtain it from the LlamaParse site following the steps in Get an API key
resultType Output format. Options:
- markdown (default)
- text
- json
splitByPage Document splitting option:
- true (default): Split document text by pages
- false: Keep document as a single text
fastMode Parsing speed option:
- true: Bypass reconstruction, significantly accelerating parsing
- false (default): Standard parsing speed
targetPages Comma-separated list of pages to extract. Default: all pages (numbered from 0)
language Document language. Default: en. For multiple languages, repeat this parameter for each language. See documentation for other options.
invalidateCache Cache invalidation:
- true: Invalidate cache
- false (default): Use existing cache
doNotCache Caching option:
- true: Do not cache results
- false (default): Allow caching

Considerations

  • pageNumber parameter is not assigned when ingesting PDF files.

Samples

The following shows how to use Chat API for different ingestion provider options:

  • Markdown format, first page only:
curl -X POST "$BASE_URL/v1/search/profile/{name}/document" \
 -H "Authorization: Bearer $SAIA_PROJECT_APITOKEN" \
 -H "Content-Type: multipart/form-data" \
 --form 'file=@"/C:/temp/SampleFile.pdf"' \
 --form 'provider="llamaParse"' \
 --form 'format="markdown"' \
 --form 'targetPages="0"' \
 --form 'splitByPage="true"' \
 --form 'apiKey=""'
  • Text format without splitting, two languages:
curl -X POST "$BASE_URL/v1/search/profile/{name}/document" \
 -H "Authorization: Bearer $SAIA_PROJECT_APITOKEN" \
 -H "Content-Type: multipart/form-data" \
 --form 'file=@"/C:/temp/SampleFile.pdf"' \
 --form 'provider="llamaParse"' \
 --form 'format="text"' \
 --form 'splitByPage="false"' \
 --form 'language="en"' \
 --form 'language="es"'

Note: The "name" in the URL refers to the associated RAG Assistant identifier.

legacy

The legacy option uses the following packages:

Note: This is the fastest option, but its accuracy may vary depending on the file types.

Custom Ingestion (ETL/ELT)

If you need a custom ingestion process, you can create your own ETL (Extract/Transform/Load) or ELT (Extract/Load/Transform) extraction process and then upload a .custom file. This allows you to take complete control of the process by creating the necessary Document Interface for correct processing.

This option is useful when you need more control over how your data is extracted, transformed, and loaded into the system.

Ingestion samples

See Also

Ingestion SDK

Last update: March 2025 | © GeneXus. All rights reserved. GeneXus Powered by Globant