TO_MARKDOWN()
function that lets users extract the content of their documents in markdown by simply specifying the document path or URL. This function is especially useful for passing the extracted content of documents through LLMs or for storing them in a Knowledge Base.
Configuration
TheTO_MARKDOWN()
function supports different file formats and methods of passing documents into it, as well as an LLM required for processing documents.
Supported File Formats
TheTO_MARKDOWN()
function supports PDF, XML, and Nessus file formats. The documents can be provided from URLs, file storage, or Amazon S3 storage.
Supported LLMs
TheTO_MARKDOWN()
function requires an LLM to process the document content into the Markdown format.
The supported LLM providers include:
- OpenAI
- Azure OpenAI
The model you select must support multi-modal inputs, that is, both images and text. For example, OpenAI’s gpt-4o is a supported multi-modal model.
- Set the default model in the Settings of MindsDB Editor.
- Set the default model in the MindsDB configuration file.
-
Use environment variables defined below to set an LLM specifically for the
TO_MARKDOWN()
function. TheTO_MARKDOWN_FUNCTION_PROVIDER
environment variable defines the selected provider, which is one ofopenai
,azure_openai
, orgoogle
.OpenAI
Here are the environment variables for the OpenAI provider:Azure OpenAI
Here are the environment variables for the Azure OpenAI provider:Google
Here are the environment variables for the Google provider:
Usage
You can use theTO_MARKDOWN()
function to extract the content of your documents in markdown format. The arguments for this function are:
file_path_or_url
: The path or URL of the document you want to extract content from.
From Amazon S3
From Amazon S3
The following example shows how to use the Here are the steps for passing files from Amazon S3 into TO_MARKDOWN().
TO_MARKDOWN()
function with a PDF document from Amazon S3 storage connected to MindsDB.- Connect Amazon S3 to MindsDB following this instruction.
- The
public_url
of the file is generated in thes3_datasource.files
table upon connecting the Amazon S3 data source to MindsDB. - Upon running the above query, the
public_url
of the file is selected from thes3_datasource.files
table.
From URL
From URL
The following example shows how to use the Here is the output:
TO_MARKDOWN()
function with a PDF document from URL.Usage with Knowledge Bases
You can also use theTO_MARKDOWN()
function to extract content from documents and store it in a Knowledge Base. This is particularly useful for creating a Knowledge Base from a collection of documents.