Create Embeddings¤
Python Plugin
This operator is part of a Python Plugin Package. In order to use it, you need to install it, e.g. with cmemc.
This plugin creates vector embeddings from text data using an OpenAI compatible embeddings API. It processes input entities containing text data and generates high-dimensional vector representations that capture semantic meaning.
Features¤
- Supports OpenAI embeddings models (e.g.,
text-embedding-3-small) - Batch processing for efficient API usage
- Configurable input/output paths
- Built-in error handling and workflow cancellation support
Input/Output¤
- Input: Entities with text data in specified paths
- Output: Original entities enhanced with embedding vectors and source text
- Embedding vectors are stored as string representations of float arrays
- Source text used for embedding is preserved for reference
Use Cases¤
- Semantic search and similarity matching
- Text clustering and classification
- Recommendation systems
- Natural language processing pipelines
Parameter¤
Base URL¤
The base URL of the OpenAI compatible API (without endpoint path).
- ID:
base_url - Datatype:
string - Default Value:
https://api.openai.com/v1/
API Type¤
Select the API client type. This determines the authentication method and endpoint configuration used for API requests. Choose OPENAI for direct OpenAI API access or AZURE_OPENAI for Azure-hosted OpenAI services. Consider using the API version advanced parameter in case you access Azure-hosted OpenAI services.
- ID:
api_type - Datatype:
enumeration - Default Value:
OPENAI
API key¤
An optional API key for authentication.
- ID:
api_key - Datatype:
password - Default Value:
None
Embeddings model¤
The identifier of the embeddings model to use. Available model IDs for some public providers can be found here: Claude, OpenAI.
- ID:
model - Datatype:
string - Default Value:
text-embedding-3-small
Embedding entity paths (comma-separated list)¤
Changing this value will change, which input paths are used by the workflow task to calculate embeddings. A blank value means, all paths are used.
- ID:
embedding_paths - Datatype:
string - Default Value:
text
Forward entity paths (comma-separated list)¤
Paths from input entities to forward to output without modification. These paths will be passed through unchanged alongside embeddings.
- ID:
forward_paths - Datatype:
string - Default Value:
None
Advanced Parameter¤
API Version¤
Azure OpenAI API version (only used when API Type is AZURE_OPENAI). For more information about OpenAI API version at Azure, please see the documentation.
- ID:
api_version - Datatype:
string - Default Value:
None
Timeout (milliseconds)¤
The timeout for a single API request in milliseconds.
- ID:
timout_single_request - Datatype:
Long - Default Value:
10000
Entries Processing Buffer¤
How many input values do you want to send per request?
- ID:
entries_processing_buffer - Datatype:
Long - Default Value:
100
Entity Embedding text (output)¤
Changing this value will change the output schema accordingly. Default: _embedding_source
- ID:
embedding_output_text - Datatype:
string - Default Value:
_embedding_source
Entity Embedding path (output)¤
Changing this value will change the output schema accordingly. Default: _embedding
- ID:
embedding_output_path - Datatype:
string - Default Value:
_embedding