Create Embeddings¤
Python Plugin
This operator is part of a Python Plugin Package. In order to use it, you need to install it, e.g. with cmemc.
This plugin creates vector embeddings from text data using OpenAI’s embeddings API. It processes input entities containing text data and generates high-dimensional vector representations that capture semantic meaning.
Features¤
- Supports OpenAI embeddings models (e.g., text-embedding-3-small)
- Batch processing for efficient API usage
- Configurable input/output paths
- Automatic schema generation based on configuration
- Built-in error handling and workflow cancellation support
Configuration¤
- URL: OpenAI API endpoint (default: https://api.openai.com/v1)
- API Key: Your OpenAI API key for authentication
- Model: The embedding model to use (e.g., text-embedding-3-small)
- Timeout: Request timeout in milliseconds (default: 10000)
- Buffer Size: Number of texts to process per batch (default: 100)
- Input Paths: Comma-separated list of entity paths to embed (default: “text”)
- Output Paths: Configurable paths for embedding vectors and source text
Input/Output¤
- Input: Entities with text data in specified paths
- Output: Original entities enhanced with embedding vectors and source text
- Embedding vectors are stored as string representations of float arrays
- Source text used for embedding is preserved for reference
Use Cases¤
- Semantic search and similarity matching
- Text clustering and classification
- Recommendation systems
- Natural language processing pipelines
Parameter¤
Base URL¤
URL of the OpenAI API (without endpoint path and without trailing slash)
- Datatype:
string
- Default Value:
https://api.openai.com/v1
The OpenAI API key¤
Fill the OpenAI API key if needed (or give a dummy value in case you access an unsecured endpoint).
- Datatype:
password
- Default Value:
None
The embeddings model, e.g. text-embedding-3-small¤
- Datatype:
string
- Default Value:
text-embedding-3-small
Timeout (Single Request, in Milliseconds)¤
- Datatype:
Long
- Default Value:
10000
Entries Processing Buffer¤
How many input values do you want to send per request?
- Datatype:
Long
- Default Value:
100
Used entity paths (comma-separated list)¤
Changing this value will change, which input paths are used by the workflow task. A blank value means, all paths are used.
- Datatype:
string
- Default Value:
text
Entity Embedding text (output)¤
Changing this value will change the output schema accordingly. Default: _embedding_source
- Datatype:
string
- Default Value:
_embedding_source
Entity Embedding path (output)¤
Changing this value will change the output schema accordingly. Default: _embedding
- Datatype:
string
- Default Value:
_embedding