Skip to content

Create Embeddings¤

Python Plugin

This operator is part of a Python Plugin Package. In order to use it, you need to install it, e.g. with cmemc.

This plugin creates vector embeddings from text data using OpenAI’s embeddings API. It processes input entities containing text data and generates high-dimensional vector representations that capture semantic meaning.

Features¤

  • Supports OpenAI embeddings models (e.g., text-embedding-3-small)
  • Batch processing for efficient API usage
  • Configurable input/output paths
  • Automatic schema generation based on configuration
  • Built-in error handling and workflow cancellation support

Configuration¤

  • URL: OpenAI API endpoint (default: https://api.openai.com/v1)
  • API Key: Your OpenAI API key for authentication
  • Model: The embedding model to use (e.g., text-embedding-3-small)
  • Timeout: Request timeout in milliseconds (default: 10000)
  • Buffer Size: Number of texts to process per batch (default: 100)
  • Input Paths: Comma-separated list of entity paths to embed (default: “text”)
  • Output Paths: Configurable paths for embedding vectors and source text

Input/Output¤

  • Input: Entities with text data in specified paths
  • Output: Original entities enhanced with embedding vectors and source text
  • Embedding vectors are stored as string representations of float arrays
  • Source text used for embedding is preserved for reference

Use Cases¤

  • Semantic search and similarity matching
  • Text clustering and classification
  • Recommendation systems
  • Natural language processing pipelines

Parameter¤

Base URL¤

URL of the OpenAI API (without endpoint path and without trailing slash)

  • Datatype: string
  • Default Value: https://api.openai.com/v1

The OpenAI API key¤

Fill the OpenAI API key if needed (or give a dummy value in case you access an unsecured endpoint).

  • Datatype: password
  • Default Value: None

The embeddings model, e.g. text-embedding-3-small¤

  • Datatype: string
  • Default Value: text-embedding-3-small

Timeout (Single Request, in Milliseconds)¤

  • Datatype: Long
  • Default Value: 10000

Entries Processing Buffer¤

How many input values do you want to send per request?

  • Datatype: Long
  • Default Value: 100

Used entity paths (comma-separated list)¤

Changing this value will change, which input paths are used by the workflow task. A blank value means, all paths are used.

  • Datatype: string
  • Default Value: text

Entity Embedding text (output)¤

Changing this value will change the output schema accordingly. Default: _embedding_source

  • Datatype: string
  • Default Value: _embedding_source

Entity Embedding path (output)¤

Changing this value will change the output schema accordingly. Default: _embedding

  • Datatype: string
  • Default Value: _embedding

Comments