Embedding Utilities

Embedding utility components provide helper functions and tools for working with embeddings, including text processing and similarity calculations.

Text Embedder

The Text Embedder component provides a unified interface for generating embeddings from text using various embedding models.

Usage

Text Embedder features:

  • Unified embedding interface

  • Multiple model support

  • Batch processing

  • Text preprocessing

  • Caching capabilities

Inputs

Name
Display Name
Info

text

Text

Input text to embed

embedding_model

Embedding Model

Connected embedding model component

batch_size

Batch Size

Number of texts to process in each batch

cache_embeddings

Cache Results

Whether to cache embedding results

Outputs

Name
Display Name
Info

embeddings

Embeddings

Generated embeddings for the input text

vectors

Vectors

Raw numerical vectors

Embedding Similarity

The Embedding Similarity component calculates similarity scores between embeddings using various distance metrics.

Usage

Similarity calculation features:

  • Multiple similarity metrics

  • Batch similarity calculation

  • Threshold filtering

  • Ranking and sorting

  • Performance optimization

Inputs

Name
Display Name
Info

query_embedding

Query Embedding

Reference embedding for comparison

candidate_embeddings

Candidate Embeddings

Set of embeddings to compare against

similarity_metric

Similarity Metric

Distance metric (cosine, euclidean, dot)

threshold

Threshold

Minimum similarity score

top_k

Top K

Number of top results to return

Outputs

Name
Display Name
Info

similarity_scores

Similarity Scores

Calculated similarity scores

ranked_results

Ranked Results

Results sorted by similarity score

filtered_results

Filtered Results

Results above the threshold

Similarity Metrics

Cosine Similarity

  • Range: -1 to 1 (1 = identical, 0 = orthogonal, -1 = opposite)

  • Use Case: Most common for text embeddings

  • Benefits: Normalized, angle-based comparison

Euclidean Distance

  • Range: 0 to ∞ (0 = identical, larger = more different)

  • Use Case: Spatial relationships

  • Benefits: Intuitive geometric interpretation

Dot Product

  • Range: -∞ to ∞ (higher = more similar)

  • Use Case: When magnitude matters

  • Benefits: Fast computation, considers vector magnitude

Manhattan Distance

  • Range: 0 to ∞ (0 = identical)

  • Use Case: Robust to outliers

  • Benefits: Less sensitive to extreme values

Advanced Features

Batch Processing

  • Parallel Computation: Efficient batch similarity calculation

  • Memory Management: Optimized memory usage for large datasets

  • Chunking: Automatic chunking for very large datasets

  • Progress Tracking: Real-time processing progress

Caching and Optimization

  • Embedding Cache: Cache frequently used embeddings

  • Result Cache: Cache similarity calculations

  • Index Optimization: Optimized similarity search

  • Memory Pooling: Efficient memory management

Text Preprocessing

  • Normalization: Text cleaning and normalization

  • Tokenization: Smart text tokenization

  • Language Detection: Automatic language detection

  • Encoding Handling: Proper text encoding management

Use Cases

  • Find similar documents

  • Content recommendation

  • Duplicate detection

  • Content clustering

Question Answering

  • Find relevant passages

  • Context retrieval

  • Answer ranking

  • Knowledge base search

Content Analysis

  • Document similarity

  • Topic modeling

  • Content classification

  • Sentiment analysis

Usage Notes

  • Performance: Optimized for large-scale similarity calculations

  • Flexibility: Support for various embedding models and metrics

  • Scalability: Efficient batch processing capabilities

  • Accuracy: High-precision similarity calculations

  • Integration: Easy integration with vector databases

  • Monitoring: Built-in performance monitoring and logging

Last updated