Embedding Utilities
Embedding utility components provide helper functions and tools for working with embeddings, including text processing and similarity calculations.
Text Embedder
The Text Embedder component provides a unified interface for generating embeddings from text using various embedding models.
Usage
Text Embedder features:
Unified embedding interface
Multiple model support
Batch processing
Text preprocessing
Caching capabilities
Inputs
text
Text
Input text to embed
embedding_model
Embedding Model
Connected embedding model component
batch_size
Batch Size
Number of texts to process in each batch
cache_embeddings
Cache Results
Whether to cache embedding results
Outputs
embeddings
Embeddings
Generated embeddings for the input text
vectors
Vectors
Raw numerical vectors
Embedding Similarity
The Embedding Similarity component calculates similarity scores between embeddings using various distance metrics.
Usage
Similarity calculation features:
Multiple similarity metrics
Batch similarity calculation
Threshold filtering
Ranking and sorting
Performance optimization
Inputs
query_embedding
Query Embedding
Reference embedding for comparison
candidate_embeddings
Candidate Embeddings
Set of embeddings to compare against
similarity_metric
Similarity Metric
Distance metric (cosine, euclidean, dot)
threshold
Threshold
Minimum similarity score
top_k
Top K
Number of top results to return
Outputs
similarity_scores
Similarity Scores
Calculated similarity scores
ranked_results
Ranked Results
Results sorted by similarity score
filtered_results
Filtered Results
Results above the threshold
Similarity Metrics
Cosine Similarity
Range: -1 to 1 (1 = identical, 0 = orthogonal, -1 = opposite)
Use Case: Most common for text embeddings
Benefits: Normalized, angle-based comparison
Euclidean Distance
Range: 0 to ∞ (0 = identical, larger = more different)
Use Case: Spatial relationships
Benefits: Intuitive geometric interpretation
Dot Product
Range: -∞ to ∞ (higher = more similar)
Use Case: When magnitude matters
Benefits: Fast computation, considers vector magnitude
Manhattan Distance
Range: 0 to ∞ (0 = identical)
Use Case: Robust to outliers
Benefits: Less sensitive to extreme values
Advanced Features
Batch Processing
Parallel Computation: Efficient batch similarity calculation
Memory Management: Optimized memory usage for large datasets
Chunking: Automatic chunking for very large datasets
Progress Tracking: Real-time processing progress
Caching and Optimization
Embedding Cache: Cache frequently used embeddings
Result Cache: Cache similarity calculations
Index Optimization: Optimized similarity search
Memory Pooling: Efficient memory management
Text Preprocessing
Normalization: Text cleaning and normalization
Tokenization: Smart text tokenization
Language Detection: Automatic language detection
Encoding Handling: Proper text encoding management
Use Cases
Semantic Search
Find similar documents
Content recommendation
Duplicate detection
Content clustering
Question Answering
Find relevant passages
Context retrieval
Answer ranking
Knowledge base search
Content Analysis
Document similarity
Topic modeling
Content classification
Sentiment analysis
Usage Notes
Performance: Optimized for large-scale similarity calculations
Flexibility: Support for various embedding models and metrics
Scalability: Efficient batch processing capabilities
Accuracy: High-precision similarity calculations
Integration: Easy integration with vector databases
Monitoring: Built-in performance monitoring and logging
Last updated