In real-time data environments, especially those involving financial markets, news aggregation, or sensor networks, data is constantly flowing and changing. Keeping a model or application contextually aware of this incoming information is essential. However, constantly updating expensive models with every micro-update can be computationally prohibitive and economically inefficient.
This drives the need for a lightweight pre-step that can triage incoming data changes before triggering heavy context updates.
The Case for a Cheap Pre-Step
Instead of passing every new fact or update through a full model recomputation, systems can use a cheaper, approximate check to detect whether a meaningful change has occurred. This approach dramatically reduces the load on downstream expensive processes and focuses computational effort only where necessary.
Such a pre-step should:
- Be fast and computationally cheap.
- Approximate the semantic change in incoming data.
- Be sensitive enough to detect meaningful shifts but robust against noise.
In other words, we want to detect when there is a significant "drift in truth" - a real semantic change that would justify an expensive context update.
Detecting Truth Drift
Truth drift occurs when the underlying facts or conditions have changed enough that prior inferences, plans, or contexts are no longer valid. In AI-driven systems, responding to truth drift appropriately is crucial to maintaining relevance and accuracy.
A naive system might simply compare raw strings or numbers, but semantic drift often isn't captured at the surface level. Instead, we can turn to sentence embedding models to interpret changes more meaningfully.
Using Sentence Models for Approximate Vectors
Sentence models take a piece of text and map it into a dense, high-dimensional vector that captures the semantic meaning of the sentence.
Popular families of sentence embedding models include:
- Sentence-BERT (SBERT): Fine-tunes BERT to produce semantically meaningful sentence embeddings.
- Universal Sentence Encoder (USE): Provides quick, general-purpose sentence embeddings.
- MiniLM: A lightweight, fast alternative that trades some accuracy for speed.
- Instructor-XL: A new family of open models that can encode text with task-specific prompts.
By using one of these models, an incoming update (e.g., "stock prices are surging" vs. "stock prices are declining") can be transformed into a vector that reflects the deep meaning of the sentence, not just its surface form.
How a Single Vector Is Derived: Mean-Pooling
Most transformer-based models output a sequence of token embeddings, one for each word or subword in the input sentence. To derive a single sentence vector from these multiple token embeddings, a common approach is mean-pooling.
In mean-pooling, we:
- Compute the average of all the token embeddings.
- This produces a single, fixed-size vector that represents the entire sentence.
Mathematically, if the model produces token embeddings for a sentence with tokens, the sentence embedding is:
This method is simple, fast, and generally effective at summarizing the overall semantic content of a sentence, especially when combined with a model trained for sentence-level tasks.
Other strategies exist (such as using the [CLS] token embedding or attention-weighted pooling), but mean-pooling remains the most common and robust for general-purpose sentence embedding tasks.
It's worth noting, we're not just casually averaging the vectors resulting from the sentence analysis. The process to achieve those vectors is more involved:
- You feed in your tokens (input IDs).
- The model (like BERT or MiniLM) runs them through all its layers — often 12, 24, or more Transformer encoder blocks.
- Each token ends up with a final hidden vector — a high-dimensional representation.
- Those final hidden vectors already "bake in" all the attention weights, intermediate transformations, and learning.
Cosine Distance for Drift Detection
Once sentences are mapped into vectors, we can use cosine distance to measure how much two vectors differ in their semantic meaning.
The cosine distance between two vectors and is defined as:
- A distance close to 0 implies the two sentences have very similar meanings.
- A distance close to 1 implies the sentences are semantically very different.
By setting a threshold (e.g., cosine distance > 0.2), systems can detect when an incoming update is meaningfully different from the current truth, and thus trigger a heavier, more expensive recomputation.
This method is fast, scalable, and model-agnostic: any sentence embedding model can be plugged in depending on the latency and quality trade-offs desired.
Demo and Playground
We can use the onnxruntime-web to run some of these small models in the browser. This library uses Web Assembly to run the model code directly, it will use WebGPU if it's available, and will use the upcoming WebNN (Web Neural Network) API when it's more widely available.
The models listed below are a selection of sentence models, all using the ONNX format and quantized to 8-bit integer, with Graph optimization O1. This is all fairly lightweight, but that's the point as we're designing this to be a lightweight real-time compatible filter step.
Use the interactive model below to select or edit 3 sentences. The cosine distance is shown in the graph below (click Analyze to see the graph).
Sentence Models: Options and Trade-Offs
Here are some notable models to consider:
Model | Strengths | Trade-offs |
---|---|---|
Sentence-BERT (SBERT) | High accuracy, tuned for semantic similarity | Larger, slower than some alternatives |
MiniLM | Fast, small footprint, surprisingly good | Slightly less precise at subtle semantic differences |
Universal Sentence Encoder (USE) | Very fast, easy to use | Best suited for English, weaker on nuance |
Instructor-XL | Instruction tuned, customizable embeddings | Larger, newer, still being benchmarked |
e5-small / e5-large | Open-weight models tuned for retrieval and semantic search | May require prompt tuning for best results |
Choice depends on your operating point:
- If latency matters most: MiniLM or USE.
- If accuracy matters most: SBERT or larger e5 variants.
These models are mostly transformer based: - for example, the BERT class of models (Bidirectional Encoder Representations from Transformers) uses multiple attention heads and QKV based weights, just like an LLM. The difference is that it's trained to process text, filling in gaps and does not have a auto-regressive 'Decoder' step in its architecture (e.g. predicting next token).
Conclusion
Efficient real-time systems need to be selective about when they perform expensive context updates. By using cheap vector approximations with sentence models, mean-pooling, and cosine distance, you can detect "truth drift" effectively without overwhelming your compute budget.
This is one approach to optimizing model calculations for fast moving real-time feeds. With a well-tuned model, it should be possible to limit calls into a larger transformer model when there's meaningful change, or 'truth drift'.