Sentence Model and Truth Drift in Real-Time Systems

Looking closer and playing with sentence models as a filter step to real-time updates into a transformer model

April 15, 2025

In real-time data environments, especially those involving financial markets, news aggregation, or sensor networks, data is constantly flowing and changing. Keeping a model or application contextually aware of this incoming information is essential. However, constantly updating expensive models with every micro-update can be computationally prohibitive and economically inefficient.

This drives the need for a lightweight pre-step that can triage incoming data changes before triggering heavy context updates.

The Case for a Cheap Pre-Step

Instead of passing every new fact or update through a full model recomputation, systems can use a cheaper, approximate check to detect whether a meaningful change has occurred. This approach dramatically reduces the load on downstream expensive processes and focuses computational effort only where necessary.

Such a pre-step should:

Be fast and computationally cheap.
Approximate the semantic change in incoming data.
Be sensitive enough to detect meaningful shifts but robust against noise.

In other words, we want to detect when there is a significant "drift in truth" - a real semantic change that would justify an expensive context update.

Detecting Truth Drift

Truth drift occurs when the underlying facts or conditions have changed enough that prior inferences, plans, or contexts are no longer valid. In AI-driven systems, responding to truth drift appropriately is crucial to maintaining relevance and accuracy.

A naive system might simply compare raw strings or numbers, but semantic drift often isn't captured at the surface level. Instead, we can turn to sentence embedding models to interpret changes more meaningfully.

Using Sentence Models for Approximate Vectors

Sentence models take a piece of text and map it into a dense, high-dimensional vector that captures the semantic meaning of the sentence.

Popular families of sentence embedding models include:

Sentence-BERT (SBERT): Fine-tunes BERT to produce semantically meaningful sentence embeddings.
Universal Sentence Encoder (USE): Provides quick, general-purpose sentence embeddings.
MiniLM: A lightweight, fast alternative that trades some accuracy for speed.
Instructor-XL: A new family of open models that can encode text with task-specific prompts.

By using one of these models, an incoming update (e.g., "stock prices are surging" vs. "stock prices are declining") can be transformed into a vector that reflects the deep meaning of the sentence, not just its surface form.

How a Single Vector Is Derived: Mean-Pooling

Most transformer-based models output a sequence of token embeddings, one for each word or subword in the input sentence. To derive a single sentence vector from these multiple token embeddings, a common approach is mean-pooling.

In mean-pooling, we:

Compute the average of all the token embeddings.
This produces a single, fixed-size vector that represents the entire sentence.

Mathematically, if the model produces token embeddings $t_1, t_2, \ldots, t_n$ for a sentence with $n$ tokens, the sentence embedding $s$ is:

s = \frac{1}{n} \sum\_{i=1}^{n} t_i

This method is simple, fast, and generally effective at summarizing the overall semantic content of a sentence, especially when combined with a model trained for sentence-level tasks.

Other strategies exist (such as using the [CLS] token embedding or attention-weighted pooling), but mean-pooling remains the most common and robust for general-purpose sentence embedding tasks.

It's worth noting, we're not just casually averaging the vectors resulting from the sentence analysis. The process to achieve those vectors is more involved:

You feed in your tokens (input IDs).
The model (like BERT or MiniLM) runs them through all its layers — often 12, 24, or more Transformer encoder blocks.
Each token ends up with a final hidden vector — a high-dimensional representation.
Those final hidden vectors already "bake in" all the attention weights, intermediate transformations, and learning.

Cosine Distance for Drift Detection

Once sentences are mapped into vectors, we can use cosine distance to measure how much two vectors differ in their semantic meaning.

The cosine distance between two vectors $\vec{u}$ and $\vec{v}$ is defined as:

\text{cosine distance} = 1 - \frac{ \vec{u} \cdot \vec{v} } { \|\vec{u}\| \|\vec{v}\| }

A distance close to 0 implies the two sentences have very similar meanings.
A distance close to 1 implies the sentences are semantically very different.

By setting a threshold (e.g., cosine distance > 0.2), systems can detect when an incoming update is meaningfully different from the current truth, and thus trigger a heavier, more expensive recomputation.

This method is fast, scalable, and model-agnostic: any sentence embedding model can be plugged in depending on the latency and quality trade-offs desired.

Demo and Playground

We can use the onnxruntime-web to run some of these small models in the browser. This library uses Web Assembly to run the model code directly, it will use WebGPU if it's available, and will use the upcoming WebNN (Web Neural Network) API when it's more widely available.

The models listed below are a selection of sentence models, all using the ONNX format and quantized to 8-bit integer, with Graph optimization O1. This is all fairly lightweight, but that's the point as we're designing this to be a lightweight real-time compatible filter step.

Use the interactive model below to select or edit 3 sentences. The cosine distance is shown in the graph below (click Analyze to see the graph).

Sentence 1

Sentence 2

Sentence 3

Sentence Models: Options and Trade-Offs

Here are some notable models to consider:

Model	Strengths	Trade-offs
Sentence-BERT (SBERT)	High accuracy, tuned for semantic similarity	Larger, slower than some alternatives
MiniLM	Fast, small footprint, surprisingly good	Slightly less precise at subtle semantic differences
Universal Sentence Encoder (USE)	Very fast, easy to use	Best suited for English, weaker on nuance
Instructor-XL	Instruction tuned, customizable embeddings	Larger, newer, still being benchmarked
e5-small / e5-large	Open-weight models tuned for retrieval and semantic search	May require prompt tuning for best results

Choice depends on your operating point:

If latency matters most: MiniLM or USE.
If accuracy matters most: SBERT or larger e5 variants.

These models are mostly transformer based: - for example, the BERT class of models (Bidirectional Encoder Representations from Transformers) uses multiple attention heads and QKV based weights, just like an LLM. The difference is that it's trained to process text, filling in gaps and does not have a auto-regressive 'Decoder' step in its architecture (e.g. predicting next token).

Conclusion

Efficient real-time systems need to be selective about when they perform expensive context updates. By using cheap vector approximations with sentence models, mean-pooling, and cosine distance, you can detect "truth drift" effectively without overwhelming your compute budget.

This is one approach to optimizing model calculations for fast moving real-time feeds. With a well-tuned model, it should be possible to limit calls into a larger transformer model when there's meaningful change, or 'truth drift'.

Originally posted: April 15, 2025

Filed Under:

llm

architecture

vector

blog