Zero Temperature Randomness in LLMs

Apr 29, 2025

The randomness of LLM outputs is controlled by a parameter known as "temperature." A higher temperature increases randomness, while a lower temperature produces “more deterministic” outputs. Interestingly, documentation from major LLM providers consistently includes noteworthy clarifications regarding temperature:

OpenAI:

Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

Anthropic:

Note that even with temperature of 0.0, the results will not be fully deterministic.

GCP Vertex AI:

A temperature of 0 means that the highest probability tokens are always selected. In this case, responses for a given prompt are mostly deterministic, but a small amount of variation is still possible.

Thus, despite setting the temperature to 0, none of the popular providers guarantee determinism. This naturally raises an interesting question: why is determinism elusive with LLMs?

Temperature

Before diving deeper into the LLMs randomness, it would be useful to understand what the temperature actually is.

LLMs generate text by predicting the next token t (such as a word) based on the preceding tokens. To do this, they produce scores (called logits) representing how likely each possible next token is. These logits aren't probabilities—they can be any real number. To convert them into probabilities, LLMs use the softmax function:

\(\text{P}(t_i) = \frac{e^{\text{logit}_i}}{\sum_j e^{\text{logit}_j}} \)

Temperature (T) adjusts these logits before applying softmax:

\(\text{P}(t_i) = \frac{e^{\text{logit}_i / T}}{\sum_j e^{\text{logit}_j / T}} \)

Low temperature (T < 1): Logits become more distinct, increasing the probability of the most likely token, thus producing less random output.
High temperature (T > 1): Logits become more similar, spreading probabilities more evenly, leading to more diverse and random outputs.

Figure 1. 1000 “logits” were sampled from a standard normal distribution and sorted for visualisation. Lower temperatures concentrate the probability mass on higher-value logits (a steeper cumulative rise at the end), while higher temperatures spread the probability mass more evenly across logits.

When the temperature approaches zero, the softmax formula becomes mathematically undefined due to division by zero. In practice, models handle this case by switching to greedy sampling, also known as greedy decoding, where the model deterministically selects the token with the highest logit (the maximum score) at each step, bypassing any probabilistic sampling—effectively turning the selection into a simple argmax operation. So why are LLMs non-deterministic even at zero temperature?

Non-Associativity of Floats

As stated in “What Every Computer Scientist Should Know about Floating Point Arithmetic” by David Goldberg:

Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. For example, the expression (x+y)+z has a totally different answer than x+(y+z) when x = 10³⁰, y = -10³⁰ and z = 1 (it is 1 in the former case, 0 in the latter). The importance of preserving parentheses cannot be overemphasized.

Python users could test the floats’ non-associativity simply by evaluating the expressions below:

(1 + 1e16) - 1e16
# 0.0
1 + (1e16 - 1e16)
# 1.0

This non-associativity becomes relevant in parallel computations, such as those performed on GPUs. Technically speaking, the issue of floating-point non-associativity and error accumulation arises in any software that performs concurrent computations with floating-point numbers, where race conditions can lead to variability in execution order. Many GPU operations are non-deterministic because their default thread scheduling implementation is non-deterministic. Depending on the execution order and the effects of floating-point non-associativity, the accumulation of numerical errors can vary. While it’s possible to enforce determinism, it’s usually avoided: non-deterministic algorithms can be significantly faster, and when performance is the priority, that becomes the default implementation.

This variation can already be observed on a single GPU. In multi-GPU setups, additional variability is introduced through intra-node communication (within a machine) and inter-node communication (across machines). Variations from inter-node communication can become quite noticeable when scaling up to hundreds or thousands of GPUs.

While the error introduced by a single operation may be small, it accumulates over many operations, model layers, and nodes. In some cases, this compounded error can be enough to change the ranking of the top-k most probable logits, ultimately affecting the outcome of the greedy sampler.

Sparse Mixture of Experts

On August 5, 2023, Sherman Chann published an excellent article titled “Non-determinism in GPT-4 is caused by Sparse MoE” (I encourage everyone to read it). In the article, he highlights behaviour described in Puigcerver et al., “From Sparse to Soft Mixtures of Experts”:

Under capacity constraints, all Sparse MoE approaches route tokens in groups of a fixed size and enforce (or encourage) balance within the group. When groups contain tokens from different sequences or inputs, these tokens often compete against each other for available spots in expert buffers. As a consequence, the model is no longer deterministic at the sequence-level, but only at the batch-level, as some input sequences may affect the final prediction for other inputs.

In other words, even if you submit the exact same input multiple times, you may receive different outputs depending on what other inputs are processed in the same batch. This is because tokens from different input sequences are grouped together and may compete for the same expert resources within the model. As a result, the presence or absence of certain other sequences in the batch can influence the routing decisions made for your input. Since batching is a common optimization strategy among model providers—used to increase throughput and reduce costs—this variation in response is a natural side effect of how sparse mixture of experts models operate.

It's important to note that only models using sparse mixture of experts architectures are affected by this. Still, a good reminder that randomness can be baked into the design of a model, making it extremely difficult to eliminate.

Closing Thoughts

Even if you're self-hosting the LLMs used in your product, it’s useful to explore how the model behaves at temperature 0 with your actual prompts. This helps assess whether model variability stays within acceptable limits—especially when using a sparse MoE architecture. If you're using external APIs, it's also worth setting up model drift monitoring. Since you don’t have control or visibility into changes in infrastructure, model updates, or how the provider handles distributed computation, some variation may emerge over time.

MLOps Shenanigans

Discussion about this post