In 2018, Kai joins an online chess platform rated 1200—the default for new players. He's actually much stronger, having studied tactics for years offline. His first opponent, Mira, is rated 1450. The system calculates Kai's expected score: using the Elo formula, a 250-point gap gives him roughly a 15% win probability. Kai wins. Because the system expected him to lose, his rating jumps 13 points to 1213, while Mira drops only 13. The system has updated its belief about Kai's strength based on new evidence. Over the next 30 games, Kai tears through opponents. He beats a 1500, a 1600, even a 1750. Each upset produces a large rating swing because the system assigned low probability to his victories. By game 30, he's climbed to 1680. But something shifts. He loses to a 1620-rated player name...
Popular framing: A rising Elo rating means you are improving; a stalling or falling rating means you have stopped improving or are playing worse. The number tracks your skill trajectory in real time. The 'Bayesian' nature is often missed—Elo is a 'belief' that is updated, not a 'measurement' like height or weight.
Structural analysis: The Elo number is not a measure of skill—it is a Bayesian posterior estimate of relative skill, subject to convergence dynamics, prior sensitivity, and K-factor decay. As the estimate converges on true skill, the visible signal (point swings, win rates) necessarily flattens regardless of whether actual skill is changing. The system's statistical behavior is mistaken for a performance signal. The 'Matthew Effect' in rating—how higher-rated players get invited to better tournaments and training, which further increases their skill, creating a feedback loop that the Elo merely records.
The gap matters because players, coaches, and platform designers make consequential decisions—training load, opponent selection, self-evaluation—based on a misread of what rating movement communicates. Treating convergence as plateau triggers unnecessary interventions and creates anxiety around normal system behavior. In non-game applications, the same gap causes algorithmic outputs to be mistaken for ground truths about people.