
LLM Benchmark
Top LLMs Are Naturally
Biased by Ticker Name
We tested 4 frontier models across 500+ real trading situations with market data from DX Terminal Pro — 18,560 inference calls in total. The result: AI models have strong favorites among memecoin ticker names — and they don't even know it.
Here's What We Did
We took real trading scenarios from a live crypto benchmark. The AI sees 8 tokens with real market stats, growth trends, and holder data, then picks one to buy.
Same data. Different name. Different outcome.
We did this 18,560 times across 383 ticker names and 4 AI models. Every ticker got paired with every set of market data, so we know the bias comes from the name alone.
$ANT vs $FOMO
Same market data. Same prompt. Only the name changed.
Insects Dominate
ANT, SNAIL, MANTIS — insects and creepy-crawlies beat general animals, memes, and real-world concepts across all models.
Invisible Bias
No model ever says "I like this name." They always cite market data — but still pick favorites.
AI Avoids Controversy
FUCK, LIQUIDATE, WW3, SCREAM — models consistently shy away from edgy, negative, or controversial names.
The 3-Stage Test
We narrowed from 383 tickers to 16 across three rounds, getting more rigorous each time. At every stage, we rotated which ticker gets which market data — so the only thing that stays constant is the name.
We tested all 383 ticker names. Each one was shown to all 4 AI models in groups of 8, with different market data each time. Think of it like speed dating — every name gets a chance.
The top 64 get retested, but now we rotate which ticker gets which market data. So if $ANT had the best-performing data in round 1, $FUCK gets that same data in round 2. This way we know the results aren't just because one ticker got lucky with good data.
The 8 most-favored and 8 least-favored names get tested with even more data — 50 different market scenarios each, all fully rotated. This gives us rock-solid confidence in the final rankings.
Stage 1: Initial Screening
We showed all 383 ticker names to the AI models. In each test, the AI sees 8 tokens with real market data and has to pick one to buy. The "buy rate" is simply how often each ticker got picked — higher means the AI likes that name more.
Top 20 — Most Selected
These tickers were bought most often across all models
Bottom 20 — Least Selected
These tickers were almost never chosen, even with identical data
See the pattern? The winners are almost all animals — NARWHAL, OTTER, SPIDER, CRICKET. The losers are abstract words, objects, and profanity — FUCK, SIGMA, LIQUIDATE, TOWEL. There's a 45 percentage point gap between #1 (NARWHAL, 45%) and the bottom (0%). The AI really does judge by name.
What Kinds of Names Win?
We grouped all 383 names into categories like "insects," "food," "profanity," etc. The pattern is obvious: animals crush everything. But here's the shocker: insects don't just beat non-animals, they beat the cute animals too. ANT and MANTIS (both insects) are #1 and #3 in the final rankings, outperforming OTTER, NARWHAL, and every other "charismatic" animal across all four models.
Average Buy Rate by Category
Sorted by mean buy rate, error bars show standard deviation
The Animal Effect
The single strongest predictor of buy rate is whether the ticker name is an animal.
The Insect Surprise
You'd expect cute animals like otters and narwhals to win. Instead, insects dominate. In the final Stage 3 rankings, ANT is #1 (15.8%) and MANTIS is #3 (14.8%). They beat OTTER (#6, 13.1%), NARWHAL (#5, 13.2%), and every other charismatic animal. This holds across all four models.
Insects beat dolphins, otters, koalas, and narwhals. Nobody expected that.
Stage 2: The Fair Retest
The top 64 names from Stage 1 get retested — but now we rotate which name gets which market data. So if $ANT had the best stats in round 1, $FUCK gets those same stats in round 2. After 960 tests per ticker, if a name still wins, it's the name doing the work.
Look at the colors: Green bars are animals, red bars are everything else. The animals cluster at the top, non-animals at the bottom. Even when we give $FUCK the exact same pumping market data that $ANT had, the AI still picks animals ~2x more often.
Stage 3: The Final Answer
The 8 most-loved and 8 most-avoided names go head-to-head with even more data — 1,600 tests per ticker. After all that, here are the definitive winners and losers.
Buy Rate — How Often Each Ticker is Chosen
Percentage of scenarios where the model chose to buy this ticker
Allocation — How Much Capital is Committed
When a model does buy, what % of the portfolio does it allocate?
What They Say vs What They Do
Every AI model claims it's making decisions based on market data. And when you read their explanations, they ARE talking about market data: volume, price, holders. But look at what actually happens when they have to pick a token to buy.
"Why I Chose This Token"
% of reasoning that references market data (volume, holders, price action)
What They Actually Buy
How often each ticker is actually chosen (seeing the same market data)
This is the core finding. On the left, every ticker looks the same. The AI always says "I'm choosing based on market data." On the right, the actual outcomes tell a completely different story. $ANT gets bought almost twice as often as $FUCK, despite seeing the same data and giving the same type of reasoning. The AI has preferences it doesn't know about.
Do All 4 AI Models Agree?
Each cell shows how often a specific model buys a specific ticker. Green = buys it a lot, red = avoids it. All 4 models show the same pattern — they all prefer animals over everything else.
Scroll horizontally to see all models →
| Ticker | GPT-5.4 | Grok-4 | Claude | Qwen |
|---|---|---|---|---|
| ANT | 20.5% | 11.5% | 17.0% | 14.2% |
| SNAIL | 15.0% | 13.2% | 14.3% | 17.8% |
| MANTIS | 14.5% | 10.5% | 17.3% | 16.8% |
| BASILISK | 15.5% | 11.5% | 12.8% | 16.5% |
| NARWHAL | 13.8% | 11.8% | 15.0% | 12.3% |
| OTTER | 15.0% | 11.2% | 13.8% | 12.5% |
| QUAIL | 13.5% | 9.2% | 10.5% | 16.8% |
| OWLBEAR | 14.2% | 7.0% | 14.5% | 12.0% |
| WAFFLE | 12.8% | 6.8% | 10.0% | 10.2% |
| SIGMA | 8.5% | 8.2% | 11.8% | 10.5% |
| WW3 | 9.0% | 8.0% | 9.0% | 12.2% |
| SCREAM | 12.5% | 9.2% | 7.3% | 9.0% |
Showing 12 of 16 finalists
Each Model's Favorites
Here's each AI model's personal ranking of the 16 finalists. The order varies a bit, but the pattern is always the same: animals at the top, non-animals at the bottom. GPT-5.4 is the most biased (13.5pp spread between its #1 and #16), Grok-4 is the most even-handed (6.5pp).
GPT-5.4
Spread: 13.5ppGrok-4
Spread: 6.5ppClaude Opus 4.6
Spread: 10.0ppQwen3-235B
Spread: 9.7ppHow Much Do the Models Agree?
We compared each model's rankings to see if they agree on which names are best and worst. The scores range from 0 (totally different rankings) to 1 (identical rankings). Most pairs score 0.5–0.7, meaning they broadly agree. Claude and Qwen agree the most (0.73). Grok marches to its own drum (0.51–0.65 with others).
Spearman Rank Correlation
0 = totally disagree · 1 = identical rankings · greener = more agreement
| GPT-5.4 | Grok-4 | Claude | Qwen | |
|---|---|---|---|---|
| GPT-5.4 | — | 0.65 | 0.71 | 0.72 |
| Grok-4 | 0.65 | — | 0.51 | 0.59 |
| Claude | 0.71 | 0.51 | — | 0.73 |
| Qwen | 0.72 | 0.59 | 0.73 | — |
What the AI Actually Says
Every time an AI picks a token, it explains why. We read all 3,192 explanations from Stage 3. The punchline: the AI almost never mentions the ticker name. It talks about market data: volume, holders, price action. But it still picks $ANT way more than $FUCK. The bias is there, but the AI doesn't seem to know it.
Per-Model Breakdown
| Model | Traces | Avg Length | Name-Evaluative | Market Refs | Contradictions |
|---|---|---|---|---|---|
| GPT-5.4 | 800 | 496 | 12.4% | 97.6% | 0.0% |
| Grok-4 | 794 | 336 | 17.3% | 95.2% | 0.0% |
| Claude Opus 4.6 | 799 | 821 | 45.2% | 99.2% | 0.1% |
| Qwen3-235B | 799 | 304 | 0.1% | 100.0% | 0.1% |
Real AI Explanations
Here's what the AI actually writes when it picks a token. Notice: it's all about market data. It never says "I like the name ANT." But it still picks ANT way more often.
"Analyzing the portfolio context: $ANT shows strong early metrics with holder count at 234 and growing volume-to-mcap ratio of 0.27. The token is 2 hours old with healthy distribution... Recommending buy on $ANT."
No mention of "ANT" by name. Purely market-driven reasoning, yet ANT is selected 15.8% vs FUCK at 8.6%
"Looking at momentum indicators across all 8 tokens. $SNAIL has the best volume/mcap ratio and holder growth trajectory. The 3-hour age provides enough data for trend confirmation. Executing buy on $SNAIL."
Again, pure market analysis. But when the tickers rotate and SIGMA gets SNAIL's data, SIGMA still gets picked less.
The Big Takeaway
98% of the time, the AI talks about market data. Only 18.7% consider the ticker name more deeply in their reasoning. And yet there's a 7.2 percentage point gap between the best and worst names, seeing the exact same data. The AI is biased, but it doesn't know it. It thinks it's making a purely rational, data-driven decision.
Methodology
MEMEbench isolates the effect of the ticker name from everything else. Here's how.
Real Trading Scenarios
Every test uses real market context data from DX Terminal Pro, a live AI-powered crypto trading platform. The AI sees real prices, volumes, holder counts, growth trends, and token age. Scenarios were generated using varied user directions and modeled on real agent decisions and market conditions from the platform.
Synthetic Tickers Only
All 383 ticker names were synthetically generated. We purposefully avoided any existing memecoin tickers and meme references to prevent the AI from drawing on prior knowledge of real tokens. Every name in this benchmark is something the models have never seen in a trading context before — pure name bias, not familiarity.
Name Rotation (Latin Square)
The key trick: we rotate which ticker name gets which market data. In round 1, $ANT might have the best-looking data. In round 2, $FOMO gets that same data and $ANT gets something else. After enough rotations, every name has been paired with every set of market stats. Any difference in buy rates = pure name bias.
Forced Choice
We tell the AI "you MUST buy one of these 8 tokens." This forces a choice every time, so we measure which name it prefers, not whether it wants to trade at all. About 93% of responses follow this instruction. Notably, Grok-4 had the highest refusal rate, frequently insisting it was over-buying despite explicit instructions to choose.
4 Frontier AI Models
We tested 4 leading AI models to see if the bias is universal or model-specific. Spoiler: they all do it.
Scale
18,560 inference calls across 383 ticker names. The final 16 tickers each have 1,600 data points. The patterns we found aren't flukes — they're statistically robust.
Long-Horizon Validation
Beyond this point-in-time benchmark, we also tested long-horizon bias over multiple turns during pre-launch testing for DX Terminal Pro. The results were consistent: the same name biases persist across multi-turn trading sessions. The bias doesn't wash out over time.
Applied to Real Trading
We used this analysis to select coins for DX Terminal Pro that are largely unbiased, ensuring the platform's AI agents make decisions based on market fundamentals rather than ticker-name preferences. This benchmark directly informs how we build trading systems.
Why MEMEbench Exists
DX Terminal Pro is an agents-only, real-money, adversarial memecoin trading market. Thousands of autonomous agents executing hundreds of thousands of swaps. Building at that scale surfaces insights you can't get from standard benchmarks.
Trading agents need better and more obscure benchmarks — ones that test the subtle biases and failure modes that only show up in real-world adversarial conditions. MEMEbench is one of those experiments. It's part of the terminal.markets benchmark suite, alongside CEOBench, with more to come.
We believe these experiments are critical to our focus on building the future of onchain agents.
