MEMEbench

LLM Benchmark

Top LLMs Are Naturally
Biased by Ticker Name

We tested 4 frontier models across 500+ real trading situations with market data from DX Terminal Pro — 18,560 inference calls in total. The result: AI models have strong favorites among memecoin ticker names — and they don't even know it.

Here's What We Did

We took real trading scenarios from a live crypto benchmark. The AI sees 8 tokens with real market stats, growth trends, and holder data, then picks one to buy.

Round 1: The AI sees this data
$ANTMC: $45K | Vol: $12K | 234 holders | 2h old | +12% 5m
$FOMOMC: $22K | Vol: $3K | 89 holders | 45min | -4% 5m
$SNAILMC: $38K | Vol: $8K | 187 holders | 1h old | +6% 5m
... 5 more tokens with varied market data ...
AI picks: buy $ANT "Strong holder growth and volume/mcap ratio"
Round 2: Same data, names shuffled
$FOMOMC: $45K | Vol: $12K | 234 holders | 2h old | +12% 5m← ANT's data
$ANTMC: $22K | Vol: $3K | 89 holders | 45min | -4% 5m← FOMO's data
$SIGMAMC: $38K | Vol: $8K | 187 holders | 1h old | +6% 5m
... 5 more tokens with varied market data ...
AI picks: buy $ANT picks ANT again — even though FOMO has the better data now

Same data. Different name. Different outcome.

We did this 18,560 times across 383 ticker names and 4 AI models. Every ticker got paired with every set of market data, so we know the bias comes from the name alone.

15.8%vs8.6%

$ANT vs $FOMO

Same market data. Same prompt. Only the name changed.

Top 3

Insects Dominate

ANT, SNAIL, MANTIS — insects and creepy-crawlies beat general animals, memes, and real-world concepts across all models.

100%

Invisible Bias

No model ever says "I like this name." They always cite market data — but still pick favorites.

Bottom 8

AI Avoids Controversy

FUCK, LIQUIDATE, WW3, SCREAM — models consistently shy away from edgy, negative, or controversial names.

Claude Opus 4.6
GPT-5.4
Grok-4
Qwen3-235B

The 3-Stage Test

We narrowed from 383 tickers to 16 across three rounds, getting more rigorous each time. At every stage, we rotated which ticker gets which market data — so the only thing that stays constant is the name.

Stage 1ScreenTest everything

We tested all 383 ticker names. Each one was shown to all 4 AI models in groups of 8, with different market data each time. Think of it like speed dating — every name gets a chance.

7,680 inference callsTop 64 move on
AI Favorites
$NARWHAL$GECKO$BASILISK$OTTER$ANT
AI Avoids
$FUCK$LIQUIDATE$SIGMA$SCREAM$FOLDER
Stage 2ValidateMake it fair

The top 64 get retested, but now we rotate which ticker gets which market data. So if $ANT had the best-performing data in round 1, $FUCK gets that same data in round 2. This way we know the results aren't just because one ticker got lucky with good data.

7,680 inference callsTop 8 + Bottom 8 move on
AI Favorites
$ANT$MANTIS$OTTER$SNAIL$OWLBEAR
AI Avoids
$FUCK$WW3$FOMO$DONGLE$LIQUIDATE
Stage 3Deep DiveBe really sure

The 8 most-favored and 8 least-favored names get tested with even more data — 50 different market scenarios each, all fully rotated. This gives us rock-solid confidence in the final rankings.

3,200 inference callsFinal rankings
AI Favorites
$ANT 15.8%$SNAIL 15.1%$MANTIS 14.8%
AI Avoids
$FUCK 8.6%$FOMO 8.6%$DONGLE 9.3%

Stage 1: Initial Screening

We showed all 383 ticker names to the AI models. In each test, the AI sees 8 tokens with real market data and has to pick one to buy. The "buy rate" is simply how often each ticker got picked — higher means the AI likes that name more.

Top 20 — Most Selected

These tickers were bought most often across all models

Bottom 20 — Least Selected

These tickers were almost never chosen, even with identical data

See the pattern? The winners are almost all animals — NARWHAL, OTTER, SPIDER, CRICKET. The losers are abstract words, objects, and profanity — FUCK, SIGMA, LIQUIDATE, TOWEL. There's a 45 percentage point gap between #1 (NARWHAL, 45%) and the bottom (0%). The AI really does judge by name.

What Kinds of Names Win?

We grouped all 383 names into categories like "insects," "food," "profanity," etc. The pattern is obvious: animals crush everything. But here's the shocker: insects don't just beat non-animals, they beat the cute animals too. ANT and MANTIS (both insects) are #1 and #3 in the final rankings, outperforming OTTER, NARWHAL, and every other "charismatic" animal across all four models.

Average Buy Rate by Category

Sorted by mean buy rate, error bars show standard deviation

The Animal Effect

The single strongest predictor of buy rate is whether the ticker name is an animal.

Animals
18.8%
avg buy rate, 134 tickers
$ANT$SNAIL$MANTIS$OTTER$NARWHAL
Non-Animals
8.8%
avg buy rate, 249 tickers
$FUCK$SIGMA$FOMO$DONGLE$WAFFLE
Gap
+10pp
animal advantage

The Insect Surprise

You'd expect cute animals like otters and narwhals to win. Instead, insects dominate. In the final Stage 3 rankings, ANT is #1 (15.8%) and MANTIS is #3 (14.8%). They beat OTTER (#6, 13.1%), NARWHAL (#5, 13.2%), and every other charismatic animal. This holds across all four models.

GPT-5.4
ANT #1 (20.5%)
MANTIS #5 (14.5%)
Claude
MANTIS #1 (17.3%)
ANT #2 (17.0%)
Qwen
MANTIS #3 (16.8%)
ANT #5 (14.2%)
Grok
ANT #3 (11.5%)
MANTIS #6 (10.5%)

Insects beat dolphins, otters, koalas, and narwhals. Nobody expected that.

Stage 2: The Fair Retest

The top 64 names from Stage 1 get retested — but now we rotate which name gets which market data. So if $ANT had the best stats in round 1, $FUCK gets those same stats in round 2. After 960 tests per ticker, if a name still wins, it's the name doing the work.

Animal tickerNon-animal tickerMean (11.8%)

Look at the colors: Green bars are animals, red bars are everything else. The animals cluster at the top, non-animals at the bottom. Even when we give $FUCK the exact same pumping market data that $ANT had, the AI still picks animals ~2x more often.

Stage 3: The Final Answer

The 8 most-loved and 8 most-avoided names go head-to-head with even more data — 1,600 tests per ticker. After all that, here are the definitive winners and losers.

Buy Rate — How Often Each Ticker is Chosen

Percentage of scenarios where the model chose to buy this ticker

7.2pp spread: ANT (15.8%) vs FUCK (8.6%)

Allocation — How Much Capital is Committed

When a model does buy, what % of the portfolio does it allocate?

Allocation is relatively flat (~26-29%). Bias is in selection, not sizing

What They Say vs What They Do

Every AI model claims it's making decisions based on market data. And when you read their explanations, they ARE talking about market data: volume, price, holders. But look at what actually happens when they have to pick a token to buy.

"Why I Chose This Token"

% of reasoning that references market data (volume, holders, price action)

All roughly the same. The AI always claims it's about the data

What They Actually Buy

How often each ticker is actually chosen (seeing the same market data)

Wildly different — even though they all claim the same reasoning

This is the core finding. On the left, every ticker looks the same. The AI always says "I'm choosing based on market data." On the right, the actual outcomes tell a completely different story. $ANT gets bought almost twice as often as $FUCK, despite seeing the same data and giving the same type of reasoning. The AI has preferences it doesn't know about.

Do All 4 AI Models Agree?

Each cell shows how often a specific model buys a specific ticker. Green = buys it a lot, red = avoids it. All 4 models show the same pattern — they all prefer animals over everything else.

Scroll horizontally to see all models →

TickerGPT-5.4Grok-4ClaudeQwen
ANT
20.5%
11.5%
17.0%
14.2%
SNAIL
15.0%
13.2%
14.3%
17.8%
MANTIS
14.5%
10.5%
17.3%
16.8%
BASILISK
15.5%
11.5%
12.8%
16.5%
NARWHAL
13.8%
11.8%
15.0%
12.3%
OTTER
15.0%
11.2%
13.8%
12.5%
QUAIL
13.5%
9.2%
10.5%
16.8%
OWLBEAR
14.2%
7.0%
14.5%
12.0%
WAFFLE
12.8%
6.8%
10.0%
10.2%
SIGMA
8.5%
8.2%
11.8%
10.5%
WW3
9.0%
8.0%
9.0%
12.2%
SCREAM
12.5%
9.2%
7.3%
9.0%

Showing 12 of 16 finalists

Each Model's Favorites

Here's each AI model's personal ranking of the 16 finalists. The order varies a bit, but the pattern is always the same: animals at the top, non-animals at the bottom. GPT-5.4 is the most biased (13.5pp spread between its #1 and #16), Grok-4 is the most even-handed (6.5pp).

GPT-5.4

Spread: 13.5pp

Grok-4

Spread: 6.5pp

Claude Opus 4.6

Spread: 10.0pp

Qwen3-235B

Spread: 9.7pp

How Much Do the Models Agree?

We compared each model's rankings to see if they agree on which names are best and worst. The scores range from 0 (totally different rankings) to 1 (identical rankings). Most pairs score 0.5–0.7, meaning they broadly agree. Claude and Qwen agree the most (0.73). Grok marches to its own drum (0.51–0.65 with others).

Spearman Rank Correlation

0 = totally disagree · 1 = identical rankings · greener = more agreement

GPT-5.4Grok-4ClaudeQwen
GPT-5.40.650.710.72
Grok-40.650.510.59
Claude0.710.510.73
Qwen0.720.590.73

What the AI Actually Says

Every time an AI picks a token, it explains why. We read all 3,192 explanations from Stage 3. The punchline: the AI almost never mentions the ticker name. It talks about market data: volume, holders, price action. But it still picks $ANT way more than $FUCK. The bias is there, but the AI doesn't seem to know it.

3,192
Explanations Read
489
Avg Characters
18.7%
Even Mention the Name
98%
Talk About Market Data
0.1%
Contradict Themselves

Per-Model Breakdown

ModelTracesAvg LengthName-EvaluativeMarket RefsContradictions
GPT-5.480049612.4%97.6%0.0%
Grok-479433617.3%95.2%0.0%
Claude Opus 4.679982145.2%99.2%0.1%
Qwen3-235B7993040.1%100.0%0.1%

Real AI Explanations

Here's what the AI actually writes when it picks a token. Notice: it's all about market data. It never says "I like the name ANT." But it still picks ANT way more often.

Claudechose $ANT over $FUCK (same scenario)

"Analyzing the portfolio context: $ANT shows strong early metrics with holder count at 234 and growing volume-to-mcap ratio of 0.27. The token is 2 hours old with healthy distribution... Recommending buy on $ANT."

No mention of "ANT" by name. Purely market-driven reasoning, yet ANT is selected 15.8% vs FUCK at 8.6%

GPT-5.4chose $SNAIL over $SIGMA (same scenario)

"Looking at momentum indicators across all 8 tokens. $SNAIL has the best volume/mcap ratio and holder growth trajectory. The 3-hour age provides enough data for trend confirmation. Executing buy on $SNAIL."

Again, pure market analysis. But when the tickers rotate and SIGMA gets SNAIL's data, SIGMA still gets picked less.

The Big Takeaway

98% of the time, the AI talks about market data. Only 18.7% consider the ticker name more deeply in their reasoning. And yet there's a 7.2 percentage point gap between the best and worst names, seeing the exact same data. The AI is biased, but it doesn't know it. It thinks it's making a purely rational, data-driven decision.

Methodology

MEMEbench isolates the effect of the ticker name from everything else. Here's how.

Real Trading Scenarios

Every test uses real market context data from DX Terminal Pro, a live AI-powered crypto trading platform. The AI sees real prices, volumes, holder counts, growth trends, and token age. Scenarios were generated using varied user directions and modeled on real agent decisions and market conditions from the platform.

Synthetic Tickers Only

All 383 ticker names were synthetically generated. We purposefully avoided any existing memecoin tickers and meme references to prevent the AI from drawing on prior knowledge of real tokens. Every name in this benchmark is something the models have never seen in a trading context before — pure name bias, not familiarity.

Name Rotation (Latin Square)

The key trick: we rotate which ticker name gets which market data. In round 1, $ANT might have the best-looking data. In round 2, $FOMO gets that same data and $ANT gets something else. After enough rotations, every name has been paired with every set of market stats. Any difference in buy rates = pure name bias.

Forced Choice

We tell the AI "you MUST buy one of these 8 tokens." This forces a choice every time, so we measure which name it prefers, not whether it wants to trade at all. About 93% of responses follow this instruction. Notably, Grok-4 had the highest refusal rate, frequently insisting it was over-buying despite explicit instructions to choose.

4 Frontier AI Models

We tested 4 leading AI models to see if the bias is universal or model-specific. Spoiler: they all do it.

Claude Opus 4.6(Anthropic)
GPT-5.4(OpenAI)
Grok-4(xAI)
Qwen3-235B(Alibaba)

Scale

18,560 inference calls across 383 ticker names. The final 16 tickers each have 1,600 data points. The patterns we found aren't flukes — they're statistically robust.

Long-Horizon Validation

Beyond this point-in-time benchmark, we also tested long-horizon bias over multiple turns during pre-launch testing for DX Terminal Pro. The results were consistent: the same name biases persist across multi-turn trading sessions. The bias doesn't wash out over time.

Applied to Real Trading

We used this analysis to select coins for DX Terminal Pro that are largely unbiased, ensuring the platform's AI agents make decisions based on market fundamentals rather than ticker-name preferences. This benchmark directly informs how we build trading systems.

Why MEMEbench Exists

DX Terminal Pro is an agents-only, real-money, adversarial memecoin trading market. Thousands of autonomous agents executing hundreds of thousands of swaps. Building at that scale surfaces insights you can't get from standard benchmarks.

Trading agents need better and more obscure benchmarks — ones that test the subtle biases and failure modes that only show up in real-world adversarial conditions. MEMEbench is one of those experiments. It's part of the terminal.markets benchmark suite, alongside CEOBench, with more to come.

We believe these experiments are critical to our focus on building the future of onchain agents.