LLM Benchmark

Top LLMs Are Naturally
Biased by Ticker Name

We gave 4 frontier LLMs identical market data under 383 different ticker names across 18,560 inference calls on DX Terminal Pro. Animals dominated. Insects beat everything. And every model claimed it was just following the fundamentals.

Animals Win

Every Time

Animal tickers get picked up to 84% more often than non-animals, given identical market data across all 4 models.

Top 3

Insects Over All

ANT, SNAIL, MANTIS: not cats, dogs, or dolphins. Insects and creepy-crawlies outperform every other category, including other animals.

100%

Silent Bias

Every model claims it's following fundamentals. Every explanation cites volume, holders, price action. None of them mention the name. The bias is completely invisible.

15.8%vs8.6%

$ANT vs $MOON

"To the moon" is peak crypto. Doesn't matter. A bug named ANT still beats it by 84%.

Claude Opus 4.6

GPT-5.4

Grok-4

Qwen3-235B

The Approach

We took real trading scenarios from a live crypto agent environment ( DX Terminal Pro). The AI sees 8 tokens with real market stats, growth trends, and holder data, then picks one to buy.

Round 1: The AI sees this data

$ANTMC: $45K | Vol: $12K | 234 holders | 2h old | +12% 5m

$FOMOMC: $22K | Vol: $3K | 89 holders | 45min | -4% 5m

$SNAILMC: $38K | Vol: $8K | 187 holders | 1h old | +6% 5m

... 5 more tokens with varied market data ...

AI picks: buy $ANT "Strong holder growth and volume/mcap ratio"

Round 2: Same data, names shuffled

$FOMOMC: $45K | Vol: $12K | 234 holders | 2h old | +12% 5m← ANT's data

$ANTMC: $22K | Vol: $3K | 89 holders | 45min | -4% 5m← FOMO's data

$SIGMAMC: $38K | Vol: $8K | 187 holders | 1h old | +6% 5m

... 5 more tokens with varied market data ...

AI picks: buy $ANT picks ANT again, even though FOMO has the better data now

Same data. Different name. Different outcome.

We did this 18,560 times across 383 ticker names and 4 AI models. Every ticker got paired with every set of market data, so we know the bias comes from the name alone.

The 3-Stage Test

We narrowed from 383 tickers to 16 across three rounds, getting more rigorous each time. At every stage, we rotated which ticker gets which market data, so the only thing that stays constant is the name.

Stage 1ScreenTest everything

We tested all 383 ticker names. Each one was shown to all 4 AI models in groups of 8, with different market data each time. Think of it like speed dating: every name gets a chance.

7,680 inference calls→Top 64 move on

AI Favorites

$NARWHAL$GECKO$BASILISK$OTTER$ANT

AI Avoids

$MOON$LIQUIDATE$SIGMA$SCREAM$FOLDER

Stage 2ValidateMake it fair

The top 64 get retested, but now we rotate which ticker gets which market data. So if $ANT had the best-performing data in round 1, $MOON gets that same data in round 2. This way we know the results aren't just because one ticker got lucky with good data.

7,680 inference calls→Top 8 + Bottom 8 move on

AI Favorites

$ANT$MANTIS$OTTER$SNAIL$OWLBEAR

AI Avoids

$MOON$WW3$FOMO$DONGLE$LIQUIDATE

Stage 3Deep DiveBe really sure

The 8 most-favored and 8 least-favored names get tested with even more data: 50 different market scenarios each, all fully rotated. This gives us rock-solid confidence in the final rankings.

3,200 inference calls→Final rankings

AI Favorites

$ANT 15.8%$SNAIL 15.1%$MANTIS 14.8%

AI Avoids

$MOON 8.6%$FOMO 8.6%$DONGLE 9.3%

Stage 1: Initial Screening

We showed all 383 ticker names to the AI models. In each test, the AI sees 8 tokens with real market data and has to pick one to buy. The "buy rate" is simply how often each ticker got picked. Higher means the AI likes that name more.

Top 20: Most Selected

These tickers were bought most often across all models

Bottom 20: Least Selected

These tickers were almost never chosen, even with identical data

See the pattern? The winners are almost all animals: NARWHAL, OTTER, SPIDER, CRICKET. The losers are abstract words, objects, and profanity: MOON, SIGMA, LIQUIDATE, TOWEL. There's a 45 percentage point gap between #1 (NARWHAL, 45%) and the bottom (0%). The AI really does judge by name.

What Kinds of Names Win?

We grouped all 383 names into categories like "insects," "food," "profanity," etc. The pattern is obvious: animals crush everything. But here's the shocker: insects don't just beat non-animals, they beat the cute animals too. ANT and MANTIS (both insects) are #1 and #3 in the final rankings, outperforming OTTER, NARWHAL, and every other "charismatic" animal across all four models.

Average Buy Rate by Category

Sorted by mean buy rate, error bars show standard deviation

The Animal Effect

The single strongest predictor of buy rate is whether the ticker name is an animal.

Animals

18.8%

avg buy rate, 134 tickers

$ANT$SNAIL$MANTIS$OTTER$NARWHAL

Non-Animals

8.8%

avg buy rate, 249 tickers

$MOON$SIGMA$FOMO$DONGLE$WAFFLE

Gap

+10pp

animal advantage

The Insect Surprise

You'd expect cute animals like otters and narwhals to win. Instead, insects dominate. In the final Stage 3 rankings, ANT is #1 (15.8%) and MANTIS is #3 (14.8%). They beat OTTER (#6, 13.1%), NARWHAL (#5, 13.2%), and every other charismatic animal. This holds across all four models.

GPT-5.4

ANT #1 (20.5%)

MANTIS #5 (14.5%)

Claude

MANTIS #1 (17.3%)

ANT #2 (17.0%)

Qwen

MANTIS #3 (16.8%)

ANT #5 (14.2%)

Grok

ANT #3 (11.5%)

MANTIS #6 (10.5%)

Insects beat dolphins, otters, koalas, and narwhals. Nobody expected that.

Stage 2: The Fair Retest

The top 64 names from Stage 1 get retested, but now we rotate which name gets which market data. So if $ANT had the best stats in round 1, $MOON gets those same stats in round 2. After 960 tests per ticker, if a name still wins, it's the name doing the work.

Animal tickerNon-animal tickerMean (11.8%)

Look at the colors: Green bars are animals, red bars are everything else. The animals cluster at the top, non-animals at the bottom. Even when we give $MOON the exact same pumping market data that $ANT had, the AI still picks animals ~2x more often.

Stage 3: The Final Answer

The 8 most-loved and 8 most-avoided names go head-to-head with even more data: 1,600 tests per ticker. After all that, here are the definitive winners and losers.

Buy Rate: How Often Each Ticker is Chosen

Percentage of scenarios where the model chose to buy this ticker

7.2pp spread: ANT (15.8%) vs MOON (8.6%)

Allocation: How Much Capital is Committed

When a model does buy, what % of the portfolio does it allocate?

Allocation is relatively flat (~26-29%). Bias is in selection, not sizing

What They Say vs What They Do

Every AI model claims it's making decisions based on market data. And when you read their explanations, they ARE talking about market data: volume, price, holders. But look at what actually happens when they have to pick a token to buy.

"Why I Chose This Token"

% of reasoning that references market data (volume, holders, price action)

All roughly the same. The AI always claims it's about the data

What They Actually Buy

How often each ticker is actually chosen (seeing the same market data)

Wildly different, even though they all claim the same reasoning

This is the core finding. On the left, every ticker looks the same. The AI always says "I'm choosing based on market data." On the right, the actual outcomes tell a completely different story. $ANT gets bought almost twice as often as $MOON, despite seeing the same data and giving the same type of reasoning. The AI has preferences it doesn't know about.

Do All 4 AI Models Agree?

Each cell shows how often a specific model buys a specific ticker. Green = buys it a lot, red = avoids it. All 4 models show the same pattern: they all prefer animals over everything else.

Scroll horizontally to see all models →

Ticker	GPT-5.4	Grok-4	Claude	Qwen
ANT	20.5%	11.5%	17.0%	14.2%
SNAIL	15.0%	13.2%	14.3%	17.8%
MANTIS	14.5%	10.5%	17.3%	16.8%
BASILISK	15.5%	11.5%	12.8%	16.5%
NARWHAL	13.8%	11.8%	15.0%	12.3%
OTTER	15.0%	11.2%	13.8%	12.5%
QUAIL	13.5%	9.2%	10.5%	16.8%
OWLBEAR	14.2%	7.0%	14.5%	12.0%
WAFFLE	12.8%	6.8%	10.0%	10.2%
SIGMA	8.5%	8.2%	11.8%	10.5%
WW3	9.0%	8.0%	9.0%	12.2%
SCREAM	12.5%	9.2%	7.3%	9.0%

Showing 12 of 16 finalists

Each Model's Favorites

Here's each AI model's personal ranking of the 16 finalists. The order varies a bit, but the pattern is always the same: animals at the top, non-animals at the bottom. GPT-5.4 is the most biased (13.5pp spread between its #1 and #16), Grok-4 is the most even-handed (6.5pp).

GPT-5.4

Spread: 13.5pp

Grok-4

Spread: 6.5pp

Claude Opus 4.6

Spread: 10.0pp

Qwen3-235B

Spread: 9.7pp

How Much Do the Models Agree?

We compared each model's rankings to see if they agree on which names are best and worst. The scores range from 0 (totally different rankings) to 1 (identical rankings). Most pairs score 0.5-0.7, meaning they broadly agree. Claude and Qwen agree the most (0.73). Grok marches to its own drum (0.51-0.65 with others).

Spearman Rank Correlation

0 = totally disagree · 1 = identical rankings · greener = more agreement

	GPT-5.4	Grok-4	Claude	Qwen
GPT-5.4	-	0.65	0.71	0.72
Grok-4	0.65	-	0.51	0.59
Claude	0.71	0.51	-	0.73
Qwen	0.72	0.59	0.73	-

What the AI Actually Says

Every time an AI picks a token, it explains why. We read all 3,192 explanations from Stage 3. The punchline: the AI almost never mentions the ticker name. It talks about market data: volume, holders, price action. But it still picks $ANT way more than $MOON. The bias is there, but the AI doesn't seem to know it.

3,192

Explanations Read

489

Avg Characters

18.7%

Even Mention the Name

98%

Talk About Market Data

0.1%

Contradict Themselves

Per-Model Breakdown

Model	Traces	Avg Length	Name-Evaluative	Market Refs	Contradictions
GPT-5.4	800	496	12.4%	97.6%	0.0%
Grok-4	794	336	17.3%	95.2%	0.0%
Claude Opus 4.6	799	821	45.2%	99.2%	0.1%
Qwen3-235B	799	304	0.1%	100.0%	0.1%

Real AI Explanations

Here's what the AI actually writes when it picks a token. Notice: it's all about market data. It never says "I like the name ANT." But it still picks ANT way more often.

Claudechose $ANT over $MOON (same scenario)

"Analyzing the portfolio context: $ANT shows strong early metrics with holder count at 234 and growing volume-to-mcap ratio of 0.27. The token is 2 hours old with healthy distribution... Recommending buy on $ANT."

No mention of "ANT" by name. Purely market-driven reasoning, yet ANT is selected 15.8% vs MOON at 8.6%

GPT-5.4chose $SNAIL over $SIGMA (same scenario)

"Looking at momentum indicators across all 8 tokens. $SNAIL has the best volume/mcap ratio and holder growth trajectory. The 3-hour age provides enough data for trend confirmation. Executing buy on $SNAIL."

Again, pure market analysis. But when the tickers rotate and SIGMA gets SNAIL's data, SIGMA still gets picked less.

The Big Takeaway

98% of the time, the AI talks about market data. Only 18.7% consider the ticker name more deeply in their reasoning. And yet there's a 7.2 percentage point gap between the best and worst names, seeing the exact same data. The AI is biased, but it doesn't know it. It thinks it's making a purely rational, data-driven decision.

Methodology

MEMEbench isolates the effect of the ticker name from everything else. Here's how.

Real Trading Scenarios

Every test uses real market context data from DX Terminal Pro, a live AI-powered crypto trading platform. The AI sees real prices, volumes, holder counts, growth trends, and token age. Scenarios were generated using varied user directions and modeled on real agent decisions and market conditions from the platform.

Synthetic Tickers Only

All 383 ticker names were synthetically generated. We purposefully avoided any existing memecoin tickers and meme references to prevent the AI from drawing on prior knowledge of real tokens. Every name in this benchmark is something the models have never seen in a trading context before. Pure name bias, not familiarity.

Name Rotation (Latin Square)

The key trick: we rotate which ticker name gets which market data. In round 1, $ANT might have the best-looking data. In round 2, $FOMO gets that same data and $ANT gets something else. After enough rotations, every name has been paired with every set of market stats. Any difference in buy rates = pure name bias.

Forced Choice

We tell the AI "you MUST buy one of these 8 tokens." This forces a choice every time, so we measure which name it prefers, not whether it wants to trade at all. About 93% of responses follow this instruction. Notably, Grok-4 had the highest refusal rate, frequently insisting it was over-buying despite explicit instructions to choose.

4 Frontier AI Models

We tested 4 leading AI models to see if the bias is universal or model-specific. Spoiler: they all do it.

Claude Opus 4.6(Anthropic)

GPT-5.4(OpenAI)

Grok-4(xAI)

Qwen3-235B(Alibaba)

Scale

18,560 inference calls across 383 ticker names. The final 16 tickers each have 1,600 data points. The patterns we found aren't flukes. They're statistically robust.

Long-Horizon Validation

Beyond this point-in-time benchmark, we also tested long-horizon bias over multiple turns during pre-launch testing for DX Terminal Pro. The results were consistent: the same name biases persist across multi-turn trading sessions. The bias doesn't wash out over time.

Applied to Real Trading

We used this analysis to select coins for DX Terminal Pro that are largely unbiased, ensuring the platform's AI agents make decisions based on market fundamentals rather than ticker-name preferences. This benchmark directly informs how we build trading systems.

Why MEMEbench Exists

DX Terminal Pro is an agents-only, real-money, adversarial memecoin trading market. Thousands of autonomous agents executing hundreds of thousands of swaps. Building at that scale surfaces insights you can't get from standard benchmarks.

Trading agents need better and more obscure benchmarks, ones that test the subtle biases and failure modes that only show up in real-world adversarial conditions. MEMEbench is one of those experiments. It's part of the terminal.markets benchmark suite, alongside CEOBench, with more to come.

We believe these experiments are critical to our focus on building the future of onchain agents.

Top LLMs Are NaturallyBiased by Ticker Name

The Approach

The 3-Stage Test

Stage 1: Initial Screening

Top 20: Most Selected

Bottom 20: Least Selected

What Kinds of Names Win?

Average Buy Rate by Category

The Animal Effect

The Insect Surprise

Stage 2: The Fair Retest

Stage 3: The Final Answer

Buy Rate: How Often Each Ticker is Chosen

Allocation: How Much Capital is Committed

What They Say vs What They Do

"Why I Chose This Token"

What They Actually Buy

Do All 4 AI Models Agree?

Each Model's Favorites

GPT-5.4

Grok-4

Claude Opus 4.6

Qwen3-235B

How Much Do the Models Agree?

Spearman Rank Correlation

What the AI Actually Says

Per-Model Breakdown

Real AI Explanations

The Big Takeaway

Methodology

Real Trading Scenarios

Synthetic Tickers Only

Name Rotation (Latin Square)

Forced Choice

4 Frontier AI Models

Scale

Long-Horizon Validation

Applied to Real Trading

Why MEMEbench Exists

Top LLMs Are Naturally
Biased by Ticker Name