Monthly Research · May 2026

Agentic Strategy Search Gets Real — May 2026

LLM-driven evolutionary optimization of trading pipelines arrives with p-hacking controls, benchmarks for trading agents follow, and validation discipline becomes the edge.

#agentic-search #strategy-evolution #p-hacking #trading-benchmarks #validation

Abstract

May moved the "LLM agent as quant researcher" thesis from speculation to working systems: MadEvolve demonstrated evolutionary optimization of full trading pipelines with explicit p-hacking controls, and two independent benchmarks arrived to measure LLM trading and portfolio agents under controlled conditions. On the model-supply side, Google shipped the Gemini 3.5 series and Anthropic released Claude Opus 4.8 alongside a $65 billion raise. With agentic strategy search now cheap and credible, the scarce asset is no longer idea generation but rigorous validation.

Executive Summary

May 2026 was the month the "LLM agent as quant researcher" thesis moved from speculation to working systems: an AlphaEvolve-style framework (MadEvolve) demonstrated LLM-driven evolutionary optimization of full trading pipelines with explicit p-hacking controls [3], while two independent benchmarks arrived to measure LLM trading and portfolio agents under controlled conditions [5][6]. On the model-supply side, Google shipped the Gemini 3.5 series at I/O — positioned around "frontier intelligence with action," with the Flash tier claiming 4x output speed over other frontier models [9] — and Anthropic released Claude Opus 4.8 and closed a $65B Series H at a $965B post-money valuation [7]. The academic quant flow leaned heavily toward prediction-market microstructure, deep-RL portfolio construction, and backtest-hygiene tooling [1][2][4]. The takeaway for partners: agentic strategy search is now cheap and credible enough that the scarce asset is no longer idea generation but rigorous validation — leakage and overfitting controls are becoming the differentiator.

AI — Latest Approaches

Gemini 3.5: frontier models repositioned around "action"

At Google I/O, Google released the Gemini 3.5 series, framed explicitly as "frontier intelligence with action" and built for complex, agentic workflows [9]. Gemini 3.5 Flash is described as the company's strongest agentic and coding model yet, outperforming Gemini 3.1 Pro on Terminal-Bench 2.1 (76.2%), GDPval-AA (1656 Elo), and MCP Atlas (83.6%), while claiming roughly 4x faster output token throughput than other frontier models [9]. The emphasis on speed and cost at near-flagship quality matters for any shop running high-volume agentic pipelines rather than single-query workloads [9].

Claude Opus 4.8 ships; Anthropic raises at near-trillion-dollar valuation

Anthropic released Claude Opus 4.8 on May 28 [7]. The same day it announced a $65B Series H at a $965B post-money valuation, and on May 27 opened a Milan office aimed at Italian enterprise, research, and developer markets [7]. The pairing of a frontier-model refresh with one of the largest private rounds on record underscores how capital-intensive the frontier remains — and how concentrated the supplier base for serious agentic tooling is [7].

DeepMind's multi-agent science push: Co-Scientist

Google DeepMind introduced Co-Scientist in May, a multi-agent AI partner intended to accelerate research workflows [8]. It sits alongside a broader May slate of Gemini-for-Science tooling announced around I/O [8]. The pattern — multiple specialized agents coordinated over a shared research objective — is the same architecture now appearing in quant-finance preprints (see the Market Regime Council and coordination-layer work below), suggesting the design is generalizing across domains [8][1][2].

Multimodal and world-model expansion: Gemini Omni and Project Genie

DeepMind's May releases also included Gemini Omni, a "create anything from anything" multimodal model, and an expansion of Project Genie that simulates real-world places by combining its generative world model with Street View data [8]. A companion May post documented AlphaEvolve's impact scaling across fields [8] — notable because the evolutionary-search recipe it popularized was directly adapted to trading-system optimization this month [3].

Quantitative Trading — Latest Approaches

LLM-driven evolutionary optimization of trading systems (MadEvolve)

Kvasiuk, Li, Colegrove, and Münchmeyer (submitted May 21) apply MadEvolve — a general-purpose, AlphaEvolve-inspired LLM optimization framework originally built for computational cosmology — to algorithmic trading and alpha generation on Bitcoin [3]. In their simulation and backtesting setup they report significant improvements across evolving feature sets for signal generation, optimizing individual strategy components, and jointly evolving the feature pipeline with the execution strategy [3]. Importantly, they benchmark against other agentic search approaches (specifically Claude Code) and explicitly evaluate p-hacking probabilities — a methodological bar most "LLM finds alpha" papers have skipped [3].

Benchmarks arrive for LLM trading and portfolio agents

Two May preprints move LLM-in-markets evaluation from anecdote to instrumentation: "From Knowing to Doing" proposes a memory-controlled benchmark for LLM trading agents on stock markets [5], and PortBench offers a correlation-aware, full-pipeline benchmark for LLM-driven portfolio management [6]. Together with the Market Regime Council work on dynamic credit assignment in multi-agent LLM decision systems listed the same month [2], the field is converging on controlled, reproducible evaluation of agentic trading — a prerequisite before any of it is allocatable capital.

Backtest hygiene: a one-switch benchmark for decision-time leakage

"When Alpha Disappears" (Zhang, Li, Peng, Chen) introduces a one-switch benchmark isolating decision-time leakage in financial backtests [4]. The premise — that much reported alpha evaporates when a single leakage pathway is toggled off — lands in the same month as MadEvolve's p-hacking analysis [3], reinforcing that validation rigor, not model capacity, is the current binding constraint on ML-driven strategies [4].

Deep RL and regime-aware portfolio construction

The May portfolio-management flow was dominated by reinforcement learning and regime modeling: a 67-page deep-RL framework for diversified portfolio management across global equity markets (Kashif & Ślepaczuk), regime-based allocation combining hidden Markov models with RL, and a study asking whether better GNN volatility forecasts actually produce better portfolios [2]. The common thread is a shift from point-forecast alpha toward distribution- and regime-conditioned allocation [2].

Microstructure: prediction markets become a research object

May's q-fin.TR listing (37 entries) shows an unusual concentration on decentralized prediction-market microstructure — information leakage scoring, insider-case hazard analysis, fill-side behavioral tiers on 13.36M Polymarket order events, and risk designs for event-linked perpetual futures [1]. Alongside this, classical microstructure advanced with signal-adaptive sequential optimal-execution quotes and entropy-regularized risk-sensitive market making [1]. Prediction markets are graduating into a venue with institutional-grade microstructure literature — and institutional-grade manipulation and regulation questions [1].

Cross-cutting Signals / Relevance to SteadyHash

The clearest May signal is the closing loop between frontier AI tooling and quant research process. The same evolutionary-agent recipe DeepMind promoted for science (AlphaEvolve, Co-Scientist) [8] was applied within weeks to trading-strategy search with credible statistical controls [3], and benchmark infrastructure for LLM trading agents appeared in parallel [5][6]. For a quantitative-investment firm, this reframes the build decision: LLM-driven strategy iteration is becoming a commodity capability, so durable edge migrates to proprietary data, execution, and — above all — validation discipline. The month's own literature says as much: alpha that survives a leakage switch [4] or a p-hacking audit [3] is the scarce commodity.

Second, the economics of running agentic research pipelines improved materially. Gemini 3.5 Flash's positioning — near-flagship agentic capability at 4x throughput and lower cost [9] — plus a competitive refresh from Anthropic (Opus 4.8) [7] means high-volume backtest-critique, feature-evolution, and report-generation loops get cheaper per iteration. The countervailing fact is supplier concentration: Anthropic's $65B raise at a $965B valuation [7] shows frontier capability remains a capital-intensive oligopoly, which argues for model-agnostic internal tooling rather than single-vendor lock-in.

Third, prediction markets warrant a standing watch item. The volume and quality of May's Polymarket microstructure work — leakage detection, perp designs on binary outcomes, manipulation and regulatory frameworks [1] — suggests these venues are maturing into a tradable, researchable asset class adjacent to SteadyHash's systematic focus, with the information-leakage tooling itself a potential alpha and risk-management input.

Sources

arXiv — Trading and Market Microstructure (q-fin.TR), authors and titles for May 2026 — https://arxiv.org/list/q-fin.TR/2026-05 (accessed 2026-06-12)
arXiv — Portfolio Management (q-fin.PM), authors and titles for May 2026 — https://arxiv.org/list/q-fin.PM/2026-05 (accessed 2026-06-12)
Kvasiuk, Li, Colegrove, Münchmeyer — MadEvolve: Evolutionary Optimization of Trading Systems with Large Language Models — https://arxiv.org/abs/2605.23007 (accessed 2026-06-12)
Zhang, Li, Peng, Chen — When Alpha Disappears: A One-Switch Benchmark for Decision-Time Leakage in Financial Backtests — https://arxiv.org/abs/2605.23959 (accessed 2026-06-12; listed on [1])
Zhu et al. — From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets — https://arxiv.org/abs/2605.28359 (accessed 2026-06-12; listed on [1])
Zhao, Chen, Su — PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management — https://arxiv.org/abs/2605.27887 (accessed 2026-06-12; listed on [2])
Anthropic — Newsroom (Claude Opus 4.8, May 28; Series H $65B at $965B post-money, May 28; Milan office, May 27) — https://www.anthropic.com/news (accessed 2026-06-12)
Google DeepMind — News / Blog (Gemini Omni; Co-Scientist; Project Genie + Street View; AlphaEvolve impact — May 2026) — https://deepmind.google/discover/blog/ (accessed 2026-06-12)
Google — Gemini 3.5: frontier intelligence with action — https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/ (accessed 2026-06-12)