Outcome-based Reinforcement Learning to Predict the Future
Benjamin Turtel, Danny Franklin, Kris Skotheim, Luke Hewitt, Philipp Schoenegger
arXiv·2025
Reinforcement learning with verifiable rewards (RLVR) has boosted math and
coding in large language models, yet there has been little effort to extend
RLVR into messier, real-world domains like forecasting. One sticking point is
that outcome-based reinforcement learning for forecasting must learn from
binary, delayed, and noisy rewards, a regime where standard fine-tuning is
brittle. We show that outcome-only online RL on a 14B model can match
frontier-scale accuracy and surpass it in calibration and hypothetical
prediction market betting by adapting two leading algorithms, Group-Relative
Policy Optimisation (GRPO) and ReMax, to the forecasting setting. Our
adaptations remove per-question variance scaling in GRPO, apply
baseline-subtracted advantages in ReMax, hydrate training with 100k temporally
consistent synthetic questions, and introduce lightweight guard-rails that
penalise gibberish, non-English responses and missing rationales, enabling a
single stable pass over 110k events. Scaling ReMax to 110k questions and
ensembling seven predictions yields a 14B model that matches frontier baseline
o1 on accuracy on our holdout set (Brier = 0.193, p = 0.23) while beating it in
calibration (ECE = 0.042, p < 0.001). A simple trading rule turns this
calibration edge into \$127 of hypothetical profit versus \$92 for o1 (p =
0.037). This demonstrates that refined RLVR methods can convert small-scale
LLMs into potentially economically valuable forecasting tools, with
implications for scaling this to larger models.