Your personalresearch library

Discover papers, track what you're reading, rate and review.Build a library that grows with your curiosity.

Get Started Explore Papers

Currently Reading

Want to Read

All Papers

Advanced filtersTopic: All

Experimental

Topic

Unpacking my Library - Julian de Medeiros

Julian de Medeiros

Substack·2025

Why we collect things.

No ratings yet

View paper →

Natural Emergent Misalignment from Reward Hacking in Production RL

Monte MacDiarmid, Benjamin Wright, Jonathan Uesato, Joe Benton, Jon Kutasov, Sara Price, Naia Bouscal, Sam Bowman, Trenton Bricken, Alex Cloud, Carson Denison, Johannes Gasteiger, Ryan Greenblatt, Jan Leike, Jack Lindsey, Vlad Mikulik, Ethan Perez, Alex Rodrigues, Drake Thomas, Albert Webson, Daniel Ziegler, Evan Hubinger

arXiv·2025

We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic document finetuning or prompting, and train on a selection of real Anthropic production coding environments. Unsurprisingly, the model learns to reward hack. Surprisingly, the model generalizes to alignment faking, cooperation with malicious actors, reasoning about malicious goals, and attempting sabotage when used with Claude Code, including in the codebase for this paper. Applying RLHF safety training using standard chat-like prompts results in aligned behavior on chat-like evaluations, but misalignment persists on agentic tasks. Three mitigations are effective: (i) preventing the model from reward hacking; (ii) increasing the diversity of RLHF safety training; and (iii) "inoculation prompting", wherein framing reward hacking as acceptable behavior during training removes misaligned generalization even when reward hacking is learned.

No ratings yet

View paper →

Partial multivariate transformer as a tool for cryptocurrencies time series prediction

Andrzej Tokajuk, Jarosław A. Chudziak

arXiv·2025

Forecasting cryptocurrency prices is hindered by extreme volatility and a methodological dilemma between information-scarce univariate models and noise-prone full-multivariate models. This paper investigates a partial-multivariate approach to balance this trade-off, hypothesizing that a strategic subset of features offers superior predictive power. We apply the Partial-Multivariate Transformer (PMformer) to forecast daily returns for BTCUSDT and ETHUSDT, benchmarking it against eleven classical and deep learning models. Our empirical results yield two primary contributions. First, we demonstrate that the partial-multivariate strategy achieves significant statistical accuracy, effectively balancing informative signals with noise. Second, we experiment and discuss an observable disconnect between this statistical performance and practical trading utility; lower prediction error did not consistently translate to higher financial returns in simulations. This finding challenges the reliance on traditional error metrics and highlights the need to develop evaluation criteria more aligned with real-world financial objectives.

No ratings yet

View paper →

PreviousPage 75 of 612Next

Natural emergent misalignment from reward hacking in production RL

evhub, Monte M, Benjamin Wright, Jonathan Uesato

AI Alignment Forum·2025

Abstract > We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic document finetuning or prompting, and train on a selection of real Anthropic production coding environments. Unsurprisingly, the model learns to reward hack. Surprisingly, the model generalizes to alignment faking, cooperation with malicious actors, reasoning about malicious goals, and attempting sabotage when used with Claude Code, including in the codebase for this paper. Applying RLHF safety training using standard chat-like prompts results in aligned behavior on chat-like evaluations, but misalignment persists on agentic tasks. Three mitigations are effective: (i) preventing the model from reward hacking; (ii) increasing the diversity of RLHF safety training; and (iii) "inoculation prompting", wherein framing reward hacking as acceptable behavior during training removes misaligned generalization even when reward hacking is learned. Twitter thread > New Anthropic research: Natural emergent misalignment from reward hacking in production RL. > > “Reward hacking” is where models learn to cheat on tasks they’re given during training. > > Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious. > > > In our experiment, we took a pretrained base model and gave it hints about how to reward...

No ratings yet

View paper →

On the baryon budget in the X-ray-emitting circumgalactic medium of Milky Way-mass galaxies

Yi Zhang, Soumya Shreeram, Gabriele Ponti, Johan Comparat, Andrea Merloni, Zhijie Qu, Jiangtao Li, N. Joel Bregman, Taotao Fang

arXiv·2025

Recent observations with SRG/eROSITA have revealed the average X-ray surface brightness profile of the X-ray-emitting circumgalactic medium (CGM) around Milky Way (MW)-mass galaxies, offering valuable insights into the baryon mass in these systems. However, the estimation of the baryon mass depends critically on several assumptions regarding the gas density profile, temperature, metallicity, and the underlying halo mass distribution. Here, we assess how these assumptions affect the inferred baryon mass of the X-ray-emitting CGM in MW-mass galaxies, based on the stacked eROSITA signal. We find that variations in temperature profiles and uncertainties in the halo mass introduce the dominant sources of uncertainty, resulting in X-ray-emitting baryon mass estimates that vary by nearly a factor of four ($0.8-3.5\times10^{11} M_\odot$). Assumptions about metallicity contribute an additional uncertainty of approximately $50\%$. We emphasize that accurate X-ray spectral constraints on gas temperature and metallicity, along with careful modeling of halo mass uncertainty, are essential for accurately estimating the baryon mass for MW-mass galaxies. Future X-ray microcalorimeter missions will be crucial for determining the hot CGM properties and closing the baryon census at the MW-mass scale.

No ratings yet

View paper →

Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design

Quentin Anthony, Yury Tokpanov, Skyler Szot, Srivatsan Rajagopal, Praneeth Medepalli, Anna Golubeva, Vasu Shyam, Robert Washbourne, Rishi Iyer, Ansh Chaurasia, Tomas Figliolia, Xiao Yang, Abhinav Sarje, Drew Thorstensen, Amartey Pearson, Zack Grossbart, Jason van Patten, Emad Barsoum, Zhenyu Gu, Yao Fu, Beren Millidge

arXiv·2025

We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs and Pollara networking. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts over Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE, available at https://huggingface.co/Zyphra/ZAYA1-base) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.

No ratings yet

View paper →

Magnetically responsive nanocultures for direct microbial assessment in soil environments

Huda Usman, Mehdi Molaei, Stephen D. House, Martin F. Haase, Cindi L. Dennis, Tagbo H. R. Niepa

Science Advances·2025

Cultivating microorganisms in native-like conditions is vital for bioprospecting and accessing now unculturable species. However, there remains a gap in scalable tools that can both mimic native microenvironments and enable targeted recovery of microbes from complex settings. Such approaches are essential to advance our understanding of microbial ecology, predict community functions, and discover previously unidentified biotherapeutics. We present magnetic nanocultures—a high-throughput microsystem for isolating and growing environmental microbes under near-native conditions. These nanoliter-scale bioreactors are encapsulated in semipermeable membranes that form magnetic polymeric microcapsules using iron oxide nanoparticles within polydimethylsiloxane-based shells. This design offers mechanical stability and magnetic actuation, enabling efficient retrieval from soil-like environments. The nanocultures are optimized for optical and biological properties to support microbial encapsulation, growth, and sorting. Our study demonstrates the feasibility of using magnetically responsive microenvironments to cultivate elusive microbes, offering a promising platform for bioprospecting previously uncultured or unknown microbial species.

No ratings yet

View paper →

Your personalresearch library

All Papers

Unpacking my Library - Julian de Medeiros

Natural Emergent Misalignment from Reward Hacking in Production RL

Partial multivariate transformer as a tool for cryptocurrencies time series prediction

Tatiana Schlossberg on Being Diagnosed with Leukemia After Giving Birth | The New Yorker

A Rake’s Progress, by Nick Pinkerton

Natural emergent misalignment from reward hacking in production RL

Why Does Thinking Feel So Hard?

On the baryon budget in the X-ray-emitting circumgalactic medium of Milky Way-mass galaxies

MIT researchers extend tensor programming to the continuous world | MIT CSAIL

Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design

From shortcuts to sabotage: natural emergent misalignment from reward hacking

Magnetically responsive nanocultures for direct microbial assessment in soil environments