Beyond Reward Design: Discovering RL Interfaces with LLMs
Jointly evolving RL observations and rewards with evolutionary LLM-guided search

Here is a puzzle. You are handed a 13x13 four room grid based environment where an agent must pick up a blue pyramid and place it next to a yellow hex. Your default observation is a 7x7 tile patch surrounding your agent. The default reward is +1 for completion and 0 otherwise.
You may write any reward function you want. You may not change the observation.
We tried solving this task by giving an LLM 30 attempts to write progressively better rewards, with phase gating, milestone bonuses, potential-based shaping and it failed. The peak performance was 7%, the policy literally cannot see the relational structure it needs, and no reward shaping changes that.
Now flip the puzzle. A robotic arm must learn to track a moving 3D trajectory. The raw state i.e. joint angles, velocities, end-effector position is informationally complete and forms the observation for the policy. Any RL engineer could write a working tracker from this. But the reward is success ∈ {0, 1} at episode end. We again give an LLM thirty attempts to evolve a better observation, keeping the reward fixed and the final score is 0%.
These are the same failure viewed from opposite sides. The RL interface i.e. what the agent sees and how it's rewarded is doing more work than the algorithm on top of it. Which half is the bottleneck varies across tasks, and isn't always obvious in advance.
This post summarizes our recent paper on automating the discovery of both halves jointly.
Arxiv link: https://arxiv.org/abs/2605.03408
TLDR
Existing LLM-based work (Eureka, Text2Reward, DrEureka) automates only the reward function, treating the observation space as fixed. We show this is structurally insufficient - different tasks fail for different reasons.
We introduce LIMEN, an LLM-guided evolutionary search over executable programs for both observations and rewards, with PPO training as the fitness evaluator.
Across 5 tasks, joint evolution is the only configuration that avoids catastrophic failure on at least one domain.
The interface as a search problem
Formally, an RL task interface is a pair (φ, R) where φ: S → O maps simulator state to agent observations and R: S × A × S → ℝ produces scalar rewards. Together they define the induced MDP the agent actually learns on. Most RL research treats both as given; the interesting work happens in the policy and value networks downstream.
LLM-based reward design (Eureka, Text2Reward) lifted the reward half of this from human researchers to automated search. Given a task description, an LLM writes reward code, an RL agent trains on it, and the result feeds back into the next iteration. This works well and assumes φ is fixed and adequate.
The assumption fails in both directions. In compositional reasoning tasks the default observation often lacks the relational structure the policy needs; in continuous control the raw state is usually fine but the success signal is too sparse to learn from. Optimizing one half while fixing the other could lead to catastrophic failure on whichever half you fixed wrong. Since you don’t always know which half is the bottleneck in advance, the safest move is to search over both.
Method
We frame interface discovery as a bilevel problem. The outer loop searches over (φ, R) pairs to maximize a trajectory-level success metric F, a binary task-completion check, distinct from the per-step reward. The inner loop is a fixed RL algorithm (PPO) that trains a policy on whatever interface the outer loop hands it. The search space is executable Python programs operating on raw simulator state.
LIMEN runs this as an LLM-guided evolutionary loop. Each iteration:
Sample a parent interface from a MAP-Elites archive. Plain hill climbing collapses into refining one design; MAP-Elites maintains a population of structurally distinct candidates by binning solutions along two axes - observation dimensionality and reward AST node count, so that a sparse one-line reward and a heavily-shaped multi-term reward occupy different cells and both survive.
Mutate via Claude Sonnet 4.6, prompted with the parent code, top performers from the archive, and traces from recently failed candidates.
Validate for syntax and shape correctness.
Evaluate. A short-budget cascade filters obvious failures; survivors train over 3 seeds and are scored by mean success rate.
Insert back into the archive.

30 iterations per run, one candidate per iteration. Total cost: 1–7 GPU hours and $3–11 in LLM calls per task on a single L4.
The headline result
We evaluate on three XLand-MiniGrid tasks (object pickup, relational placement, multi-room sequential subgoals) and two MuJoCo MJX tasks (Go1 push recovery, Panda Lissajous tracking) against three ablations:
Sparse — raw observation, binary success reward
Obs-only — evolve
φ, fixRto binary successReward-only — evolve
R, fixφto raw observation (this is what Eureka-style methods do)Joint (LIMEN) — evolve both
The RL algorithm is held fixed throughout. Best discovered interfaces are retrained from scratch over 10 independent seeds to remove post-selection bias from evolutionary search.

The pattern is the result.
Reward-only collapses on Medium and Hard gridworld (19%, 7%) as the LLM produces well-structured rewards with phase gating and milestone bonuses, and the policy still cannot extract relational features from the default 7×7 patch.
Observation-only fails completely on Panda (0%) for the symmetric reason: the raw state already contains everything the policy needs, but success ∈ {0, 1} provides no gradient.
Joint evolution is the only configuration with non-trivial performance across all five tasks (99%, 99%, 85%, 45%, 48%).
Joint loses to reward-only on Panda (45% vs 70%). We suspect the cause is that fitness doesn't penalize observation dimensionality, so the LLM produces unnecessarily large feature vectors when unconstrained. A dimensionality penalty in F is straightforward future work.
What the LLM rediscovers
Looking at the evolved code, the same motifs appear across tasks and they’re the same motifs experienced RL practitioners use by hand.
Observation programs consistently construct relative geometric features (offsets between agent and target, normalized distances, directional indicators), multi-scale encodings of the same quantity, explicit task-phase indicators, and predictive features computed from state derivatives.

Reward functions consistently include potential-based shaping via distance deltas, milestone bonuses for phase transitions, multi-scale Gaussians on tracking error, and smoothness penalties.
The most interesting finding isn't that the LLM finds these patterns, it's that evolution finds structural changes the LLM would not find on its own. An early Go1 interface gates the position reward by uprightness: no position gradient until the robot is stable. It's a reasonable design choice and it plateaus at 32%. A later mutation removes the gate and adds multi-scale position encodings. Success jumps to 55%. The change is a qualitative restructuring that depends on having seen the gated version fail. The evaluate-and-refine loop is doing real work.
This shows up cleanly in the i.i.d. ablation: 30 independent samples from the same prompt with no iterative feedback average 0.8% (Hard gridworld), 2.1% (Medium), 10.9% (Panda), 21.5% (Go1) versus 76%, 97%, 67%, 55% with evolution. The LLM's prior is informative but not sufficient.

Limitations
A clean trajectory-level success metric is required to drive evolution. RL training cost dominates and would be prohibitive on vision-based environments without further engineering. Observation programs read privileged simulator state (state.data.qpos, state.info["gravity"]) not available on real robots. Search reliability degrades on hard tasks, when we re-ran the full LIMEN evolution loop with 5 different random seeds on Hard gridworld, only 2 of them converged to a strong interface; the other 3 stalled below 10%.
Takeaway
Today, humans design the full RL interface by hand. Recent LLM-based work (Eureka, Text2Reward) automated the reward half but left the observation to humans. Our results suggest that split is structurally insufficient: the bottleneck isn’t always on the reward side, and which half matters varies by task. In our suite, harder gridworld tasks were observation-limited, Panda was reward-limited, and Go1 benefited from co-designing both. Single-component optimization fails catastrophically on whichever side you got wrong, and you can’t always tell which side that is in advance.
The natural next questions are about scale: vision-based observations where programmatic search doesn’t directly apply, real-robot settings without privileged simulator access, transfer between related tasks. The result that the joint formulation is necessary, not just better, holds independently of how those resolve.
🌐 Project Website: https://akshat-sj.github.io/limen/
📄 Read the full paper: https://www.arxiv.org/abs/2605.03408
💻 Code: https://github.com/Lossfunk/LIMEN
🤖 Discussion + AI Summary: https://www.alphaxiv.org/abs/2605.03408
📧 akshat.jaswal@lossfunk.com | ashish.baghel@lossfunk.com | paras@lossfunk.com
