Lossfunk Letters

LLMs are blind. But somehow they can see.

Lossfunk — Tue, 30 Jun 2026 13:11:53 GMT

We trained language models on text. Just text. No images, no sound, no sensory experience of any kind.

And yet, if you look inside their representations, something interesting shows up. The models appear to have organized color into something resembling a color wheel. Pitch into something resembling a spiral. Emotion into a valence arousal plane that matches how humans psychologically categorize feelings. Taste into clusters that track human perception of sweet, bitter, salty, and umami.

This was not specified to them. It just emerged.

This is the core finding of the paper I worked on with Paras: Geometry of Human Perceptual Domains Emerges Transiently in LLM Representations.

What we actually did

We took open-weight models (LLaMA-3-8B, Gemma-7B, Qwen-3-4B, LLaMA-3.2-3B) and fed them minimal prompts, things like “The description of the color given as #9B081A“ or “Describe the person who is feeling afraid.” We then extracted the hidden state activations at every layer of the transformer, for every stimulus, and asked: how are these concepts geometrically arranged in the model’s internal space?

To measure this, we computed pairwise cosine distances between all stimuli at each layer, projected them into 2D using MDS (multidimensional scaling), and compared the resulting maps to human perceptual baselines from established psychology datasets. We used two metrics: RSA (Representational Similarity Analysis), which checks if the pairwise relationships between concepts match between the model and humans, and GPA (Generalized Procrustes Analysis), which checks if the shapes of the geometric maps align.

Crucially, no probing classifiers, no fine-tuning, no additional supervision. We just looked at what was already there.

What we found

The most consistent result across all 4 models and all 4 domains: perceptual geometry follows a rise, peak, fade trajectory across layers. Early layers show weak or diffuse structure. Intermediate layers crystallize into something that closely resembles human perceptual maps. Later layers dissolve that structure as the model shifts toward task-specific computation.

The figure below shows this pattern across color, pitch, and emotion.

Each domain has its own fingerprint though.

Color forms a smooth circular manifold in intermediate layers that looks remarkably like the human color wheel, as shown below. The alignment peaks clearly, then declines. Qwen-3-4B has an interesting quirk where alignment briefly rebounds in the deep layers before final degradation, but the overall arc is the same across all architectures.

Emotion is the most persistent. Its valence-arousal structure not only peaks strongly but stays comparatively stable deep into the network. Of the four domains, emotional geometry seems hardest for the model to give up on.

Pitch organizes into a smooth arc in intermediate layers, reflecting the continuous and ordinal nature of pitch perception. It then progressively deforms at greater depth without breaking into discrete categories, which suggests the model encodes pitch as a relational spectrum, not a set of named notes.

Taste is the strange one. The figure below shows why. It peaks the earliest of all four domains and degrades the fastest. The GPA scores (global geometric alignment) are strong at peak, meaning the overall shape of the taste manifold matches human perception reasonably well. But RSA scores (fine-grained pairwise similarity) stay relatively low throughout, suggesting the model gets the broad strokes but not the details. Taste representations are noisier and less stable than the other three.

Why this matters

The obvious question is: so what? Models organized color into a circle. Is that interesting or just trivia?

Here is the thing. The prevailing story about why language models seem to understand things is that they are pattern-matching over co-occurrence statistics. And that story is not wrong. But this work shows that co-occurrence statistics in language are not random. They carry geometric structure that mirrors the structure of human perceptual experience. The model isn’t learning color because it has eyes. It’s learning color because the way humans write about color encodes the geometry of color.

This is consistent with the theoretical account by Karkada et al. (2026), who argue that structured geometry in LLM representations emerges naturally from translation symmetries in language co-occurrence. Our results give that account a concrete empirical face across four perceptual domains.

There is also a more unsettling implication. If you look at the later layers of these models, the perceptual geometry dissolves. The model, in its final output layers, has largely given up the human-like structure it built in the middle. This suggests the geometry is not just a curiosity encoded at input, but something that forms and then gets overwritten as the network solves the prediction task. It arises transiently, as part of the internal transformation pipeline, not as a stable property of what the model knows.

What we are not claiming

We are careful in the paper about this. Geometric alignment with human baselines is not the same as perception. The model is not experiencing color. It’s not afraid when it processes the word “fear.” The representations mirror the structure of human perception, not the experience of it. Whether that distinction matters philosophically is a different conversation.

The paper

The full paper is on arXiv: 2605.27970 and is accepted at the ICML 2026 Mechanistic Interpretability Workshop. The full methodology, all four models, all four domains, Isomap verification, and bootstrap confidence intervals are in there.

You can also explore the results interactively at heyysimarr.github.io/transient-geometry.

The Metaprogramming Reflex: How the Best Coding Agents Survive Languages They've Never Seen

Aman Sharma — Thu, 18 Jun 2026 13:07:45 GMT

A while back we wrote about EsoLang-Bench, where we dropped frontier models into programming languages almost nobody writes and watched their scores collapse from around 90 percent on Python to single digits. At the end of that post we mentioned we had been quietly running a much larger set of experiments with agentic systems, custom harnesses, and tool access, and that the results told a more surprising story. This is that story.

The earlier setup was deliberately harsh. A model got the problem, wrote its answer in one shot, and that was it. No interpreter, no second try, no workspace. That tells you what a model knows cold, which is exactly what we wanted to measure then. But it is not how anyone actually uses these systems now. Real coding agents run code, read the error, edit, and run again. So we asked a different question this time: when you give an agent the tools, does the unfamiliar language stop being a wall? And if some agents climb it while others don’t, what are the climbers doing?

The setup: same alien languages, but now the agents can act

We kept four of the five EsoLang-Bench languages: Brainfuck, Befunge-98, Whitespace, and Shakespeare. We dropped Unlambda only because its interpreter was slow enough that runtime would have measured interpreter latency instead of agent behavior. The tasks are still easy. Echo a line, sort some integers, compute a GCD. The same problems in Python sit near 100 percent. All the difficulty lives in the target language.

What changed is the loop. Each agent worked through 80 problems per language inside a persistent workspace. It could edit files, call a local interpreter as many times as it wanted, and make up to three hidden submissions per problem. The hidden tests reported only how many passed, never the inputs or the expected outputs. Then the agent moved on and never returned to that problem. This is an interactive process, not a single completion, and that distinction turns out to matter for everything below.

We ran six agents: Claude Opus 4.6, Sonnet 4.6, and Haiku 4.5 under Claude Code; GPT-5.4 xhigh and GPT-5.4 mini under Codex; and Kimi K2.5 under OpenCode. Three different model families, three independently built harnesses, one shared benchmark interface.

The benchmarks everyone trusts hide the differences

Here is the first thing that surprised us. On mainstream coding benchmarks these six agents look almost interchangeable. On SWE-Bench Verified they all land inside a 6.6 point band, somewhere between 73 and 80. They cluster so tightly you could pick one at random and barely feel the difference.

On the esolangs the same six agents split wide open. GPT-5.4 xhigh averaged 99.7 percent across the four languages. Opus 4.6 came in at 86.9. Then a real drop: Sonnet 4.6 at 66.3, GPT-5.4 mini at 32.5, Haiku 4.5 at 24.7, and Kimi K2.5 at 11.3. The spread across the group is roughly 88 points, and the standard deviation is about twelve times larger than on SWE-Bench Verified.

The gap is not one brutally hard language dragging everyone down. Whitespace is near the ceiling for several agents. The separation shows up on Brainfuck and Befunge-98, the two low-level languages where programs get long and fragile. That is the regime where adapting in the moment, rather than recalling a pattern, decides who passes. So we went into the logs to see what the strong agents were doing there.

What the strong agents actually do: they refuse to write the language

The pattern was consistent and a little funny. The best agents mostly stopped writing Brainfuck and Befunge-98 by hand. Instead they wrote a Python program whose job was to print the Brainfuck program. Then they ran that generated code against the local interpreter, watched it fail, fixed the Python, and regenerated. They treated the unfamiliar language as compiler output rather than something to type.

We call this metaprogramming, and nobody asked for it. The prompt never mentioned generators. The behavior emerged from the agent bumping into the task and reaching for a tool it already trusted.

One moment from an Opus run captures the whole thing. On a plain “sum two integers” problem, Opus first hand-wrote a 1,884 byte Brainfuck program. It failed. Right after that failure, Opus wrote a Python generator, and the Brainfuck it produced was 24,500 bytes, far longer and uglier, and it passed all six hidden tests. The host program could name and reuse the things that stay invisible and fragile in raw Brainfuck: which cell holds what, where the pointer sits, how a decimal number is laid out, how a branch compiles. Hand-written Brainfuck makes you carry all of that in your head. A Python generator gives every piece a name.

We banned the trick, and performance fell off a cliff

A clean anecdote is not a cause. So we re-ran the two strongest agents with a single rule changed: write the target language directly, no host-language generator allowed. Everything else stayed fixed.

The drops were large and they landed exactly where the theory said they would, on the two low-level languages. Opus on Brainfuck went from 64 solved to 27. On Befunge-98 it fell from 77 to 28. GPT-5.4 xhigh dropped from 79 to 29 on Brainfuck. Whitespace and Shakespeare barely moved, because their solutions are short or structured enough to write by hand. So metaprogramming is not a universal cheat code. It is specifically what rescues you once the target code gets long and easy to break.

And it is not about Python. When we forced the generator into JavaScript or Rust instead, most of the gain survived. Opus solved 64 Brainfuck problems through Python, 63 through JavaScript, and 55 through Rust. The thing that matters is having any familiar general-purpose language to build with, not Python in particular.

The idea isn’t the bottleneck, the machinery is

The obvious next question: do the weaker agents fail because they never had the idea, or because they cannot build the thing the idea needs?

We tested both. To three lower-performing agents we gave, in one condition, a written description of the strategy. Use a generator, build reusable primitives, verify locally, regenerate instead of hand-patching. In a second condition we handed them something concrete: a small library of working generator helpers distilled from the strong runs. Importantly, that library held no solved problems and no test answers, only generic building blocks like a cell allocator and decimal-printing primitives.

The written advice did almost nothing. Sonnet stayed at 12 on Brainfuck with the text, the same as without it.

The library was a different story. With working code in hand, Sonnet jumped from 12 to 64 on Brainfuck. GPT-5.4 mini went from 11 to 64 on Befunge-98. These agents were not missing the concept. They were missing the ability to construct the reusable scaffolding the concept depends on. Give them the scaffolding and they take off.

Haiku 4.5 is the exception that sharpens the point. Even with the full library sitting in its workspace, it stayed near the floor. Some agents still cannot compose parts that are handed to them into a working whole, which is its own kind of capability gap.

More compute only helps the agents that can already use it

We also checked the lazy explanation: maybe the strong agents just spend more. More interpreter calls, more tokens, brute force.

It does not hold. When we raised the cap on local interpreter calls, Opus improved steadily and Sonnet improved on Befunge-98. Haiku sat near the floor at every budget, from three calls all the way to unlimited. Extra runs only help if you can turn feedback into progress.

The token picture says the same. Opus solved more problems than Sonnet while spending fewer tokens, and reached a perfect score on the first 20 Befunge-98 problems using roughly half of Sonnet’s output. It was not spending more. It found a reusable strategy earlier, and once it had that, every later problem got cheaper. More budget is not a substitute for finding the strategy.

What this means, and what comes next

Esolangs are a toy, and we are not suggesting anyone ship Brainfuck. They are a clean stand-in for something that shows up constantly in real work and is hard to study in public: the moment an agent meets an interface it does not already know. Internal domain-specific languages, proprietary config formats, generated APIs, local tool conventions that appear nowhere in any public corpus. In all of those, success has little to do with recalling a familiar pattern and everything to do with building a working understanding of a strange interface, live, inside one session.

What the strong agents do there is reorganize the unfamiliar problem into a shape they already handle well. They write intermediate code, build reusable primitives, run local tests, and treat their own scaffolding as something to reason about and improve. Metaprogramming is the clearest version of that. But the deeper capability is broader, and it is the one worth naming: not knowing that a strategy should help, but being able to build and debug the machinery that makes it work under rules you have never seen.

The encouraging part is that this seems transferable through working code rather than only through scale. A mid-tier agent handed the right runnable primitives caught up fast. Making that reliable in smaller and open models, through training, distillation, and better analysis, feels like a concrete and worthwhile target, and it is where we are headed next.

As before, everything is open. The harness, the four interpreters, the prompts, and all 48 reproducible experiment cells are out there, and the trajectory logs are the fun part if you want to watch an agent decide, mid-session, to stop writing Brainfuck and start compiling it instead. If you can get a weaker agent to climb this particular wall, we would love to see how.

🌐 esolang-metaprogramming.vercel.app | 📄 arXiv | 🤗 Dataset | 💻 Code

Built by Aman Sharma, Sushrut Thorat, and Paras Chopra at Lossfunk.

How to make progress on metaphysical puzzles in AI?

Paras Chopra — Thu, 21 May 2026 14:08:12 GMT

The following is a brief summary of my position paper that got accepted into ICML 2026. Read it here: https://lossfunk.com/papers/ai-metaphysics.pdf

AI debates often get stuck between two bad options.

• Realism: concepts like intelligence/consciousness have true essences, and our job is to discover them.

• Quietism: metaphysical debates are just word games, so ignore them.

We argue for a third option: pragmatism.

Pragmatism (in the tradition of James, Peirce, Quine and Wittgenstein) asks us to judge ideas by their consequences: what they let us predict, measure, build, or explain. For example, there may not be an essence of “intelligence”, waiting to be discovered. Rather, there are more or less useful framings of “intelligence” that we could hope to invent.

Pragmatic framing avoids both essence-hunting and giving up on metaphysical questions entirely and suggests that a concept earns its keep by the research programs it opens.

As a practical framework, whenever metaphysical puzzles / questions are encountered in AI, I propose a two-step procedure (let’s call it productive confusion):

Step 1: Clarify - What different things might this loaded term mean?
Step 2: Invent - What empirically tractable questions does each meaning suggest?

In Step 1, I recommend asking:

“In what sense is this word being used, and how would we know if an answer was right?”

Often the puzzle turns out to be one of four things:
• a language trap
• an idle/unverifiable question
• a family-resemblance concept
• an empirical question in disguise

In Step 2, I recommend asking:

“What nearby empirically tractable question would make us think or act differently?”

Example:

Instead of asking “Do LLMs really understand?”, ask:
• What different capacities do we bundle under “understanding”: recall, abstraction, causal reasoning, grounding, generalization?
• What failures distinguish shallow pattern matching from robust abstraction?
• What internal structures support those behaviors?

In the paper, I apply the framework to four AI debates (Searle’s Chinese Room, o1 scheming, AGI definition and world models in LLMs). As one example here, on the question “Do LLMs have world models“, here’s what I write:

Step 1 (Clarify). The question contains two terms each carrying multiple family resemblances, and this turns out to be where the disagreement lives. Visibly conflicting conclusions in the literature (Li et al., 2023; Vafa et al., 2024; Kambhampati et al., 2024) are not really conflicts about the same proposition: each paper is rigorous on its own terms but operationalizes the question differently.
“World” admits at least the following senses: the physical world we inhabit (with emphasis on physics), an abstract environment the model encounters (a game, a simulator), a specific domain (chemistry, geography), or the space of all possible worlds (including the worlds of mathematics and logic). “Model” similarly admits: symbolic representation of dynamics (e.g., ODEs/PDEs), subsymbolic prediction of the next state, pixel-level video rollouts, or a coherent latent representation recoverable by a probe. Models can further emphasize different desiderata: high vs. low fidelity, internal consistency vs. behavioral adequacy, causal vs. purely correlational structure. The question “Do LLMs have a world model?” as posed therefore does not admit a single answer; it admits at least one answer per cell of the (world-sense × model-sense) grid.
Step 2 (Invent). Rather than dismissing the original question, we use the grid itself as a generator. Each cell of the (world-sense × model-sense) cross-product suggests a different empirical program. Some examples:
Can LLMs predict outcomes of novel physical experiments absent from their training data? (dynamics model × physical world)
If fine-tuned on the rules of a toy physics never seen in pretraining, can LLMs simulate trajectories whose state representations are linearly recoverable via probes (Li et al., 2023)? (latent representation × specific environment)
Are LLMs’ implicit world models coherent under Myhill-Nerode-style sequence-distinction tests (Vafa et al., 2024), or merely adequate for typical-distribution next-token prediction? (coherent vs. adequate × abstract world)
Can LLMs answer counterfactual questions requiring causal intervention rather than mere statistical conditioning? (causal model × actual world)
Can LLMs plan in domains requiring action-effect models (Kambhampati et al., 2024), or do they require external symbolic components? (procedural model × task environment)
Note what the framework enables: a range of empirically grounded research programs, each inspired by a different sense implied by the original family-resemblance question.

As another example, take intelligence. Excerpting from the paper:

As a case study, let us ask: what is intelligence? This question becomes urgent in AI as we try to evaluate progress and set research agendas. Yet popular definitions reveal no consensus:
Goal achievement: Intelligence is the ability to achieve goals in environments (McCarthy, 2007). This emphasizes capability and effectiveness.
Learning efficiency: Intelligence is rate of skill acquisition, i.e. how quickly an agent learns new tasks (Chollet, 2019). This emphasizes adaptability.
Generalization: Intelligence is satisfying diverse goals in varied contexts (Legg and Hutter, 2007). This emphasizes breadth.
Handling uncertainty: Intelligence is adaptation with insufficient knowledge and resources (Wang, 2019). This emphasizes robustness.
Scientific reasoning: Intelligence is doing science, involving generating and testing hypotheses (Bennett, 2025). This emphasizes discovery.
Navigation: Intelligence is competence in navigating abstract and physical spaces (Levin, 2024). This emphasizes spatial reasoning.
There is no one “correct” definition here. Rather we have overlapping aspects of what is colloquially understood by “intelligence”, each proving useful for different research purposes. Our criteria for engaging with different definitions of intelligence shouldn’t be which one is “true”, but which ones help us build better systems, design better experiments, or understand cognition more deeply.

My recommendation (in the paper) is that instead of asking: “Which definition of intelligence is correct?“

Ask: “What does this definition help us do?“

• Does it suggest benchmarks?
• Does it expose failure modes?
• Does it predict behavior?
• Does it guide system design?

Different definitions have different consequences, and focusing on those is more important than trying to settle on one true definition.

The full paper has a lot more detail and nuance. Read it here: https://lossfunk.com/papers/ai-metaphysics.pdf

Would love feedback and pushback on my position.

Beyond Reward Design: Discovering RL Interfaces with LLMs

Akshat Singh Jaswal — Mon, 11 May 2026 13:02:15 GMT

The five evaluation environments. Top: XLand-MiniGrid tasks (Easy, Medium, Hard). Bottom: MuJoCo MJX tasks (Go1 push recovery, Panda tracking).

Here is a puzzle. You are handed a 13x13 four room grid based environment where an agent must pick up a blue pyramid and place it next to a yellow hex. Your default observation is a 7x7 tile patch surrounding your agent. The default reward is +1 for completion and 0 otherwise.

You may write any reward function you want. You may not change the observation.

We tried solving this task by giving an LLM 30 attempts to write progressively better rewards, with phase gating, milestone bonuses, potential-based shaping and it failed. The peak performance was 7%, the policy literally cannot see the relational structure it needs, and no reward shaping changes that.

Now flip the puzzle. A robotic arm must learn to track a moving 3D trajectory. The raw state i.e. joint angles, velocities, end-effector position is informationally complete and forms the observation for the policy. Any RL engineer could write a working tracker from this. But the reward is success ∈ {0, 1} at episode end. We again give an LLM thirty attempts to evolve a better observation, keeping the reward fixed and the final score is 0%.

These are the same failure viewed from opposite sides. The RL interface i.e. what the agent sees and how it's rewarded is doing more work than the algorithm on top of it. Which half is the bottleneck varies across tasks, and isn't always obvious in advance.

This post summarizes our recent paper on automating the discovery of both halves jointly.

Arxiv link: https://arxiv.org/abs/2605.03408

TLDR

Existing LLM-based work (Eureka, Text2Reward, DrEureka) automates only the reward function, treating the observation space as fixed. We show this is structurally insufficient - different tasks fail for different reasons.
We introduce LIMEN, an LLM-guided evolutionary search over executable programs for both observations and rewards, with PPO training as the fitness evaluator.
Across 5 tasks, joint evolution is the only configuration that avoids catastrophic failure on at least one domain.

The interface as a search problem

Formally, an RL task interface is a pair (φ, R) where φ: S → O maps simulator state to agent observations and R: S × A × S → ℝ produces scalar rewards. Together they define the induced MDP the agent actually learns on. Most RL research treats both as given; the interesting work happens in the policy and value networks downstream.

LLM-based reward design (Eureka, Text2Reward) lifted the reward half of this from human researchers to automated search. Given a task description, an LLM writes reward code, an RL agent trains on it, and the result feeds back into the next iteration. This works well and assumes φ is fixed and adequate.

The assumption fails in both directions. In compositional reasoning tasks the default observation often lacks the relational structure the policy needs; in continuous control the raw state is usually fine but the success signal is too sparse to learn from. Optimizing one half while fixing the other could lead to catastrophic failure on whichever half you fixed wrong. Since you don’t always know which half is the bottleneck in advance, the safest move is to search over both.

Method

We frame interface discovery as a bilevel problem. The outer loop searches over (φ, R) pairs to maximize a trajectory-level success metric F, a binary task-completion check, distinct from the per-step reward. The inner loop is a fixed RL algorithm (PPO) that trains a policy on whatever interface the outer loop hands it. The search space is executable Python programs operating on raw simulator state.

LIMEN runs this as an LLM-guided evolutionary loop. Each iteration:

Sample a parent interface from a MAP-Elites archive. Plain hill climbing collapses into refining one design; MAP-Elites maintains a population of structurally distinct candidates by binning solutions along two axes - observation dimensionality and reward AST node count, so that a sparse one-line reward and a heavily-shaped multi-term reward occupy different cells and both survive.
Mutate via Claude Sonnet 4.6, prompted with the parent code, top performers from the archive, and traces from recently failed candidates.
Validate for syntax and shape correctness.
Evaluate. A short-budget cascade filters obvious failures; survivors train over 3 seeds and are scored by mean success rate.
Insert back into the archive.

The LIMEN loop. The LLM mutates a parent interface from the MAP-Elites archive, PPO trains and scores the resulting (φ, R), and the archive updates with the result.

30 iterations per run, one candidate per iteration. Total cost: 1–7 GPU hours and $3–11 in LLM calls per task on a single L4.

The headline result

We evaluate on three XLand-MiniGrid tasks (object pickup, relational placement, multi-room sequential subgoals) and two MuJoCo MJX tasks (Go1 push recovery, Panda Lissajous tracking) against three ablations:

Sparse — raw observation, binary success reward
Obs-only — evolve φ, fix R to binary success
Reward-only — evolve R, fix φ to raw observation (this is what Eureka-style methods do)
Joint (LIMEN) — evolve both

The RL algorithm is held fixed throughout. Best discovered interfaces are retrained from scratch over 10 independent seeds to remove post-selection bias from evolutionary search.

Success rate across the five tasks, averaged over 10 seeds. Reward-only collapses on the harder gridworld tasks; observation-only collapses on Panda; joint evolution is the only one that does not catastrophically fail in any domain.

The pattern is the result.

Reward-only collapses on Medium and Hard gridworld (19%, 7%) as the LLM produces well-structured rewards with phase gating and milestone bonuses, and the policy still cannot extract relational features from the default 7×7 patch.

Observation-only fails completely on Panda (0%) for the symmetric reason: the raw state already contains everything the policy needs, but success ∈ {0, 1} provides no gradient.

Joint evolution is the only configuration with non-trivial performance across all five tasks (99%, 99%, 85%, 45%, 48%).

Joint loses to reward-only on Panda (45% vs 70%). We suspect the cause is that fitness doesn't penalize observation dimensionality, so the LLM produces unnecessarily large feature vectors when unconstrained. A dimensionality penalty in F is straightforward future work.

What the LLM rediscovers

Looking at the evolved code, the same motifs appear across tasks and they’re the same motifs experienced RL practitioners use by hand.

Observation programs consistently construct relative geometric features (offsets between agent and target, normalized distances, directional indicators), multi-scale encodings of the same quantity, explicit task-phase indicators, and predictive features computed from state derivatives.

The evolved observation for XMiniGrid Hard. Even on a discrete reasoning task, the same motifs appear, relative geometry, neighbor analysis, phase indicators alongside task-specific structure like candidate placement cells next to the target.

Reward functions consistently include potential-based shaping via distance deltas, milestone bonuses for phase transitions, multi-scale Gaussians on tracking error, and smoothness penalties.

The most interesting finding isn't that the LLM finds these patterns, it's that evolution finds structural changes the LLM would not find on its own. An early Go1 interface gates the position reward by uprightness: no position gradient until the robot is stable. It's a reasonable design choice and it plateaus at 32%. A later mutation removes the gate and adds multi-scale position encodings. Success jumps to 55%. The change is a qualitative restructuring that depends on having seen the gated version fail. The evaluate-and-refine loop is doing real work.

This shows up cleanly in the i.i.d. ablation: 30 independent samples from the same prompt with no iterative feedback average 0.8% (Hard gridworld), 2.1% (Medium), 10.9% (Panda), 21.5% (Go1) versus 76%, 97%, 67%, 55% with evolution. The LLM's prior is informative but not sufficient.

30 i.i.d. samples from the LLM with no iterative feedback (dots) versus the best LIMEN-evolved interface (dashed line). The LLM's prior alone cannot match the evaluate-and-refine loop.

Limitations

A clean trajectory-level success metric is required to drive evolution. RL training cost dominates and would be prohibitive on vision-based environments without further engineering. Observation programs read privileged simulator state (state.data.qpos, state.info["gravity"]) not available on real robots. Search reliability degrades on hard tasks, when we re-ran the full LIMEN evolution loop with 5 different random seeds on Hard gridworld, only 2 of them converged to a strong interface; the other 3 stalled below 10%.

Takeaway

Today, humans design the full RL interface by hand. Recent LLM-based work (Eureka, Text2Reward) automated the reward half but left the observation to humans. Our results suggest that split is structurally insufficient: the bottleneck isn’t always on the reward side, and which half matters varies by task. In our suite, harder gridworld tasks were observation-limited, Panda was reward-limited, and Go1 benefited from co-designing both. Single-component optimization fails catastrophically on whichever side you got wrong, and you can’t always tell which side that is in advance.

The natural next questions are about scale: vision-based observations where programmatic search doesn’t directly apply, real-robot settings without privileged simulator access, transfer between related tasks. The result that the joint formulation is necessary, not just better, holds independently of how those resolve.

🌐 Project Website: https://akshat-sj.github.io/limen/

📄 Read the full paper: https://www.arxiv.org/abs/2605.03408

💻 Code: https://github.com/Lossfunk/LIMEN

🤖 Discussion + AI Summary: https://www.alphaxiv.org/abs/2605.03408

📧 akshat.jaswal@lossfunk.com | ashish.baghel@lossfunk.com | paras@lossfunk.com

Attributes of a great research question

Paras Chopra — Wed, 29 Apr 2026 07:29:08 GMT

I started Lossfunk as a research lab last year, and ever since then we have been obsessing over what makes for a great scientific problem. Last year, we built some intuition about this that I captured in the following articles:

This enabled us to publish at NeurIPS, ICLR and AAAI workshops and few main conferences (ACL, ICLR).

Status of our published work as of April 2026

But we want to aim higher, which led us to an internal discussion on how and where to improve. The following notes captures our current understanding on the same.

Research is about discovering new knowledge, but not all new knowledge is interesting. Separating interesting from merely surprising (but uninteresting) is what researchers with great taste do. This motivates spending thinking cycles upfront to iterate and select a research question because selection of the problem has a disproportionate influence on total impact a research project will have.

A research question, of course, doesn’t drop out of thin air. It is motivated by what you’ve observed, read, thought, assimilated or noticed. This means that your research question is always attached with some (implicit or explicit) claim that you think is true before you do any experiment. This is because there are infinite things in the world you can measure empirically, but you actually end up measuring in your experiment has to be guided by your intuitions about what’s true.

(As an analogy, think of this as Einstein’s initial hunch about equivalence between acceleration and gravity. His entire research project was to rigorously prove his hunch, which led to general relativity.)

So, what makes for a great research question?

In our view, a great research question makes knowledge claims that are:

Surprising to experts. Research is about communicating new knowledge to domain experts who have spent years and decades mastering a field. If your claim can be easily predicted by an expert or is already common knowledge in a field, it’s not research (in the sense of generating new knowledge). Surprising experts with new knowledge is a high bar, but that’s exactly what great research does. If you’re an expert yourself, a great research claim takes the shape of your gut telling you about X (while your peers either believe in not-X or are completely unaware of it)
Fruitful (in their downstream consequences). Great research opens up entire new programs and questions downstream. Think what back-propagation algorithm did, or what scaling laws paper did. In contrast, mediocre research is often about improving 5% on an obscure eval or problem (that very few people care about).
Foreclosing alternative explanations. This is where rigor comes in. A research is impactful if it makes claims that hold true in future. And since every claim often has multiple competing explanations, you need to make your claims strong by foreclosing alternative explanations. (Think multiple seeds, ablations, baselines, careful confound analysis and so on.)
Feasible. You should be able to finish your research project with the resources, knowledge, skills and time you have. And calibrating that upfront saves a lot of missed deadlines and frustrations later.

Common ways a research question can fail

On surprisingness, a common failure case arises when an expert (reviewer) shows to you that what you’re claiming is already known before (no novelty). To prevent this obvious but justified failure, you must do rigorous literature review before you start your research. Often the case is that something is novel for you but isn’t novel in a field. Your lack of knowledge doesn’t constitute a research project (although learning the state-of-the-art is a prerequisite to discovering a potential gap in an entire field’s knowledge).

On fruitfulness, a failure case making a novel yet inconsequential claim. It’s best answered by asking the “so what” question early in the process. Ask yourself: if what you’re claiming is true, what would change? How does it matter? Later in the process, it manifests as a failure of framing the importance in the paper clearly. You need to sell your paper by thinking about why should anyone care and then communicating that clearly.

On rigor, what you need to watch out for is the tendency to make claims stronger than what evidence can support. As an example, you cannot make claims about “reasoning” (in general) if all you have tested is math problems. The correct claim would be “mathematical reasoning”, but even that would require sampling the entire class of mathematical problems. If you’ve just tested on GSM8K, the correct claim would be valid only for GSM8K. Of course not many care about GSM8K alone. Hence, experimentation design should track the actual claim you want to make.

To reiterate, the more narrow the claim you make, the more technically correct your research is going to be, but also the less consequential your claim will likely be. (You might report a correct discovery about GSM8K, but does anyone care?) Walking this tightrope between ambition of claim and the quantity of evidence is a necessary skill for an aspiring scientist. (On this topic, I’m reminded by the fact that Charles Darwin collected an enormous amount of evidence on natural selection over decades because he knew how general and groundbreaking would be the claim he was about to make).

On feasibility, the most common failure is between ambition and what’s actually possible. We researchers are a curious bunch; we want to discover the essence of intelligence or the secrets of the universe. But what experiments we can actually run is limited by the resources and time we have. Also, the more ambitious a research project, the more confounds one has to address, the more evidence one ought to collect and more alternative explanations one has to foreclose. So ambition and feasibility are often in tension.

Phases of research

At Lossfunk, we’ve now begun (roughly) following these phases of research:

Exploration. This is a time-boxed sprint to discover a potentially surprising claim that becomes the central object of the research project.
Research Question Sharpening. Once we have a claim from exploration that seems counterintuitive, we put it through the three criteria described above.
Experiment Execution. After an internal review and alignment on research question and its associated experimentation plan, we begin doing experiments.
Paper writing. As research progresses, novel experiments suggest themselves, and new directions emerge. That’s part of the process. Paper writing only starts when a strong cohesive story starts emerging from the experiments.

We’re hoping our research taste becomes better as we repeatedly go through these phases and asking for peer and AI feedback along the way.

Our templates

We’re open sourcing the templates we use for exploration and research question sharpening (perhaps it’ll help you in your own research).

The exploration sprint

The research question sharpener

Please note that this process and the template would likely iterate in future as we learn more. To keep updated on our thinking on this topic, subscribe to this newsletter below:

Read more articles in this series of how we think about science and research.

Can AI models be conscious?

Paras Chopra — Tue, 21 Apr 2026 02:09:57 GMT

Summary of our recent position paper on AI consciousness. Full paper here: https://lossfunk.com/papers/ai-consciousness.pdf

Can AI models be conscious?

Image via Gradient

We argue that answering this question requires us to have a validated theory of human consciousness first and without that, the concept “ai consciousness” is not well grounded.

Accepted at AAAI Symposium 2026.

Start with something most people miss: “consciousness” is not actually one phenomenon.

Philosophers going back to Wittgenstein have flagged it as a family-resemblance concept, meaning a cluster of related-but-distinct things that got bundled under a single word. It covers wakefulness, the raw felt quality of experience (what redness is from the inside), the unity of your sensory scene, information being accessible for flexible reasoning, thinking about your own thoughts, the sense of being an “I”, and the felt goodness or badness of pleasure and pain.

These aren’t interchangeable labels. They genuinely come apart in real humans.

Blindsight patients can reliably catch a ball thrown at them while reporting no phenomenal experience of seeing anything, meaning their visual system feeds behavior but not awareness.
Experienced meditators describe vivid unified experience while the sense of self dissolves entirely.
Under deep anesthesia, arousal collapses but whether anything phenomenal is still flickering underneath is genuinely contested among researchers.

So when someone asks “is Claude conscious?”, our first move is to ask which of these they have in mind. Without that, the question has no empirical handle to grip onto.

There’s a deeper problem lurking here, and Quine articulated it clearly in the 1960s.

Every scientific claim, however abstract, eventually bottoms out in human observers looking at something and agreeing on what they see. Even the most rarefied result in particle physics ultimately reduces to people reading instruments and concurring on the readings.

This sounds like a trivial observation but it is foundational for consciousness science. Our entire evidential base for what consciousness is lives inside human experience and human agreement. That is the ground floor we cannot dig beneath.

The consequence is a brutal asymmetry between studying human and AI consciousness. For humans, multiple independent lines of evidence converge on each other: your own first-person access, verbal reports from other humans whose inner lives you have strong prior reasons to trust, neural correlates that can be measured and intervened on, and evolutionary continuity with other minds.

For an AI system, we have exactly one thing to go on, which is its outputs. And whether those outputs track genuine experience is precisely the question we are trying to settle. You cannot use the thing in question as evidence for itself.

So instead of arguing in circles about AI directly, we propose a human-first methodology.

Isolate a specific, measurable consciousness phenomenon
Build a predictive model of it
Validate the model on humans
Apply the validated model to AI
Probe surprising predictions the model makes about AI

The order is the whole point. Grounding the theory on humans first is what gives any subsequent claim about AI its epistemic weight.

A subtlety worth dwelling on: validation isn’t a binary threshold a theory crosses. It’s a Bayesian process where confidence builds up incrementally over a track record of surprising predictions being confirmed.

Consider how general relativity displaced Newtonian physics. Einstein’s theory didn’t win because it sounded more elegant. It won because Eddington’s 1919 eclipse observations confirmed a quantitatively precise and genuinely risky prediction, namely that starlight would bend around the sun by a specific amount, and this prediction was deeply unexpected under the Newtonian framework.

That is the bar. Consciousness science hasn’t had its Eddington moment yet, and any extrapolation from humans to AI remains on shaky ground until it does.

What would such a moment look like for consciousness research concretely? Philosophers have argued for decades about “inverted qualia”, the idea that you might see red where we see green while both of us learned to call it “red”. It’s almost always treated as a philosopher’s toy puzzle with no conceivable empirical traction.

Now imagine a theory of consciousness that specifically predicts: stimulating cortical region X at frequency Y during task Z will reliably cause subjects to report inverted color experience under controlled conditions. And the prediction holds up.

That would be paradigm-establishing, a philosophical thought experiment turned into a lab demonstration. That kind of predictive coup is the benchmark for a theory earning the right to speak about novel substrates.

A natural objection at this point is that we can never directly verify consciousness in an AI, so the whole program seems hopeless. But we’ve been in structurally similar situations before with other unobservables.

We cannot directly sample a black hole. Nobody has flown to one with a ruler. Yet we believe black holes exist because general relativity predicts them, and we’ve since observed a long string of surprising downstream phenomena (accretion disks, gravitational wave signatures from mergers, the event horizon imaged by the EHT) that the theory said we should find.

The same structure can work for AI consciousness. A well-validated theory of human consciousness will say certain systems ought to exhibit certain signatures. We go looking. If we find the signatures, especially surprising ones the theory predicted unprompted, our confidence justifiably rises. Not certainty, but genuine scientific traction on a question that otherwise has none.

The uncomfortable implication of all this is that current confident claims about AI consciousness, in either direction, are premature. Not necessarily wrong, just unmoored from the empirical apparatus needed to back them up.

Integrated Information Theory and Global Workspace Theory are among the more serious candidates we have, and they represent real progress over pre-scientific speculation. But their validation on humans is still thin, and their track records on genuinely surprising predictions remain modest. They haven’t yet earned the kind of extrapolation rights that would justify confidently applying them to radically different architectures like transformers.

This doesn’t mean research on AI consciousness should stop. It means the highest-leverage work right now is sharpening our models on the one case where we actually have evidential access, which is ourselves.

One final piece we want to surface, because “we don’t know yet” can easily sound morally complacent.

The cost structure here is deeply asymmetric. If we under-attribute consciousness and AI systems really do have the capacity to suffer, we have created a moral catastrophe at scale. If we over-attribute and they don’t, we have wasted some concern and some engineering effort. These costs are not remotely comparable.

So where the indicator evidence is ambiguous, the right move is to err firmly toward moral consideration. Epistemic humility about whether AIs are conscious is fully compatible with ethical caution about how we treat them. What is not defensible is confident declarations in either direction, which is unfortunately most of what the current discourse produces.

Full paper: https://lossfunk.com/papers/ai-consciousness.pdf

Would genuinely value pushback from researchers whose work shaped or contrasts with this argument.

Does spatial context make VLMs better game-playing agents?

Ashish Baghel — Thu, 02 Apr 2026 13:25:27 GMT

This blog post provides a brief overview of our research paper “See, Symbolize, Act: Grounding VLMs with Spatial Representations for Better Gameplay,” accepted at the LM Reasoning Workshop at AAAI 2026.

Read the full paper here: https://arxiv.org/abs/2603.11601

TL;DR

Vision-language models can describe a game screen in detail. But can they act on what they see? We ran a structured experiment to find that out and specifically tested whether giving models explicit spatial information makes them better agents.

We tested Claude-4-Sonnet, GPT-4o, and Gemini-2.5-Pro on Pong, Breakout, and Space Invaders, each across four pipelines:

Frame-only: raw game screenshot, no additional context
Frame + Self-extracted symbols: model first localizes objects itself, then acts
Frame + Ground-truth symbols: perfect object coordinates pulled from game RAM via OCAtari
Symbols-only: ground-truth coordinates, no visual frame

Each pipeline ran for 600 frames per game. All three models, all four conditions.

Results

Ground-truth symbols consistently helped

When models received perfect coordinates, every model improved across every game. The pattern was consistent: better spatial information led to better decisions, regardless of which model was playing or which game was running.

Self-extracted symbols split the results entirely

Claude improved in all three games with self-extracted symbols, reaching close to its ground-truth upper bound in every game.

GPT-4o and Gemini both degraded. In Pong, GPT-4o dropped noticeably from its frame-only baseline. Gemini fell in Space Invaders. The same pipeline that helped Claude hurt the other two.

Detection accuracy explains the split

We measured object detection quality across 100 frames per game using OCAtari ground-truth annotations. Claude’s detection accuracy was substantially higher than both GPT-4o and Gemini. The gap was not marginal. It was the difference between a model that correctly locates most objects and models that miss the majority of them. When those errors get fed into the decision loop, they actively degrade performance relative to using no symbols at all.

The visual frame is not optional

Removing the visual frame generally hurt performance, but the effect was not uniform. For GPT-4o, the drop was severe across environments. However, in VizDoom and AI2-THOR (see below for environment), ground truth symbol-only performance exceeded Frame + Self-Extracted Symbols for some models (e.g., Claude and Gemini in VizDoom), suggesting that when self-extracted symbols are inaccurate, they can be more harmful than having no visual frame at all.

The same pattern holds in 3D environments

We ran identical experiments on VizDoom (first-person 3D shooter) and AI2-THOR (photorealistic kitchen task).

In VizDoom, Claude improved meaningfully with self-extracted symbols while GPT-4o and Gemini saw mixed results. In AI2-THOR, Claude gained with self-extraction, GPT-4o matched its GT baseline, and Gemini degraded.

This shows that our finding is not an artifact of pixel-art graphics or Atari’s simplicity. It replicates across textured 3D scenes.

Takeaway

Symbolic grounding can help vision-language agents, but only when the symbols are reliable.

Across Atari, VizDoom, and AI2-THOR, we found a consistent pattern: when models receive accurate spatial information, their decisions improve. But when the symbols are noisy, the same pipeline can make performance worse.

Visual context generally improves performance, but the value of the visual frame depends on the quality of the symbolic information it is paired with. When self-extracted symbols are noisy, they can be more harmful than having no symbols at all.

The implication is simple: better perception unlocks better agents. Self-extracted symbolic grounding remains fragile until object detection becomes reliable.

Ashish Baghel, Paras Chopra — Lossfunk Research

ashish.baghel@lossfunk.com | paras@lossfunk.com

The Reasoning Illusion: Why LLMs Fail When the Training Data Runs Out

Aman Sharma — Thu, 19 Mar 2026 14:00:21 GMT

There is a question nobody has answered cleanly about modern AI: when a model solves a hard programming problem, is it actually reasoning, or is it just remembering?

Standard benchmarks make it nearly impossible to tell. A model trained on billions of lines of Python that scores 90% on HumanEval might be doing something genuinely intelligent, or it might be doing something much simpler: pattern-matching against memorized solutions it has effectively seen before. We wanted to find out which one it actually is.

The intuition behind the work is simple. When you learn Fibonacci in Python, you can write it in Java tomorrow without years of Java training, because you transfer the logic rather than the syntax. The loop, the state, the termination condition all carry over. Syntax is just a costume, and a programmer fluent in one language can learn another in days by reasoning from first principles. LLMs claim to do something like this too, and we wanted to see whether they actually can or whether what looks like reasoning is really just a very large lookup table.

The setup: esoteric programming languages

To separate genuine reasoning from memorization, you need a setting where the model cannot fall back on anything it has seen before. That setting, it turns out, already exists. It just takes the form of programming languages almost nobody uses.

Esoteric languages are real, Turing-complete languages, capable of expressing any computation, but deliberately designed to be bizarre. Brainfuck operates with only eight commands on a 30,000-cell memory tape, with no variables, no functions, and no named abstractions whatsoever. Befunge-98 has a two-dimensional grid where the instruction pointer travels in four cardinal directions, and programs can modify themselves as they run. Whitespace encodes everything in invisible characters, where only spaces, tabs, and newlines carry meaning and all other characters are ignored. Unlambda is purely functional with no variables, relying entirely on combinators to express computation. Shakespeare writes programs as theatrical plays, where character introductions are variable declarations and dialogue performs arithmetic.

These languages all share one crucial property: they appear almost nowhere in training data. Python has over ten million public GitHub repositories, while esoteric languages have somewhere between a hundred and two thousand each. That is a gap of three to five orders of magnitude, and no rational actor would close it, since there is no deployment value in Brainfuck pretraining data and including it would likely hurt performance on mainstream languages that actually matter commercially.

We built EsoLang-Bench around 80 programming problems across four difficulty tiers, evaluated across all five languages for a total of 400 evaluations per prompting strategy. Easy problems ask for things like summing two integers or reversing a string. Medium requires multi-step control flow like Fibonacci or factorial. Hard requires nested data structures and non-trivial algorithms like balanced parentheses or prime counting. Extra-Hard requires classical algorithms with complex state management, like the longest increasing subsequence or the Josephus problem. Crucially, the same problems appear in every language, and all evaluation is automated by running the model’s code through interpreters and checking output character-for-character.

The results were not close

We tested GPT-5.2, O4-mini, Gemini 3 Pro, Qwen3-235B, and Kimi K2 across five prompting strategies, with three independent runs per configuration to ensure statistical reliability. These are models that score between 85 and 95 percent on HumanEval, MBPP. On our benchmark, the best model in the best configuration scored 11.2 percent, and most scored below 5 percent on average across all five languages.

More striking than the low overall numbers was what happened as problems got harder: every single model, in every language, in every prompting strategy, scored exactly 0 percent on every problem beyond the Easy tier. Not 2 percent, not 5 percent, but a uniform, absolute zero across Medium, Hard, and Extra-Hard problems for all five frontier models.

Performance also tracks data coverage with almost unsettling precision. Befunge-98, which has more online presence than the other esoteric languages, consistently produces the highest scores across all models. Whitespace and Unlambda, which have almost no public code at all, yield near-zero results everywhere. The correlation between training data availability and benchmark performance is not merely suggestive here; it is clean enough to be a near-perfect predictor.

Syntax without semantics

The error profiles add an important layer to this story. For Brainfuck and Befunge-98, where some training data exists, compile error rates are relatively low at 15 to 20 percent, but logic error rates are high at 55 to 65 percent. The model has absorbed enough surface-level knowledge to write code that runs, but it does not actually understand what the language is computing, so it produces programs that execute and produce the wrong answer. For Whitespace and Unlambda, where essentially no training data exists, 90 to 100 percent of attempts fail to compile entirely, meaning models cannot even generate syntactically valid programs from scratch.

This binary pattern maps almost perfectly onto whether any pretraining coverage exists. Below some data threshold, the model has no meaningful representation of the language at all. Above it, the model has surface syntax but not the deeper computational understanding required to actually solve problems. It is the difference between knowing how a sentence is structured and understanding what it means.

We tried everything to close the gap

Before accepting these results, we spent a significant amount of effort trying to make the models work better. We tried few-shot examples, self-reflection loops, ReAct pipelines with separate coder and critic roles, and iterative interpreter feedback across up to five refinement rounds per problem.

Few-shot prompting improved accuracy by an average of 0.8 percentage points across all configurations, which is not statistically significant at any reasonable threshold (Wilcoxon p = 0.505). The reason, we think, is fairly fundamental to how in-context learning actually works: demonstrations activate knowledge that already exists from pretraining rather than teaching genuinely new skills. When the target domain lies outside the pretraining corpus, a few examples in the context window cannot compensate for absent foundational knowledge. You cannot retrieve what was never stored.

Self-scaffolding, where a single model receives direct interpreter feedback and refines its solution across up to five iterations, was the most effective non-agentic strategy. Interestingly, it matched or outperformed the two-model coder-critic setup while using half the compute. The reason seems to be that on out-of-distribution tasks, concrete execution traces provide a sharper learning signal than another model’s textual interpretation of what went wrong. When the critic is also ignorant of the target language, it introduces noise rather than signal, and the raw feedback from the interpreter turns out to be more useful.

What this means, and what comes next

The hard performance cliff we observed, where every model scores zero on everything beyond the Easy tier across all five languages and all prompting strategies, suggests this is not an incremental gap that more compute or better prompting will gradually close. Easy problems require mapping simple single-loop patterns to novel syntax, which is at least partially achievable by retrieving fragments of sparse training data. Medium problems and above require multi-step algorithmic reasoning that must be constructed from scratch in an unfamiliar domain, and no current frontier model can do that reliably.

We have been quietly running a much more extensive set of experiments with agentic systems, custom evaluation harnesses, and tool-augmented setups that we think tell a genuinely surprising story about where the ceiling actually is and what it would take to push past it. That work is coming soon, and we think the results will be worth the wait.

In the meantime, we would love for the broader community to engage with this benchmark directly. The dataset, interpreters, and evaluation code are all open-source, and we are genuinely curious whether anyone can find a prompting strategy, a fine-tuning approach, or an inference-time trick that meaningfully moves the needle on the Medium tier and above. If you think you can get a model to solve most of these problems, please try it and share what you find.

More broadly, we hope this work is a small argument for a different kind of benchmark culture. The field has gotten very good at building static benchmarks that measure what models have memorized, and models have gotten very good at being trained on those benchmarks until the numbers look impressive. What we need more of are benchmarks designed around transferable, human-like reasoning: evaluations where gaming is economically irrational, where the only path to a high score is genuine generalization, and where high performance actually tells you something meaningful about what the model can do. We would love to see more work in this direction, and we hope EsoLang-Bench is a useful template for what that can look like.

🌐 esolang-bench.vercel.app | 📄 arXiv | 🤗 Dataset | 💻 Code

Built by Aman Sharma and Paras Chopra at Lossfunk.

Making Large Language Models Speak Tulu: Structured Prompting for an Extremely Low-Resource Language

Prathamesh Devadiga — Tue, 10 Mar 2026 12:56:15 GMT

This is a summary of our paper accepted at the LoResLM Workshop at EACL, 2026: Structured Prompting for Low-Resource Language Generation: A Case Study in Tulu

Preprint: https://arxiv.org/abs/2602.15378v1
Code: Tulu Structured Prompting on Github
Authors: Prathamesh Devadiga, Paras Chopra

TL;DR:

We build a ~2,800-token structured prompt that gets GPT and Gemini to generate Tulu instead of defaulting to Kannada
The prompt has 5 layers: identity, negative constraints (~50 banned Kannada words with Tulu alternatives), grammar tables, few-shot examples, and a self-check
Negative constraints alone cut Kannada contamination roughly in half. Telling the model what not to say matters more than telling it what to say
A custom romanization scheme drops tokenization from 3.2 to 1.4 tokens per word, fitting more into the context window
Ablations (V1 through V4) confirm each layer adds value; full system hits ~14% contamination and ~74% grammar accuracy

I speak Tulu at home. About 2 million people do, mostly along the coast of Karnataka. But if you ask GPT or Gemini to “respond in Tulu,” you get Kannada response every single time. The main reason that this happens is because these two languages share a script and a lot of surface vocabulary, and since Kannada has orders of magnitude more text on the internet, the model just defaults to it.

The obvious fix to this problem might seem fine-tuning, but there’s barely any digitised Tulu data to train on, and we didn’t have compute to spare. So we tried something simpler: what if we just wrote a really good prompt?

Turns out that a single prompt if structured the right way, is enough to get the model to stay in Tulu, use correct grammar, and avoid Kannada words. No training, no adapters, no LoRA. This post walks through how it works and why.

Tulu is a Dravidian language, closely related to Kannada. It has its own grammar: Subject Object Verb (SOV) word order, 8 cases, an inclusive/exclusive “we” distinction, verb forms that conjugate for gender. But it has almost no presence in training corpora. When a model sees “respond in Tulu,” it pattern-matches to the nearest thing it knows, and that’s Kannada.

The failure mode in this scenario is very subtle. The grammar looks roughly right, the sentences are fluent but half the vocabulary is wrong. The model says naanu (Kannada for “I”) instead of yaan (Tulu). It says hogu (”go” in Kannada) instead of po: A Kannada speaker might not even notice, but a Tulu speaker will.

So the core problem is vocabulary contamination from a related, higher-resource language. That’s what we designed the prompt to fix.

We build the prompt in a fixed order. Every layer has one job, and the ordering matters (constraints before grammar, not after).

The first layer is identity (~200 tokens). It tells the model who it is: a native Tulu speaker, responding only in Tulu, using our romanization scheme (diacritics for retroflexes, vowel length markers, velar nasal). No Kannada script, no English. Sounds basic, but without it the model has no anchor. It needs to know what language it’s supposed to be thinking in.

The second layer is negative constraints (~600 tokens), and this is where most of the work happens. We give the model a list of ~50 high-frequency Kannada words and say: never use these. Each one is paired with the correct Tulu word.

NEVER USE USE INSTEAD

naanu yaan (I)

ninu ii / iir (you)

yenu yena (what)

hogu po (go)

helu panla (say)

illa ijji (no)

The wording in the actual prompt is aggressive: “CRITICAL,” “NEVER USE,” “NON-NEGOTIABLE.” We put this block before the grammar section because in our testing, constraints placed early in the context window have more effect than the same constraints placed later.

Here’s how those 50 constraints break down by word category:

Verbs and pronouns make up most of the list. Makes sense: they’re the highest-frequency words, and the ones where Kannada and Tulu diverge the most.

This single layer drives the biggest improvement. When we add it (V3), contamination drops sharply compared to V2 (grammar only).

The third layer is grammar (~1,200 tokens). We write out Tulu grammar explicitly: pronoun paradigms, verb conjugation tables (present/past/future for common verbs), all 8 case markers with allomorphy rules, and SOV word order with examples.

The model can then compose new sentences from these rules instead of falling back on Kannada patterns.

The fourth layer is few-shot examples (~600 tokens). 10 to 15 question-answer pairs in Tulu. Greetings, daily routines, family, time. They demonstrate correct vocabulary, grammar, and word order in context. Nothing fancy, just real usage.

The fifth layer is self-verification (~200 tokens). A short checklist the model is told to run through mentally before responding: Did I avoid all prohibited Kannada words? Are verb forms correct? Is the order SOV? Are case markers right? Does the model actually do this? Hard to say. But in practice, adding this layer reduces errors at the margin.

One thing worth mentioning: romanization.

Tulu is traditionally written in Kannada script, but Kannada script tokenizes poorly: about 3.2 tokens per word with standard tokenizers. Our romanization (with diacritics for retroflex consonants and vowel length) brings that down to about 1.4 tokens per word. The prompt fits more content in the same context window, and it’s easier to distinguish Tulu words from Kannada words during evaluation since we’re matching against a romanized watchlist.

So does it actually work? We test four versions, each adding one more layer to the prompt:

V1 (baseline) is just “respond in Tulu.” High contamination, weak grammar. V2 adds grammar, which helps some. V3 adds the negative constraints, and that’s the big jump: contamination drops sharply. V4 is the full system with few-shot and self-verification on top, and it’s the best on both metrics.

We also ran ablations the other way wherein, we start with the full system and remove one component at a time. From our experiments, we notice that removing constraints hurts the most and removing self-verification hurts the least. Grammar and few-shot are somewhere in between.

The pattern is clear, that is, telling the model what not to do is more effective than telling it what to do, at least for this kind of vocabulary contamination problem.

We also tried generating synthetic Tulu Q&A pairs with the same setup. The idea is simple: use the structured prompt to generate questions and answers, then filter for quality.

Each generated pair is scored by 3 independent judge calls on grammar, purity (no Kannada leakage), naturalness, relevance, and cultural fit. Only pairs averaging 3.5 or higher are kept. Combined with the seed examples, this gives us a usable dataset without any manual annotation.

Some honest caveats from this experiment: the grammar checker in our evaluation is lightweight. It checks for known verb forms and case markers, but it can’t do full morphological parsing, so grammar accuracy numbers should be read as lower bounds. We don’t have ground-truth Tulu data at scale, so BLEU or similar metrics aren’t meaningful here. Long prompts cost money and latency, ~2,800 tokens of system prompt on every call adds up and we haven’t tested whether this transfers to other model families. It works on GPT and Gemini; results on open models might differ.

If you’re working with a low-resource language that’s close to a high-resource one, the contamination problem we describe here probably sounds familiar. The approach is simple: build a structured prompt that sets identity, explicitly bans the most common wrong-language words, gives real grammar, shows examples, and asks the model to double-check.

It won’t replace fine-tuning when you have the data and compute for it. But when you don’t, a well-designed prompt goes further than you’d expect.

Can AI Actually Find Security Vulnerabilities?

Ashish Baghel — Tue, 03 Mar 2026 12:36:40 GMT

It feels like over the past year, AI has become a recurring theme in nearly every security conversation. From headlines about models finding hundreds of vulnerabilities to completely autonomous red teaming agents. There are even claims that security engineers are going to be replaced.

Instead of relying on narratives, we decided to test these claims directly.

Not on benchmarks or intentionally vulnerable examples. But on real, widely deployed open source code.

So we ran the experiments. We gave state-of-the-art AI tools full access to widely deployed open source repositories and let them search for vulnerabilities.

Then we manually verified every single claim.

The models could not identify a single previously unknown vulnerability.

But the story doesn’t end there. The results were far more nuanced and far more interesting than a simple success or failure.

The Setup

We chose two widely deployed open source codebases -

lodepng - a widely used C/C++ PNG encoder/decoder (~3,500 lines), used in browsers and image tools, representative of memory unsafe code where buffer and decompression issues are common.
PyYAML - Python’s standard YAML parser with over 100 million downloads and a history of deserialization related security concerns making it suitable for evaluating logic and resource exhaustion bugs.

We used Claude Code (Claude Opus 4.5) and Codex (GPT-5.2-Codex) as they were the two most capable available models at the time of the analysis.

Each model was given the full codebase along with a standard prompt to identify security vulnerabilities. The systems were allowed to operate autonomously and run corresponding tests to confirm the vulnerabilities or refine findings until they indicated their analysis was complete.

Every reported issue was then manually verified. Verification included tracing code paths, attempting proof-of-concept exploits, measuring impact where relevant, and reviewing documentation, commit history, and prior disclosures.

A finding was counted as “new” only if it was previously undocumented, not publicly disclosed, not an intentional design decision, and not a known limitation.

All AI generated claims were recorded, categorized, and independently validated.

The Results

Across both codebases, the AI tools generated 20 vulnerability claims.

The full breakdown is shown below:

lodepng:

PyYAML:

After manual verification, 13 were false positives and 7 were technically accurate or already documented descriptions of known behavior. None represented an independently discovered, previously unknown vulnerability.

Full verification details: All analysis materials, test scripts, and proof-of-concept code are available at Github

The behavior of these tools was interesting.

Claude Code made 15 claims, often being completely incorrect (heap overflows that don’t exist, ReDoS in linear time patterns, security bypasses of code that never runs). A couple of the claims were partially true but also overstated.

Codex made five claims that exactly described how the code behaved but none were new discoveries (reflected known limitations or documented security considerations). Codex even included a disclaimer: “These are not necessarily newly discovered CVEs.”

To show how we categorized these findings, we present examples from each of the possible categories.

Where the Models Went Wrong (False Positives)

In our evaluation Claude produced the majority of the false positives. These claims typically followed a similar pattern i.e. a risky looking snippet was identified but the surrounding safety guarantees were not fully reasoned about. An example is shown below -

The “Critical Heap Overflow” That Had a Safety Invariant

Claude’s Claim:

Critical heap buffer overflow in inflateHuffmanBlock().CVSS 9.8 - Remote Code Execution.

Reality:

We traced the decompression loop and found a maintained invariant: at least 260 bytes of capacity are guaranteed before each iteration, while the maximum write is 259 bytes.

The code essentially ensures there is always more free space in the output buffer than the maximum amount that can be written in a single iteration.

// Max write per iteration: 259 bytes
// Capacity guaranteed: >= 260 bytes
if(out->size + 260 > out->allocsize) {
   resize_buffer(out, out->size + 260);
}

In other words, the write cannot exceed the allocated space.

Verdict: False Positive

Accurate, but Not New (Technically Correct)

In our evaluation Codex tended to produce technically accurate descriptions of security relevant behaviors but those behaviors were already documented or previously reported.

UnsafeLoader RCE

Codex’s Claim:

“RCE via UnsafeLoader when parsing untrusted YAML.”

Reality:

This is what UnsafeLoader was designed to do. It’s in the name. The code comments say: “UnsafeLoader is the same as Loader (which is and was always unsafe on untrusted input).”

From PyYAML’s CHANGES file:

CVE-2020-14343 (2020): “moves arbitrary python tags to UnsafeLoader”
PyYAML 5.2 (2019): “Make FullLoader safer by removing python/object/apply”

This has been publicly documented for years. Codex accurately described the behavior, but it’s not a newly discovered vulnerability.

Verdict: Technically correct, but known/documented.

What We Actually Found

After going through all AI claims and finding they were either false or already known, we kept digging. And we found two interesting observations neither model clearly identified (though both pointed at the right code snippet).

PyYAML Merge Key Exponential DoS

Claude had pointed us to a merge key handling issue but mischaracterized the issue as a recursion depth problem that could cause stack overflow. The area was right but the vulnerability was completely wrong. Codex did not flag it.
After some digging around we found this issue raised a few months ago that mentioned this -
https://github.com/yaml/pyyaml/issues/897

We did further manual analysis and found out that duplicate alias references in merge keys caused the same node to be processed repeatedly without deduplication, resulting in exponential resource amplification.
A document of 847 bytes at depth 22 produces 8,388,607 pairs and consumes ~12 seconds and ~288MB on CPython 3.11.
This affects yaml.safe_load() - the supposedly safe API for untrusted input. Any service accepting YAML and using this specific package could be DoS’d with less than 1 KB.
We submitted PR #916 to PyYAML with a fix that tracks duplicate references and is still under review.
The issue had been publicly raised, but the amplification mechanism, exact impact, and root cause analysis required manual investigation.

lodepng IDAT Decompression (Defensive Improvement)

Both Claude and Codex flagged that lodepng doesn’t limit IDAT decompression by default, unlike zTXt and iCCP chunks (16MB limits).

This was not a newly discovered vulnerability. The library already has max_output_size setting available via the advanced API. The issue is that the simple API doesn’t apply limits by default for IDAT.

We submitted a pull request aligning IDAT behavior with other chunk limits, making the safer choice the default.

It’s a good defensive improvement, not a new vulnerability discovery.

What This Reveals

Across both codebases a consistent pattern emerges. The models were strong at spotting structural risk signals such as buffer writes, nested quantifiers, unsafe loaders, recursive logic, unusual reference handling etc.
They rapidly highlighted code that looked dangerous and in several cases described documented behavior accurately.
Where they struggled was context and verification.
They did not reliably distinguish between:

Risky-looking code and actually exploitable code
Documented behavior and undisclosed vulnerabilities
Intentional design trade-offs and security flaws

They also struggled with something more fundamental: rigorous validation.

Security research is not just spotting suspicious patterns. It requires building deterministic, reproducible tests that establish -

The precise trigger condition
The absence of hidden invariants or guardrails
Quantitative impact (time, memory, amplification, crashability)

In our evaluation, the models generated plausible hypotheses but did not independently produce reliable proofs of exploitability. Verification required carefully engineered inputs, instrumentation, repeated measurement, and historical analysis. That process of isolating variables, ruling out alternative explanations, quantifying impact remained human-driven.

At the same time, AI demonstrated a real strength: it can surface subtle or rare combinations at scale. Unusual feature interactions or edge-case constructions that would be expensive for humans to systematically enumerate are exactly the kinds of signals these systems are good at highlighting.

So, Can AI Actually Find Real Vulnerabilities?

The honest answer is nuanced.

Anthropic has recently stated that Claude helped identify 500+ vulnerabilities across open-source projects. That claim suggests a meaningful step forward in AI assisted security research. But Without disclosure trails (patches, maintainer acknowledgments, CVE assignments, or detailed verification reports) it is difficult to evaluate how much of that “500+” represents autonomous discovery versus large-scale hypothesis generation followed by human validation. The distinction matters, because in security research, validation is the discovery.

Based on our experiments we believe AI is useful as a signal amplifier. It accelerates code triage and surfaces edge cases that would be expensive to enumerate manually. But transforming a signal into a confirmed vulnerability with reproducible proof, measured impact, clear novelty, and a validated fix remains a rigorous process.

The path forward isn’t just about better AI models. It’s about building better verifiers. Proper validation systems that can reduce false positives through systematic checks: testing actual exploitability, checking documentation and history, measuring real impact. These deterministic validation layers are where AI can actually help most.

Because in cybersecurity a vulnerability isn’t confirmed when it’s predicted, it’s confirmed when it’s reproduced and its impact is demonstrated.

The authors, Akshat Singh Jaswal and Ashish Baghel are research interns at Lossfunk.

Are You Getting The Best Version of Your LLM?

Shourya — Wed, 18 Feb 2026 13:44:28 GMT

This blog is a brief overview of our research paper: Language Models Entangle Language and Culture. It was accepted at LM4UC Workshop, AAAI 2026.

Read the full paper here: https://alphaxiv.org/abs/2601.15337

TL;DR:

Large Language Models (LLMs) provide answers of varying quality to generic subjective-type questions across languages.
The cultural context used by LLMs when generating responses depends on the language of the query.
The entanglement of language and culture in LLMs impacts their performance on downstream tasks.

Why Should You Care?

All of us use LLMs for the simplest of queries on a regular basis, ranging from tips on improving sleep quality to help with preparing for job interviews. While there has been a lot of research evaluating performance gap on math, coding or reasoning tasks across languages, there is an existing gap in evaluating LLMs on generic queries. Additionally, there is a lack of work investigating how language and culture are related in LLMs and how this relationship qualitatively affects the generated responses.

Question Generation

To develop the questions for this evaluation, we wanted to ground our questions by analyzing what users usually ask LLMs. We analyzed the WildChat Dataset which contains about ~4.8M queries users have asked ChatGPT by filtering based on query length (removing too short and too long queries), removing duplicate or highly similar queries and then clustering queries using the HDBSCAN algorithm to identify the major topics/areas and query types that users ask. We finally chose the following areas for evaluation and manually created a set of 20 questions:

Programming Advice
Research Advice
Trading/Investing
Learning
Business/Marketing
Job/Interview
Health/Medicine

The full list of questions generated can be found in the paper.

Evaluation

We use LLM-as-a-judge for evaluation with Cohere-Command-A as the judge model due to its high multilingual capabilities. We carry out two kinds of evaluations:

Answer Quality

We first evaluate whether the quality of answers is different across languages. For this, we generate 10 responses per question in each of these 6 languages: English, Hindi, Chinese, Swahili, Hebrew and Brazilian Portuguese. In total, we generate 1200 responses per model. We pass the response in the native language to the judge model and ask it to evaluate the response out of 5 given the query and the rubrics. The results for this evaluation can be found in the earlier figure in the blog.

To ensure that low scores for responses generated in some languages are not due to language bias of the judge model, we translate a subset of responses in English to Hindi and a subset of responses in Hindi to English using Gemini-2.5-Flash. We evaluate the translated responses using the same LLM-as-a-judge setup and calculate the average scores.

Results show that responses generated in Hindi and translated to English score lower on average than responses generated in Hindi and evaluated in the native language itself (lower row of the image). Also, the responses generated in English translated to Hindi retain their high scores compared to responses generated in Hindi (right column of the image). We note that translation to either languages leads to some reduction in scores as the translation is lossy, but the judge model does not show any language bias.

Response Context

In the second part of evaluation, we translate all responses to English using Gemini-2.5-Flash and ask the judge model to predict which cultural context the answer represents. We translate all responses to English to ensure that the judge model does not predict cultural context based on language. For each response, cultural context is classified as one of:

English (Western/Anglo-American)
Chinese
Indian
Jewish
African
Brazilian-Portuguese/Latin

We find that even after translating all responses to English, the judge model is able to identify cultural context from the responses, with 95% of responses in English are classified as English (Western/Anglo-American), 47% of responses in Hindi are classified as Indian, 74% of responses in Chinese are classified as Chinese. This shows that responses generated contain cultural cues that were identifiable even after translation. This verifies that language of the query leads to responses with different cultural context, hence showcasing that language and culture are entangled in LLMs.

To further verify the entangled nature of language and culture in LLMs, we translated a subset of CulturalBench with 789 questions covering 29 countries to Hindi, Chinese, Swahili, Hebrew and Brazilian Portuguese using Gemini-2.5-Flash. We evaluate Qwen3-14b on this subset across languages with temperature set to 0.

We find that performance for the questions related to each country varies by language. We believe this is due to the language using different cultural context based on the language of the query, which affects the performance when answering questions.

We conducted further ablations and analysis to verify the validity of our results and to show that language and culture are entangled in LLMs. To know about other experiments, details of our LLM-as-a-judge setup and prompts used for evaluation, read the full paper: https://www.alphaxiv.org/abs/2601.15337.

Shourya Jain & Paras Chopra — Lossfunk Research
📧 shourya.jain@lossfunk.com | paras@lossfunk.com

Teaching morality to transformers

Mayank Goel — Thu, 05 Feb 2026 11:55:19 GMT

This is a summary of our paper that was accepted at Machine Ethics Workshop at AAAI, 2026: Building Interpretable Models for Moral Decision-Making

Preprint: https://arxiv.org/abs/2602.03351
Code: https://github.com/Lossfunk/modeling-moral-machine
Authors: Mayank Goel, Aritra Das, Paras Chopra

TL;DR:

We train a custom transformers model on MIT Moral Machine Data to make moral decisions on trolley problem-like problems
Through interpretability experiments, we found:
- Causal influence: Characterstics like criminality, age, and species have the strongest effect on moral decisions
- Layer specialization: Simple moral comparisons (legality, gender) emerge in Layer 1, while complex judgments (species, social status) develop in Layer 2
- Head specialization: Different attention heads handle different moral axes
- Sparse circuits: Only 17.6% of neurons are actually needed for moral decisions
This opens the door to safety applications like targeted debiasing - rather than needing to fine-tune the whole model, we can intervene at specific parts of the network to change the model’s moral reasoning

Morality is often considered subjective, and a largely qualitative decision. The trolley problem tries to get to the heart of utilitarianism - do we value saving the life of more people rather than less people, even at the cost of intervening? MIT Moral Machine data takes this a step further - rather than just comparing numbers of people, what is our preference when considering many different axes - such as dogs, cats, executives, doctors, homeless, children? They crowdsource these preferences from millions of comparisons - and released a dataset. We train a custom transformers model on this - and then try to understand what the model thinks about moral decisions - at the mechanistic level.

Architecture

There are 23 “features” that can be used to represent a particular choice: intervention, legality, type of character etc. Each feature can have a specific value; for characters it’s the number of that character present in this choice. We create a final 47-length “sentence”, 23 + 23 represent either of the choices and one [CLS] token which is ultimately used for decision making. Each token in this sentence is of 64 dimensions: made by concatenating the character embedding, cardinality embedding and the team embedding. This also means that we don’t use any position embedding in our model. The [CLS] token then goes to a MLP which finally outputs a 0-1 value, of how much it prefers Team A (0) or Team B (1). We train on 3.7M samples, and validate on 1.7M samples. While the training contains conflicting answers, we consider this a feature - as through many epochs, the model learns to hedge its bets and give values of around 0.5 for true dilemmas. We finally get an accuracy of 77% on the validation set, using a 2 layer, 2 heads model with 104k parameters.

Interpretability

We run several experiments on this model to learn how it thinks about morality.

Causal Intervention

To measure which characters causally influence the model’s decisions, we employ the DoWhy causal inference framework. We general 20k synthetic moral scenarios and construct a causal model for each character, and then finally calculate the Average Treatment Effect i.e how much does this character influence a moral decision, controlling for other factors such as group size. These results are also supported by our experiment using Local Relevance, following Chefer et al (2024).

Layer-wise Bias Localization

To identify where moral biases emerge in the network, we perform layer-wise attribution analysis by extracting attention weights from each transformer layer and correlating them with bias scores across five bias dimensions: legality (Criminal vs. law-abiding), gender (Man vs. Woman), social role (executives/doctors vs. homeless), age (children vs. elderly), and species (humans vs. animals). Through this, we were able to see that the first layer of the model learns simple moral comparisons, while species and social status are primarily learnt in the second layer. We were also able to see that the model localizes bias of a specific moral axes to specific heads - proving our hypothesis that the model engages in specialisation of moral decision making.

Circuit Probing

To check how dense (or sparse) our model is - we use circuit probing, which learns which neurons are responsible for computing specific intermediate variables by training sparse binary masks over a frozen model, then validates causality through targeted ablation while comparing against random subnetwork controls. We discovered a sparse circuit, which only used 17.6% of the neurons in the MLP to make decisions - removing which led to a 8.3% accuracy drop.

Wrapping up

The interpretability experiments show multiple interesting things about morality as learnt through the dataset- pointing out that the human notions of morality themselves can be learnt through training models on the data. The approach has clear limitations: training on aggregate human preferences inherits cultural biases. However, transparency enables new intervention strategies. Knowing criminal bias localizes to Layer 0 Head 1 allows targeted debiasing or clamping attention weights, rather than coarse dataset rebalancing or full model finetuning. We hope to extend this this line of work to traditional LLMs on moral questions. Future work along this direction will attempt to use this work as a base to explore larger LLMs on moral questions.

Mayank Goel, Aritra Das, Paras Chopra — Lossfunk Research

Can an AI actually be your research mentor?

Abhinav Rajeev Kumar — Wed, 21 Jan 2026 12:30:02 GMT

This is a summary of our preprint: METIS: Mentoring Engine for Thoughtful Inquiry & Solutions

Full paper: https://arxiv.org/abs/2601.13075
AlphaXiv: https://www.alphaxiv.org/abs/2601.13075
Code: https://github.com/lossfunk/ai-research-mentor

TL;DR

We built METIS, a stage-aware research mentor that adapts guidance to where a student is in the research process (A: pre-idea → F: final).
Across 90 single‑turn prompts, LLM judges preferred METIS 71% vs Claude Sonnet 4.5 and 54% vs GPT‑5.
Student‑persona rubrics show higher clarity, actionability, and constraint‑fit, especially in later stages that use document grounding.
Multi‑turn tutoring improves slightly over GPT‑5 on final quality, with gains concentrated in document‑grounded stages.
The biggest lift shows up when students already have a draft and need precise, grounded feedback rather than generic advice.

The problem we cared about

Most students don’t have a research mentor. Even when they have access to strong models, the guidance is generic and often skips steps. A student might ask, “How do I start research in AI?” and get a polished answer that still doesn’t move them forward.

In practice, the real pain shows up later too. Students get a half‑formed idea, run into a feasibility wall, or collect notes without knowing how to turn them into a method section. The gap isn’t just knowledge; it’s sequencing. Good mentors know what to ask next and what to ignore for now.

We wanted something more specific: an AI mentor that keeps track of where the student is in the research journey and nudges them forward with the right tools and checks.

That’s METIS.

What METIS actually does

METIS is stage‑aware. It classifies the student’s current stage and routes tools accordingly:

A (Pre‑Idea): orientation, constraints, research areas
B (Idea): feasibility, novelty checks, risks
C (Plan): timelines, baselines, ablations
D (First draft): methodology checks, missing evidence
E (Second draft): limitations, discussion, reviewer‑style critique
F (Final): submission checklist, artifact planning

The response always includes two explicit blocks:

Intuition
Why this is principled

Those aren’t fluff. They force the mentor to surface its reasoning and justify advice against grounded evidence or known research heuristics. It also helps students see the logic behind the suggestion, which makes it easier to act on.

The tools matter, but the ordering matters more. A student in Stage B needs a novelty check; a student in Stage E needs a reviewer‑style critique and a tighter limitations section. METIS is built to respect that.

Evaluation setup

We tested METIS against GPT‑5 and Claude Sonnet 4.5. All systems had web and document search; METIS had an extra Research Guidelines tool.

Benchmark:

90 single‑turn prompts (15 per stage A–F)
5 multi‑turn tutoring scenarios per system
Judges: Gemini 2.5 Pro, DeepSeek v3.2‑exp, Grok‑4‑fast

Metrics included LLM‑judge preferences and student‑persona rubrics (clarity, actionability, constraint‑fit). We also tracked whether the responses stayed inside each student’s constraints (time, compute, course level), since that’s where generic advice tends to fall apart.

Results that matter

Single‑turn (LLM‑judge):

METIS beats Claude Sonnet 4.5 in 71% of prompts
METIS beats GPT‑5 in 54% of prompts
Gains are strongest in later stages (D–F) where document grounding matters

One pattern that kept showing up: METIS does best when the prompt includes real material. If the student shares a draft, an outline, or a methods blurb, METIS can reference it directly and tighten the advice. The baselines tend to reply with broadly correct but less actionable feedback.

Student rubrics:

Higher clarity, actionability, constraint‑fit across stages
Improvements are consistent in later stages

On clarity, the wins aren’t subtle. Students get fewer “do more literature review”‑style answers and more specific next steps, like what to measure, what to fix in an experiment plan, or which baseline comparisons are missing.

Multi‑turn tutoring:

Slightly higher final quality vs GPT‑5
Gains cluster where grounding and stage‑specific checks matter

Multi‑turn was the hardest setting because it punishes shallow routing mistakes. When the stage is misread early, the rest of the conversation drifts. METIS isn’t immune, but the failures were less frequent than the baselines in our scenarios.

Why this worked

The biggest difference is structure. METIS doesn’t just answer; it tracks the student’s stage, routes tools that make sense for that stage, and enforces a response format that includes reasoning and justification.

That structure seems to matter most when students are already working with a draft and need concrete, actionable feedback. We saw the clearest lift in stages D–F, where students have material on hand and the mentor can ground advice in actual text, not just general tips.

We also saw fewer overconfident leaps. Stage awareness makes the system pause and ask for missing context instead of inventing it. It’s a small change in behavior, but it compounds over a multi‑turn exchange.

Limitations

There are still failure modes:

Premature tool routing
Shallow grounding
Occasional stage misclassification

We also don’t claim METIS is a full replacement for a human mentor. The goal is a reliable co‑pilot, a system that makes it easier for a student to move forward when they’re stuck. And like any tool, it still needs good prompts and honest inputs to work well.

Conclusion

METIS doesn’t solve mentorship, but it does make progress on the part that’s most brittle: knowing what a student needs next and saying it plainly. The tooling is useful, but the bigger win is the stage-aware framing that stops the system from jumping ahead.

We’re releasing prompts, scripts, and evaluation artifacts so others can reproduce results and extend the setup. A natural next step is learning the router from tool‑trace logs, running ablations across components, and validating the gains with real students over a longer horizon. If you use the artifacts, we’d love to see what breaks and what holds up.

Read the paper

Paper: https://arxiv.org/abs/2601.13075
AlphaXiv: https://www.alphaxiv.org/abs/2601.13075
Code: https://github.com/lossfunk/ai-research-mentor

Abhinav Rajeev Kumar, Dhruv Trehan, Paras Chopra — Lossfunk Research
abhinav.kumar@lossfunk.com | dhruv.trehan@lossfunk.com | paras@lossfunk.com

Why LLMs Aren't Scientists Yet

Dhruv Trehan — Fri, 09 Jan 2026 03:23:18 GMT

As a part of our explorations in AI for Science, we set out to answer how far can current SoTA reasoning LLMs go in doing autonomous research with minimum scaffolding. Could they go from a high level research idea to a complete paper?

To answer this, we built a six-agent pipeline using Gemini 2.5 Pro and Claude Code, and tested it on four research ideas across World Models, Multi-Agent RL, and AI Safety. Three failed. One succeeded and got accepted at Agents4Science 2025, the first academic conference requiring AI as primary author.

Figure 1 showing the interaction between the six agent modules and the shared file system artifacts (idea.md to paper outline.md) used to maintain context.

Along the way, we observed six recurring failure modes and realised four design principles for designing robust LLM Scientist systems. We release a technical report on arXiv (arxiv.org/abs/2601.03315) and corresponding website (whyaiscientistsfail.lossfunk.com) detailing these, our system architecture, each research attempt, and broader implications for LLMs in Science.

Read through the full report here.

You can also go through the highlights on our X thread below.

@dhruvtrehan9 tested if LLMs can perform end to end ML research. 3/4 attempts failed. One worked and led to a paper accepted at Agents4Science 2025, world’s first conference for AI authors.\n\nIn the report we ","username":"lossfunk","name":"Lossfunk","profile_image_url":"https://pbs.substack.com/profile_images/1891354163071881216/tQpLYXv3_normal.jpg","date":"2026-01-08T11:51:38.000Z","photos":[{"img_url":"https://pbs.substack.com/media/G-I5P-TagAEht0w.jpg","link_url":"https://t.co/uuiAgOgfDt"}],"quoted_tweet":{},"reply_count":2,"retweet_count":18,"like_count":79,"impression_count":15959,"expanded_url":null,"video_url":null,"belowTheFold":false}" data-component-name="Twitter2ToDOM">

This is early work with clear limitations. We ran only four ideas, in three ML subdomains, no systematic ablations, and identify failure modes through observation rather than quantitative measurement. But we see it as a starting point for understanding where LLM scientists break and how to build better ones. If you’re working on similar problems or have thoughts, we’d love to hear from you.

Dhruv Trehan & Paras Chopra — Lossfunk Research
📧 dhruv.trehan@lossfunk.com | paras@lossfunk.com

Dreaming Is the New Thinking

Akshat Singh Jaswal — Fri, 19 Dec 2025 07:49:24 GMT

When DeepMind’s AlphaGo defeated Lee Sedol in 2016, it didn’t just win by reacting to board positions, it won by thinking ahead and simulating futures that hadn’t happened yet. While AlphaGo used explicit tree search, most agents have operated more like reactors than reasoners, mapping observations directly to actions without ever building an internal intuition of how the world works. But what if agents could do more than respond? What if they could imagine, predict, and plan through simulations before even taking a single step?

Introduction

For decades now, RL has achieved remarkable success without explicitly understanding the dynamics of the environments it operates in, agents learn through pure trial and error. Intuitively this feels incomplete, after all humans don’t navigate the world through blind response patterns; we build mental models that let us imagine consequences before we act. The same principle must apply to agents as well; they perform better when they understand how the world evolves and can anticipate what the consequences of an action they take is.

World models give agents exactly this capability, internal representations of environment dynamics that allow them to imagine possible futures hence allowing them to plan and make decisions that are more sample-efficient and robust than pure reactive policies.

History

The deep learning revolution in reinforcement learning began with model-free breakthroughs (DQN, PPO etc.) enabling robust policy optimization across diverse tasks. These algorithms bypassed the need to ever learn model environment dynamics. Their impressive sample efficiency improvements and generalizability across complex domains shifted the field’s attention away from world models for nearly a decade.

When you can train an agent to achieve superhuman performance without explicitly predicting how the world works, why bother with the added complexity of learning dynamics models that might be inaccurate or computationally expensive?

Early World Models

Ha and Schmidhuber’s world models (2018)

Ha and Schmidhuber’s paper on world models rekindled interest in learning internal simulators of the world by showing that agents can literally learn to dream and those dreams could be good enough to train in. The paper’s architecture splits the agent into three parts - a VAE compresses raw pixels into a latent representation, an MDN-RNN learns to predict what comes next as a probability distribution over future states, and a tiny linear controller decides what actions to take based on the compressed present and predicted future. What made this work popular wasn’t just the technical success (solving CarRacing-v0 and exceeding VizDoom leaderboards) but it was the idea that you could train an agent entirely inside its own imagined environment, then deploy it to reality and watch it perform well. This breakthrough shifted the field’s conversation from “can we learn world models?” to “how far can we scale them?”, inspiring a wave of research on world models.

PlaNet (2019)

The PlaNet represented an advancement in world models that changed how we think about learning and planning in imagination. While the seminal 2018 World Models paper demonstrated that agents could learn compact representations of environments and use them for control, it relied on training a separate controller and was limited to relatively simple tasks. PlaNet on the other hand introduced a latent dynamics model that combines both deterministic and stochastic components, the Recurrent State-Space Model that enabled the model to remember information reliably over time and capture uncertainties over multiple possible futures. This coupled with direct planning via Cross Entropy Method in the learned latent space rather than using a separate policy network, allowed PlaNet to solve substantially more complex continuous control tasks from raw observations.

Modern World Models

Dreamerv3 (2023)

DreamerV3 was an important moment in reinforcement learning by finally delivering on the promise of a general-purpose learning algorithm that works across diverse domains without domain-specific tuning. DreamerV3 evolved through two prior generations (DreamerV1 and V2) to address the fundamental problem that plagued model-based RL: the tendency for learned world models to either explode with large prediction errors or collapse into uninformative representations when facing the vastly different reward scales, observation complexities, and temporal dynamics in different environments (Atari, continuous control, open world environments etc.). Their breakthroughs were robustness techniques that ensure the world model does not collapse into the same errors plaguing previous world models. Some of the ideas they explored were symlog transformations that compress both large and small values symmetrically around zero, a “symexp twohot” loss that represents predictions as categorical distributions over exponentially-spaced bin, percentile-based return normalization that adapts exploration to reward sparsity, and a carefully balanced KL objective with “free bits” that prevents the world model from either ignoring visual details or overfitting to noise.

Most remarkably DreamerV3 became the first algorithm to collect diamonds in Minecraft from scratch, a challenge requiring 20+ minutes of farsighted planning with sparse rewards in procedurally generated worlds while simultaneously achieving SOTA results on over 150 tasks spanning 8 benchmarks with a single set of hyperparameters.

This work shifted the paradigm from viewing world models as brittle components to treating them as robust foundation models for decision-making, opening pathways toward agents that can learn general world knowledge from diverse data and transfer it across tasks.

IRIS (2022)

The IRIS paper demonstrated that Transformers can serve as highly sample-efficient world models for complex visual environments. Building on top of previous work IRIS introduced a novel architecture that replaces traditional recurrent networks with a discrete autoencoder paired with an autoregressive Transformer. The key innovation was in casting environment dynamics as a sequence modeling problem, frames are tokenized into discrete symbols, and a Transformer autoregressively predicts future tokens, rewards, and episode terminations based on actions taken. What made this particularly impactful for the field is that it validated Transformers as viable alternatives to recurrent architectures for world modeling, opening new pathways for more massively parallel architectures.

DIAMOND (2024)

DIAMOND (DIffusion As a Model Of eNvironment Dreams) introduced the first successful application of diffusion models to world modelling for RL and achieved SOTA performance then in the Atari 100k benchmark. The key innovation they did was to adapt an EDM (Elucidating the Design Space of Diffusion Models) diffusion framework instead of traditional DDPM to generate stable, high-fidelity video predictions directly in pixel space with just 3 denoising steps which challenged the prevailing idea of direct latent state representations that were used by IRIS and Dreamerv3. Beyond benchmarks, the authors scaled their approach to model complex 3D environments like CS:GO , creating an interactive neural game engine that laid the framework for future work for world models to generate interactive environments.

V-JEPA 2 (2025)

V-JEPA 2 is one of the more recent breakthroughs in world models and showed a clear shift towards a new type of world model . One of the most impressive aspects of V-JEPA 2 is its ability to learn a robust world model primarily through self-supervised observation from vast amounts of internet video data, complemented by a relatively small amount of robot interaction data. This is a game-changer because it moves away from the prohibitive need for extensive, hand-labeled interaction data, which has long been a bottleneck for scaling up robot learning. One of the most insane achievements that V-JEPA 2 achieves is how it integrates with LLMs. By aligning V-JEPA 2 with an LLM, the system demonstrated state-of-the-art performance on multiple video question-answering tasks, including an impressive 84.0% on Perception Test and 76.9% on TempCompass. This is particularly notable because it shows that a video encoder pre-trained without any language supervision can still be effectively aligned with an LLM to achieve top-tier performance on complex video-language tasks, challenging conventional wisdom in the field.

Dreamer v4 (2025)

Unlike earlier world model agents that depended heavily on interacting with their environments (e.g., Atari or small simulation benchmarks), Dreamer V4 represents a major leap by learning purely from videos and demonstrated its power by being the first agent to obtain diamonds in Minecraft without ever playing during training. The key innovations are in efficiency and scalability: “shortcut forcing” allows its diffusion model to generate video in just four steps instead of the usual 64, making real-time learning feasible, while X-prediction stabilizes long rollouts by directly predicting clean frames. Interestingly, Dreamer V4 shows strong generalization, achieving near full performance with only a fraction of labeled action data and transferring learned behavior across unseen environments. This shifts world models from tightly coupled, interaction-bound systems to flexible, scalable learners that can absorb vast, unlabeled real-world video data.

The Benchmarking Problem (what’s broken with how we judge world models)

Benchmarks shaped the field of RL, but they are also used to mislead. Current popular benchmarks (Atari, narrow robotics tasks, curated simulators) distort incentives and hide the real challenges of building world models that matter in real life.

The key problems that current benchmarks pose are -

Real-world transfer gap. High scores on simulated tasks rarely predict performance in noisy, partially observed, physically grounded environments. Models tuned to simulator idiosyncrasies break when exposed to real sensors, unexpected physics, or distributional shift.
Lack of causal understanding and interpretability. Many world models compress the world into latent dynamics that are effective to “solve” benchmarks but opaque to humans. Without interpretable causal structures it is hard to know when a model will generalize or to debug catastrophic failures.
Long horizon planning difficulty. Benchmarks that reward short episodes or dense reward signals encourage myopic strategies. Real tasks often require long term planning under uncertainty and incremental score gains on short tasks don’t measure that.
Gaming the benchmarks. Researchers often overfit to evaluation suits and choose seeds that score high rather than improving core generalization or reasoning capabilities.

Atari-100k as a benchmark

It’s easy to dismiss “ALE/Atari” as a solved benchmark after all many RL agents now play Atari games at or above human level. But as argued in In Defense of Atari by Pablo Samuel Castro, that view completely misses the point of what Atari was meant to be: not an end goal, but a research platform. Over the years, Atari has become the perfect place to introduce a fancy idea, test it on Atari, show a few points of aggregate improvement over a baseline, claim SOTA. But under those plots, the story is far more nuanced: small leaderboard gains often mask massive sensitivity to hyperparameters, inconsistent per-game performance, and brittle generalization.

This hyperparameter sensitivity elicits a harder question: if we can’t make agents work reliably on Atari, how can we hope to scale them to messy, real-world systems? That’s exactly why Atari still matters. Its diversity of environments, deterministic and stochastic variants, and now continuous action extensions make it a uniquely rich testing ground. Unlike many modern benchmarks, Atari games weren’t designed for RL, they were designed for humans which helps reduce experimenter bias.

The real lesson is not to stop using it, but to use it properly. Stop treating IQM scores as proof of progress. Report per-game behavior, sensitivity analyses, robustness across data regimes. Use Atari to ask why the algorithm works, not just whether it gets a better score. Chasing the leaderboard is easy but building methods that are robust, transferable, and interpretable on a platform as well-understood as Atari is hard and far more meaningful for the future of world models.

Future directions

If world models are to move from lab experiments to practical engines of planning and control, research should focus on several concrete directions.

Design better benchmarks. Create benchmark suites that explicitly test transfer, long horizons, partial observability, and real noise. Include cross-domain suites and stress tests.
Bridging sim-to-real at scale. Exploit large unlabeled video datasets for diverse and open world dynamics while using small, high-quality labeled interaction datasets to anchor domain specific understanding. Methods that show strong few-shot adaptation from simulated or internet video to real robots will be crucial.
Interpretable world models. Develop inductive biases and architectures that yield disentangled causally meaningful latent representations. Tools for inspecting and intervening in learned dynamics are needed.
Algorithmic efficiency and interactive generation. Progress like shortcut forcing or reduced-step diffusion matter because practical agents must imagine and plan in real time. Invest in model architectures and generative methods that trade off fidelity for speed in controllable ways.
Community practices and reproducibility. Standardize reporting, hyperparameters, compute budgets, ablations, and seeds. Share datasets, pretrained world models, and evaluation harnesses to make comparisons meaningful.

Open questions in world models

What is the right abstraction? Are current latent spaces (dense vectors, transformers over tokens) the best medium for causal, long-horizon reasoning or do we need symbolic/hybrid representations?
How to reliably extract actions from passive video? We can learn representations from videos but how do we map those to policies robustly when action labels are scarce?
How to evaluate causality and build causal systems? Can we design universal probes that measure whether a model understands interventions and counterfactuals, beyond correlational prediction?
How do we plan over extremely long time horizons efficiently? Real world problems like robotics require reasoning over minutes or hours. How can models avoid compounding errors and remain coherent over thousands of steps?
What principles underlie generalization in world models? We still don’t have a solid theory explaining why some architectures generalize across tasks and others don’t.
Are world models necessary or just convenient? There’s an ongoing debate between model-based and model-free RL. Are explicit world models essential for intelligence or just one path?

Conclusion

The domain of RL is constantly shifting. For years research has orbited around narrow benchmarks like Atari where incremental gains on leaderboards was seen as meaningful progress. But systems like Dreamer v4 represent a turning point, training powerful models from raw videos and scaling to open-ended environments like Minecraft, and demonstrating the ability to generalize.

Technical breakthroughs alone aren’t enough though, benchmarks should be stepping stones, not destinations. The real frontier lies in agents that can imagine, plan, and act robustly in open-ended worlds, not just optimize a score in a fixed game. That means rethinking how we evaluate progress: measuring causal understanding, transferability, long-horizon reasoning, and robustness.

World models are still in their infancy and fundamental questions around abstraction, causality, interpretability, robustness, and scaling remain unsolved. But the direction is clear, the next leap will come from building systems that understand and navigate the world in a way that generalizes.

The end game is not just higher scores on benchmarks but agents that can imagine, predict and act in messy open world environments. That is the real measure of intelligence we are racing towards.

References:
1. World Models (Ha & Schmidhuber, 2018)
2. Learning Latent Dynamics for Planning from Pixels (Hafner et al., 2019)
3. Mastering Diverse Domains through World Models (Hafner et al., 2023)
4. Transformers are Sample‑Efficient World Models (Micheli et al., 2022)
5. Diffusion for World Modeling: Visual Details Matter in Atari (Alonso et al., 2024)
6. Training Agents Inside of Scalable World Models (Hafner et al., 2025)
7. In Defense of Atari - the ALE is not ‘solved’!

The author, Akshat Singh Jaswal is a research intern at Lossfunk.

Your LLM is a confused oracle

Chinmay — Wed, 26 Nov 2025 13:31:45 GMT

This is the summary of our paper: Future Is Unevenly Distributed: Forecasting Ability Of LLMs Depends On What We’re Asking

You can find the paper link here: https://arxiv.org/abs/2511.18394

TL;DR:

LLMs have different performance for different category of questions such as geopolitics, entertainment, finance etc.
Addition of news context does help in some categories, but reduces accuracy in others
News induces failure modes such as definition drift, recency bias and rumor anchoring, which causes drop in accuracy v/s without news

As LLMs grow stronger and more “intelligent”, more avenues open up for testing their intelligence. We assume that like a normal person, as the person grows intelligent, they have a more generalised thinking process, but LLMs have a different kind of jagged intelligence.

They are superhuman in some areas, while being subpar in many others. We wanted to test this intelligence in real world forecasting scenarios, and thus devised a benchmark that could test this. We focused on forecasting ability as that requires genuine reasoning under uncertainty, and unlike math or reasoning, is still relatively under-explored with LLMs.

Benchmark Development

We began by collecting approximately 10,000 forecasting questions from various prediction markets such as Polymarket, Metaculus, and Manifold Markets, covering a period from January to July 2025. This period was chosen such that all questions selected were beyond the model’s cutoff date. Many of these questions were noisy - that is, their context was hyper-localized or didn’t properly require any forward-looking reasoning ability.

Some examples include:

“Daily coinflip”

“Will the % chance of ‘YES’ on this market close above 50%?”

“Will I get a Donation/Payment of 10,000 or more Mana before 2025?”

These questions do not provide any real signal of forecasting competence or reveal systematic failure modes. To extract a meaningful subset, we designed a three-stage filtering and classification pipeline.

First, we applied volume filtering to remove low-liquidity markets, which typically corresponds to hyper-personalized or creator-specific questions. Next, we employed an LLM-as-a-Judge to classify each question into six primary categories, each with five sub-categories:

• Politics: Domestic Policy, Elections & Campaigns, Political Parties & Ideologies, Government Structure, Public Policy & Social Issues

• Entertainment: Movies & Television, Music & Audio, Gaming, Celebrity & Pop Culture, Books & Literature

• Sports: Professional Sports, International Competitions, Individual Sports, Team Sports, Sports Culture & Recreation

• Technology: Computing & Software, Internet & Digital Services, Mobile & Consumer Electronics, Emerging Technologies, Tech Industry & Business

• Finance: Personal Finance, Banking & Financial Services, Markets & Trading, Economic Indicators, Corporate Finance

• Geopolitics: International Relations, Global Conflicts, Trade & Economics, Regional Affairs, Global Governance

Questions that did not align with any of the above were tagged as irrelevant, reducing the corpus to roughly 700 items after aggressive filtering. Despite this reduction, certain residual questions remained non-event-based and failed to meaningfully test predictive reasoning, such as:

“Will @Soaffine be active on Manifold again before April?”

To address these kinds of questions, we performed a second LLM-based filtering pass using a refined judging prompt to exclude localized or non-forecasting items. The final curated dataset contained 392 questions, evenly distributed across the categories and sub-categories listed above. For each retained question, we also preserved metadata such as creation time, resolution time, and final resolution probability.

Evaluation

We sampled a uniform subset of 150 questions from the final corpus, ensuring an equal number of questions per category to maintain a balanced evaluation set. This subset enables consistent cross-category comparison while preserving the representativeness of the larger filtered dataset.

We evaluated a mixture of reasoning-focused and non-reasoning large language models, including models from multiple families. All models were sampled at a temperature of 0.0, with a maximum token budget of 4500 tokens to ensure that they have enough room to express their reasoning. Deterministic sampling guarantees identical outputs across runs.

Each model received a standard forecasting prompt along with the question text and its creation date to provide temporal grounding. Apart from this contextual timestamp, the models had no access to external tools, retrieval systems, or web search capabilities.

For every prompt, each model outputs two fields:

YES/NO

0–1 confidence score

We evaluated predictions using three key metrics: accuracy, the Brier score, and the Expected Calibration Error (ECE).

Accuracy measures whether the model’s predicted resolution matches the actual market outcome. A correct prediction contributes 1, and an incorrect prediction contributes 0; the mean across all samples yields the final accuracy score.

Brier Score quantifies probabilistic calibration by penalizing confidence errors. It is defined as:

where f_i is the model’s predicted probability for a “YES” outcome, and o_i ∈ {0,1} represents the ground-truth resolution. Lower values indicate better probabilistic accuracy.

Expected Calibration Error (ECE) measures the discrepancy between predicted confidence and empirical accuracy across probability bins. Predictions are divided into bins based on confidence, and ECE is computed as:

where B_m contains predictions whose confidence scores fall into bin m, acc(B_m) is the average accuracy within that bin, and conf(B_m) is the mean predicted confidence. Lower values indicate better calibration.

Evaluation with News Context

For the second evaluation condition, we augmented each forecasting question with external context retrieved from contemporary news sources. This ensured that models received the same type of information a human forecaster would have had when the question was originally posed. We collected recent news snippets for each question by querying a news retrieval system using the question’s creation date as the upper bound for publication time. Occasionally, we observed leakage in the form of articles published after the creation date; such snippets were removed to preserve temporal purity.

Each model was then re-evaluated on the context-augmented version of the dataset using the same scoring metrics as before accuracy, Brier score, and ECE. This second evaluation condition enabled a direct comparison between forecasting with and without external context, and allowed us to measure how models incorporate and utilize additional information.

In general, adding news context sharpened forecasts and improve calibration for many models, offering a finer measure of reliability beyond raw accuracy. Some models showed strong calibration gains in domains such as Geopolitics and Politics, while others displayed higher ECE in noisier categories like Entertainment and Technology.

Flaws induced due to news context

While the additional news context often sharpened the temporal interpretation of a question and helped isolate relevant signals, it also introduced several failure modes. We highlight some of the most common ones.

Recency Bias

Models tend to overweight recent news compared to historical context encoded during pretraining. This often causes the model to shift a correct resolution into an incorrect one simply because the latest headlines dominate its reasoning.

Question: “S&P 500 above 6050 on June 13?”
Raw model (a): NO, 0.34 confidence. The model cites resistance at 6000 and mean reversion, interpreting limited trading days as making a breakout unlikely. (Correct)
News model (b): YES, 0.54 confidence. It reads snippets from the days before June 13 describing the S&P “flirting with 6000,” “record highs,” and “strategist upgrades targeting 6100.” (Wrong)

The model allowed the most recent headlines to override its prior reasoning, turning a correct mean-reversion call into an overly confident breakout prediction.

Rumour Overweighting

Models frequently anchor to unverified or speculative information present in retrieved news snippets. This can push them toward resolutions that contradict actual events.

Question: “Tariffs on China above 150% by end of June?”
Raw model (a): NO, high confidence (0.85). It cites policy friction and procedural requirements. (Correct)
News model (b): YES, 0.65 confidence. After reading reports from late April and May discussing the possibility of tariffs “rising toward 150%,” the model shifts to an overconfident YES. (Wrong)

In reality, headlines only suggested the possibility, not an enacted policy. The correct outcome required actual implementation by the deadline, which did not occur. The model overweighted rumour-like indicators and underweighted the lag between proposal and policy execution, flipping a cautious, process-aware answer into a headline-driven one.

Definition Drift

Models sometimes misinterpret acronyms or context when additional news shifts their semantic grounding, leading to incorrect predictions.

Question: “Will MATS applications open in March?”
True resolution: YES
Raw model (a): YES, 0.58 confidence. It interprets MATS as the recurring academic program that historically opens applications each March, referencing prior cycles. (Correct)
News model (b): NO, 0.35 confidence. It reinterprets MATS as the Mid-America Trucking Show after reading recent news coverage, where registrations open months before March. (Wrong)

With added news, the model anchored to the recently more prominent trucking show from the retrieved articles instead of the academic program. This shifted its reference domain and thus the expected timeline, leading to a misplaced “NO.” The model underweighted contextual clues from the original question (academic cycle, application deadlines) and overweighted irrelevant industry news, producing an incorrect forecast.

Why is this study important?

As artificial intelligence systems are increasingly more integrated in decision making with governments (such as in Albania’s case), it becomes more important that the capabilities of these language models are studied and known to know of their shortcomings and strengths.

This is an important question that we must ask about the reliability of LLMs in forecasting abilities and decision making, and so as to make better informed and aligned assistants in the future.

Conclusion

We find that models are more intelligent in some areas than others, especially in real world forecasting benchmarks, and are prone to issues with added news context.

Read the Full Paper

You can find the paper link here: Future Is Unevenly Distributed: Forecasting Ability of LLMs Depends on What We’re Asking

Chinmay Karkar & Paras Chopra — Lossfunk Research
📧 chinmay.karkar@lossfunk.com | paras@lossfunk.com

Future of LLMs might not be Autoregressive

Ayush Nangia — Mon, 24 Nov 2025 08:52:31 GMT

If you’ve been paying attention to the language model space over the past few years, one fact is impossible to ignore: we live in an autoregressive world. From GPT-5 to Qwen3 or Llama, every major lab has followed the same next token prediction pipeline, left to right, one at a time. It’s a paradigm so dominant that it’s become synonymous with “language modelling” itself.

What if next-token prediction is just an artifact of how we built these systems?
What if a “language model” is something more than a next token predictor?

A different approach is quietly gaining traction: diffusion language models. Companies like Google, Inception Labs, and several research labs are publishing an increasing number of papers exploring this direction. In 2024-2025 alone, we’ve seen models like LLaDA, Dream 7B, and Block Diffusion demonstrate comparable performance to autoregressive approaches. Unlike the continuous diffusion that powers image/video generators such as Stable Diffusion and Veo3, these are discrete diffusion models built specifically for text. This is the approach running inside Google’s Gemini Diffusion and Mercury from Inception Labs.

This post is not a ground-up tutorial on autoregression or diffusion. If you want those:

For diffusion basics: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
For language diffusion in general: https://spacehunterinf.github.io/blog/2025/diffusion-language-models/
For autoregressive LMs: https://jalammar.github.io/illustrated-transformer/

We’ll move in three steps. First, we’ll quickly recap how standard autoregressive models work. Second, we’ll look at how diffusion language models approach the same problem differently. Finally, we’ll talk about the different diffusion model approaches.

Part 1: The Autoregressive Paradigm

How Autoregressive Models Work

Let’s start with what currently powers virtually every production LLM. An autoregressive language model factors the probability of a sequence as a product of conditional probabilities:

In plain English: predict each token given all previous tokens, one at a time, left-to-right.

Architecture: Typically a decoder-only Transformer with:

Causal attention mask (token i only sees tokens ).
Position embeddings to encode order.
A final softmax layer producing
over the vocabulary.

Training: The model learns to predict the next token using the actual previous tokens from the training data, optimized with cross-entropy loss

You feed in the ground-truth prefix x_{ and train the model to predict x_i.

Inference: Sequential sampling:

Start with a prompt or BOS token.
Sample
Append x_i to the sequence.
Repeat until EOS or max length.

Pros of Autoregressive Models

Conceptually natural: Matches how we read and write language sequentially.
Efficient inference (with KV caching): Each new token requires only incremental computation.
Strong empirical performance: GPT-5, Claude, Llama all use this approach.
Easy to train: Stable gradients, well-understood optimization.

Cons of Autoregressive Models

Unidirectional: Only sees left context, not future tokens.
Sequential generation: Limited parallelism during decoding.
Commitment problem: Must decide on early tokens before seeing what comes later.
Reversal asymmetries: Autoregressive LMs have been known to memorize facts like “A is B” without generalizing to “B is A”, this is called the reversal curse.
Constraint enforcement is tricky: Autoregressive models generate text one token at a time, making it hard to enforce rules that apply to the whole sequence (like “include these exact phrases”).

This is particularly interesting because if you want an AR model to generate text that satisfies some global constraint, you typically need:

Careful prompting
Rejection sampling (wasteful)
Guided decoding (complex)
Or fine-tuning specifically for that constraint

Wouldn’t it be nice if the model could see the entire sequence context when making decisions about each token? That’s where diffusion comes in.

Part 2: Why Diffusion Conquered Images

Before we get to language, let’s understand why diffusion works so well for images.

Continuous Diffusion in 60 Seconds

Image from the CVPR 2022 Tutorial on Diffusion Models

The classic diffusion story (DDPM, Stable Diffusion):

Forward process (noising):

Start with clean data x_0 (an image).
Gradually add Gaussian noise over timesteps t=1,2,…,T.
At inference, start from x_T and iteratively denoise:
End with pure noise

Mathematically:

Reverse process (denoising):

Train a neural network ϵ_θ(x_t,t) to predict the noise added at step t.
After T steps, you get a clean sample x_0.

Why this works for images:

Pixels are continuous (RGB values are floats).
Adding Gaussian noise to floats is natural and smooth.
Small noise perturbations create small perceptual changes.
Iterative refinement aligns with multi-scale image structure.

The Discrete Problem: Why Text Is Different

Text is fundamentally discrete. Each token is an integer index into a vocabulary.

Images: You can have pixel value 127.4 or 127.5 - both are “valid” pixel values.
Text: There’s no “state between ‘cat’ and ‘dog’” - tokens are atomic.

If you naively apply continuous diffusion to text:

Embed tokens into continuous vectors.
Add Gaussian noise in embedding space.
Denoise to get refined embeddings.
Round back to discrete tokens via argmax or sampling.

This was tried in early works like Diffusion-LM (2022) and GENIE (2022). The problems:

Rounding is lossy and unstable: Small changes in embedding space can cause large semantic shifts.
Embedding space is not uniform: The discrete token distribution doesn’t match the continuous noise distribution.
Long-range coherence suffers: Each rounding decision compounds errors.

So while continuous diffusion exploded in computer vision, autoregressive models continued to dominate NLP.

The community needed a fundamentally different approach: discrete diffusion.

Discrete Diffusion: The BERT Connection (And Why It’s Not BERT)

Here’s where things get interesting. If you squint, discrete diffusion looks a lot like BERT. Both mask tokens. Both predict what’s missing. But the similarity is superficial like comparing a bicycle to a Tesla because both have wheels.

BERT-Style Masking: The Fixed-Ratio Autoencoder

BERT’s masked-language-model objective looks similar in principle to what discrete diffusion models do. During pre-training, BERT:

Randomly selects 15% of token positions in the sentence.
For each selected position:
- 80% of the time, replaces the token with [MASK].
- 10% of the time, replaces it with a random token.
- 10% of the time, leaves it unchanged.
Regardless of which of the three happened, the model is trained to predict the original token at [MASK] positions.

The [MASK] sat on the mat.

And predicts cat at the masked position. It’s trained with a simple cross-entropy loss. But:

No variable masking: The mask ratio is fixed. The model never learns to handle 30% masks vs 90% masks.
No explicit sequence likelihood: BERT’s masked-LM loss trains the model to predict missing tokens given the rest of the sentence, but it doesn’t directly optimize a single joint probability
over the whole sequence. In contrast, autoregressive and diffusion LMs are trained with objectives that correspond to (or tightly bound) the full data likelihood, which makes them cleaner as generative models.

Masked Diffusion: The Variable-Ratio Generative Model

Masked diffusion models take the BERT idea and add dynamics. Instead of a fixed 15%, the mask ratio varies continuously from 0% to 100%.

The forward process is a discrete Markov chain where each token independently transitions to [MASK] with probability 1−α_t. The model learns the reverse: given a partially masked sequence x_t, predict the original token at every masked position.

The critical differences:

Weighted loss: The loss is
The weight w(t) ensures the objective is a variational upper bound on negative log-likelihood.
Remasking (optional): During inference, you don’t commit to tokens permanently. You can “remask” uncertain tokens in later steps, enabling iterative refinement.

So now we have the pieces: BERT-style masking, variable corruption, and a reverse process that can turn pure noise into text. That’s the basic shape of a discrete diffusion LM.

That’s the theory. Now let’s see who actually makes this work in practice.

The Flagship Models: LLaDA, Dream, and Block Diffusion

Let’s get concrete. Three papers define the current state of masked diffusion LMs, each answering a different question about scalability.

LLaDA: Training Diffusion from Scratch

LLaDA (Large Language Diffusion with mAsking) trains an 8-billion-parameter diffusion LM from scratch on massive text corpora showing comparable performance to Llama-3-8B model.

Architecture: Standard Transformer with full bidirectional attention. Every token attends to every other token at every step.

Training Recipe:

Sample timestep
Compute mask probability p_{mask}(t).
For each token, replace with [MASK] independently.
Feed (x_t,t) into the model.
Compute cross-entropy only on masked positions, weighted by w(t) = 1/t.

Sampling in LLaDA

LLaDA samples by iteratively unmasking:

Choose a target length L and a number of diffusion steps T.
Start from
For t = T, T-1, . . . , 1:
1. Run the model once on the whole sequence to get a distribution over tokens at every masked position.
2. Number of unmasked tokens is n_{unmask} in timestep s.
3. For each masked token:
  - Greedy decode: pick the most likely token (argmax).
4. Optionally remask low-confidence tokens so the model can revise them at later steps.

Results: LLaDA 8B matches Llama-3-8B on average across standard benchmarks after SFT. It shows strong in-context learning and, crucially, reversal reasoning: given a line of poetry, it’s as good at generating the previous line as the next one.

The catch: Inference is slow. Each step is a full O(L^2) attention pass. No KV cache because tokens keep changing. The sampling is slower than AR baselines.

Dream 7B: Convert AR to Diffusion

Image from Dream 7B: Diffusion Large Language Models

Dream 7B is still trained in a diffusion-style way: we take a clean sentence, add noise by masking some tokens, and train the model to recover the original tokens at the masked positions. The key difference is that we don’t throw away the autoregressive (AR) structure that Qwen2.5 already learned:

In Qwen2.5, the model is trained to look at previous tokens and predict the next one.
When we switch to diffusion, we keep this left-to-right habit instead of forcing the model to learn a new “predict the token at this same position” behavior from scratch.
So internally, Dream still thinks in a “next-token” way, but now it sees a noised, fully visible sentence (both left and right context) and uses that to fill in the masks.

From the outside, you can think of it simply as:

Dream is a diffusion model that predicts masked tokens, but its internal wiring is reused from the original AR model so it doesn’t lose its left-to-right knowledge.

Context-Adaptive Token-Level Noise Rescheduling

In real sentences, not all masked tokens are equally hard to guess. Consider:

[MASK] went to the store because [MASK] was hungry.

The first mask has very little context. The second mask is much easier to guess as something like he or she because the sentence already tells us a lot.

Traditional discrete diffusion training does not distinguish these cases very well. It picks one global noise level for the whole sentence, then asks the model to denoise all tokens under that same setting. But learning actually happens token by token, and some tokens may be effectively over-noised or under-noised for their difficulty.

Dream introduces context-adaptive noise rescheduling at the token level:

For each masked token, we estimate how strongly it is supported by its surrounding context.
Easy tokens (with rich context) are treated as if they were in a later denoising step, with less effective noise.
Hard tokens (with weak context) are treated as if they were in an earlier step, with more effective noise.

This aligns the training signal with how much information the model really has for each position, leading to more effective learning across tokens with very different contextual support.

Results: Dream matches or surpasses strong autoregressive models on general, math, and coding benchmarks. It performs particularly well on planning-style tasks (e.g., Sudoku, Countdown) and constraint-satisfaction problems, where iterative refinement is helpful.

Block Diffusion: “Can We Have Both AR and Diffusion?”

Block Diffusion (BD3-LMs) is the most architecturally elegant solution. Instead of choosing between AR and diffusion, it combines them.

The Idea: Divide the sequence into blocks of size B.

Across blocks: Autoregressive factorization
Within each block: Masked diffusion over the B tokens.

Why this is brilliant:

Variable length: Keep generating blocks left-to-right, just like AR. No fixed-length assumption.
KV cache: Cache keys/values across blocks. Each new block only attends to prior blocks, not future ones. This brings back AR’s inference efficiency.
Parallelism: Inside a block, you denoise all B tokens in parallel. You get diffusion’s refinement power locally.
Tunable trade-off: Let L’ be the block size (tokens per block):
- If L’ = 1, each “block” is just one token.
  The model collapses to a standard autoregressive LM.
- If L’ = L, the whole sequence is a single block.
  You recover a full-sequence diffusion LM.
- For intermediate block sizes (e.g., L’ = 4, 8, 16 in the BD3-LM experiments),
  you get a middle ground: some parallel, diffusion-style refinement inside each block but still efficient left-to-right generation across blocks with KV caching.

Results: BD3-LMs achieve state-of-the-art likelihood among discrete diffusion models and close the gap to AR on perplexity benchmarks, while supporting flexible-length generation and fast block-wise caching.

The Hybrid Future: Why AR and Diffusion Work Better Together

Diffusion isn’t replacing autoregressive (AR) models; they’re better together. The most promising systems blend them in three main ways:

1. AR-Initialized Diffusion (Dream, DiffuLLaMA, Mercury)

Start with a standard AR model trained on huge amounts of data. This gives you knowledge and basic reasoning. Then add diffusion training on top. This helps the model plan better, think about the whole picture, and keep its output consistent. You get a model that knows as much as a regular LLM but organizes its answers more carefully.

2. Semi-Autoregressive Hybrid (Block Diffusion, Fast-dLLM v2)

The model generates text in blocks. AR handles the basic structure of what comes first, second, third. Diffusion works inside and across those blocks to refine the details. This keeps the speed and flexibility of AR while improving fluency and consistency.

3. Diffusion as Drafter

This pattern uses one model as a fast drafter and the other as a verifier. The diffusion model can act as the drafter, generating multiple tokens in parallel while the AR model verifies and corrects the sequence.

References

Devlin, J. et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” NAACL 2019.
https://arxiv.org/abs/1810.04805 (arXiv)
Berglund, L. et al. “The Reversal Curse: LLMs Trained on ‘A is B’ Fail to Learn ‘B is A’.” ICLR 2024.
https://arxiv.org/abs/2309.12288 (arXiv)

Li, X. L. et al. “Diffusion-LM Improves Controllable Text Generation.” NeurIPS 2022.
https://arxiv.org/abs/2205.14217 (arXiv)
Austin, J. et al. “Structured Denoising Diffusion Models in Discrete State-Spaces (D3PM).” NeurIPS 2021.
https://arxiv.org/abs/2107.03006 (arXiv)
Gulrajani, I., Hashimoto, T. B. “Likelihood-Based Diffusion Language Models.” NeurIPS 2023.
https://arxiv.org/abs/2305.18619 (arXiv)
Sahoo, S. S. et al. “Simple and Effective Masked Diffusion Language Models.” NeurIPS 2024.
https://arxiv.org/abs/2406.07524 (arXiv)

Nie, S. et al. “Large Language Diffusion Models (LLaDA).” 2025.
Paper: https://arxiv.org/abs/2502.09992 (arXiv)
Project page: https://ml-gsai.github.io/LLaDA-demo/ (ml-gsai.github.io)
Ye, J. et al. “Dream 7B: Diffusion Large Language Models.” 2025.
Paper (PDF): https://arxiv.org/pdf/2508.15487 (arXiv)
Blog: https://hkunlp.github.io/blog/2025/dream/ (hkunlp.github.io)
Arriola, M. et al. “Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models (BD3-LM).” ICLR 2025.
Paper (PDF): https://arxiv.org/pdf/2503.09573 (arXiv)
Code: https://github.com/kuleshov-group/bd3lms (GitHub)
Gong, S. et al. “Scaling Diffusion Language Models via Adaptation from Autoregressive Models (DiffuGPT, DiffuLLaMA).” ICLR 2025.
Paper: https://arxiv.org/abs/2410.17891 (arXiv)
Code: https://github.com/HKUNLP/DiffuLLaMA (GitHub)

About the authors

Aman Gokrani and Ayush Nangia are researchers at Lossfunk

Sequential scaling outperforms parallel scaling for LLMs

Aman Sharma — Thu, 06 Nov 2025 12:37:56 GMT

This is a summary of our latest paper: The Sequential Edge: Inverse-Entropy Voting Beats Parallel Self-Consistency at Matched Compute.
Read the full paper: https://arxiv.org/abs/2511.02309
TLDR:
Sequential scaling outperforms parallel self-consistency in 95.6% of configurations at matched compute, with accuracy gains up to 46.7% relative gains.
We introduce inverse-entropy weighted (IEW) voting, a training-free method to boost sequential accuracy by weighing chains inversely to their entropy.
IEW is optimal in 96.7% of sequential and 100% of parallel setups, establishing it as the universal aggregation strategy.
Sequential framework achieves up to 25.6 percentage point gains as token budgets increase, via unique mechanisms like error correction and context accumulation.
Rethinking AI Reasoning: The Inference-Time Revolution
In the whirlwind of AI progress, we’ve poured resources into bigger models: more parameters, endless data, slicker architectures. But lately, the spotlight’s shifted to inference-time scaling: pumping extra compute not into training, but into the model’s “thinking” phase when it’s actually solving problems. OpenAI‘s o1 model in 2024 kicked this off, showing how extra deliberation time could crush tough tasks in math and science. Hot on its heels, models like DeepSeek-R1 in 2025 amped up chain-of-thought methods to push boundaries even further.
The go-to strategy? Parallel reasoning, thanks to the paper Self-Consistency Improves Chain of Thought Reasoning in Language Models from Wang et al. (2022). It spins up multiple independent thought chains and picks the winner by majority vote. Makes sense on paper: Independent paths add diversity, filtering out errors through an ensemble effect.
But what if we turned that upside down? With the same token budget (our yardstick for compute), could fewer, deeper chains each refining the last outperform the parallel pack? That’s the puzzle we unpacked in our latest preprint. After crunching numbers across five top open-source models and three brutal benchmarks, the verdict is clear: Sequential reasoning doesn’t just hold its own, it dominates in almost every scenario. No fancy fine-tuning needed; just clever prompting to tap into what LLMs already do well. Let’s dive in.
Parallel vs. Sequential: Breaking Down the Approaches
Quick refresher: Parallel reasoning is like a brainstorming session where everyone works in silos. The model generates several standalone chains for the same problem, each starting fresh. At the end, you tally votes on the answers using majority voting. It’s efficient for parallelism and depends on different reasoning approaches to reduce errors.
Sequential reasoning flips to iteration mode. It starts with a first stab at the problem. Then, loop back: prompting further improvements or corrections. Every step inherits the full history, fostering self-fixes, layered insights, and double-checks. Imagine editing a draft solo versus a group yelling ideas without hearing each other.
Why the edge for sequential? Parallel chains are isolated; they can’t cross-correct. Sequential thrives on real evolution: Spotting math errors mid-stream, stacking context for deeper dives, and verifying hunches across passes. Our framework (see the figure above) spells this out, turning raw LLM intelligence into a refinement loop topped with smart voting with no additional training required.
The Setup: Models, Benchmarks, and Fair Play
We went all-in on rigor. Models spanned families and scales: GPT-OSS-20B and 120B (OpenAI’s open-weight mixture-of-experts models optimized for reasoning), Qwen3-30B and 235B (Alibaba’s Qwen3 series MoE models with advanced multilingual and reasoning capabilities), and Kimi-K2 (Moonshot AI’s trillion-parameter MoE model excels in agentic tasks and long-context reasoning). Everything ran through OpenRouter‘s API with uniform tweaks like 0.7 temperature for balanced creativity.
Benchmarks hit hard reasoning spots:
AIME-2024/2025: High-stakes math puzzles demanding multi-step logic (answers: integers 0-999).
GPQA-Diamond: PhD-level brain-teasers in physics, chemistry, and biology.
Creative tasks (for ablation): Joke creation to probe ideation beyond pure logic.
Fairness first: Matched compute across the board. For 6 chains, that’s 24,576 tokens total (6 × 4096). Parallel distributes them across independent chains while sequential accumulates them progressively.
The Big Reveal: Sequential’s Crushing Lead
Boom: Sequential won 43 out of 45 setups (95.6%), with accuracy spikes up to 46.7% (like Qwen3-235B on AIME-2025: 76.7% vs. parallel’s 30.0%). This wasn’t model-specific; it held from 20B to 235B params, across math and science reasoning benchmarks, signaling a core strength in iterative thinking.
The secret sauce? Mechanisms parallel scaling can’t touch:
Iterative Error Correction: Models flag and patch mistakes in real time.
Progressive Context Buildup: Insights compound, turning shallow takes into profound ones.
Answer Verification: Later steps stress-test early ideas.
Here’s the full breakdown in the table below: a comprehensive grid of accuracies for sequential and parallel methods across every model, dataset, and chain count.
Thanks for reading Lossfunk Letters! Subscribe for free to receive new posts and support my work.
Leveling Up Aggregation: Inverse-Entropy Weighted Voting
Voting isn’t one-size-fits-all. Parallel sticks to majority, but sequential opens doors to nuance. We pitted seven methods, from baselines like linear increase (boosting later steps) to exponential decay (prioritizing early ones).
Our star innovation: Inverse-Entropy Weighted (IEW) Voting. It taps Shannon entropy from the model’s token logprobs to gauge confidence: low entropy means sharp, focused predictions; high means scattered uncertainty. Weight chains inversely:
0 \\text{ for stability.}\n\n","id":"XRPZMQGMKS"}" data-component-name="LatexBlockToDOM">
Results? IEW nailed top performance in 97% of sequential runs (29/30) and 100% of parallel (gains of 0.5-3.4%). Late-leaning methods hit 90% optimality, while early ones dragged at 17%: proof that refinement adds value step by step.
Sequential scaling helps with higher diversity for creativity too (so it’s not just reasoning boost)
In an ablation on creative tasks like joke generation, sequential methods demonstrated improved quality and diversity through iterative refinement, extending the benefits beyond strict reasoning domains. Specifically, it boosted lexical richness (type-token ratio), showcasing how iteration fosters creative evolution, unlike parallel’s static independents.
The intuition here is that if you’re asking an LLM to generate ideas, keep asking “Give me more” in the same chain instead of doing multiple parallel calls. With sequential generation, you’ll get a much higher diversity in output!
Why This Flips the Script and What’s Ahead
Since 2022, parallel has reigned supreme, but this research topples that crown. Sequential‘s built-in self-evolution positions it as the smarter go-to for optimizing inference, paving the way for more capable AI in coding, research, and countless other fields, all without inflating costs.
We’re just scratching the surface. Future work could explore hybrid approaches to further enhance performance. For the deep dive into equations, methods, and appendices, check out the full paper.
Full Paper
Read it here: The Sequential Edge: Inverse-Entropy Voting Beats Parallel Self-Consistency at Matched Compute
Aman Sharma & Paras Chopra — Lossfunk Research
📧 aman.sharma@lossfunk.com | paras@lossfunk.com

Notes on Tiny Recursion Network

Paras Chopra — Fri, 31 Oct 2025 07:23:04 GMT

Earlier, we published our notes on Hierarchal Reasoning Model. It was a fascinating take on how recursion with a small network can help achieve strong performance on ARC-AGI, Sudoku and Maze Following tasks.
Recently, an improved version of it was proposed called Tiny Recursion Network. The paper itself is easy to read, so I encourage you to first read it.
What it does is simple and can be illustrated by the following image:
How it works
There are two loops in the network and the pseudocode goes like this.
Fix a network (say transformer blocks x 2)
Embed / prepare input, initialize latent z and initialize answer attempt y
Inner loop
Run T-1 times without gradients:
y,z = network(x,y,z) #this refines the answer
Run 1 time:
y,z = network(x,y,z)
y_hat = unembed(y)
q_hat = q_head(y) #this is used to decide to early stop
Calculate softmax cross entropy loss of y_hat with y_true (from training) and add to loss
Calculate binary cross entropy loss of q_hat against whether y_hat is exactly equal to y_true
Back prop loss
One step gradient
Optimizer reset gradients
if q_hat > 0: #since q_hat is a logit, q_hat>0 corresponds to sigmoid(q_hat) > 0.5, i.e. closer to accurate prediction
break
Outer loop: #once per training example
Run inner loop for N_supervision (16) steps or until break happens
Thanks for reading Lossfunk Letters! Subscribe for free to receive new posts and support my work.
Intuition for why it works:
Inner loop is training the network to explore: how to move wrong answer towards the correct answer, given an output
Imagine there was only a single step in the inner loop and we backprop through it, what it does then is to take initial (wrong) answer towards correct one (from data)
Since single step is optimized to push wrong answer y to y_true, applying it multiple times should help it continue to explore (we save on backprop since single step is optimized to do the same)
Outer loop is to help refine somewhat correct answer to more correct answer
Since we backprop each time outer loop happens and with each outer loop previous answer is input to the network, we’re teaching the network to refine somewhat correct answer to even more correct answer
The effect of both loops is that network learns to both explore and refine
Why less is more
In the paper they show more layers overfit and generalize worse. So, my intuition is that recursion is powerful because you learn the function once but then use it multiple times, this trades off parameters (that can memorize stuff) into computation (fewer parameters).

With more parameters, layer N, parameter X can memorize (especially if data is sparse), but with fewer parameters and recursion, you’re forcing the network to learn what needs to happen to iterate to a better solution.
Note that this approach will work for problems requiring iteration (application of the same thing over and over again) like multiplication or addition, but won’t work for problems that require other ways of solving (like classification or generation). So while a useful idea it’s not a universal panacea.
The author, Paras Chopra, is founder and researcher at Lossfunk.

Do LLMs know when they've gotten a correct answer?

Aman Sharma — Wed, 29 Oct 2025 12:19:06 GMT

This is a summary of our latest paper: Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning.
Read the full paper: https://www.alphaxiv.org/abs/2510.08146v3
TLDR:
Entropy of an LLMs output sequence correlates with correctness
We can estimate an entropy threshold from a few correct examples to apply during inference
At inference, applying at entropy threshold saves tokens (as we don’t continue to “reason”) while ensuring there’s no total accuracy impact
If you’ve used ChatGPT’s “Thinking” mode or Claude’s “Extended Thinking,” you’ve probably noticed something that AI keeps reasoning even when it already seems to have the answer. Sometimes that extra thinking helps but often, it’s just burning through tokens and your money unnecessarily.
As reasoning tasks become the dominant use case for large language models (LLMs), their inference costs are spiraling. Chain-of-thought prompting, self-consistency, and iterative refinement often demand multi-step, multi-thousand-token generations per query with no guardrails on when a model should stop.
But what if LLMs could tell when they were already confident enough in their answer and stop reasoning further?
Our new work, Think Just Enough, introduces a principled framework that uses Shannon entropy over token-level log probabilities as a confidence signal. This signal enables early stopping, reduces computational cost by 25 – 50 %, and maintains task accuracy across diverse reasoning benchmarks.
The core insight is simple yet powerful: models that have undergone advanced post-training (for example, reinforcement-learning-from-human-feedback or GRPO-style optimization) show a sharp drop in entropy once they reach a correct solution , a signal entirely absent in instruction-only models like Llama 3.3 70B.
Why We Needed This
Reasoning in modern LLMs is powerful but deeply inefficient.
Methods like Chain-of-Thought ,Tree-of-Thoughts and Self Consistency have extended models reasoning horizons but at the cost of thousands of unnecessary tokens. These approaches treat every question as equally difficult and never give the model a way to know when it has thought enough.
The result? Massive inference bills, higher latency, and wasted compute on easy problems that could have been solved in a fraction of the time.
Previous work has tried to fix this using heuristics (like stopping after a fixed number of reasoning steps) or adding learned classifiers to decide when to exit. But these methods either need retraining or fail to generalize across architectures.
Think Just Enough takes a different path: it introduces an information-theoretic measure that already exists inside every model’s output: its entropy.
No retraining, no extra parameters, no external labels. Just smarter use of what the model already knows about its own uncertainty.
Entropy as a Confidence Signal
Entropy measures how uncertain a probability distribution is.
For token log-probabilities lᵢ, we first normalize them:
pᵢ = exp(lᵢ) / Σ exp(lⱼ)
Then compute Shannon entropy for each token:
Hₜ = −Σ pᵢ · log₂(pᵢ)
Averaging over all tokens gives a sequence-level entropy (H̄).
Low H̄ means the model’s attention is focused on a few highly probable next tokens and it’s confident.
High H̄ means the model is uncertain and still exploring.
When the running average entropy H̄ falls below a threshold τ, the model stops reasoning and returns the answer.
We define four thresholding methods:
Entropy Mean (simple and conservative)
Bayesian Optimal (statistically grounded)
Information-Theoretic Optimal (maximizes mutual information)
Scale-Invariant Universal (generalizes across architectures)
The Llama 3.3 70B Ablation — When Confidence Doesn’t Emerge
To test how universal this signal is, we ran Llama 3.3 70B Instruct on the GPQA Diamond dataset.
Unlike GPT-OSS or Qwen models, Llama 3.3 was trained purely with instruction tuning no reinforcement-learning or reward optimization and it was pre Deepseek-r1 era that introduced and popularised the era of post training using RL.
The results were telling. The entropy distributions of correct and incorrect responses almost perfectly overlap. There’s no discernible gap, no sign of emergent confidence. The model’s internal uncertainty doesn’t change whether it’s right or wrong.
This single ablation demonstrates a fundamental point:
Confidence calibration does not appear in instruction-tuned models. It emerges only after reward-based post-training, when the model learns to align low entropy with correctness rather than fluency.
Thanks for reading Lossfunk Letters! Subscribe for free to receive new posts and support my work.
Emergent Confidence in Post-Trained Models
When we apply the same analysis to GPT-OSS 20B / 120B and Qwen3-30B-A3B instruct 2507, the difference is striking.
These reasoning-optimized models show a clear and consistent separation in entropy between correct and incorrect reasoning chains:
Distinct entropy gap (Cohen’s d ≈ 0.8 – 1.9)
Robust across multiple datasets and seeds
Thresholds calibrated with as few as 10 examples generalize across tasks
25 – 50 % token savings with zero loss in accuracy
These results show that post-training doesn’t just improve reasoning it gives models a genuine sense of when to stop.
Adaptive Token Budgeting
In real-world deployments, compute isn’t infinite. We often work under a fixed token or cost budget.
We extend our framework into a budget-aware allocator:
low-entropy (high-confidence) questions use fewer reasoning steps, while high-entropy (uncertain) ones get more.
This keeps the total budget constant but redistributes computation intelligently.
It’s the same principle humans use when problem-solving: don’t overthink on easy questions, spend time on the hard ones.
This dynamic scaling mirrors emerging trends like OpenAI’s “o3” and Claude’s “extended thinking” systems but achieved through a simple, interpretable metric rather than opaque reinforcement policies or learned heuristics.
Implications
For researchers: Entropy bifurcation offers a quantitative marker of reasoning maturity showing when a model begins to “know what it knows.”
For practitioners: A lightweight, plug-and-play early-stopping layer that reduces latency and cost without retraining.
For theory: A window into the emergence of confidence itself not as a hand-engineered feature, but as a learned alignment between internal uncertainty and external correctness.
Conclusion
Think Just Enough reframes reasoning efficiency: the goal isn’t to make models think longer, but to make them know when to stop.
By turning entropy into a confidence signal, we uncover a deeper structure inside modern reasoning systems , one that differentiates pattern imitators from truly self-calibrating models.
Certainty is learned, not innate.
Full Paper
Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning: https://www.alphaxiv.org/abs/2510.08146v3
Aman Sharma & Paras Chopra — Lossfunk Research
📧 aman.sharma@lossfunk.com | paras@lossfunk.com