The recency bias example with the S&P 500 really resonates here - it perfectly ilustrates why blindly adding context can actully hurt reasoning. That overconfident YES when the model should've stuck with mean reversion logic is such a powerful failure mode to highlight. Makes me wonder if there's a way to weight historical context more competitively against fresh headlines?
what was the prompt used for these models tho? that probably played the biggest factor for all of these predictions , if the models were asked to imitate an expert and be skeptical like a human would be after watching news for a while , the results could be very different
In the paper, the prompt structure: the model had to list reasons for and against, aggregate the reasons, make a decision, output a probability and then refine it. While this is definitely a good prompt, I’m curious if more aggression in the prompt template like actor-critique, or debating would improve performance of llms with news.
I wonder if there’s compute constraint in experimenting w diff prompts. There was no mention of their prompt building methodology in the paper
Actually there were no compute constraints, as we were using openrouter for the same.
Our Prompts are a modification of manifold's trading bot prompt, with some things added/removed to fit to our usecase.
You can try out different prompts for sure, but running it on 200 questions, each with a high token budget is expensive, and also leads to forgetting in formatting. We did try with ablations on nudging the model towards more aggressive/passive predictions via the prompt, but the results were not decisive enough to check whats exactly going wrong.
Would try this out for sure as my schedule clears up!
The recency bias example with the S&P 500 really resonates here - it perfectly ilustrates why blindly adding context can actully hurt reasoning. That overconfident YES when the model should've stuck with mean reversion logic is such a powerful failure mode to highlight. Makes me wonder if there's a way to weight historical context more competitively against fresh headlines?
what was the prompt used for these models tho? that probably played the biggest factor for all of these predictions , if the models were asked to imitate an expert and be skeptical like a human would be after watching news for a while , the results could be very different
Hey, sorry for the late update, all our prompts are mentioned in the appendix of the paper
In the paper, the prompt structure: the model had to list reasons for and against, aggregate the reasons, make a decision, output a probability and then refine it. While this is definitely a good prompt, I’m curious if more aggression in the prompt template like actor-critique, or debating would improve performance of llms with news.
I wonder if there’s compute constraint in experimenting w diff prompts. There was no mention of their prompt building methodology in the paper
Actually there were no compute constraints, as we were using openrouter for the same.
Our Prompts are a modification of manifold's trading bot prompt, with some things added/removed to fit to our usecase.
You can try out different prompts for sure, but running it on 200 questions, each with a high token budget is expensive, and also leads to forgetting in formatting. We did try with ablations on nudging the model towards more aggressive/passive predictions via the prompt, but the results were not decisive enough to check whats exactly going wrong.
Would try this out for sure as my schedule clears up!
Thanks