Don't tell LLMs you are an AI safety…

Sep 5

Adding “for AI safety research” increased refusals on a harmless paraphrasing task for some top models. Conservative, aligned public models might be less useful for AI safety research by default.

Read →

4 Comments

Tanay Lohia

Sep 5

For my use case in Biology, it's so hard to get good responses from SOTA models, esp when asking technical stuff. And as you pointed out, Claude models end up making the most refusals. We need the SOTA models 'non-castrated'. :(

Expand full comment

Viswa Kumar

Sep 5

Great post. Thanks. Quick question: How do you know if the refusal is actually from the model vs the post processing guardrails driven by the model providers (than the model itself) ?

Expand full comment

Reply (1)

Dhruv Trehan

Sep 5

Thank you for your question, Viswa. That is something I considered too, and is something we can test for sure - using OpenRouter API vs the straight Gemini/OpenAI/Anthropic APIs etc - but for this small experiment, I reasoned that since different models from the same family (Claude / Gemini) were responding differently to the same prompts, it'd be fair to assume that the difference is not so much from OpenRouter as the model itself.

Additionally, there were very few that were consistently denied across n samples, which also made it reasonable to assume that this variation is not OpenRouter.

Expand full comment

Reply (1)

Viswa Kumar

Sep 5

Openrouter is simply api aggregator right? Wouldn’t the actual model provider (OpenAI or Anthropic etc) have post processing guardrails even for API inferencing? To me it looks like these denials or silent refusals are driven by their policies and not by some inherent capabilities of the models themselves. This also alludes to your observation that Anthropic’s access to their models for safety research is not the same as the ones exposed to general public. Which is nothing but IMO, the public access models are the ones with extensive guardrails in place.

Expand full comment

Lossfunk Letters

Don't tell LLMs you are an AI safety…