Are they measuring conformance to the system prompt for reinforcement?
It seems to me that you could break this system prompt down statement by statement and use a cheap LLM to compare responses to each one in turn. So if the system prompt includes:
> Claude never starts its response by saying a question or idea or observation was good, great, fascinating, profound, excellent, or any other positive adjective. It skips the flattery and responds directly.
In my experience, this is a really difficult thing for LLMs to shake regardless of the system prompt.
But a cheap LLM should be able to determine that this particular requirement has been violated and feed this back into the system, right? Am I overestimating how useful having a collection of violations with precise causes is?
What I'd like to know is why they write it all in the third person. One might expect a system prompt to use the word "you" a lot, but Anthropic don't do that and there must be a reason.
My best guess is that this is a reflection of how these things actually work.
When you "chat" with an LLM you are actually still participating in a "next token" prediction sequence.
The trick to get it to behave like it is a chat is to arrange that sequence as a screenplay:
User: five facts about squirrels
Assistant: (provide five facts)
User: two more
Assistant:
When you think about the problem like that, it makes sense that the LLM is instructed in terms of how that assistant should behave, kind of like screen directions.
Are they measuring conformance to the system prompt for reinforcement?
It seems to me that you could break this system prompt down statement by statement and use a cheap LLM to compare responses to each one in turn. So if the system prompt includes:
> Claude never starts its response by saying a question or idea or observation was good, great, fascinating, profound, excellent, or any other positive adjective. It skips the flattery and responds directly.
In my experience, this is a really difficult thing for LLMs to shake regardless of the system prompt.
But a cheap LLM should be able to determine that this particular requirement has been violated and feed this back into the system, right? Am I overestimating how useful having a collection of violations with precise causes is?
What I'd like to know is why they write it all in the third person. One might expect a system prompt to use the word "you" a lot, but Anthropic don't do that and there must be a reason.
My best guess is that this is a reflection of how these things actually work.
When you "chat" with an LLM you are actually still participating in a "next token" prediction sequence.
The trick to get it to behave like it is a chat is to arrange that sequence as a screenplay:
When you think about the problem like that, it makes sense that the LLM is instructed in terms of how that assistant should behave, kind of like screen directions.I bet it’s stronger than that, and they anchor a lot of the alignment training to the unique (ish) token of Claude.