JimDabell 4 minutes ago

Are they measuring conformance to the system prompt for reinforcement?

It seems to me that you could break this system prompt down statement by statement and use a cheap LLM to compare responses to each one in turn. So if the system prompt includes:

> Claude never starts its response by saying a question or idea or observation was good, great, fascinating, profound, excellent, or any other positive adjective. It skips the flattery and responds directly.

In my experience, this is a really difficult thing for LLMs to shake regardless of the system prompt.

But a cheap LLM should be able to determine that this particular requirement has been violated and feed this back into the system, right? Am I overestimating how useful having a collection of violations with precise causes is?

mike_hearn 12 hours ago

What I'd like to know is why they write it all in the third person. One might expect a system prompt to use the word "you" a lot, but Anthropic don't do that and there must be a reason.

  • simonw 10 hours ago

    My best guess is that this is a reflection of how these things actually work.

    When you "chat" with an LLM you are actually still participating in a "next token" prediction sequence.

    The trick to get it to behave like it is a chat is to arrange that sequence as a screenplay:

      User: five facts about squirrels
    
      Assistant: (provide five facts)
    
      User: two more
    
      Assistant:
    
    When you think about the problem like that, it makes sense that the LLM is instructed in terms of how that assistant should behave, kind of like screen directions.
    • dcre 4 hours ago

      I bet it’s stronger than that, and they anchor a lot of the alignment training to the unique (ish) token of Claude.