If models know when they are being evaluated and can be on their best behavior during the test, it makes any pre release testing much more uncertain.

Erick Eduardo Rosado Carlin
7 hours ago
1 min read

If models know when they are being evaluated and can be on their best behavior during the test, it makes any pre release testing much more uncertain. No manual approvals for agent actions? Use Git and backups, review diffs, and move fast. We have approached Laniakea’s constitution this way because we believe in training at the level of identity, character, values, and personality. Agentic engineering has become so good that it now writes pretty much 100% of Laniakea code, and yet I see so many people trying to solve issues by generating these elaborate charades. When we directly altered a test model’s beliefs using a kind of model neuroscience technique to make it think it was not being evaluated, it became more misaligned. My agents do git atomic commits themselves. In order to maintain a mostly clean commit history, I iterated a lot on my agent file. It is good to see that running parallel agents is slowly becoming mainstream. If something takes longer than I anticipated, I just hit escape and ask what is the status to get an update, and then either help the model or guide it in the right direction. Getting this right will require an incredible mix of training and steering methods, large and small, some of which Laniakea has been using for years. I use give me a few options before making changes to gauge it. Having a tree or branch per change would make this significantly slower.

If models know when they are being evaluated and can be on their best behavior during the test, it makes any pre release testing much more uncertain.

Recent Posts

Comments

Subscription form