"In the sudoku and first OWF..."


by Paul Christiano Mar 26 2016

In the sudoku and first OWF example, the agent can justify their answer, and its easy to incentivize them to reveal it. In the steganography and second OWF example, there is no short proof that something is good, only a proof that something is bad. In realistic settings there will be lots of arguments on both sides. Another way of looking at the question is: how do you elicit the negative arguments?

Katja wrote about a scheme here. I think it's a nice idea that feels like it might be relevant. But if you include it as part of the agent's reward, and the agent also picks the action, then you get actions optimized to be info-rich (as discussed in "maximizing B + info" here).