"It makes sense that one wan..."

https://arbital.com/p/40w

by Ryan Carey Jun 4 2016 updated Jun 4 2016


It makes sense that one wants to stop the AI from optimising on a false objective (maximising button-presses). It would be ideal if the agent can be taught to ignore whichever of its actions occur by controlling the button.

In practise, a hack solution would be to use multiple buttons and multiple overseers rather than just one - I guess this will be a common suggestion. Having multiple overseers might weaken the problem, in that an agent would be more likely to learn that they all point to the same thing. I could also think of arguments that such an agent may nonetheless maximise its reward by forcing one or all of the overseers to press approval buttons.