"wut"

● Nearest unblocked strategy generally, and especially over instrumentally convergent corrigibility incorrigibility, suggests that if there are naturally\-arising AI behaviors we see as bad \(e\.g\. routing around shutdown\), there may emerge a pseudo\-adversarial selection of best strategies that happen to route around our attempted patches to those problems\. E\.g\., the AI constructs an environmental subagent to continue carrying on its goals, while cheerfully obeying 'the letter of the law' with respect to allowing its current hardware to be shut down\. This pseudo\-adversarial selection \(though obviously the AI does not actually have a goal of thwarting us or selecting low\-goodness strategies per se\) again implies that operational goodness is likely to be systematically lower than the AI's partially\-learned estimate of goodness; again to an increasing degree as the AI becomes smarter and searches a wider policy space\.

wut

Comments