"> The key property we want ..."

The key property we want from the distinguisher is that it can learn to detect relevant differences between the model and the real system. This seems like it might be the kind of problem that I would classify as "probably easy if the agent is powerful and the difference is really important" and you would classify as "way too hard to count on."

Counting on things before you've found a solution to them isn't very mindset, but I do consider this a promising approach. Definitely, the generative-adversarial approach in modern neural networks causes me to hope that this is the sort of thing that actually works in practice. So I might not be as pessimistic as you think? I still think in general that one does not go about taking things for granted, but the notion of faithful simulation seems like one that could prove to have a tractable core after hammering on it for a bit, and it also seems very possible that if you're reasonably smart and you can't detect any expected differences in the behavior of neural columns then the corresponding human simulation is faithful.

My current thoughts on possible failure modes:

"No differences you know about" might mix up the map and the territority in some obscurely fatal way that leads to the equivalent of the AI deliberately managing to 'not know' about inconvenient divergences.
If we use a limited AI and don't let it run thousands of simulations of people that it can compare to thousands of brains in vats, then in practice its column-level tests won't detect cumulative neural-level differences that lead to an 80% probability of schizophrenia.
The adversarial approach as written won't work because it will turn out that it's always possible for an equally smart adversary to tell the difference, especially for simulations that can be computed at a worthwhile speedup. Which means this test won't meaningfully discriminate in the region of intuitively faithful vs. nonfaithful simulations. (This strikes me as the sort of issue that's repairable, but perhaps not trivially so.)