Reward engineering

{
  localUrl: '../page/1qn.html',
  arbitalUrl: 'https://arbital.com/p/1qn',
  rawJsonUrl: '../raw/1qn.json',
  likeableId: '675',
  likeableType: 'page',
  myLikeValue: '0',
  likeCount: '0',
  dislikeCount: '0',
  likeScore: '0',
  individualLikes: [],
  pageId: '1qn',
  edit: '1',
  editSummary: '',
  prevEdit: '0',
  currentEdit: '1',
  wasPublished: 'true',
  type: 'wiki',
  title: 'Reward engineering',
  clickbait: '',
  textLength: '3841',
  alias: '1qn',
  externalUrl: '',
  sortChildrenBy: 'likes',
  hasVote: 'false',
  voteType: '',
  votesAnonymous: 'false',
  editCreatorId: 'OliviaSchaefer',
  editCreatedAt: '2016-01-24 13:13:59',
  pageCreatorId: 'OliviaSchaefer',
  pageCreatedAt: '2016-01-24 13:13:59',
  seeDomainId: '0',
  editDomainId: 'AlyssaVance',
  submitToDomainId: '0',
  isAutosave: 'false',
  isSnapshot: 'false',
  isLiveEdit: 'true',
  isMinorEdit: 'false',
  indirectTeacher: 'false',
  todoCount: '0',
  isEditorComment: 'false',
  isApprovedComment: 'true',
  isResolved: 'false',
  snapshotText: '',
  anchorContext: '',
  anchorText: '',
  anchorOffset: '0',
  mergedInto: '',
  isDeleted: 'false',
  viewCount: '8',
  text: 'This post gestures at a handful of research questions with a loose thematic connection.\n\n## The idea\n\n##### \n\nConsider the following frameworks:\n\n- [Temporal difference learning](https://en.wikipedia.org/wiki/Temporal_difference_learning): learn to predict the future by predicting tomorrow’s prediction.\n- [Generative adversarial models](http://arxiv.org/abs/1406.2661): learn to sample from a distribution by fooling a distinguisher.\n- [Predictability minimization](http://cognet.mit.edu/system/cogfiles/journalpdfs/neco.1992.4.6.863.pdf): learn to represent data efficiently by making each part of the representation unpredictable given the others.\n\nThese algorithms replace a hard-to-optimize objective with a nicer proxy. These proxies are themselves defined by machine learning systems rather than being specified explicitly. I think this is a really nice paradigm, and my guess is that it will become more important if large-scale supervised and reinforcement learning continues to be a dominant methodology.\n\nFollowing Daniel Dewey, I’ll call this flavor of research “[reward engineering](http://www.danieldewey.net/reward-engineering-principle.pdf).” In terms of tools and techniques I don’t know if this is a really a distinct category of research; but I do think that it might be a useful heuristic about where to look for problems relevant to AI control.\n\n## Relevance to AI control\n\n##### \n\nThough reward engineering seems very broadly useful in AI, I expect it to be especially important for AI control:\n\n- A key goal of AI control is using AI systems to optimize objectives which are defined implicitly or based on expensive human feedback. We will probably need to use complex proxies for this feedback if we want to apply reinforcement learning.\n- Reward engineering seems relatively robust to changes in AI techniques. Uncertainty about future techniques if often a major obstacle to doing meaningful work on AI control in advance (even if only a little bit in advance).\n\n## Applications\n\n##### \n\nI see a few especially interesting opportunities for reward engineering for AI control:\n\n- [Making efficient use of human feedback](https://medium.com/ai-control/efficient-feedback-a347748b1557#.wp2zmi2oj). Here we have direct access to the objective we really care about, and it is just too expensive to frequently evaluate. (_Simple proposal_: train a learner to predict human judgments, then use those predicted judgments in place of real feedback.)\n- [Combining the benefits of imitation and approval-direction](https://medium.com/ai-control/mimicry-maximization-and-meeting-halfway-c149dd23fc17#.cz2phxdp7). I suspect it is possible to avoid perverse instantiation concerns while also providing a flexible training signal. (_Simple proposal_: use the adversarial generative models framework, and have the operator accomplish the desired task in a way optimized to fool the distinguisher.)\n- [Increasing robustness](https://medium.com/ai-control/synthesizing-training-data-f92a637dc1b4#.ggps9emnc). If our ML systems are sufficiently sophisticated to foresee possible problems, then we might be able to leverage those predictions to avoid the problems altogether. (_Simple proposal_: train a generative model to produce data from the test distribution, with an extra reward for samples that “trip up” the current model.)\n\nIn each case I’ve made a preliminary simple proposal, but I think it is quite possible that a clever trick could make the problem look radically more tractable. A search for clever tricks is likely to come up empty, but hits could be very valuable (and would be good candidates for things to experiment with).\n\nBeyond these semi-specific applications, I have a more general intuition that thinking about this aspect of the AI control problem may turn up interesting further directions',
  metaText: '',
  isTextLoaded: 'true',
  isSubscribedToDiscussion: 'false',
  isSubscribedToUser: 'false',
  isSubscribedAsMaintainer: 'false',
  discussionSubscriberCount: '1',
  maintainerCount: '1',
  userSubscriberCount: '0',
  lastVisit: '',
  hasDraft: 'false',
  votes: [],
  voteSummary: 'null',
  muVoteSummary: '0',
  voteScaling: '0',
  currentUserVote: '-2',
  voteCount: '0',
  lockedVoteType: '',
  maxEditEver: '0',
  redLinkCount: '0',
  lockedBy: '',
  lockedUntil: '',
  nextPageId: '',
  prevPageId: '',
  usedAsMastery: 'false',
  proposalEditNum: '0',
  permissions: {
    edit: {
      has: 'false',
      reason: 'You don't have domain permission to edit this page'
    },
    proposeEdit: {
      has: 'true',
      reason: ''
    },
    delete: {
      has: 'false',
      reason: 'You don't have domain permission to delete this page'
    },
    comment: {
      has: 'false',
      reason: 'You can't comment in this domain because you are not a member'
    },
    proposeComment: {
      has: 'true',
      reason: ''
    }
  },
  summaries: {},
  creatorIds: [
    'OliviaSchaefer'
  ],
  childIds: [],
  parentIds: [],
  commentIds: [],
  questionIds: [],
  tagIds: [],
  relatedIds: [],
  markIds: [],
  explanations: [],
  learnMore: [],
  requirements: [],
  subjects: [],
  lenses: [],
  lensParentId: '',
  pathPages: [],
  learnMoreTaughtMap: {},
  learnMoreCoveredMap: {},
  learnMoreRequiredMap: {},
  editHistory: {},
  domainSubmissions: {},
  answers: [],
  answerCount: '0',
  commentCount: '0',
  newCommentCount: '0',
  linkedMarkCount: '0',
  changeLogs: [
    {
      likeableId: '0',
      likeableType: 'changeLog',
      myLikeValue: '0',
      likeCount: '0',
      dislikeCount: '0',
      likeScore: '0',
      individualLikes: [],
      id: '5566',
      pageId: '1qn',
      userId: 'OliviaSchaefer',
      edit: '1',
      type: 'newEdit',
      createdAt: '2016-01-24 13:13:59',
      auxPageId: '',
      oldSettingsValue: '',
      newSettingsValue: ''
    }
  ],
  feedSubmissions: [],
  searchStrings: {},
  hasChildren: 'false',
  hasParents: 'false',
  redAliases: {},
  improvementTagIds: [],
  nonMetaTagIds: [],
  todos: [],
  slowDownMap: 'null',
  speedUpMap: 'null',
  arcPageIds: 'null',
  contentRequests: {}
}