1. 3

  2. 2

    That explains a lot. Perhaps the solution is still pre-train the prediction model on fixed datasets, and then do RL in an interactive setting, but including some signal for distinguishing observations from interventions? A separate prediction model could still be used for assigning rewards, just as in RLHF.