1. 3
    1. 2

      That explains a lot. Perhaps the solution is still pre-train the prediction model on fixed datasets, and then do RL in an interactive setting, but including some signal for distinguishing observations from interventions? A separate prediction model could still be used for assigning rewards, just as in RLHF.