Just had a quick look at your paper, great work and thanks for sharing. Quick question: For GP controller, is it right that you sample from the distribution of the policy until a feasible action found? What if the probability of a feasible sample is very low in a certain situation?
Thanks, we are glad you enjoyed it. During deployment we do not need to re-sample until it's valid, we only need a compute the mean of the policy's distribution to generate phase plans. That's the point of formulating the MDP in order to train the policy with RL; instead of using it like in sampling-based-planning methods, we train the parameterized policy distribution with RL so it learns to always output valid phase transitions.
This visualization was made in raisimOgre so unfortunately there is no easy-to-use configuration to share. Stay tuned for when we release the code though.