The agent tries to predict the remaining racing time if it executes action A, B or C. The impact of wiggling on overall racing time is probably so small that the agent is unable to differentiate the actions, as long as they accelerate. There is no reward for the agent to press as few buttons as possible, or for it to keep the same action several frames in a row.