Very interesting. From my experience prompts are vary from model to model - GPT-4, PALM 2, Claude 2.1 reqires different prompts (partially). Would be interesting to see such comparisons on other LLMs, because researchers don't even need to run it locally - cheap API will do the job. Thanks for covering this great paper!