The real contribution seems to be the prompt they used to generate the CoT and the metric value... Could you share the code used for the metric and the prompt for ChatPGT?
Do you think human intervention in the evaluation process is going to last? It seems its a process that LLMs could achieve by themselves in the near future.