Hi all, I've been exploring the simple-evals repo and am especially interested in the newly released HealthBench benchmark. I saw OpenAI's announcement that the dataset — 5,000 multi-turn conversations and the accompanying physician rubrics — would b...