Measuring LLMs with Jodie Burchell
How do you measure the quality of a large language model? Carl and Richard talk to Dr. Jodie Burchell about her work measuring large language models for accuracy, reliability, and consistency. Jodie talks about the variety of benchmarks that exist for LLMs and the problems they have. A broader conversation about quality digs into the idea that LLMs should be targeted to the particular topic area they are being used for - often, smaller is better! Building a good test suite for your LLM is challenging but can increase your confidence that the tool will work as expected.
Guests:

Jodie Burchell
Dr. Jodie Burchell is the Developer Advocate in Data Science at JetBrains, and was previously a Lead Data Scientist at Verve Group Europe. She completed a PhD in clinical psychology and a postdoc in biostatistics, before leaving academia for a data science career. She has worked for 7 years as a data scientist in both Australia and Germany, developing a range of products including recommendation systems, analysis platforms, search engine improvements and audience profiling. She has held a broad range of responsibilities in her career, doing everything from data analytics to maintaining machine learning solutions in production. She is a long time content creator in data science, across conference and user group presentations, books, webinars, and posts on both her own and JetBrain's blogs.
Links:
- Cymbal https://github.com/SimonCropp/Cymbal
- Deepseek https://www.deepseek.com/
- Falcon https://falconllm.tii.ae/
- Keynote on LLMs from Jodie Burchell https://www.youtube.com/watch?v=fh8jDBgORRU
- Massive Multitask Language Understanding https://docs.confident-ai.com/docs/benchmarks-mmlu
- Hugging Face Open LLM Leaderboard https://huggingface.co/collections/open-llm-leaderboard/open-llm-leaderboard-best-models-652d6c7965a4619fb5c27a03
- HellaSwag https://rowanzellers.com/hellaswag/
- Gates AP Bio Exam https://www.cnbc.com/2023/08/11/bill-gates-went-in-a-state-of-shock-after-chatgpt-aced-ap-bio-exam.html
- ARC-AGI https://arcprize.org/arc
- Hamel Husain Your AI Product Needs Evals https://hamel.dev/blog/posts/evals/