Summary
Using an LLM to answer university-level maths questions, and manually grading its answers at Teacher/Professor leve
Usefulness: Practical/Productive
Techniques
- Investigate how well it scores on “explain your working” and particularly what happens when you write new questions that have never existed before (change the parameters, change the setup, etc)
- Heavy use of Chain-of-Thought (2023/2024 current best approach)
- Tech: GPT-4
Discoveries
- Digging deeper into the hallucinations, it doesn’t understand that it has hallucinated, and often requires a lot of work (4-6 shot conversations) to get it to correct the mis-steps; getting it to present the correct answer with the mis-steps embedded is particularly difficult (GPT-4 configured against doing this)