Experiment: Can LLMs ‘reason’, from a scientific/mathematical perspective? If not, what’s the gap / what’s missing? – AI Pioneers Path

Summary

Using an LLM to answer university-level maths questions, and manually grading its answers at Teacher/Professor leve

Usefulness: Practical/Productive

Techniques

Investigate how well it scores on “explain your working” and particularly what happens when you write new questions that have never existed before (change the parameters, change the setup, etc)
Heavy use of Chain-of-Thought (2023/2024 current best approach)
Tech: GPT-4

Discoveries

Digging deeper into the hallucinations, it doesn’t understand that it has hallucinated, and often requires a lot of work (4-6 shot conversations) to get it to correct the mis-steps; getting it to present the correct answer with the mis-steps embedded is particularly difficult (GPT-4 configured against doing this)