Experiment: Can LLMs ‘reason’, from a scientific/mathematical perspective? If not, what’s the gap / what’s missing?

Summary

Using an LLM to answer university-level maths questions, and manually grading its answers at Teacher/Professor leve

Usefulness: Practical/Productive

Techniques

  • Investigate how well it scores on “explain your working” and particularly what happens when you write new questions that have never existed before (change the parameters, change the setup, etc)
  • Heavy use of Chain-of-Thought (2023/2024 current best approach)
  • Tech: GPT-4

Discoveries

  • Digging deeper into the hallucinations, it doesn’t understand that it has hallucinated, and often requires a lot of work (4-6 shot conversations) to get it to correct the mis-steps; getting it to present the correct answer with the mis-steps embedded is particularly difficult (GPT-4 configured against doing this)

Subscribe for more AI tips