How I assess AI-coding assistants and AI devtools

Modern AI is – by nature – non-deterministic; if you take this away, the quality goes down while the predictability goes up. But worse: most of the big wins from AI vanish, and you’re left with small wins (which are good, better than pre-AI, but … you’ve killed a lot of the potential). Evaluating coding tools in this context seems … tricky. Here’s how I’ve been doing it.

If you’re afraid of AI scraping your proprietary IP / source-code

Everything that follows is described with respect to your own codebase – but all of it applies just as well to any open-source project – you can use one of them as your testbed. Pick a project hosted e.g. on https://github.com – ideally one that you yourself know the codebase reasonably well, have been coding against / debugging over the years – and use that instead.

With an SCM, you can be omniscient…

If you’re going to give the AI a fair chance, you need to unleash it. The answers can (should!) be unexpectedly effective, insightful, creative. How can you assess that against the answer it didn’t give, the other possibilities? This will take a lot of effort.

Instead: go back in time – a few months, a few years, it doesn’t matter (but the further you go the more you’ll personally know about ‘what happened next’). Pick a commit in your codebase where you made some particularly insightful change (or major refactoring, or fixed a bug that was almost impossible to track down).

Key elements

  1. It should be something you / your team spent a LOT of time thinking about, from multiple angles. You should have encyclopaedic knowledge of ‘what was possible’ at the time
  2. Even better if: you explored multiple of these potential avenues, ran proof-of-concepts etc. i.e. if you already know the pros/cons of more than one of the outcomes.
  3. Obviously: you remember which one you ended up choosing and why

Example

Here’s a project I know well: https://github.com/SVGKit/SVGKit. Clicking the Commits link on the front page, and browsing them, I picked “Fixed a crash, a gradient render issue and an parsing assert (#786)

Rewind the codebase…

When someone logs a bug in a version of your software that isn’t the very latest one you shipped (the one sitting on your laptop, that you’re currently improving), what do you do? Of course: you pick a new empty folder on your laptop and git-checkout the commit that corresponds to their version.

Working in that folder you’re in a time-machine that’s guaranteed to have rewound accurately to ‘the state of the universe at the time this copy of the app was shipped’.

So we do the same for these AI tools: checkout the last commit immediately before the big change you selected in the previous step.

Key elements:

  1. You’re using SCM, of course; probably ‘git’
  2. One separate folder for each tool you want to review: enables you to compare/contrast the different outputs from ones that automatically update the codebase

Example

Depends on your git-client, but assuming you’ve already git-cloned the project, this grabs the SVGKit commit I picked above:

git checkout e990c43

Variety to generate a score

What if the one example you choose happens to be the one-in-a-million that trips-up one of the tools you’re reviewing? To avoid biasing your evaluation, select a range of commits from different stages of your project. A single example might not reveal the tool’s true capabilities or limitations.

Key elements:

  1. Choosing the primary commit to test takes a bit of work: thinking back to ‘big events’ that happened in your codebase; choosing these secondary commits can be trivial – just pick a bunch at random
  2. If your codebase has multiple programming languages this is a great opportunity to see if individual tools work approximately as well across languages, or if they’re tuned (shouldn’t happen with good AI!) heavily towards one of your languages instead of the others
  3. Search your commit-messages (or Jira tickets, or etc) for a variety of keywords, e.g. “API”, “fix”, “refactor”, “new”, and “regression”. These correspond to very different reasons/contexts in which a code change was made

Example

Open the github project in a web browser, and from the front page of the repository you can use github’s search to look for specific commits – the search is hidden behind a small magnifying glass icon at the top of the page.

Subscribe for more AI tips