Self-Coding Repository – Testing for LLM-Coding

I’ve been coding a project that aims to code itself. You can find details in this post and this post. In this post, I’ll go through some of the things I’ve learnt regarding testing.

At this point these are mainly notes that I hope to extend later.

“Good Code” – Aims for a Project Repository

  • All code should have typing and type hints (e.g., as per PEP 484)
  • All functions and classes should have docstrings in a consistent format (e.g., the Google docstring format)
  • All code files should have a module docstring in a consistent format (e.g., the Google docstring format)
  • 100% coverage – tests cover 100% of lines in 100% of files
  • Bandit checks should pass
  • Maintainability Index should be above 80 (as computed by wily)
  • Cyclomatic Complexity index should be below 10 (as computed by wily)
  • Ruff checks should pass
  • Documentation should be generated automatically using Sphinx as a pre-commit hook

Pain Points

  • Multi-line strings – solution: in prompt, ask the LLM to return as bracketed strings with each line quoted
  • Newlines – similar to the above – these often get rendered directly in the code – it would be better to have these escaped somehow – solution: in prompt, ask the LLM to escape problematic characters
  • Imports
    • We want self-contained code function blocks from the file but imports go at the top of the file and are shared between multiple blocks in the file
    • You get inconsistent imports of repository functions – this possibly needs the package structure passed as part of the prompt (based on file structure?)
  • Rewritten code typically introduces bugs
    • Because the LLM lacks whole repository context
  • Time trying to fix poor code is often > than original time to write mediocre code
  • Vanilla LLM suggestions don’t adequately cover all paths and edge cases

Cool Things to Leverage

  • GitHub / IDE Code Review GUIs
    • As the LLM cannot really be trusted not to bork existing code, it takes on the status of an intern let lose on your code base. Luckily there is a whole infrastructure set up to manage this.
    • GitHub has sophisticated workflows for code review. PyCharm also allows you to iterate through file changes, choosing to accept or refuse individual changes. These work well to spot updates and stop borking code from being added.
  • Ruff
    • This can be used with the “–fix” option to automatically correct formatting errors prior to a commit.
    • Ruff and Ruff Format can be included in a pre-commit hook.
  • Coverage
    • Coverage can be configured to output results as a JSON file. This JSON file can be parsed to determine total codebase coverage and individual file coverage.
    • The JSON file indicates lines that are missing from tests. These can be used to identify code that has not been tested.
    • I found the GPT API (GPT4-turbo) struggled a little when it was told the code is on lines X to Y and the missing lines are A to B (within X to Y). I believe any prompt will need to convert line numbers to actual lines of code (e.g., something like – “Here is a function: {} \n\n Here are some lines that are not tested: {} \n\n Write an additional test that tests the missing lines.”)
  • Bandit
    • Outputs can be parsed and passed back into a prompt to compute a code edit to address the issue.
  • Iteration
    • LLMs seem fairly good at refactoring long functions into smaller modules.
    • They are also good at going from a detailed specification in normal English to the code.
    • They perform better when generating small functions.
    • They struggle sometimes with consistency when the logic becomes too hard.
  • Test Driven Development
    • We could write the tests first as part of the planning of a solution.
    • These can then be added as failing tests before adding any implementation.
    • Tests can be added to cover edge cases, failure cases, and “happy paths”.
    • The tests can then form part of the prompt context for coding a solution.
  • Database for code
    • Having a simple SQLite database that stores the code in the repository is useful.
    • I perform a parse of the files to build the database contents, then can query and update the database instead.
    • This is often the same as an IDE “index”.
  • Consistent and automated naming
    • For example, tests can be named after the class or file they are testing
    • We can have a folder structure for our tests that reflects our source code folder structure
  • Generating files dynamically
    • In you have a mirror of your code stored in a database, you can regenerate your code base automatically.
    • It might be easier to generate things like docstrings and typing changes in situ on database strings and only write the code to a file once everything is sorted.
    • You can also get the code string from the database and pass for testing using pytest. This could allow a good lot of iteration without the problems of parsing code from files (e.g., you know it is the reading or writing via ast that is causing any failure).
    • It also allows for easier refactoring. E.g. if you kept track of the call graph you could change file names.
    • I believe many IDEs (e.g., PyCharm and VSCode) do this under the hood anyway.
  • wily and code complexity metrics
    • wily allows the computation of complexity metrics to allow optimisation.
    • E.g., you can compute the metrics and feed them into a prompt along with a function and indicate a desire to increase the Maintainability Index or reduce the Cyclomatic Complexity.
  • Modularity
    • It would be good to get some way of measuring the modularity of the code base.
    • You could likely determine a proxy measure from the call graph – looking for independent subgraphs.
  • Duplicated Code
    • Duplicated code is a prime candidate for refactoring to reduce lines of code and complexity.
    • LLMs can do this operation quite well in my experience – given two similar functions, they can extract refactored logic into a shared function.
    • You could also use vector based similarity to look for clusters of similar functions.
  • Profiling
    • cProfile or Yappi or vmprof or pytest-benchmark?
    • We need to work out a way to profile the code base.
    • Maybe we could use pytest-benchmark to create a benchmarking test for each function as well as the function test.
  • https://realpython.com/python-refactoring/
    • This is a useful guide to refactoring based on complexity metrics.
  • codium.ai
    • Has some good LLM-based logic for building useful tests.
    • In some rough experiments it was a lot better than raw GPT API calls.
    • It looks like they are using the hypothesis Python library under the hood to help.

Leave a comment