I’m working on a self-coding project using Large Language Models (LLMs). I’ve chatted about this here. This post will look at some of the things I’ve learned and some ideas for further development.
Things That Work Well
Using LLMs to Generate Code and Tests
GPT4 is pretty good at writing 5-15 line functions in Python based on a request.
GPT3.5/4-turbo are good models for summarising existing text and structured requests such as writing docstrings or describing files.
GPT4 is also good for writing to Readme files and generating documentation. This can be useful for both humans (i.e., me when I come to inspect the code months later) and for the LLM, as we can feed sections of the Readme file back in as prompts. The Readme file and documentation thus become a data source for retrieval augmented generation (RAG).
Using Existing Project Files as LLM Prompt Context
A project will have lots of support and configuration files. We can reuse these to provide prompt context for our LLM agents. For example, the requirements.txt file indicates what Python packages are installed, and so is very useful for constraining LLM output.
Also we can use files like .gitignore to also exclude files from our prompt context generation.
Function Calling
The GPT function calling options introduced over the summer 2023 seem to work well. I find I’m using them when I need a specific output without verbal packaging. For example, I have a function calling definition to return generated code as a string and separate out the imports as a separate returned field. I also use function calling for agent flow – things like selecting an option from a set work well if you enumerate the options then use a function definition.
GitHub and GitHub API
The GitHub API (via PyGithub) works really well for fetching information (such as outstanding issues) and providing a framework to submit pull requests. We don’t need to reinvent the wheel – we can use a lot of the tools that have been developed to prevent inexperienced junior developers from messing up code bases to control the contributions of AI agents.
One approach I’ve been playing around with is setting up the “personality” of the programmatic requests to reflect an “AI” GitHub developer account. This can be achieved by creating an account and a personal access token. When editing the code base, this AI developer can create a new branch, edit the files, then submit a pull request. As well as in-project controls (discussed below), we can also use GitHub Actions to prevent merges until the code is stable.
Coding Tools
While LLMs are often great at generation, I’ve found them to be patchy at critical analysis of existing code. However, given a bit of code and a specific issue, they can normally iterate a working fix over 1-5 attempts.
Saying that, my own coding approach normally also goes:
- While coding – IDE warnings catch a fair amount of typos and type clashes.
- First attempt – doesn’t work, tests don’t run.
- Second attempt – fix bug in first attempt, tests run but fail.
- Third attempt – fix another bug, single test passes but I miss an edge case.
- Fourth attempt – add an edge case, single test passes, other project tests fail.
- Fifth attempt – fix code, single test passes and project tests pass. Deploy.
- Sixth attempt – someone catches a bug during deployment. Build a hotfix to fix. Go-to 0.
Luckily in the Python ecosystem there are a number of tools both humans and machines can use to get feedback and to automate common tasks. These tools include:
- iSort – sorts Python imports.
- Black – automatically formats code.
- Flake8 – lints – looks for easy syntax errors and style issues.
- Pylint – another linter but more indepth than Flake8.
- Bandit – looks for security issues in your code.
- Pytest – a better test runner than unittest.
- Coverage – this indicates which lines of your code are not covered by your tests.
- PydocStyle – now depreciated but checks docstrings.
- Ruff – this looks to be a drop-in fast replacement for some of the above written in Rust.
- Profiler – this looks at the speed of code.
- Sphinx – this generates documentation for the code from docstrings.
- Pre-Commit – automate checks using one or more of the tools above to run before every commit.
Now we can arrange for most of these tools to be run via either subprocess or to output to a file. The self-coding system then has programmatic access to the output, which it can use as automated feedback. Also certain tools like iSort, Black and Ruff offer automated code fixes and optimisations that can be run after LLM code generation.
Problems to Solve
Differentiated Agent Structure
In the code base I’m working towards this while trying to avoid too much effort in coding it up (as I want the code to develop itself).
However, I am wondering whether I look at different agent “personalities” (i.e., a set of custom prompts and methods) for each of the following:
- Code generation (for a described task / issue).
- Test generation.
- Debugging – modifying existing code based on feedback (which may come from a human via a GitHub comment or from an automated tool).
- Refactoring – modifying existing code with no feedback (e.g., just based on inspection of the code itself).
- Documentation generation – e.g. updating the Readme and possibly creating files for Sphinx.
- Code Indexing or Parsing – this may involve updating a database of the project code and computing fields for that database.
- Git Management – branching, committing etc.
- GitHub Management – getting and writing issues, submitting pull requests. Maybe works with the Git Management.
Coping with Complex Project Directories and Context
There are several approaches I’m using at the moment:
- File Context
- I’m generating a text representation of the directory structure for my project.
- Hierarchy
- A project has the following:
- Root directory
- Subdirectories [possibly recursive]
- File
- [Class] – optional
- Function – this can be core or test
- This can be seen as a tree, with files and/or functions being leaf nodes of the tree.
- As with chunking in the human brain, once a level of the hierarchy goes above 5-10 items, it is often wise to chunk that into a single entity in the hierarchy. So if we have more than 10 functions in a file, it may be time to create a new file; if we have 5-10 files in a directory it may be time to create a new subdirectory. This could form part of the refactoring functionality.
- Hierarchy allows us to sequentially navigate with limited context at each level. For example, rather than supplying the whole directory structure, we can just indicate a top level set of folders and/or the contents of a single folder.
- This suggests some building some tree navigation functionality into our AI agents. (E.g., there is the
networkXlibrary and we can implement depth-first or breadth-first navigation approaches. We can also maybe abstract to just general graph navigation functionality.)
- A project has the following:
- Support Database
- I am playing around with having copies of the code within a simple
SQLitedatabase (viaSQLAlchemy). I can useastto extract the code and store it as strings in the database. - This also means I can create a vector embedding for each portion of code. I can then use this as part of a RAG approach, where I can retrieve possibly related portions of code and/or files and/or directories.
- One thing to think about is the need to update the database when the code changes, which will be frequently.
- We can add hash fields for both code, name, and docstring and only update those fields when the hash values change. The hash values can also be used to set
stalenessparamters.
- I am playing around with having copies of the code within a simple
- Docstrings
- I am generating docstrings at each level of the file hierarchy.
- The project summary is set out in the Readme file.
- Each file has a file docstring that describes/summarises the functions in the file (and the aim or goal of the file definition).
- Each function or class also has a docstring that describes the function or class functionality as well as the input interface types.
- I can also add a package docstring to the
__init__.pyfiles in each directory. - In this manner, each level of my hierarchy (as set out above) also has a short (50-200 word/token) description. This can be used for navigation.
- We thus have the following tiers of information:
- Root Name
- Root Summary
- Directory Name
- Directory Docstring
- File Name
- File Docstring
- Class Name
- Class Docstring
- Class Methods
- Function Name
- Function Docstring
- Function Interfaces (input and output – e.g. Args/Returns parts of docstring or typing)
- Function Code
- Function Test Code (this can have use examples)
- This can be enumerated to allow structured navigation.
- The directory and file names will form part of the file context set out above.
- Generating summarising docstrings is something even the lesser GPT models (3.5-turbo, 4-turbo) are good at.
- These docstrings can be saved in the database for quicker retrieval.
- Retrieval of the docstrings from the database could itself be made available as an API action for an agent.
One thought on “Self-Coding Repository – Idea Update”