Cleaning Code
By Tyler
Last month I mentioned that I’ve been reading “Clean Code” again after many years. The reading has been enlightening, and a review of the book and my thoughts are forthcoming. Until then, I thought it could be instructive to go through some of my own code and examine how it could be made clean.
Realreq
First, an introduction to the material:
realreq
. Realreq is a simple tool I
wrote to help me determine the actual requirements of projects I’m working on
in python. Often when I’m prototyping I’ll install several libraries to test
out. I’ll also install various tools for testing, and editing my code like
pylsp
, pytest
and black
. When it comes time to determine the requirements
for my program, I don’t want it populated with all those extraneous packages.
Now this isn’t a problem for languages like Rust, NodeJS, or Go, as the tooling handles this natively. However python does not have this. There are a lot of additional tools that have been created to do so, but most of them are pretty heavy handed, requiring you integrate them into your workspace and workflow.
So naturally when I couldn’t find a tool I liked, I decided to write a script to get what I wanted.
The material
Realreq has stayed small (<1k LoC), but to ensure that it works I do have a fairly comprehensive set of tests. However the tests themselves were pretty poorly maintained, with me having done the bare minimum to keep them running. With some new updates planned, I decided that now was the best time to get these tests under control.
Below was a particularly nasty section of the tests (you can see the whole file at it’s commit)
CONTENT = """
import os
import requests
from foo import bar
from . import local_module
import local_module2
from foo.baz import frum
import abbrev
import src.local_module
import fake_pkg
"""
#...other lines of setup code
@pytest.fixture(scope="session", params=["src", "path/to/src", "go/to/src/module.py"])
def source_files(
tmp_path_factory,
request,
):
"""Creates a temp directory that tests different source files
returns: path to directory being used for test
"""
path = os.path.normpath(request.param)
paths = path.split("/")
is_module = os.path.splitext(paths[-1])[1].lower() == ".py"
if is_module:
module = paths[-1]
paths = paths[:-1]
if len(paths) > 1 and isinstance(paths, list):
src = tmp_path_factory.mktemp(path[0], numbered=False)
for p in paths:
src = src / p
src.mkdir()
elif is_module:
src = tmp_path_factory.mktemp(path[0], numbered=False)
else:
src = tmp_path_factory.mktemp(path, numbered=False)
if is_module:
src = src / module
src.write_text(CONTENT)
else:
main = src / "main.py"
main.write_text(CONTENT)
return src
The above is a pytest fixture used to create a temporary directory with a fake python module for testing realreq with. Do read through it, but don’t worry if you have a hard time understanding it. I know I did when I was re-reading it after a while. The truth is that this function has accumulated code with every new feature I’ve added.
This code is, sadly, a mess. I have repeated logic with the is_module
variable. There are several if statments, of which the purpose and the
interplay between each is not straightforward. We make use of a lot of low
level functions that overcomplicate and expose to much the workings of this
function. All of this is my fault, for not adequately taking care of this code.
So how can we take this jumble of logic and generate something comprehensible? Well with some short steps, and resorting to some common refactoring techniques. I won’t lie, this wasn’t easy. Several times I started down a path only to have to revert my code back to its original form. I mention that because the steps I walk through are results of that accumulated struggle.
The struggle was not in vain though. Read through the code helped me understand both what the code did, and what it was intended to do.
Ideally those two things are one and the same. However, this is not always the case. As is typical in dirty code, I noticed that there was a bug in my code:
path = os.path.normpath(request.param)
paths = path.split("/")
is_module = os.path.splitext(paths[-1])[1].lower() == ".py"
if len(paths) > 1 and isinstance(paths, list):
src = tmp_path_factory.mktemp(path[0], numbered=False)
for p in paths:
src = src / p
src.mkdir()
elif is_module:
src = tmp_path_factory.mktemp(path[0], numbered=False)
else:
src = tmp_path_factory.mktemp(path, numbered=False)
I noticed that in this section of code, there was a slight bug. Here we are checking
the structure of the path (one of ["src", "path/to/src", "go/to/src/module.py"]
),
and building out the temporary directory. I realized that in the first condition,
we were creating the initial directory using the first part of the split path, but
then reusing it to build out the rest of the path.
Furthermore the second condition was just using the wrong variable to generate the source path. This piqued my suspicions about this section of code. Such a flaw really should not have generated working code, yet my tests all passed. Sure enough, after throwing a breakpoint into the code, i discovered that the code was dead. Deleting the entire branch had no effect on the tests.
These were benign in the way the tests currently functioned, but I decided to fix it before the bug metastasized into real issues.
if len(paths) > 1 and isinstance(paths, list):
src = tmp_path_factory.mktemp(paths.pop(0), numbered=False)
for p in paths:
src = src / p
src.mkdir()
else:
src = tmp_path_factory.mktemp(path, numbered=False)
Now that the bugs were cleared out, It was time for improvement. This fixture used
a lot of primitives to manage the Path data (such splitting on “/” to generate
a list of directories to create). While it “worked” it wasn’t very expressive nor
did it take advantage of Python’s native “Path” type that helps deal with working
with file system paths. My next goal was to replace this logic with one using
pathlib
paths.
path = pathlib.Path(request.param)
paths = path.parents
is_module = path.suffix.lower() == ".py"
# Get the appropriate paths
if not is_module:
# Hack to get all parts of the path
paths = (path / "child").parents
# Build out source directory
if len(paths) > 1 and not isinstance(paths, str):
# Minus 2 because of the implicit "." dir at the top of the parents list
src = tmp_path_factory.mktemp(paths[len(paths)-2] , numbered=False)
for p in list(reversed(paths))[2:]:
src = src / p.stem
src.mkdir()
else:
src = tmp_path_factory.mktemp(path, numbered=False)
# Write out source file
if is_module:
module = path.name
src = src / module
src.write_text(CONTENT)
else:
main = src / "main.py"
main.write_text(CONTENT)
return src
A lot had to change to keep the tests working, and for some of the sections
(like the second condition) it certainly seemed a bit messier than before, but I was fairly
confident this would pay off. One challenge of course was to get the name of
the parent directory. The
parents
property of a Path returns a sequence of paths each one successively removing a
parent directory until you reach the root. For a relative path (as we have in
our tests), the root directory (or final item) is "."
. This isn’t useful for
creating out test paths so we needed the second to last item. This required some
obtuse manipulations of the list by subtracting 2 from the index. At this point
I threw in a simple comment to explain why this was needed.
The next task I wanted to handle was thatis_module
variable. It is calculated
at the top of the function and then used throughout. While it is nice to cache
the results, eventually I wanted to split this function into smaller parts. I
didn’t want to have to pass an is_module
parameter to all the new functions
that I was going to create. So I decided that it would be worth it to turn the
variable into a call to a query function and just recalculate it when I needed
it. Yes this would mean a slight performance hit, but the benefits of cleaning
the code were well worth it.
def source_files(...)
path = pathlib.Path(request.param)
parents = path.parents
# Get the appropriate paths
if not _is_module(path):
# Hack to get all parts of the path
parents = (path / "child").parents
# Build out source directory
if len(parents) > 1 and not isinstance(parents, str):
# Minus 2 because of the implicit "." dir at the top of the parents list
src = tmp_path_factory.mktemp(parents[len(parents)-2] , numbered=False)
for p in list(reversed(parents))[2:]:
src = src / p.stem
src.mkdir()
else:
src = tmp_path_factory.mktemp(path, numbered=False)
# Write out source file
if _is_module(path):
module = path.name
src = src / module
src.write_text(CONTENT)
else:
main = src / "main.py"
main.write_text(CONTENT)
return src
def _is_module(path:pathlib.Path) -> bool:
"""Tests if path is a Python Module"""
return path.suffix.lower() == ".py"
With that done I was finally able to extract part of the logic out into a separate function. The first
part of the function deals with creating and setting up a directory to write out test file into. I
decided that it would be worth encapsulating it into a function to precisely do that. It was as simple
as copying and pasting the code into the body of a new function, and returning the generated src
path.
def source_files
path = pathlib.Path(request.param)
src = _create_source_directory(tmp_path_factory, path)
# Write out source file
if _is_module(path):
module = path.name
src = src / module
src.write_text(CONTENT)
else:
main = src / "main.py"
main.write_text(CONTENT)
return src
def _is_module(path:pathlib.Path) -> bool:
"""Tests if path is a Python Module"""
return path.suffix.lower() == ".py"
def _create_source_directory(tmp_path_factory, path: pathlib.Path) -> pathlib.Path:
parents = path.parents
if not _is_module(path):
# Hack to get all parts of the path
parents = (path / "child").parents
if len(parents) > 1 and not isinstance(parents, str):
# Minus 2 because of the implicit "." dir at the top of the parents list
src = tmp_path_factory.mktemp(parents[len(parents)-2] , numbered=False)
for p in list(reversed(parents))[2:]:
src = src / p.stem
src.mkdir()
else:
src = tmp_path_factory.mktemp(path, numbered=False)
return src
The first few lines in the new _create_source_directory
function, are really
just a way to generate the parents list
. I decided to extract another function to
represent that. This had the added benefit of isolating and “hiding” the
hack we have written to ensure we get all the parents for paths that aren’t
modules.
def _create_source_directory(tmp_path_factory, path: pathlib.Path) -> pathlib.Path:
"""Creates the source directory given by path"""
parents = _parent_dirs(path)
if len(parents) > 1 and not isinstance(parents, str):
# Minus 2 because of the implicit "." dir at the top of the parents list
src = tmp_path_factory.mktemp(parents[len(parents)-2] , numbered=False)
for p in list(reversed(parents))[2:]:
src = src / p.stem
src.mkdir()
else:
src = tmp_path_factory.mktemp(path, numbered=False)
return src
def _parent_dirs(path:pathlib.Path) -> typing.Sequence[pathlib.Path]:
if not _is_module(path):
# Hack to get all parts of the path
return (path / "child").parents
return path.parents
Now that I had extracted the code, I was able to realize that the else
condition in _create_source_directory
was dead. This is something that I
never would have realized until I was able to disentangle the code from each
other, by extracting these functions. Quickly deleting the useless branch
really cleaned up that function.
def _create_source_directory(tmp_path_factory, path: pathlib.Path) -> pathlib.Path:
"""Creates the source directory given by path"""
parents = _parent_dirs(path)
# Minus 2 because of the implicit "." dir at the top of the parents list
src = tmp_path_factory.mktemp(parents[len(parents)-2] , numbered=False)
for p in list(reversed(parents))[2:]:
src = src / p.stem
src.mkdir()
return src
At this point I contemplated the rest of the original fixture:
def source_files(
tmp_path_factory,
request,
):
"""
Creates a temp directory with a source file
This is a session scoped directory, so the files should be ONLY be read, and not modified.
Returns: path to directory being used for test
"""
path = pathlib.Path(request.param)
src = _create_source_directory(tmp_path_factory, path)
# Write out source file
if _is_module(path):
module = path.name
src = src / module
src.write_text(CONTENT)
else:
main = src / "main.py"
main.write_text(CONTENT)
return src
I wasn’t happy with the writing out of the test content being so duplicative. Unfortunately I couldn’t think of a good way to deduplicate that I liked. Rather than extract the function just to extract one I decided to leave the code as is. I was rather happy with the overall result of the refactoring.
Retrospective
If you made it through all of that, I commend you. When I set out to refactor I was intending to clean up my testing code to accommodate some new features. What surprised me was how much cleaning up my code helped me to find imperfections in the code. A simple review found two Bugs, and at least two other places where code was dead. Furthermore, these issues were not creating any visible manifestations of incorrect behavior, which means that without cleaning they would have persisted for years more to come. It is possible that at some point they could have hidden incorrect behavior that I assumed was being tested. Additionally, any new additions would have had to try to add more conditions and logic to try to comply with previous if statements that were meaningless, and unexercised. Cleaning my code not only helped communicate my ideas better, it also made the code more correct and improved my ability to maintain it.
Often, even some engineers, will balk at the idea of “Clean” code as merely the subjective opinions of overly pretentious engineers. “As long as the code works, who cares” is something that is often their watchword. However what this whole exercise showed me, and I hope showed you, is that hygienic code actually improves function.