Cleaning Code

By Tyler

January 31, 2023

Last month I mentioned that I’ve been reading “Clean Code” again after many years. The reading has been enlightening, and a review of the book and my thoughts are forthcoming. Until then, I thought it could be instructive to go through some of my own code and examine how it could be made clean.

Realreq

First, an introduction to the material: realreq. Realreq is a simple tool I wrote to help me determine the actual requirements of projects I’m working on in python. Often when I’m prototyping I’ll install several libraries to test out. I’ll also install various tools for testing, and editing my code like pylsp, pytest and black. When it comes time to determine the requirements for my program, I don’t want it populated with all those extraneous packages.

Now this isn’t a problem for languages like Rust, NodeJS, or Go, as the tooling handles this natively. However python does not have this. There are a lot of additional tools that have been created to do so, but most of them are pretty heavy handed, requiring you integrate them into your workspace and workflow.

So naturally when I couldn’t find a tool I liked, I decided to write a script to get what I wanted.

The material

Realreq has stayed small (<1k LoC), but to ensure that it works I do have a fairly comprehensive set of tests. However the tests themselves were pretty poorly maintained, with me having done the bare minimum to keep them running. With some new updates planned, I decided that now was the best time to get these tests under control.

Below was a particularly nasty section of the tests (you can see the whole file at it’s commit)

CONTENT = """
import os
import requests
from foo import bar
from . import local_module
import local_module2
from foo.baz import frum
import abbrev
import src.local_module
import fake_pkg
"""

#...other lines of setup code

@pytest.fixture(scope="session", params=["src", "path/to/src", "go/to/src/module.py"])
def source_files(
    tmp_path_factory,
    request,
):
    """Creates a temp directory that tests different source files

    returns: path to directory being used for test
    """
    path = os.path.normpath(request.param)
    paths = path.split("/")
    is_module = os.path.splitext(paths[-1])[1].lower() == ".py"
    if is_module:
        module = paths[-1]
        paths = paths[:-1]
    if len(paths) > 1 and isinstance(paths, list):
        src = tmp_path_factory.mktemp(path[0], numbered=False)
        for p in paths:
            src = src / p
            src.mkdir()
    elif is_module:
        src = tmp_path_factory.mktemp(path[0], numbered=False)
    else:
        src = tmp_path_factory.mktemp(path, numbered=False)


    if is_module:
        src = src / module
        src.write_text(CONTENT)
    else:
        main = src / "main.py"
        main.write_text(CONTENT)
    return src

The above is a pytest fixture used to create a temporary directory with a fake python module for testing realreq with. Do read through it, but don’t worry if you have a hard time understanding it. I know I did when I was re-reading it after a while. The truth is that this function has accumulated code with every new feature I’ve added.

This code is, sadly, a mess. I have repeated logic with the is_module variable. There are several if statments, of which the purpose and the interplay between each is not straightforward. We make use of a lot of low level functions that overcomplicate and expose to much the workings of this function. All of this is my fault, for not adequately taking care of this code.

So how can we take this jumble of logic and generate something comprehensible? Well with some short steps, and resorting to some common refactoring techniques. I won’t lie, this wasn’t easy. Several times I started down a path only to have to revert my code back to its original form. I mention that because the steps I walk through are results of that accumulated struggle.

The struggle was not in vain though. Read through the code helped me understand both what the code did, and what it was intended to do.

Ideally those two things are one and the same. However, this is not always the case. As is typical in dirty code, I noticed that there was a bug in my code:

path = os.path.normpath(request.param)
paths = path.split("/")
is_module = os.path.splitext(paths[-1])[1].lower() == ".py"
if len(paths) > 1 and isinstance(paths, list):
    src = tmp_path_factory.mktemp(path[0], numbered=False)
    for p in paths:
        src = src / p
        src.mkdir()
elif is_module:
    src = tmp_path_factory.mktemp(path[0], numbered=False)
else:
    src = tmp_path_factory.mktemp(path, numbered=False)

I noticed that in this section of code, there was a slight bug. Here we are checking the structure of the path (one of ["src", "path/to/src", "go/to/src/module.py"]), and building out the temporary directory. I realized that in the first condition, we were creating the initial directory using the first part of the split path, but then reusing it to build out the rest of the path.

Furthermore the second condition was just using the wrong variable to generate the source path. This piqued my suspicions about this section of code. Such a flaw really should not have generated working code, yet my tests all passed. Sure enough, after throwing a breakpoint into the code, i discovered that the code was dead. Deleting the entire branch had no effect on the tests.

These were benign in the way the tests currently functioned, but I decided to fix it before the bug metastasized into real issues.

if len(paths) > 1 and isinstance(paths, list):
    src = tmp_path_factory.mktemp(paths.pop(0), numbered=False)
    for p in paths:
        src = src / p
        src.mkdir()
else:
    src = tmp_path_factory.mktemp(path, numbered=False)

Now that the bugs were cleared out, It was time for improvement. This fixture used a lot of primitives to manage the Path data (such splitting on “/” to generate a list of directories to create). While it “worked” it wasn’t very expressive nor did it take advantage of Python’s native “Path” type that helps deal with working with file system paths. My next goal was to replace this logic with one using pathlib paths.

path = pathlib.Path(request.param)
paths = path.parents
is_module = path.suffix.lower() == ".py"


# Get the appropriate paths
if not is_module:
    # Hack to get all parts of the path
    paths = (path / "child").parents


# Build out source directory
if len(paths) > 1 and not isinstance(paths, str):
    # Minus 2 because of the implicit "." dir at the top of the parents list
    src = tmp_path_factory.mktemp(paths[len(paths)-2] , numbered=False)
    for p in list(reversed(paths))[2:]:
        src = src / p.stem
        src.mkdir()
else:
    src = tmp_path_factory.mktemp(path, numbered=False)


# Write out source file
if is_module:
    module = path.name
    src = src / module
    src.write_text(CONTENT)
else:
    main = src / "main.py"
    main.write_text(CONTENT)
return src

A lot had to change to keep the tests working, and for some of the sections (like the second condition) it certainly seemed a bit messier than before, but I was fairly confident this would pay off. One challenge of course was to get the name of the parent directory. The parents property of a Path returns a sequence of paths each one successively removing a parent directory until you reach the root. For a relative path (as we have in our tests), the root directory (or final item) is ".". This isn’t useful for creating out test paths so we needed the second to last item. This required some obtuse manipulations of the list by subtracting 2 from the index. At this point I threw in a simple comment to explain why this was needed.

The next task I wanted to handle was thatis_module variable. It is calculated at the top of the function and then used throughout. While it is nice to cache the results, eventually I wanted to split this function into smaller parts. I didn’t want to have to pass an is_module parameter to all the new functions that I was going to create. So I decided that it would be worth it to turn the variable into a call to a query function and just recalculate it when I needed it. Yes this would mean a slight performance hit, but the benefits of cleaning the code were well worth it.

def source_files(...)

    path = pathlib.Path(request.param)


    parents = path.parents
    # Get the appropriate paths
    if not _is_module(path):
        # Hack to get all parts of the path
        parents = (path / "child").parents


    # Build out source directory
    if len(parents) > 1 and not isinstance(parents, str):
        # Minus 2 because of the implicit "." dir at the top of the parents list
        src = tmp_path_factory.mktemp(parents[len(parents)-2] , numbered=False)
        for p in list(reversed(parents))[2:]:
            src = src / p.stem
            src.mkdir()
    else:
        src = tmp_path_factory.mktemp(path, numbered=False)


    # Write out source file
    if _is_module(path):
        module = path.name
        src = src / module
        src.write_text(CONTENT)
    else:
        main = src / "main.py"
        main.write_text(CONTENT)
    return src


def _is_module(path:pathlib.Path) -> bool:
    """Tests if path is a Python Module"""
    return path.suffix.lower() == ".py"

With that done I was finally able to extract part of the logic out into a separate function. The first part of the function deals with creating and setting up a directory to write out test file into. I decided that it would be worth encapsulating it into a function to precisely do that. It was as simple as copying and pasting the code into the body of a new function, and returning the generated src path.

def source_files
    path = pathlib.Path(request.param)
    src = _create_source_directory(tmp_path_factory, path)


    # Write out source file
    if _is_module(path):
        module = path.name
        src = src / module
        src.write_text(CONTENT)
    else:
        main = src / "main.py"
        main.write_text(CONTENT)
    return src


def _is_module(path:pathlib.Path) -> bool:
    """Tests if path is a Python Module"""
    return path.suffix.lower() == ".py"


def _create_source_directory(tmp_path_factory, path: pathlib.Path) -> pathlib.Path:
    parents = path.parents
    if not _is_module(path):
        # Hack to get all parts of the path
        parents = (path / "child").parents
    if len(parents) > 1 and not isinstance(parents, str):
        # Minus 2 because of the implicit "." dir at the top of the parents list
        src = tmp_path_factory.mktemp(parents[len(parents)-2] , numbered=False)
        for p in list(reversed(parents))[2:]:
            src = src / p.stem
            src.mkdir()
    else:
        src = tmp_path_factory.mktemp(path, numbered=False)
    return src

The first few lines in the new _create_source_directory function, are really just a way to generate the parents list. I decided to extract another function to represent that. This had the added benefit of isolating and “hiding” the hack we have written to ensure we get all the parents for paths that aren’t modules.

def _create_source_directory(tmp_path_factory, path: pathlib.Path) -> pathlib.Path:
    """Creates the source directory given by path"""
    parents = _parent_dirs(path)
    if len(parents) > 1 and not isinstance(parents, str):
        # Minus 2 because of the implicit "." dir at the top of the parents list
        src = tmp_path_factory.mktemp(parents[len(parents)-2] , numbered=False)
        for p in list(reversed(parents))[2:]:
            src = src / p.stem
            src.mkdir()
    else:
        src = tmp_path_factory.mktemp(path, numbered=False)
    return src


def _parent_dirs(path:pathlib.Path) -> typing.Sequence[pathlib.Path]:
    if not _is_module(path):
        # Hack to get all parts of the path
        return (path / "child").parents
    return path.parents

Now that I had extracted the code, I was able to realize that the else condition in _create_source_directory was dead. This is something that I never would have realized until I was able to disentangle the code from each other, by extracting these functions. Quickly deleting the useless branch really cleaned up that function.

def _create_source_directory(tmp_path_factory, path: pathlib.Path) -> pathlib.Path:
    """Creates the source directory given by path"""
    parents = _parent_dirs(path)

    # Minus 2 because of the implicit "." dir at the top of the parents list
    src = tmp_path_factory.mktemp(parents[len(parents)-2] , numbered=False)
    for p in list(reversed(parents))[2:]:
        src = src / p.stem
        src.mkdir()
    return src

At this point I contemplated the rest of the original fixture:

def source_files(
    tmp_path_factory,
    request,
):
    """
    Creates a temp directory with a source file

    This is a session scoped directory, so the files should be ONLY be read, and not modified.

    Returns: path to directory being used for test
    """
    path = pathlib.Path(request.param)
    src = _create_source_directory(tmp_path_factory, path)


    # Write out source file
    if _is_module(path):
        module = path.name
        src = src / module
        src.write_text(CONTENT)
    else:
        main = src / "main.py"
        main.write_text(CONTENT)
    return src

I wasn’t happy with the writing out of the test content being so duplicative. Unfortunately I couldn’t think of a good way to deduplicate that I liked. Rather than extract the function just to extract one I decided to leave the code as is. I was rather happy with the overall result of the refactoring.

Retrospective

If you made it through all of that, I commend you. When I set out to refactor I was intending to clean up my testing code to accommodate some new features. What surprised me was how much cleaning up my code helped me to find imperfections in the code. A simple review found two Bugs, and at least two other places where code was dead. Furthermore, these issues were not creating any visible manifestations of incorrect behavior, which means that without cleaning they would have persisted for years more to come. It is possible that at some point they could have hidden incorrect behavior that I assumed was being tested. Additionally, any new additions would have had to try to add more conditions and logic to try to comply with previous if statements that were meaningless, and unexercised. Cleaning my code not only helped communicate my ideas better, it also made the code more correct and improved my ability to maintain it.

Often, even some engineers, will balk at the idea of “Clean” code as merely the subjective opinions of overly pretentious engineers. “As long as the code works, who cares” is something that is often their watchword. However what this whole exercise showed me, and I hope showed you, is that hygienic code actually improves function.