I'm going to show you how a micro optimization can speed up your python code by a whopping 5%. 5%! It can also annoy anyone that has to maintain your code.

But really, this is about explaining code might you see occasionally see in the standard library or in other people's code. Let's take an example from the standard library, specifically the collections.OrderedDict class:

def __setitem__(self, key, value, dict_setitem=dict.__setitem__):
    if key not in self:
        root = self.__root
        last = root[0]
        last[1] = root[0] = self.__map[key] = [last, root, key]
    return dict_setitem(self, key, value)

Notice the last arg: dict_setitem=dict.__setitem__. It makes sense if you think about it. To associate a key with a value, you'll need to provide a __setitem__ method which takes three arguments: the key you're setting, the value associated with the key, and the __setitem__ class method to the built in dict class. Wait. Ok maybe the last argument makes no sense.

Scope Lookups

To understand what's going on here, we need to take a look at scopes. Let's start with a simple question, if I'm in a python function, and I encounter something named open, how does python go about figuring out the value of open?

# <GLOBAL: bunch of code here>

def myfunc():
    # <LOCAL: bunch of code here>
    with open('foo.txt', 'w') as f:
        pass

The short answer is that without knowing the contents of the GLOBAL and the LOCAL section, you can't know for certain the value of open. Conceptually, python checks three namespaces for a name (ignoring nested scopes to keep things simple):

  • locals
  • globals
  • builtin

So in the myfunc function, if we're trying to find a value for open, we'll first check the local namespace, then the globals namespace, then the builtins namespace. And if open is not defined in any namespace, a NameError is raised.

Scope Lookups, the Implementation

The lookup process above is just conceptual. The implementation of this lookup process gives us room to exploit the implementation.

def foo():
    a = 1
    return a

def bar():
    return a

def baz(a=1):
    return a

Let's look at the bytecode of each function:

>>> import dis
>>> dis.dis(foo)
  2           0 LOAD_CONST               1 (1)
              3 STORE_FAST               0 (a)

  3           6 LOAD_FAST                0 (a)
              9 RETURN_VALUE


>>> dis.dis(bar)
  2           0 LOAD_GLOBAL              0 (a)
              3 RETURN_VALUE


>>> dis.dis(baz)
  2           0 LOAD_FAST                0 (a)
              3 RETURN_VALUE

Look at the differences between foo and bar. Right away we can see that at the bytecode level python has already determined what's a local variable and what is not because foo is using LOAD_FAST and bar is using LOAD_GLOBAL.

We won't get into the details of how python's compiler knows when to emit which bytecode (perhaps that's another post), but suffice to say python knows which type of lookup it needs to perform when it executes a function.

One other thing that can be confusing is that LOAD_GLOBAL is used for lookups in the global as well as the builtin namespace. You can think of this as "not local", again ignoring the issue of nested scopes. The C code for this is roughly [1]:

case LOAD_GLOBAL:
    v = PyObject_GetItem(f->f_globals, name);
    if (v == NULL) {
        v = PyObject_GetItem(f->f_builtins, name);
        if (v == NULL) {
            if (PyErr_ExceptionMatches(PyExc_KeyError))
                format_exc_check_arg(
                            PyExc_NameError,
                            NAME_ERROR_MSG, name);
            goto error;
        }
    }
    PUSH(v);

Even if you've never seen any of the C code for CPython, the above code is pretty straightforward. First, check if the key name we're looking for is in f->f_globals (the globals dict), then check if the name is in f->f_builtins (the builtins dict), and finally, raise a NameError if both checks failed.

Binding Constants to the Local Scope

Now when we look at the initial code sample, we can see that the last argument is binding a function into the local scope of a function. It does this by assigning a value, dict.__setitem__, as the default value of an argument. Here's another example:

def not_list_or_dict(value):
  return not (isinstance(value, dict) or isinstance(value, list))

def not_list_or_dict(value, _isinstance=isinstance, _dict=dict, _list=list):
  return not (_isinstance(value, _dict) or _isinstance(value, _list))

We're doing the same thing here, binding what would normally be objects that are in the builtin namespace into the local namespace instead. So instead of requiring the use of LOAD_GLOBAL (a global lookup), python instead will use LOCAL_FAST. So how much faster is this? Let's do some crude testing:

$ python -m timeit -s 'def not_list_or_dict(value): return not (isinstance(value, dict) or isinstance(value, list))' 'not_list_or_dict(50)'
1000000 loops, best of 3: 0.48 usec per loop
$ python -m timeit -s 'def not_list_or_dict(value, _isinstance=isinstance, _dict=dict, _list=list): return not (_isinstance(value, _dict) or _isinstance(value, _list))' 'not_list_or_dict(50)'
1000000 loops, best of 3: 0.423 usec per loop

Or in other words, that's an 11.9% improvement [2]. That's way more than the 5% I promised at the beginning of this post!

There's More to the Story

It's reasonable to think that the speed improvment is because LOAD_FAST reads from the local namespace whereas LOAD_GLOBAL will first check the global namespace before falling back to checking the builtin namespace. And in the example function above, isinstance, dict, and list all come from the built in namespace.

However, there's more going on. Not only are we able to skip additional lookup with LOAD_FAST, it's also a different type of lookup.

The C code snippet above showed the code for LOAD_GLOBAL, but here's the code for LOAD_FAST:

case LOAD_FAST:
    PyObject *value = fastlocal[oparg];
    if (value == NULL) {
        format_exc_check_arg(PyExc_UnboundLocalError,
                             UNBOUNDLOCAL_ERROR_MSG,
                             PyTuple_GetItem(co->co_varnames, oparg));
        goto error;
    }
    Py_INCREF(value);
    PUSH(value);
    FAST_DISPATCH();

We're retrieving the local value by indexing into an array. It's not shown here, but oparg is just an index into that array.

Now it's starting to make sense. In our first version not_list_or_dict had to perform 4 lookups, and each name was in the builtins namespace which we only look at after looking in the globals namespace. That's 8 dictionary key lookups. Compare that to directly indexing into a C array 4 times, which is what happens in the second version of not_list_or_dict, which all use LOAD_FAST under the hood. This is why lookups in the local namespace are faster.

Wrapping Up

Now the next time you see this in someone else's code you'll know what's going on.

And one final thing. Please don't actually do these kinds of optimizations unless you really need to. And most of the time you don't need to. But when the time really comes, and you really need to squeeze out every last bit of performance, you'll have this in your back pocket.

Footnotes

[1]Though keep in mind that I removed some performance optimizations in the above code to make it simpler to read. The real code is slightly more complicated.
[2]On a toy example for a function that doesn't really do anything interesting nor does it perform any IO and is mostly bound by the python VM loop.

To gain more confidence in production code, tests at varying levels accompany the development of such code. Unittests and integration tests are common, and sometimes suites of functional/acceptance tests are present as well. There's been quite a bit of discussion surrounding how to best write unittests (especially if you're using some xUnit framework variant), and sometimes these best practices even make their way into integration tests. Unfortunately, I still have not seen these best practices make their way into functional tests. In particular, I frequently see functional tests that are either very complicated, and/or obscure.

Before going on, it's worth voicing why I think complicated and obscure tests are bad. Typically production code is complicated. The domain is filled with special cases and unintuitive behavior. In order to help write production code, unittests are written to help make sure the production code works as we expect (in addition to driving the design if you use TDD). There's a secret so fundamental to writing unittests, yet it's rarely explicitly called out:

We don't test our tests!

Think about that for a moment. We have complicated production code. We write tests to ensure that the production code works as expected. What tests the test? Well if you have really complicated test logic, then are you really better off because you've written tests? Testcases are only effective if we can assume they are correct, in which case they are correctly asserting things about the system under test. If testcases are incorrect, that is, they are asserting incorrect things about the system, then the validity of the system under test is unknown, and you're arguably worse off than before (is a test fail really a fail?).

So what's the solution for this? Well, test the tests of course! And what if those tests are still complicated? Well, we'll test those as well! In fact it's tests all the way down. In order to avoid this recursion, we have to set a practical limitation to testcases: they must be simple enough to not require tests. In practice, this typically means two things:

  • No conditionals
  • Small in length

Now, how to avoid these two things (or how to replace these two things with better alternatives) is the topic on it's own. What I'm interested in for this post is why the trend of simple and small tests has still not made it's way into the realm of functional testing.

If you look at any (reasonable) code base's unittests, they're typically not so bad. They're short, they're small, you can read a test and more or less understand what it's testing. If this code base contains functional tests (and by functional tests, I basically mean any test that simulates interacting with the system in similar way to how a user would interact with the system), you'll typically see these characteristics instead:

  • Long
  • Complicated logic
  • Hard to understand
  • Fragile, failures aren't really failures sometimes
  • Terrible defect localization

Why is this? Is it because functional testing is still progressing towards what unittests currently are? Is it because of a lack of consensus that functional testing should have the same characteristics of unittests? Is it because writing small tests with no conditionals is hard to do at such a high level?

My take on this is that I believe writing good functional tests is much harder than writing unittests, and because of this, best practices are typically ignored. Fortunately, I think this situation can be remedied, and in the next post, I'll show several things you can do to achieve more succinct tests.