Now that fakeredis 0.3.0 is out I think it's a good time to discuss the finer points of fakeredis, and why you should consider using it for your redis unit testing needs.

What exactly is fakeredis? Other than the pedantic naming of "fake" instead of "mock", it is an in memory implementation of the redis client used for python. This allows you to write tests that use the redis-py client interface without having to have redis running.

Setting up redis is not hard, even compiling from source is easy; there's not even a ./configure step! But unit tests should require no configuration to run. Someone should be able to checkout/clone the repo, and be able to run your unit tests.

There's one big problem with writing fakes:

How do you know your fake implementation matches the real implementation?

Fakeredis verifies this in a simple way. First, there's unit tests for fakeredis. And for every unit test for fakeredis, there's the equivalent integration test that actually talks to a real redis server. This ensures that every single test for fakeredis has the exact same behavior as real redis. There's nothing worse than writing unit tests against a fake implementation only to find out that the real implementation is actually different!

In fakeredis, this is implemented with a factory method pattern. The fakeredis tests instantiate a fakeredis.FakeRedis class while the real redis integration tests instantiate a redis.Redis instance:

class TestFakeRedis(unittest.TestCase):
    def setUp(self):
        self.redis = self.create_redis()

    def create_redis(self, db=0):
        return fakeredis.FakeStrictRedis(db=db)

    def test_set_then_get(self):
        self.assertEqual(self.redis.set('foo', 'bar'), True)
        self.assertEqual(self.redis.get('foo'), 'bar')


class TestRealRedis(TestFakeRedis):
    def create_redis(self, db=0):
        return redis.Redis('localhost', port=6379, db=db)

Now every test written in the TestFakeRedis class will be automatically run against both a FakeRedis instance and a Redis instance, ensuring parity between the two.

This also makes it easier for contributors. If they notice an inconsistency between fakeredis and redis, they only need to write a single test and they'll have a simple repro that shows that the test passes for redis but fails against FakeRedis.

And finally test coverage. Every single implemented command in fakeredis has test cases. I only accept contributions for bug fixes/new features if they have tests. I normally don't worry about actual coverage numbers, but out of curiosity I checked what those numbers actually were:

$ coverage report fakeredis.py
Name        Stmts   Miss  Cover
-------------------------------
fakeredis     640     19    97%

Not bad. Most of the missing lines are either unimplemented commands (pass statements counted as missing coverage) or precondition checks such as:

def zadd(self, name, *args, **kwargs):
   # ...
   if len(args) % 2 != 0:
       raise redis.RedisError("ZADD requires an equal number of "
                              "values and scores")

I do plan on going through and ensuring that all of these precondition checks have tests.

So the next time you're looking for a fake implementation of redis, consider fakeredis.


A new release of fakeredis is out. This 0.3.0 release adds:

  • Support for redis 2.6.
  • Improved support for pipelines/watch/multi/exec.
  • Full support for variadic commands.
  • Better consistency with the actual behavior of redis.

And of course, a handful of bug fixes. This release was tested against:

  • redis 2.6.4
  • redis-py 2.6.2
  • python 2.7.3, 2.6

You can install fakeredis via pip install fakeredis. Also check out:


MY PYTHON CODE ISN'T WORKING!! We've all been there right? This is a series where I'll share miscellaneous tips I've learned for troubleshooting python code. This is aimed at people who are relatively new to python. In this first series, I'd like to cover one of those common things you'll run into: the traceback.

Reading Python Tracebacks

Many times an error in python code is accompanied by a traceback. If you want to get really good at troubleshooting python programs, you'll need to become really comfortable with reading a traceback. You should be able to look at a traceback and have a general idea of what's happening in the traceback. One of the things I always notice when working with people new to python is how puzzled they look when they first see tracebacks.

So let's work through an example. Consider this script:

import httplib2


def a():
    b()


def b():
    c()

def c():
    d()

def d():
    h = httplib2.Http()
    h.request(uri=None)


a()

When this script is run we get this traceback:

Traceback (most recent call last):
  File "issue.py", line 19, in <module>
    a()
  File "issue.py", line 5, in a
    b()
  File "issue.py", line 9, in b
    c()
  File "issue.py", line 12, in c
    d()
  File "issue.py", line 16, in d
    h.request(uri=None)
  File "/Users/jsaryer/.virtualenvs/90a/lib/python2.7/site-packages/httplib2/__init__.py", line 1394, in request
    (scheme, authority, request_uri, defrag_uri) = urlnorm(uri)
  File "/Users/jsaryer/.virtualenvs/90a/lib/python2.7/site-packages/httplib2/__init__.py", line 206, in urlnorm
    (scheme, authority, path, query, fragment) = parse_uri(uri)
  File "/Users/jsaryer/.virtualenvs/90a/lib/python2.7/site-packages/httplib2/__init__.py", line 202, in parse_uri
    groups = URI.match(uri).groups()
TypeError: expected string or buffer

While this can look intimidating at first, there's a few basic things to remember when reading a traceback:

  • The oldest frame in the stack is at the top, and the newest frame is at the bottom. This means that the bottom of the traceback output is where the uncaught exception was originally raised. This is the opposite of other languages such as java and c/c++ where the first line shows the newest frame (the frame where the uncaught exception originated).
  • Pay attention to the filenames associated with each level of the traceback, and pay attention where the frames jump across modules and package "types" (more on this later).
  • Read the bottom most line to read the actual exception message.
  • Above all, remember that the traceback alone may not be sufficient to understand what went wrong.

So let's see how we can apply these steps to the traceback above. First, let's use the first item: the stack frames go from oldest frame at the beginning of the output to the newest frame at the bottom. To be absolutely clear, in the above code, the call chain is: a() -> b() -> c() -> d() -> httplib2.Http.request. The oldest stack frame is associated with the a() function call (it's the call the triggered all the remaining calls), and the newest stack frame is for httplib2.Http.request (it's the call that actually triggered the exception being raised). Conceptually, you think of a python traceback as growing downwards, any time something is pushed onto the stack, it is appended to the output. And when something is popped off the stack, its output is removed from the end of the stack.

Now let's apply the second item: pay attention to the filenames associated with each level of the traceback. Right off the bat we can see there are two main modules involved in this interaction. There's the issue module, which looks like this in the traceback:

File "issue.py", line 19, in <module>
  a()
File "issue.py", line 5, in a
  b()

and there's httplib2, which looks like this in the traceback:

File "/Users/jsaryer/.virtualenvs/90a/lib/python2.7/site-packages/httplib2/__init__.py", line 1394, in request
  (scheme, authority, request_uri, defrag_uri) = urlnorm(uri)

There's a few important observations:

  • The length of the filenames. In this case the issue.py filename suggests that this originated from our current working directory, hence the relative path.
  • The error actually occurs in a 3rd party library (the last three lines of the output from the traceback).

We know that an error occurs in a 3rd party library because the location of this library is under the "site-packages" directory. As a rule of thumb, if something is under the "site-packages" directory it's a third party module (i.e. not something in the python standard library). This is typically where packages installed the pip are replaced (e.g. pip install httplib2).

The second item also says to pay attention to where the frames jump across modules or package "types." In this traceback we can see that we jump across modules and packages "types" here:

File "issue.py", line 16, in d
  h.request(uri=None)
File "/Users/jsaryer/.virtualenvs/90a/lib/python2.7/site-packages/httplib2/__init__.py", line 1394, in request
  (scheme, authority, request_uri, defrag_uri) = urlnorm(uri)

In these four lines we can see that we jump from issue.py to httplib2. By jumping across package "types", I simply mean where we jump from our modules/packages to either standard library packages or 3rd party packages. From the four lines shown above we can see that by calling h.request() we jump into the httplib2 module.

Now let's apply the third item: Read the bottom most line to read the actual exception message. In our example, the actual exception that's raised is:

TypeError: expected string or buffer

Admittedly, not the most helpful error message. If we look at the line before this line, we can see the actual line that caused this TypeError:

groups = URI.match(uri).groups()

The two most likely things to cause a TypeError would be a call to match() or a call to groups(). Noticing that uri arg is seen at multiple frames in the traceback, our first guess would be that the value of uri is causing a TypeError. If we go bottom up until we don't see the uri param mentioned, we can see that it's first mentioned here:

File "issue.py", line 16, in d
  h.request(uri=None)
File "/Users/jsaryer/.virtualenvs/90a/lib/python2.7/site-packages/httplib2/__init__.py", line 1394, in request
  (scheme, authority, request_uri, defrag_uri) = urlnorm(uri)

Given that the h.request(uri=None) comes from our code, this is probably the first place we should look.

It turns out that the uri parameter needs to be a string:

h = httplib2.Http()
response = h.request(uri='http://www.google.com')

Now, it doesn't always work out as nicely as this, but having a basic example helps to serve as a basis for further debugging techniques.


Semidbm is a pure python dbm. While the docs go into the specifics of how to use the dbm, I'd like to offer a more editorialized view of semidbm (the why of semidbm).

Semidbm is a pure python dbm, which is basically a key value store. Similar python modules in the standard library include gdbm, bsddb, and dumbdbm.

The first question one might ask is:

Another persistent key value store, really?

Fair question.

It all started when I was working on a project where I needed a simple key value store, accessible from python. Technically, I was using the shelve module, and it decided to use the Berkeley DB (via anydbm). So far so good. But there were a few issues:

  • Not everyone has the Berkeley DB python bindings installed. Or in general, dbms that are based on C libraries have varying availability on people's systems.
  • Not all dbms perform equally.
  • Not all dbms are portable.

C based DBMs and their availability

The first issue is regarding availability. Not all python installations are the same. Just because a user has python installed does not mean they necessarily have all the standard libraries installed. I just checked my python install on my Macbook, and I don't have the bsddb module available. On my debian system I don't have the gdbm module installed. Given that these packages are just python bindings to C based dbms, installing these packages involves:

  • Install the C libraries and development packages for the appropriate dbm.
  • Have a development environment that can build python.
  • Rebuild python

None of these steps are that much work, but are there any alternatives?

Not all dbms perform equally

On all of my systems I have the dbm module available. This is a C based DBM that seems to available on most python installations. How fast is it? There's a scripts/benchmark script available in the semidbm repo that can benchmark any dbm like module. Here's the results for the dbm module:

$ scripts/benchmark -d dbm
Generating random data.
Benchmarking: <module 'dbm' from
'/Users/jsaryer/.virtualenvs/semidbm/lib/python2.7/lib-dynload/dbm.so'>
    num_keys  : 1000000
    key_size  : 16
    value_size: 100
HASH: Out of overflow pages.  Increase page size

ERROR: exception caught when benchmarking <module 'dbm' from '/Users/jsaryer/.virtualenvs/semidbm/lib/python2.7/lib-dynload/dbm.so'>: cannot add item to database

Or in other words, it made it to about 450000 keys before this error was generated. So storing a large number of keys doesn't seem possible with python's dbm module.

Not all dbms are portable

While some dbms that aren't available simply require compiling/installing the right packages and files, there are some dbms that just aren't available on certain platforms (notoriously windows).

Well fortunately, there's a fallback python module that's guaranteed to be available on every single python installation: dumbdbm.

Unfortunately, the performance is terrible. There's also a number of undesirable qualities:

  • When a key is added to the DB, the data file is updated, but the index file is not updated, which means the data file and the index file are not in sync. If python crashed, any newly added/updated keys are lost.
  • Every deletion writes out the entire index. This makes deletions painfully slow (O(n)).

To be fair, dumbdbm was most likely written as a last resort fallback to the more classic dbms. It's also really old (written by Guido himself if I remember correctly).

A key value store with modest aspirations

Hopefully the goals of semidbm are becoming clearer. I just wanted a dbm that was:

  1. Portable
  2. Easily installable
  3. Reasonably performance and semantics

The first two points I felt I could achieve by simply using python, and not requiring any C libraries or C extensions.

The third point I felt I could improve by taking dumbdbm and making some minor improvements.

So that's the background of semidbm.

Can simpler really be better?

I think so. The benchmark page has more details regarding the performance, but as a quick comparison to semidbm:

$ scripts/benchmark -d semidbm -n 10000
Generating random data.
Benchmarking: <module 'semidbm'>
    num_keys  : 10000
    key_size  : 16
    value_size: 100
fill_sequential     : time:     0.126,   micros/ops:    12.597,   ops/s:  79382.850,  MB/s:      8.782
read_hot            : time:     0.041,   micros/ops:     4.115,   ops/s: 243036.754,  MB/s:     26.886
read_sequential     : time:     0.039,   micros/ops:     3.861,   ops/s: 258973.197,  MB/s:     28.649
read_random         : time:     0.042,   micros/ops:     4.181,   ops/s: 239171.571,  MB/s:     26.459
delete_sequential   : time:     0.058,   micros/ops:     5.819,   ops/s: 171856.854,  MB/s:     19.012

$ scripts/benchmark -d dumbdbm -n 10000
Generating random data.
Benchmarking: <module 'dumbdbm'>
    num_keys  : 10000
    key_size  : 16
    value_size: 100
fill_sequential     : time:     1.824,   micros/ops:   182.400,   ops/s:   5482.447,  MB/s:      0.607
read_hot            : time:     0.165,   micros/ops:    16.543,   ops/s:  60450.332,  MB/s:      6.687
read_sequential     : time:     0.167,   micros/ops:    16.733,   ops/s:  59762.818,  MB/s:      6.611
read_random         : time:     0.175,   micros/ops:    17.505,   ops/s:  57126.529,  MB/s:      6.320
delete_sequential   : time:    99.025,   micros/ops:  9902.522,   ops/s:    100.984,  MB/s:      0.011

From the output above, writes are an order of magnitude faster (and semidbm computes and writes out a checksum for every value) and reads are almost 4 times faster. Deletion performance is much better (0.058 seconds vs. 99.025 seconds for deleting 10000 keys).

Also, every single insertion/update/deletion is immediately written out to disk so if python crashes, at worst you'd lose one key, the key that was being writen out to disk when python crashed.

Why you should use semidbm

I think if you ever need to use a pure python dbm, semidbm is a great choice. Any time you'd otherwise have to use dumbdbm, use semidbm instead.

Future plans for semidbm

There's a number of things I'd like to investigate in the future:

  • Faster db loading. Semidbm needs to read the entire data file to load the db. There's potential to speed this up.
  • Caching reads. Looking at the implementation of other dbms, many of them have some type of in memory cache to improve read performance.
  • Support for additional db methods. Semidbm does not support all of the dict methods.
  • Batch writes/reads. Due to the append only nature of the file format, this could substantially improve write performance.

For more info, check out the docs and the github repo.


Learning a new programming language can be a daunting task. Even though you start with the basic things like syntax, in order to become productive in the language you must learn things like

  • Common coding idioms and patterns
  • The standard library
  • Best practices (including what frameworks to use, what development tools to use, etc)

But then there's also the, for lack of a better term, "extra stuff." The collection of miscellaneous tips and tricks you pick up while coding in the language on a day to day basis. These set of tips end up saving you a lot of time in the long run, but are hard to distinguish how useful a tip really is when you first hear about it.

Well, this is my list of tips. It's not 100% complete, and focuses mostly on various tidbits of information that, when I think about how I code on a day to day basis, I find myself repeatedly doing.

The _ variable

This tip is useful when you're in an interactive python shell. The _ variable stores the value of the most recently evaluated expression:

>>> 1 + 2 + 3
6
>>> _ * 24
144
>>> _ / 12.
12.0
>>> [i for i in range(100) if i < 10]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> len(_)
10

Figuring Out Where to Look

Sometimes if you're trying to debug a problem you'll to need to figure out where a module is located. A really easy way to do this is to use the __file__ attribute of a module object:

>>> import httplib
>>> httplib.__file__
'/usr/local/lib/python2.7/httplib.pyc'

You can also use inspect.getfile(obj) to find where an object is located.

Running Your Module as a Script

Every module will have a __name__ attribute, but the value of that attribute will depend on how the module is executed. Consider a module:

python foo.py print __name__

When the module is imported the name will be "foo".

>>> import foo
foo
>>>

However, when the module is executed as a script, the name will be __name__:

$ python foo.py
__main__

It may not be obvious how this is useful. The way that this is typically used is to allow a module to be both imported and used as a script. Sometimes the script is a command line interface to the functionality available in the module. Sometimes the script provides a demo of the capabilities of the module. And sometimes the script runs any tests that live in the module (for example all of the doctests). To use this in your own library you can use something like this:

def do_something(args):
    # Do something with args.
    pass

def main(argv=None):
    if argv is None:
        argv = sys.argv
    args = parse_args(argv)
    do_something(args)


if __name__ == '__main__':
    sys.exit(main())

The main() function is only called when the module is run directly.

The -m option

Once your module has an if __name__ == '__main__' clause (I usually refer to this as just the ifmain clause), an easy way to invoke the module is to use the -m option of python. This allows you to refer to a module by its import name rather than its specific path. In the previous example the foo.py module could be run using:

$ python -m foo

One final thing worth pointing out is that many modules in python's stdlib have useful ifmain functionality. A few notable ones include:

python -m SimpleHTTPServer

This will serve the current working directory on port 8000. I use this command on almost a daily basis. From quickly downloading files to viewing html files on a remote server, this is one of the most useful ifmain clauses in the entire python standard library.

python -m pdb myfile.py

Run a python script via pdb (the python debugger).

python -m trace --trace myfile.py

Print each line to stdout before it's executed. Be sure to see the help of the trace module, there's a lot of useful options besides printing each line being executed.

python -m profile myfile.py

Profile myfile.py and print out a summary.

So there it is. My list of tips. In the future I plan on expanding on some of these tips in more depth (the profiling workflow for python code and how to debug python code stand out), but in the meantime, may these tips be as helpful to you as they are to me.