I've just released 0.4.0 of semidbm. This represents a number of really cool features. See the full changelog for more details.

One of the biggest features is python 3 support. I was worried about not introducing a performance regression by supporting python 3. Fortunately, this was not the case.

In fact, performance increased. This was possible for a number of reasons. First, the index file and data file were combined into a single file. This means that a __setitem__ call results in only a single write() call. Also, semidbm now uses a binary format. This results in a more compact form and it's easier to create the sequence of bytes we need to write out to disk. This is also including the fact that semidbm now includes checksum data for each write that occurs.

Try it out for yourself.

What's Next?

I think at this time, semidbm has more than exceeded it's original goal, which was to be a pure python cross platform key value storage that had reasonable performance. So what's next for semidbm? In a nutshell, higher level abstractions (aka the "fun stuff"). Code that builds on the simple key value storage of semidbm.db and provides additional features. And as we get higher level, I think it makes sense to reevaluate the original goals of semidbm and whether or not it makes sense to carry those goals forward:

  • Cross platform. I'm inclined to not support windows for these higher level abstractions.
  • Pure python. I think the big reason for remaining pure python was for ease of installation. Especially on windows, pip installing a package should just work. With C extensions, this becomes much harder on windows. If semidbm isn't going to support windows for these higher level abstractions, then C extensions are fair game.

Some ideas I've been considering:

  • A C version of _Semidbm.
  • A dict like interface that is concurrent (possibly single writer multiple reader).
  • A sorted version of semidbm (supporting things like range queries).
  • Caching reads (need an efficient LRU cache).
  • Automatic background compaction of data file.
  • Batched writes
  • Transactions
  • Compression (I played around with this earlier. Zlib turned out to be too slow for the smaller sized values (~100 bytes) but it might be worth being able to configure this on a per db basis.

Semidbm is a pure python dbm. While the docs go into the specifics of how to use the dbm, I'd like to offer a more editorialized view of semidbm (the why of semidbm).

Semidbm is a pure python dbm, which is basically a key value store. Similar python modules in the standard library include gdbm, bsddb, and dumbdbm.

The first question one might ask is:

Another persistent key value store, really?

Fair question.

It all started when I was working on a project where I needed a simple key value store, accessible from python. Technically, I was using the shelve module, and it decided to use the Berkeley DB (via anydbm). So far so good. But there were a few issues:

  • Not everyone has the Berkeley DB python bindings installed. Or in general, dbms that are based on C libraries have varying availability on people's systems.
  • Not all dbms perform equally.
  • Not all dbms are portable.

C based DBMs and their availability

The first issue is regarding availability. Not all python installations are the same. Just because a user has python installed does not mean they necessarily have all the standard libraries installed. I just checked my python install on my Macbook, and I don't have the bsddb module available. On my debian system I don't have the gdbm module installed. Given that these packages are just python bindings to C based dbms, installing these packages involves:

  • Install the C libraries and development packages for the appropriate dbm.
  • Have a development environment that can build python.
  • Rebuild python

None of these steps are that much work, but are there any alternatives?

Not all dbms perform equally

On all of my systems I have the dbm module available. This is a C based DBM that seems to available on most python installations. How fast is it? There's a scripts/benchmark script available in the semidbm repo that can benchmark any dbm like module. Here's the results for the dbm module:

$ scripts/benchmark -d dbm
Generating random data.
Benchmarking: <module 'dbm' from
'/Users/jsaryer/.virtualenvs/semidbm/lib/python2.7/lib-dynload/dbm.so'>
    num_keys  : 1000000
    key_size  : 16
    value_size: 100
HASH: Out of overflow pages.  Increase page size

ERROR: exception caught when benchmarking <module 'dbm' from '/Users/jsaryer/.virtualenvs/semidbm/lib/python2.7/lib-dynload/dbm.so'>: cannot add item to database

Or in other words, it made it to about 450000 keys before this error was generated. So storing a large number of keys doesn't seem possible with python's dbm module.

Not all dbms are portable

While some dbms that aren't available simply require compiling/installing the right packages and files, there are some dbms that just aren't available on certain platforms (notoriously windows).

Well fortunately, there's a fallback python module that's guaranteed to be available on every single python installation: dumbdbm.

Unfortunately, the performance is terrible. There's also a number of undesirable qualities:

  • When a key is added to the DB, the data file is updated, but the index file is not updated, which means the data file and the index file are not in sync. If python crashed, any newly added/updated keys are lost.
  • Every deletion writes out the entire index. This makes deletions painfully slow (O(n)).

To be fair, dumbdbm was most likely written as a last resort fallback to the more classic dbms. It's also really old (written by Guido himself if I remember correctly).

A key value store with modest aspirations

Hopefully the goals of semidbm are becoming clearer. I just wanted a dbm that was:

  1. Portable
  2. Easily installable
  3. Reasonably performance and semantics

The first two points I felt I could achieve by simply using python, and not requiring any C libraries or C extensions.

The third point I felt I could improve by taking dumbdbm and making some minor improvements.

So that's the background of semidbm.

Can simpler really be better?

I think so. The benchmark page has more details regarding the performance, but as a quick comparison to semidbm:

$ scripts/benchmark -d semidbm -n 10000
Generating random data.
Benchmarking: <module 'semidbm'>
    num_keys  : 10000
    key_size  : 16
    value_size: 100
fill_sequential     : time:     0.126,   micros/ops:    12.597,   ops/s:  79382.850,  MB/s:      8.782
read_hot            : time:     0.041,   micros/ops:     4.115,   ops/s: 243036.754,  MB/s:     26.886
read_sequential     : time:     0.039,   micros/ops:     3.861,   ops/s: 258973.197,  MB/s:     28.649
read_random         : time:     0.042,   micros/ops:     4.181,   ops/s: 239171.571,  MB/s:     26.459
delete_sequential   : time:     0.058,   micros/ops:     5.819,   ops/s: 171856.854,  MB/s:     19.012

$ scripts/benchmark -d dumbdbm -n 10000
Generating random data.
Benchmarking: <module 'dumbdbm'>
    num_keys  : 10000
    key_size  : 16
    value_size: 100
fill_sequential     : time:     1.824,   micros/ops:   182.400,   ops/s:   5482.447,  MB/s:      0.607
read_hot            : time:     0.165,   micros/ops:    16.543,   ops/s:  60450.332,  MB/s:      6.687
read_sequential     : time:     0.167,   micros/ops:    16.733,   ops/s:  59762.818,  MB/s:      6.611
read_random         : time:     0.175,   micros/ops:    17.505,   ops/s:  57126.529,  MB/s:      6.320
delete_sequential   : time:    99.025,   micros/ops:  9902.522,   ops/s:    100.984,  MB/s:      0.011

From the output above, writes are an order of magnitude faster (and semidbm computes and writes out a checksum for every value) and reads are almost 4 times faster. Deletion performance is much better (0.058 seconds vs. 99.025 seconds for deleting 10000 keys).

Also, every single insertion/update/deletion is immediately written out to disk so if python crashes, at worst you'd lose one key, the key that was being writen out to disk when python crashed.

Why you should use semidbm

I think if you ever need to use a pure python dbm, semidbm is a great choice. Any time you'd otherwise have to use dumbdbm, use semidbm instead.

Future plans for semidbm

There's a number of things I'd like to investigate in the future:

  • Faster db loading. Semidbm needs to read the entire data file to load the db. There's potential to speed this up.
  • Caching reads. Looking at the implementation of other dbms, many of them have some type of in memory cache to improve read performance.
  • Support for additional db methods. Semidbm does not support all of the dict methods.
  • Batch writes/reads. Due to the append only nature of the file format, this could substantially improve write performance.

For more info, check out the docs and the github repo.