The first question one might ask is:
Another persistent key value store, really?
It all started when I was working on a project where I needed a simple key value store, accessible from python. Technically, I was using the shelve module, and it decided to use the Berkeley DB (via anydbm). So far so good. But there were a few issues:
The first issue is regarding availability. Not all python installations are the same. Just because a user has python installed does not mean they necessarily have all the standard libraries installed. I just checked my python install on my Macbook, and I don't have the bsddb module available. On my debian system I don't have the gdbm module installed. Given that these packages are just python bindings to C based dbms, installing these packages involves:
None of these steps are that much work, but are there any alternatives?
On all of my systems I have the dbm module available. This is a C based DBM that seems to available on most python installations. How fast is it? There's a scripts/benchmark script available in the semidbm repo that can benchmark any dbm like module. Here's the results for the dbm module:
$ scripts/benchmark -d dbm Generating random data. Benchmarking: <module 'dbm' from '/Users/jsaryer/.virtualenvs/semidbm/lib/python2.7/lib-dynload/dbm.so'> num_keys : 1000000 key_size : 16 value_size: 100 HASH: Out of overflow pages. Increase page size ERROR: exception caught when benchmarking <module 'dbm' from '/Users/jsaryer/.virtualenvs/semidbm/lib/python2.7/lib-dynload/dbm.so'>: cannot add item to database
Or in other words, it made it to about 450000 keys before this error was generated. So storing a large number of keys doesn't seem possible with python's dbm module.
While some dbms that aren't available simply require compiling/installing the right packages and files, there are some dbms that just aren't available on certain platforms (notoriously windows).
Well fortunately, there's a fallback python module that's guaranteed to be available on every single python installation: dumbdbm.
Unfortunately, the performance is terrible. There's also a number of undesirable qualities:
To be fair, dumbdbm was most likely written as a last resort fallback to the more classic dbms. It's also really old (written by Guido himself if I remember correctly).
Hopefully the goals of semidbm are becoming clearer. I just wanted a dbm that was:
The first two points I felt I could achieve by simply using python, and not requiring any C libraries or C extensions.
The third point I felt I could improve by taking dumbdbm and making some minor improvements.
So that's the background of semidbm.
I think so. The benchmark page has more details regarding the performance, but as a quick comparison to semidbm:
$ scripts/benchmark -d semidbm -n 10000 Generating random data. Benchmarking: <module 'semidbm'> num_keys : 10000 key_size : 16 value_size: 100 fill_sequential : time: 0.126, micros/ops: 12.597, ops/s: 79382.850, MB/s: 8.782 read_hot : time: 0.041, micros/ops: 4.115, ops/s: 243036.754, MB/s: 26.886 read_sequential : time: 0.039, micros/ops: 3.861, ops/s: 258973.197, MB/s: 28.649 read_random : time: 0.042, micros/ops: 4.181, ops/s: 239171.571, MB/s: 26.459 delete_sequential : time: 0.058, micros/ops: 5.819, ops/s: 171856.854, MB/s: 19.012 $ scripts/benchmark -d dumbdbm -n 10000 Generating random data. Benchmarking: <module 'dumbdbm'> num_keys : 10000 key_size : 16 value_size: 100 fill_sequential : time: 1.824, micros/ops: 182.400, ops/s: 5482.447, MB/s: 0.607 read_hot : time: 0.165, micros/ops: 16.543, ops/s: 60450.332, MB/s: 6.687 read_sequential : time: 0.167, micros/ops: 16.733, ops/s: 59762.818, MB/s: 6.611 read_random : time: 0.175, micros/ops: 17.505, ops/s: 57126.529, MB/s: 6.320 delete_sequential : time: 99.025, micros/ops: 9902.522, ops/s: 100.984, MB/s: 0.011
From the output above, writes are an order of magnitude faster (and semidbm computes and writes out a checksum for every value) and reads are almost 4 times faster. Deletion performance is much better (0.058 seconds vs. 99.025 seconds for deleting 10000 keys).
Also, every single insertion/update/deletion is immediately written out to disk so if python crashes, at worst you'd lose one key, the key that was being writen out to disk when python crashed.
I think if you ever need to use a pure python dbm, semidbm is a great choice. Any time you'd otherwise have to use dumbdbm, use semidbm instead.
There's a number of things I'd like to investigate in the future: