Semidbm is a pure python dbm. While the
docs go into the specifics of how to use
the dbm, I’d like to offer a more editorialized view of semidbm (the why of
semidbm).
Semidbm is a pure python dbm, which is basically a key value store. Similar
python modules in the standard library include
gdbm,
bsddb, and
dumbdbm.
The first question one might ask is:
Another persistent key value store, really?
Fair question.
It all started when I was working on a project where I needed a simple key
value store, accessible from python. Technically, I was using the
shelve module, and it decided to
use the Berkeley DB (via anydbm).
So far so good. But there were a few issues:
Not everyone has the Berkeley DB python bindings installed. Or in general,
dbms that are based on C libraries have varying availability on people’s
systems.
Not all dbms perform equally.
Not all dbms are portable.
C based DBMs and their availability
The first issue is regarding availability. Not all python installations are
the same. Just because a user has python installed does not mean they
necessarily have all the standard libraries installed. I just checked my
python install on my Macbook, and I don’t have the bsddb module
available. On my debian system I don’t have the gdbm module installed.
Given that these packages are just python bindings to C based dbms,
installing these packages involves:
Install the C libraries and development packages for the appropriate dbm.
Have a development environment that can build python.
Rebuild python
None of these steps are that much work, but are there any alternatives?
Not all dbms perform equally
On all of my systems I have the dbm
module available. This is a C based DBM that seems to available on most python
installations. How fast is it? There’s a scripts/benchmark script available
in the semidbm repo that can benchmark any dbm like module. Here’s the results
for the dbm module:
12345678910
$ scripts/benchmark -d dbm
Generating random data.
Benchmarking: <module 'dbm' from
'/Users/jsaryer/.virtualenvs/semidbm/lib/python2.7/lib-dynload/dbm.so'>
num_keys : 1000000
key_size : 16
value_size: 100
HASH: Out of overflow pages. Increase page size
ERROR: exception caught when benchmarking <module 'dbm' from '/Users/jsaryer/.virtualenvs/semidbm/lib/python2.7/lib-dynload/dbm.so'>: cannot add item to database
Or in other words, it made it to about 450000 keys before this error was
generated. So storing a large number of keys doesn’t seem possible with
python’s dbm module.
Not all dbms are portable
While some dbms that aren’t available simply require compiling/installing
the right packages and files, there are some dbms that just aren’t available on
certain platforms (notoriously windows).
Well fortunately, there’s a fallback python module that’s guaranteed to be
available on every single python installation:
dumbdbm.
Unfortunately, the performance is terrible. There’s also a number of
undesirable qualities:
When a key is added to the DB, the data file is updated, but the index
file is not updated, which means the data file and the index file are
not in sync. If python crashed, any newly added/updated keys are lost.
Every deletion writes out the entire index. This makes deletions painfully
slow (O(n)).
To be fair, dumbdbm was most likely written as a last resort fallback to the
more classic dbms. It’s also really old (written by Guido himself if I
remember correctly).
A key value store with modest aspirations
Hopefully the goals of semidbm are becoming clearer. I just wanted a dbm that
was:
Portable
Easily installable
Reasonably performance and semantics
The first two points I felt I could achieve by simply using python, and not
requiring any C libraries or C extensions.
The third point I felt I could improve by taking dumbdbm and making some minor
improvements.
So that’s the background of semidbm.
Can simpler really be better?
I think so. The
benchmark page
has more details regarding the performance, but as a quick comparison to
semidbm:
From the output above, writes are an order of magnitude faster (and semidbm
computes and writes out a checksum for every value) and reads are almost 4
times faster. Deletion performance is much better (0.058 seconds vs. 99.025
seconds for deleting 10000 keys).
Also, every single insertion/update/deletion is immediately written out to
disk so if python crashes, at worst you’d lose one key, the key that was being
writen out to disk when python crashed.
Why you should use semidbm
I think if you ever need to use a pure python dbm, semidbm is a great choice.
Any time you’d otherwise have to use dumbdbm, use semidbm instead.
Future plans for semidbm
There’s a number of things I’d like to investigate in the future:
Faster db loading. Semidbm needs to read the entire data file to
load the db. There’s potential to speed this up.
Caching reads. Looking at the implementation of other dbms, many of them
have some type of in memory cache to improve read performance.
Support for additional db methods. Semidbm does not support all of the
dict methods.
Batch writes/reads. Due to the append only nature of the file format, this
could substantially improve write performance.
Learning a new programming language can be a daunting task. Even though you
start with the basic things like syntax, in order to become productive in the
language you must learn things like
Common coding idioms and patterns
The standard library
Best practices (including what frameworks to use, what development tools to
use, etc)
But then there’s also the, for lack of a better term, “extra stuff.” The
collection of miscellaneous tips and tricks you pick up while coding in the
language on a day to day basis. These set of tips end up saving you a lot of
time in the long run, but are hard to distinguish how useful a tip really is
when you first hear about it.
Well, this is my list of tips. It’s not 100% complete, and focuses mostly on
various tidbits of information that, when I think about how I code on a day to
day basis, I find myself repeatedly doing.
The _ variable
This tip is useful when you’re in an interactive python shell. The _ variable
stores the value of the most recently evaluated expression:
Sometimes if you’re trying to debug a problem you’ll to need to figure
out where a module is located. A really easy way to do this is to use
the __file__ attribute of a module object:
You can also use inspect.getfile(obj) to find where an object is
located.
Running Your Module as a Script
Every module will have a __name__ attribute, but the value of that attribute
will depend on how the module is executed. Consider a module:
foo.py
1
print__name__
When the module is imported the name will be “foo”.
123
>>>importfoofoo>>>
However, when the module is executed as a script, the name will be __name__:
12
$pythonfoo.py__main__
It may not be obvious how this is useful. The way that this is typically used
is to allow a module to be both imported and used as a script. Sometimes the
script is a command line interface to the functionality available in the
module. Sometimes the script provides a demo of the capabilities of the
module. And sometimes the script runs any tests that live in the module (for
example all of the doctests). To use this in your own library you can use
something like this:
12345678910111213
defdo_something(args):# Do something with args.passdefmain(argv=None):ifargvisNone:argv=sys.argvargs=parse_args(argv)do_something(args)if__name__=='__main__':sys.exit(main())
The main() function is only called when the module is run directly.
The -m option
Once your module has an if __name__ == '__main__' clause (I usually
refer to this as just the ifmain clause), an easy way to invoke the module is
to use the -m option of python. This allows you to refer to a module by its
import name rather than its specific path. In the previous example the foo.py
module could be run using:
1
$ python -m foo
One final thing worth pointing out is that many modules in python’s stdlib have
useful ifmain functionality. A few notable ones include:
1
python -m SimpleHTTPServer
This will serve the current working directory on port 8000. I use this command
on almost a daily basis. From quickly downloading files to viewing html files
on a remote server, this is one of the most useful ifmain clauses in the entire
python standard library.
1
python -m pdb myfile.py
Run a python script via pdb (the python debugger).
1
python -m trace --trace myfile.py
Print each line to stdout before it’s executed. Be sure to see the help of the
trace module, there’s a lot of useful options besides printing each line being
executed.
1
python -m profile myfile.py
Profile myfile.py and print out a summary.
So there it is. My list of tips. In the future I plan on expanding on some of
these tips in more depth (the profiling workflow for python code and how to
debug python code stand out), but in the meantime, may these tips be as helpful
to you as they are to me.