Rhett's Nullhole

Where I put my stuff

Lensmob in 2015

I don’t talk about this much, but since shutting Harbor Labs back in 2013 (makers of such blockbusters as Retrosift), I’ve been quietly maintaining my favorite project Lensmob.

By quietly maintain, I mean I’ve basically just been paying the bills and keeping the server running. I still think Lensmob is a great product, and I’ve got a few dedicated users such as teachers, instructors and camps that aren’t comfortable just creating Facebook groups to share photos.

In my head I’ve always considered Lensmob my side project. It uses some nifty technologies behind that scenes. It’s cool to hack on. But in reality, I haven’t touched the code in about a year.

But then earlier this month, something changed:

To fully understand my excitement, it’s important to note that I’ve been follower of Lessig for quite a while. I’ve read his books, I donated to his MayDay PAC and had already been following the New Hampshire Rebellion. I’m such a nerd I spilled my scotch when while re-watching the West Wing I realized Christopher Lloyd plays him for a couple of episodes.

Of course I wanted to help in any way possible. There isn’t much in the way of user management features in Lensmob and mostly he just wanted to make sure they could contact users to get clearances for using photos.

Yeah sure, sounds simple enough. But then again, when was the last time I changed anything? How do I deploy code? I can’t even remember. Two years ago? Who wrote this shit? After dedicating a few evenings to exploring this long forgotten code base, I’m confident that Hell is my Own Old Code.

Ok, that’s probably an exaggeration, but I will say that there were some interesting lessons to be learned:

  • Lensmob is over-engineered. It does things in a rather complicated way. This makes it pretty difficult to just jump back into. Also, some of those complications means seemingly simple things like updating python libraries to version released in the last year or two is actually really difficult.
  • However, one side-effect of being over-engineered: It’s really scalable. It’s also pretty solid. Everything is still running perfectly fine after all this time of not touching it. My monitoring is still chugging away fine. This is part of the reason I haven’t touched the codebase in so long: I haven’t needed to, it just works.
  • Even relatively “good” python code written by myself, but 2 years ago, today looks like it was written by a mad man.

So, my conclusion in the end was I could easily write manual tools and dedicate some time to helping with this project anyway I could. But releasing new features or feeling really happy about the state of development for lensmob probably wasn’t going to happen in time. Making changes right before going to the big time is not a good idea.

So a week or so later:

The walk is over and so far lensmob has operated just great. The only issues I’ve seen is really a lack of features that make large albums like this more management. The 1400+ photos uploaded to this album perform fine from a system’s perspective, but is reaching limits of usability.

So now, more inspired, I think it’s time for me to dig back into it.

  • Filed under: software
  • Published at: 9:55 am on January 29, 2015
  • Comments: no comments
  • Written by: rhettg

The Story of Dynochemy

As I wrote previously about “DynamoDB and Me”, I’ve been using Amazon’s hosted NoSQL datastore for some new projects including Lensmob.com. I like it, but it inevitably led me to writing a library for better higher-level usage: Dynochemy.

The following is a story of the evolution of this library. How it started as a simple wrapper to enable easier integration into Tornado and then grew into it’s own datastore framework. This story isn’t over yet, but you might find this article interesting if you’re interested in using DynamoDB in the real world or if you just enjoy a good software engineering yarn.

Early April, 2012: The Beginning

View Changes

The beginnings of Dynochemy came from two deficiencies in existing libraries:

  • Async Support
  • Lack of reasonable API for performing operations rather than manually creating requests.

I found asyncdynamo, a library written by some developers at bit.ly. It plugs into boto and provided an async interface. The most obvious hurdle at that time was how low-level it was. Rather than do something like:

db.put({'name': 'Rhett Garber})

You ended up with something more like:

data = json.dumps({'TableName': 'MyTable', 'Item': {'name': {'S': 'Rhett Garber'}})
client.make_request('PutItem', body=data, callback=all_done)

(In later versions of boto, they added some high-level support that made creating this requests easier, however, integrating with the new async client would still be a challenge.)

Adding a nice little API for each operation seemed straightforward enough. Plus, it gave me the opportunity to really dive deep into what DynamoDB supported. This first version of Dynochemy supported basic operations of put, get, scan and query and was fairly high-level. I wasn’t sure exactly how I would end up wanting to integrate Dynochemy into different types of applications, so I wanted to support a few different APIs: Callback, Defer or Synchronous. So I ended up with 4 ways to do the exact same thing:

db['123'] = {'name': 'Rhett Garber'}

db.put({'name': 'Rhett Garber', 'id': '123'})

df = db.put_defer({'name': 'Rhett Garber', 'id': '123'})
df(ioloop=self.ioloop)

db.put_async({'name': 'Rhett Garber', 'id': '123'}, callback=after_put)

This is the first time I mentioned ‘defer’. I had a vague recollection of the defer concept from when I worked in Twisted. The generally idea is to just have an object that represents the completion of some asynchronous task. Twisted’s version seemed too complicated, so I of course tried to implement my own. Through the course of this project, I’ve come to understand why twisted is so complicated. I also accidentally ended up with a defer system that almost identical (or at least compatible) to the new ‘futures’ built-in to python (http://www.python.org/dev/peps/pep-3148/). Don’t knock reinventing the wheel, it’s a great learning process.

Late April, 2012: Some Tooling

View Changes

I soon realized this was becoming a larger project than a simple wrapper around asyncdynamo. Testing and development needed to be a little more streamlined. Connecting to a live DynamoDB installation was a pain, especially if I wanted any automated test cases.

I decided it shouldn’t be too hard to create a backing store that was SQLite based. It wouldn’t be async capable, but I could at least reasonably test my data operations.

With a little refactor, I had a pluggable ‘client’ I could set as my database abstraction’s interface.

June, 2012: Functional Completeness?

View Changes

By this point, I had most of the operations supported and I was starting to use Dynochemy in a real product.

Batch operations were particularly challenging to come up with a good API for. I realized that batch operations can span across multiple tables. My API didn’t really allow for that because the ‘db’ instance, was really a specific table.

July, 2012: Tables and Errors

View Changes

A major change to the API allowed Dynochemy to support multiple tables. Rather than:

    db['123'] = {'name': 'Rhett Garber'}

We now do:

    db.MyTable['123'] = {'name': 'Rhett Garber'}

I also started to understand just how complicated it was to handle errors in async code. Keeping my error handling straight would continue to haunt this project.

Another cool feature to come out this time period was code to run through all the pages of a query. So you could do something like:

q = db.MyTable.query(hash_key).range(1234324, 1234340).async()

results, err = run_all(q)

This would run the query till all the results were found. In production, I soon learned, this was almost completely useless because you’ll quickly exhaust your provisioning and then the query just fails.

Late August, 2012: Operations and Solvents

View Changes

I finally got to one of the major reasons I wanted this library in the first place: Dealing with provisioning and rate limiting.

Ever since I introduced the batch operations, I thought something just wasn’t quite right with my design. I knew that would come to a head when I wanted my library to deal with provisioning and retries.

The system I really wanted was to be able to say: “Do this set of operations, tell me when they are all done”. The library should transparently handle however many underlying requests to DynamoDB are required. Perhaps they can all be batched together, perhaps some of them need to be retried.

Also, I knew at some point I wanted to integrate Dynochemy with a caching layer like memcache. Being able to “replay” database operations against other plugins could result in some very useful applications.

So I created a new set of abstractions that interacted with the existing “raw” interfaces to give me these properties.

For lack of a better term, this is solved with a ‘Solvent’. A Solvent is a set of operations against one or more DynamoDB tables. When the solvent is executed, some number of HTTP requests are made to handle these operations. Eventually, a result comes back. The client can then examine the results for each operation.

It looks something like:

    s = Solvent()
    put1 = s.MyTable.put({'id': '123', 'name': 'Alice'})
    put2 = s.MyTable.put({'id': '124', 'name': 'Bob'})
    q = s.OtherTable.query(hash_key).limit(10)
    res, err = s.run(db)

    # Print all the query results
    for r in res[q]:
        print r

Very functional and powerful. But also pretty verbose. Especially around error handling.

September, 2012: Views

View Changes

As I got deeper into real-world use of this new datastore, some common patterns kept coming up. For Lensmob, a common access pattern is to want to query all the albums for a user. But, you’ll also probably want to query all the users for an album. DynamoDB has very limited query options, so what this means is that we have to have a table with the following structure:

    {
     'album': 'album1',  # Hash Key
     'user': 'user1', # Range Key
    }

And a separate table organized just the opposite:

    {
     'album': 'album1',  # Range Key
     'user': 'user1', # Hash Key
    }

Then we can do a queries like:

    db.AlbumUsers.query(album_id).limit(20)
    db.UserAlbums.query(user_id).limit(20)

There are other types of secondary meta data that might need to be maintained. Keeping a count of how many photos an album has for instance. It could be very expensive to fetch all the photos for an album each time you want to display the count. Maintaining a counter like that can be very tricky though, as each modification to a photo may need to maintain the counter.

Maintaining these associations and counters are pretty similiar, but tedious to do manually. So, now that we had a smarter, higher-level interface to DynamoDB, we had the tools to automate this. I called this feature ‘Views’.

To create a view, basically you create a class that identifies how a secondary meta entity is to be maintained. It uses the visitor design pattern, meaning all operations in a solvent are delivered to each registered view, allowing it to create additional operation.

For example:

    class UserAlbumsView(View):
        table = AlbumTable
        view_table = UserAlbumTable
	
        @classmethod
        def add(cls, entity):
            return [PutOperation(cls.view_table, {'album': entity['album'], 'user': entity['user']})]
	
        @classmethod
        def remove(cls, entity):
            return [DeleteOperation(cls.view_table, {'album': entity['album'], 'user': entity['user']})]

With this view, any album that’s created, automatically has another table maintained that is organized by user. It is important to understand what is happening behind the scenes. A minimum of two sequential DynamoDB calls are going to be required to maintain a view like this. The first, will simply add the album. Secondonly we’ll do any followup operations such as adding to an index. We can’t really do them together in the same Batch operation or else our views could be inconsisent from the actual tables (imagine if there wasn’t enough capacity on the album table, but there was on the user-album table).

Of course, maintaining the views is just half the story. We also want to query them. The View class also acts as something you can query against in a solvent. When you query against a View class, each page of the query results will be fed into a BatchGetItem with the appropriate keys. This gives the query the ability to automatically return to you the final objects, not the intermediate relationship object.

January, 2013: Streamlining

View Changes

After heavy use of Dynochemy in Lensmob and other projects, one thing kept annoying me: Error handling. The standard pattern for running a solvent in Tornado looked like this:

    s = Solvent()
    get_op = s.get(AlbumTable, album_id)
    res, err = yield tornado.gen.Task(s.run_async, self.db)
    
    # Check if the overall solvent failed
    if err:
       raise err
    
    # Check if the GetItem failed
    album, err = res[get_op]
    if err == ItemNotFoundError:
        return None
    elif err:
        raise err
    
    return album

Error handling this way was getting pretty annoying and repetitive. All that work just to get one thing from the database?

About this time I discovered python ‘futures’ PEP and library. It comes built-in to Python 3, but as a library in Python 2.7. Tornado also has some built-in handling for using futures, but it’s a little rough. Rather than try to convert my entire library to futures I took a more conservative approach and just made some changes to my own ‘defer’ class with respect to error handling.

Now, rather than the convention of:

    result, error = df()

I change it so that a defer will raise any generated exeception when the result is asked for.

So rather than recording an error for a defer as:

    df.callback(result, error=AnError)    

Now, a successful result is recorded as:

    df.callback(result)

Where an error is recorded as:

    df.exception(AnError)

This makes the common pattern of getting results from a Solvent much more straightforward:

    s = Solvent()
    get_op = s.get(AlbumTable, album_id)
    res = yield tornado.gen.Task(s.run_async, self.db)
    
    # Check if the GetItem failed
    try:
        album = res[get_op]
    except ItemNotFoundError:
        return None    

    return album

The Future

I have not discussed much how the schema design for Lensmob evolved during this period. There is a lot of really great code that makes for a pretty useful datastore that is not in Dynochemy, but is in my application code base. This part of the application makes a lot of use of Views, and uses just two DynamoDB tables: Entity and EntityIndex. This is inspired by the friendfeed MySQL-NoSQL schema design.

Dynochemy is still pretty hard to use and I would be really surprised if many people looked at it and knew that it solved their problems.

In the next steps in this libraries evolution, I would like to do several things to clean up it’s use:

  • Use python futures and hopefully make use of other tooling around them already present in many libraries (tornado)
  • Clean up the the interfaces and naming so that ‘Solvent’ is a first class citizen.
  • Implement a caching plugin (and formalize the plugin-interface)
  • Integrate my custom schema in a way that makes it the default choice for desigining an app. This includes tooling for re-building views and schema migrations.

If I actual implement the above, Dynochemy moves from being a library for accessing DynamoDB to a higher-level datastore that simply uses DynamoDB as a backing.

This begs the question.. is DynamoDB still a necessary requirement? My SQLite backing has actually been very useful and is pretty close to be production-ready as well. It uses SQLAlchemy so it should be fairly straightforward to run this against MySQL (or Postgres or whatever).

One downside about re-orienting Dynochemy to run against SQL data stores is the lack of async support. One direction I would like to investigate would be to handle the transactions against the database as a separate thread pool using futures.ThreadPoolExecutor or some such built-in tooling for executing futures. Anybody who knows me should be coughing up their coffee right now since the idea I would ever suggest using a thread is crazy. However, I think the futures interface and the fact that Dynochemy threading can be totally isolated from the application, much like a ZeroMQ application, makes it possible it won’t become a multi-threaded disaster. This is a direction for future investigation anyway, no promises.

Conclusion

I hope you enjoyed this long history of the development of a little-used python library. I don’t think developers write these things often enough.

For any potential users of Dynochemy, I think a good amount of caution is warranted. It’s proven pretty stable for our application so far, but a single user does not a battle tested library make. Anybody who identified similiar shotcomings in existing libraries and were excited to use DynamoDB should be interested in contributing and understanding how it works if they hope to make use of it.

  • Filed under: dynamodb
  • Published at: 1:37 pm on May 20, 2013
  • Comments: no comments
  • Written by: rhettg

DynamoDB and Me

I first became interested Amazon’s hosted NoSQL datastore, DynamoDB, after reading about Datomic. It’s interesting to consider using this hardened underlying datastore for the simplest possible operations and putting higher-level (and perhaps more dangerous complexity) in the application layer. Also, I’m a big fan of async io and writing web applications in Tornado. This is a problem with common database libraries. Accessing a database through HTTP means an async datastore is seemingly within grasp. It’s just web servers talking to web servers.

DynamoDB has some really interesting selling points. I’m on a small team, working on a new small project, but I’m very aware of the horrors of database failure which typically you can only afford to solve when you’re big. I don’t think I can really live that way. But, trying to setup and maintain a cluster of MySQL instances, with replication and backups just so I could sleep soundly at night wasn’t going to work. Also, every new project dreams about that moment when you’ll be panicking because you’re growing so fast that you’ll have to take downtime to upgrade your MySQL instances. AWS takes care of all of that since DynamoDB is really a hosted database “service”. The operational complexities are handled for you. All that’s required of the developer is to guess how much capacity you’ll need, and a big checkbook.

My current conclusions, after 8 months of use, are not too far off from my initial expectations:

  • DynamoDB is expensive for a small project. But you do get a lot for your money, just maybe more than you’d otherwise be willing to pay for right away.
  • Existing libraries and tooling (at least from the Python perspective) are seriously lacking. But this also provides a lot of opportunity for building cool tools to share with the world.
  • In many regards, using a NoSQL solution like DynamoDB is less flexible in the early parts of your application. You are forced to think about access patterns and what you want your data to do at the very beginning. This is good and bad. If your application doesn’t get big, you may have wasted a lot of time.

In summary, I’d say DynamoDB is a great tool for the right job. I probably wouldn’t advise jumping into it with a big mature application right away without getting your head around it on a smaller project. Unfortunately it might cost you a lot of time in a small project that will probably just slow you down in that critical prototyping phase. But, this is true of just about every new technology you might be considering introducing into a new project. Just don’t make the mistake of doing too many of these experiments at once.

  • Filed under: dynamodb
  • Published at: 5:09 pm on May 8, 2013
  • Comments: no comments
  • Written by: rhettg

How’s your Gearmang

This is the beginning of a series of blog posts detailing some of the technology I’ve been playing with over the last year or so. These tools were used in the development of, primarily, two projects: Lensmob and Retrosift

Your prototypical web application does essentially two things. Take content from the user, put it in the database. Take data from the database and show it to the user.

When you start dealing with more complex data flows like processing emails, doing image resizing or other expensive operations, you need a data pipeline that is outside the request-response cycle of a web app. For Lensmob, I use Gearman. The basic outline of using gearman would be:

  1. Send a task to the Gearman Server
  2. Worker process grabs the task from the Gearman Server.
  3. Worker process does work, then deletes the task from the

As simple as this sounds, the operational realities of all the moving parts take a little more planning. This is especially true when in a cloud environment such as AWS where failure of a single machine can be commonplace. Also, when dealing with user supplied content such as emails, expect the unexpected. Failures of your software will happen, and having the opportunity to fix your bugs and try again is very important.

Here are some of the things I’ve done to manage this complexity, keeping us sane and providing a high-quality experience for our users.

The Server

Gearman does have the tendency to have quality issues with certain versions. Specifically I’ve seen issues with persistence which we find to be very important, but apparently the greater community does not. We’re currently running 0.41 which has been running great for us. We use SQLite backing store which gives us some reassurance around gearmand or the host machine failing. On EC2, we store the gearmand database file on a separate EBS volume to keep it isolated as well as simplify recovery procedures if that machine dies.

The Worker

My workers are all written in Python using a standard python library. A common gearman worker abstraction code makes creating new workers take just a few lines of code. This common code handles failures, logging and cleanup. I’m also careful to ensure each worker only handles one type of task, rather than fall into the trap of building one worker to handle all tasks. Balancing your resources is just too difficult if you combine workers. To avoid dealing with memory leak issues or version changes, we have our workers exit every couple of minutes. Most importantly, the worker can make decisions about what to do in the face of a failure, which leads us to…

The Job of Death

One of the worst things that can happen to your gearman workers is to encounter a job that kills the worker. If that job is requeued as-is, all your workers may constantly be killed off. This makes it difficult to process any normal, working jobs. Also, your error logs and notification mechanisms will be going nuts.

We handle this condition by allowing the worker some amount of smarts in the face failure. When a failure is detected in the worker, it can:

  1. Requeue the job directly, incrementing a counter in the task definition. This puts the job at the end of the queue, so other, working jobs get their chance first.
  2. If the counter reaches some configurable value, we put that task into a secondary queue we call “Gearmang”.

Gearmang is itself a worker that collects failed jobs and adds them to a second SQLite database. We then have command line tools for inspecting, requeuing or removing those tasks.
If we have a failure from a bug in the worker code, we can just fix the bug, then requeue the task. This is much preferable to your entire gearman infrastructure grinding to a halt while you try to deploy a fix.

Monitoring

The final piece of the puzzle is monitoring. As mentioned above, logging failures to a central logging system is critical to keeping track of your gearman infrastructure. In additional to failures, we use BlueOx to track our successful task processing. This allows us to add kinds of fun analytics to our pipelines such as task duration, image sizes we processed, etc.

In addition to log data, we also use Collectd to monitor our tasks queue lengths and worker counts. Using Nagios, we can then get alerted if workers don’t appear to be running or queues go above configured thresholds. Since our worker processes are supposed to exit every few minutes, we also have a check that ensures workers haven’t been running to long. This allows us to catch any hangs from blocked IO. The added confidence this monitoring provides can’t be understated.

The End?

This system isn’t perfect yet and there is still room for more redundancies and safety. However I feel this is a good enough effort without to much added expense either from computing resources or labor. I hope to continue to evolve this system, whether from experience or from feedback from you, dear reader.

I’ve put a Gearmang repository up on github if you want to see some of the code I’m talking about. The code in here was an attempt to pull it out of my main code base into something more re-usable. This project is far from done and I’m sure the code doesn’t actually run. But you can get an idea of what I’m talking about anyway.

  • Filed under: software
  • Published at: 1:57 pm on May 2, 2013
  • Comments: no comments
  • Written by: rhettg

Anatomy of a regression test

I ran into an issue with the socket module of the python standard library. This always comes as a surprise to me when I find a problem with something as mature as python. But it happens.

My own issue involved having a python based daemon running that does HTTP requests (using httplib) to another service. This daemon is restarted gracefully by sending SIGTERM, which it catches with a signal handler, finishes up what it was doing and exits. The problem arises if it receives a signal while in a system call, for example while receiving the response from an HTTP request. To correct behavior is to attempt the system call again, however the actual system call is abstracted away, so the caller, or even httplib can’t re-try.

The crux of the issue is the function readline() provided by a fileobject socket wrapper in socket.py

                self._rbuf = StringIO()  # reset _rbuf.  we consume it via buf.
                data = None
                recv = self._sock.recv
                while data != "\n":
                    data = recv(1)
                    if not data:
                        break
                    buffers.append(data)
                return "".join(buffers)

I’m not the first to find this, as this issue even has a patch. But, due to the “test needed” status, it’s been siting there getting no attention for quite a while. Well I want it fixed, so let’s try to write a regression test!

The first step was to apply this patch to an appropriate development branch:

  svn co http://svn.python.org/projects/python/branches/release26-maint python26
  cd python26/Lib
  patch -p0 < ~/socket.py.diff

Now it turned out, this didn’t apply cleanly, as the patch was from an earlier version. But it was easy enough to fix.

Secondly, I need to a test case to Lib/test/test_socket.py
There is already a test case for normal behavior of fileobject, however causing a real socket to generate a EINTR isn’t exactly easy. But I just need to test the error handling, this is unit test. Perfect case for using a mock object. Now there arn’t any handy mock object libraries in the standard python distribution, so i’ll just keep it simple:

        class MockSocket(object):
            def __init__(self):
                # Build a generator that returns functions that we'll call and return for each
                # call to recv()
                def raise_error():
                    raise socket.error(errno.EINTR)
                self._step = iter([
                    lambda : "This is the first line\nAnd the sec",
                    raise_error,
                    lambda : "ond line is here\n",
                    lambda : None,
                ])

            def recv(self, size):
                return self._step.next()()

Now when I create my test case, I’ll just pass this mock socket in and call readline on it.

class FileObjectInterruptedTestCase(unittest.TestCase):
    """Test that the file object correctly handles being interrupted by a signal."""
    def setUp(self):
      ... create my mock socket ...

    def test(self):
        fo = socket._fileobject(self._mock_sock)
        self.assertEquals(fo.readline(), "This is the first line\n")
        self.assertEquals(fo.readline(), "And the second line is here\n")

Now to find out if this test case will allow this fix to be included……

  • Filed under: software
  • Published at: 10:01 pm on August 2, 2009
  • Comments: no comments
  • Written by: rhettg

First impressions of couchdb

For the last few weeks i’ve been playing with couchdb. I have not had much time but primarily I wanted to see how it performed for a common task I deal with at work. This is not your common “write a blog” or generic web implementation of something. In fact, I really wasn’t sure if couchdb was appropriate tool for this job at all. However, it seemed like a really easy tool to use, and perhaps even a poor-man’s hadoop for playing with map-reduce ideas.

The Problem

Imagine, if you will, daily log files of about 1.2 gigs (about 2.4 million lines). These log files are repr() python structures (and very easily translate into json). The information in them isn’t very important, but let’s say for example they detail clicks on a website.

We slice and dice this information in several ways for different reporting purposes. All said and done, I think we process these logs 3 times every night. They take about an 30-45 minutes a piece. The general methodology is to run through the logs totaling up certain values, mapping page types to number of events, etc. Once the counts are generated, we insert them all into MySQL. For some of these we end up wanting other rollup sizes as well… daily is are highest granularity, but often we roll into weekly and monthly versions as well. In practice, inserting in to the database is often the slowest part and tends to adversely affect other processes also using the database. Let me re-iterate that: Inserting daily and monthly rollup data into the MySQL is annoyingly resource intensive. I’m not even trying to put the raw data in to report on the fly.

Not an ideal situation, but it’s working for now.

Couchdb Solution

My theory was that couchdb could provide all these reporting functions in a much more flexible way than these custom reporting scripts / relational db could do. The hope is that I could just load the raw data into Couchdb, write my views and I’d be good to go. The big question mark was if couchdb was fast enough to make this feasible

The Setup

After some discussion with some helpful people in the couchdb user mailing list, I arrived at the following setup and performance tweaks:

  • couchdb 9.0a<whatever trunk is> (allegedly MUCH faster than official released versions)
  • Latest version of Erlang (5.6.5, apparently 5.1, which seems to be the default ubuntu install, is REALLY SLOW)
  • Effective use of _bulk_docs (sorta awkward way to do uploads in batches. I chose a batch size of 1500 lines)
  • Generated my own sequential doc ids (auto-generated ids are quite slow as they are not sequential, and we are living in a b-tree world)

I’m using a quad core opteron 2ghz machine, 8 gigs for ram. Storage is an XFS raid volume, but i’m trying to get more details on this, I *think* it’s some external scsi raid array.

There was some question as to how parallel processing would affect speeds. There are few possible setups:

  • Single data loader
  • Multiple data loaders, same db
  • Multiple data loads, different dbs, different machines (merge with replication ?)

I tried the first two, but the 3rd is a bit more complex. I’d like to try it, but then again I’m really looking for a solution that is “good enough”.

The Result

Single threaded

Base line (running through the logs without inserting into couchdb) was 4:36

it took 33 min, 26 seconds to load 2.4 million rows. On disk, this took 959 megs (which is smaller than the log file the data came from). So that’s about 1200 rows per second.

Dual Loaders

Base line was 3:23.

Inserting into couchdb, I got it to 19 min, 16 seconds, or about 2000 rows per second.

Note that compaction (the process of reclaiming deleted space, making the datastructure as efficient on disk as possible) resulted in no space savings. It did take about 6 minutes to run though.

Conclusion

Though loading data into couchdb is just the just the start, I feel reasonably comfortable with my results. If having the data in couchdb is as flexible as I’m hoping, it should be fairly easy to convert these multi-step reporting projects into something a little more manageable (and scalable).

As for using couchdb in general, I’ve been pretty impressed. The whole thing is refreshingly simple. The JSON/REST interface is super easy to build tools around. Installation wasn’t really that hard, even with needing to install most everything from source for performance reasons.

The community has been quite supportive and knowledgeable…. albeit small. This couchdb project isn’t taking the world by storm quite yet, but it’s making a lot of progress.

Updates on actually using this data to come…….

  • Filed under: couchdb
  • Published at: 9:17 am on February 17, 2009
  • Comments: 2 comments
  • Written by: rhettg