For the last few weeks i’ve been playing with couchdb. I have not had much time but primarily I wanted to see how it performed for a common task I deal with at work. This is not your common “write a blog” or generic web implementation of something. In fact, I really wasn’t sure if couchdb was appropriate tool for this job at all. However, it seemed like a really easy tool to use, and perhaps even a poor-man’s hadoop for playing with map-reduce ideas.
Imagine, if you will, daily log files of about 1.2 gigs (about 2.4 million lines). These log files are repr() python structures (and very easily translate into json). The information in them isn’t very important, but let’s say for example they detail clicks on a website.
We slice and dice this information in several ways for different reporting purposes. All said and done, I think we process these logs 3 times every night. They take about an 30-45 minutes a piece. The general methodology is to run through the logs totaling up certain values, mapping page types to number of events, etc. Once the counts are generated, we insert them all into MySQL. For some of these we end up wanting other rollup sizes as well… daily is are highest granularity, but often we roll into weekly and monthly versions as well. In practice, inserting in to the database is often the slowest part and tends to adversely affect other processes also using the database. Let me re-iterate that: Inserting daily and monthly rollup data into the MySQL is annoyingly resource intensive. I’m not even trying to put the raw data in to report on the fly.
Not an ideal situation, but it’s working for now.
My theory was that couchdb could provide all these reporting functions in a much more flexible way than these custom reporting scripts / relational db could do. The hope is that I could just load the raw data into Couchdb, write my views and I’d be good to go. The big question mark was if couchdb was fast enough to make this feasible
After some discussion with some helpful people in the couchdb user mailing list, I arrived at the following setup and performance tweaks:
I’m using a quad core opteron 2ghz machine, 8 gigs for ram. Storage is an XFS raid volume, but i’m trying to get more details on this, I *think* it’s some external scsi raid array.
There was some question as to how parallel processing would affect speeds. There are few possible setups:
I tried the first two, but the 3rd is a bit more complex. I’d like to try it, but then again I’m really looking for a solution that is “good enough”.
Base line (running through the logs without inserting into couchdb) was 4:36
it took 33 min, 26 seconds to load 2.4 million rows. On disk, this took 959 megs (which is smaller than the log file the data came from). So that’s about 1200 rows per second.
Base line was 3:23.
Inserting into couchdb, I got it to 19 min, 16 seconds, or about 2000 rows per second.
Note that compaction (the process of reclaiming deleted space, making the datastructure as efficient on disk as possible) resulted in no space savings. It did take about 6 minutes to run though.
Though loading data into couchdb is just the just the start, I feel reasonably comfortable with my results. If having the data in couchdb is as flexible as I’m hoping, it should be fairly easy to convert these multi-step reporting projects into something a little more manageable (and scalable).
As for using couchdb in general, I’ve been pretty impressed. The whole thing is refreshingly simple. The JSON/REST interface is super easy to build tools around. Installation wasn’t really that hard, even with needing to install most everything from source for performance reasons.
The community has been quite supportive and knowledgeable…. albeit small. This couchdb project isn’t taking the world by storm quite yet, but it’s making a lot of progress.
Updates on actually using this data to come…….