<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Rhett's Nullhole &#187; software</title>
	<atom:link href="http://nullhole.com/category/software/feed/" rel="self" type="application/rss+xml" />
	<link>http://nullhole.com</link>
	<description>Where I put my stuff</description>
	<lastBuildDate>Thu, 01 Jul 2010 00:19:22 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Anatomy of a regression test</title>
		<link>http://nullhole.com/2009/08/02/anatomy-of-a-regression-test/</link>
		<comments>http://nullhole.com/2009/08/02/anatomy-of-a-regression-test/#comments</comments>
		<pubDate>Sun, 02 Aug 2009 22:01:59 +0000</pubDate>
		<dc:creator>rhettg</dc:creator>
				<category><![CDATA[software]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[testing]]></category>

		<guid isPermaLink="false">http://nullhole.com/?p=70</guid>
		<description><![CDATA[I ran into an issue with the socket module of the python standard library. This always comes as a surprise to me when I find a problem with something as mature as python. But it happens.
My own issue involved having a python based daemon running that does HTTP requests (using httplib) to another service. This [...]]]></description>
			<content:encoded><![CDATA[<p>I ran into an issue with the socket module of the python standard library. This always comes as a surprise to me when I find a problem with something as mature as python. But it happens.</p>
<p>My own issue involved having a python based daemon running that does HTTP requests (using httplib) to another service. This daemon is restarted gracefully by sending SIGTERM, which it catches with a signal handler, finishes up what it was doing and exits. The problem arises if it receives a signal while in a system call, for example while receiving the response from an HTTP request. To correct behavior is to attempt the system call again, however the actual system call is abstracted away, so the caller, or even httplib can&#8217;t re-try.</p>
<p>The crux of the issue is the function readline() provided by a fileobject socket wrapper in socket.py</p>
<pre>
                self._rbuf = StringIO()  # reset _rbuf.  we consume it via buf.
                data = None
                recv = self._sock.recv
                while data != "\n":
                    data = recv(1)
                    if not data:
                        break
                    buffers.append(data)
                return "".join(buffers)
</pre>
<p>I&#8217;m not the first to find this, as this <a href="http://bugs.python.org/issue1628205">issue</a> even has a patch. But, due to the &#8220;test needed&#8221; status, it&#8217;s been siting there getting no attention for quite a while. Well I want it fixed, so let&#8217;s try to write a regression test!</p>
<p>The first step was to apply this patch to an appropriate development branch:</p>
<pre>
  svn co http://svn.python.org/projects/python/branches/release26-maint python26
  cd python26/Lib
  patch -p0 < ~/socket.py.diff
</pre>
<p>Now it turned out, this didn't apply cleanly, as the patch was from an earlier version. But it was easy enough to fix.</p>
<p>Secondly, I need to a test case to Lib/test/test_socket.py<br />
There is already a test case for normal behavior of fileobject, however causing a real socket to generate a EINTR isn't exactly easy. But I just need to test the error handling, this is unit test. Perfect case for using a mock object. Now there arn't any handy mock object libraries in the standard python distribution, so i'll just keep it simple:</p>
<pre>
        class MockSocket(object):
            def __init__(self):
                # Build a generator that returns functions that we'll call and return for each
                # call to recv()
                def raise_error():
                    raise socket.error(errno.EINTR)
                self._step = iter([
                    lambda : "This is the first line\nAnd the sec",
                    raise_error,
                    lambda : "ond line is here\n",
                    lambda : None,
                ])

            def recv(self, size):
                return self._step.next()()
</pre>
<p>Now when I create my test case, I'll just pass this mock socket in and call readline on it.</p>
<pre>
class FileObjectInterruptedTestCase(unittest.TestCase):
    """Test that the file object correctly handles being interrupted by a signal."""
    def setUp(self):
      ... create my mock socket ...

    def test(self):
        fo = socket._fileobject(self._mock_sock)
        self.assertEquals(fo.readline(), "This is the first line\n")
        self.assertEquals(fo.readline(), "And the second line is here\n")
</pre>
<p>Now to find out if this test case will allow this fix to be included......</p>
]]></content:encoded>
			<wfw:commentRss>http://nullhole.com/2009/08/02/anatomy-of-a-regression-test/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>First impressions of couchdb</title>
		<link>http://nullhole.com/2009/02/17/first-impressions-of-couchdb/</link>
		<comments>http://nullhole.com/2009/02/17/first-impressions-of-couchdb/#comments</comments>
		<pubDate>Tue, 17 Feb 2009 09:17:22 +0000</pubDate>
		<dc:creator>rhettg</dc:creator>
				<category><![CDATA[couchdb]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://nullhole.com/?p=51</guid>
		<description><![CDATA[For the last few weeks i&#8217;ve been playing with couchdb. I have not had much time but primarily I wanted to see how it performed for a common task I deal with at work. This is not your common &#8220;write a blog&#8221; or generic web implementation of something. In fact, I really wasn&#8217;t sure if [...]]]></description>
			<content:encoded><![CDATA[<p>For the last few weeks i&#8217;ve been playing with <a href="http://couchdb.apache.org/">couchdb</a>. I have not had much time but primarily I wanted to see how it performed for a common task I deal with at work. This is not your common &#8220;write a blog&#8221; or generic web implementation of something. In fact, I really wasn&#8217;t sure if couchdb was appropriate tool for this job at all. However, it seemed like a really easy tool to use, and perhaps even a poor-man&#8217;s hadoop for playing with map-reduce ideas.</p>
<p><strong>The Problem</strong></p>
<p>Imagine, if you will, daily log files of about 1.2 gigs (about 2.4 million lines). These log files are repr() python structures (and very easily translate into json). The information in them isn&#8217;t very important, but let&#8217;s say for example they detail clicks on a website.</p>
<p>We slice and dice this information in several ways for different reporting purposes. All said and done, I think we process these logs 3 times every night. They take about an 30-45 minutes a piece. The general methodology is to run through the logs totaling up certain values, mapping page types to number of events, etc. Once the counts are generated, we insert them all into MySQL. For some of these we end up wanting other rollup sizes as well&#8230; daily is are highest granularity, but often we roll into weekly and monthly versions as well. In practice, inserting in to the database is often the slowest part and tends to adversely affect other processes also using the database. Let me re-iterate that: Inserting daily and monthly rollup data into the MySQL is annoyingly resource intensive. I&#8217;m not even trying to put the raw data in to report on the fly.</p>
<p>Not an ideal situation, but it&#8217;s working for now.</p>
<p><strong>Couchdb Solution<br />
</strong></p>
<p>My theory was that couchdb could provide all these reporting functions in a much more flexible way than these custom reporting scripts / relational db could do. The hope is that I could just load the raw data into Couchdb, write my views and I&#8217;d be good to go. The big question mark was if couchdb was fast enough to make this feasible</p>
<p><strong>The Setup</strong></p>
<p>After some discussion with some helpful people in the couchdb user mailing list, I arrived at the following setup and performance tweaks:</p>
<ul>
<li>couchdb 9.0a&lt;whatever trunk is&gt; (allegedly MUCH faster than official released versions)</li>
<li>Latest version of Erlang (5.6.5, apparently 5.1, which seems to be the default ubuntu install, is REALLY SLOW)</li>
<li>Effective use of _bulk_docs (sorta awkward way to do uploads in batches. I chose a batch size of 1500 lines)</li>
<li>Generated my own sequential doc ids (auto-generated ids are quite slow as they are not sequential, and we are living in a b-tree world)</li>
</ul>
<p>I&#8217;m using a quad core opteron 2ghz machine, 8 gigs for ram. Storage is an XFS raid volume, but i&#8217;m trying to get more details on this, I *think* it&#8217;s some external scsi raid array.</p>
<p>There was some question as to how parallel processing would affect speeds. There are few possible setups:</p>
<ul>
<li>Single data loader</li>
<li>Multiple data loaders, same db</li>
<li>Multiple data loads, different dbs, different machines (merge with replication ?)</li>
</ul>
<p>I tried the first two, but the 3rd is a bit more complex. I&#8217;d like to try it, but then again I&#8217;m really looking for a solution that is &#8220;good enough&#8221;.</p>
<p><strong>The Result</strong></p>
<p>Single threaded</p>
<p>Base line (running through the logs without inserting into couchdb) was 4:36</p>
<p>it took 33 min, 26 seconds to load 2.4 million rows. On disk, this took 959 megs (which is smaller than the log file the data came from). So that&#8217;s about 1200 rows per second.</p>
<p>Dual Loaders</p>
<p>Base line was 3:23.</p>
<p>Inserting into couchdb, I got it to 19 min, 16 seconds, or about 2000 rows per second.</p>
<p>Note that <em>compaction </em>(the process of reclaiming deleted space, making the datastructure as efficient on disk as possible) resulted in no space savings. It did take about 6 minutes to run though.</p>
<p><strong>Conclusion</strong></p>
<p>Though loading data into couchdb is just the just the start, I feel reasonably comfortable with my results. If having the data in couchdb is as flexible as I&#8217;m hoping, it should be fairly easy to convert these multi-step reporting projects into something a little more manageable (and scalable).</p>
<p>As for using couchdb in general, I&#8217;ve been pretty impressed. The whole thing is refreshingly simple. The JSON/REST interface is super easy to build tools around. Installation wasn&#8217;t really that hard, even with needing to install most everything from source for performance reasons.</p>
<p>The community has been quite supportive and knowledgeable&#8230;. albeit small. This couchdb project isn&#8217;t taking the world by storm quite yet, but it&#8217;s making a lot of progress.</p>
<p>Updates on actually using this data to come&#8230;&#8230;.</p>
]]></content:encoded>
			<wfw:commentRss>http://nullhole.com/2009/02/17/first-impressions-of-couchdb/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
