From 60a3ab4e7fd5c661393cc9c19ac389c3dbdbe292 Mon Sep 17 00:00:00 2001 From: "A. Jesse Jiryu Davis" Date: Sat, 5 Jun 2021 21:58:21 -0400 Subject: [PATCH] prune old posts --- .../content/a-curious-concurrency-case.md | 109 -------- ...-include-tag-to-underscore-js-templates.md | 78 ------ ...vent-synchronization-primitive-for-ruby.md | 41 --- .../content/announcing-motor-0-2-rc0.md | 39 --- emptysquare/content/announcing-motor-0-4-1.md | 20 -- ...ymongo-3-could-not-find-cursor-in-cache.md | 40 --- emptysquare/content/greenletprofiler.md | 83 ------ ...ow-to-do-an-isolated-install-of-brubeck.md | 102 ------- .../content/mongodb-full-text-search.md | 122 -------- .../content/mongodb-testing-network-errors.md | 49 ---- emptysquare/content/motor-0-1-1-released.md | 34 --- emptysquare/content/motor-0-3-3-released.md | 33 --- emptysquare/content/motor-01-migration.md | 30 -- .../motor-installation-instructions.md | 88 ------ emptysquare/content/motor-is-growing-up.md | 32 --- .../content/motor-officially-released.md | 33 --- .../motor-progress-report-the-road-to-0-2.md | 53 ---- emptysquare/content/motor-progress-report.md | 31 --- emptysquare/content/nginx-spellcasting.md | 35 --- emptysquare/content/pausing-with-tornado.md | 34 --- emptysquare/content/pymongo-2-4-2-is-out.md | 30 -- .../content/pymongo-use-greenlets-followup.md | 58 ---- .../pymongos-new-default-safe-writes.md | 119 -------- .../read-your-writes-consistency-pymongo.md | 62 ----- .../real-time-profiling-a-mongodb-cluster.md | 128 --------- ...efactoring-tornado-code-with-gen-engine.md | 227 --------------- .../content/requests-in-python-and-mongodb.md | 223 --------------- .../restructured-text-chrome-livereload.md | 39 --- ...cturedtext-in-pycharm-firefox-and-anger.md | 27 -- ...-the-monkey-reliably-writing-to-mongodb.md | 263 ------------------ .../synchronously-build-mongodb-indexes.md | 78 ------ emptysquare/content/toro-0-6-released.md | 24 -- emptysquare/content/toro-0-7-released.md | 33 --- .../using-jqtouch-js-with-ibutton-js.md | 84 ------ .../wasps-nest-read-copy-update-python.md | 215 -------------- ...points-simple-extensions-to-tornado-gen.md | 42 --- 36 files changed, 2738 deletions(-) delete mode 100644 emptysquare/content/a-curious-concurrency-case.md delete mode 100644 emptysquare/content/adding-an-include-tag-to-underscore-js-templates.md delete mode 100644 emptysquare/content/an-event-synchronization-primitive-for-ruby.md delete mode 100644 emptysquare/content/announcing-motor-0-2-rc0.md delete mode 100644 emptysquare/content/announcing-motor-0-4-1.md delete mode 100644 emptysquare/content/caution-critical-bug-in-pymongo-3-could-not-find-cursor-in-cache.md delete mode 100644 emptysquare/content/greenletprofiler.md delete mode 100644 emptysquare/content/how-to-do-an-isolated-install-of-brubeck.md delete mode 100644 emptysquare/content/mongodb-full-text-search.md delete mode 100644 emptysquare/content/mongodb-testing-network-errors.md delete mode 100644 emptysquare/content/motor-0-1-1-released.md delete mode 100644 emptysquare/content/motor-0-3-3-released.md delete mode 100644 emptysquare/content/motor-01-migration.md delete mode 100644 emptysquare/content/motor-installation-instructions.md delete mode 100644 emptysquare/content/motor-is-growing-up.md delete mode 100644 emptysquare/content/motor-officially-released.md delete mode 100644 emptysquare/content/motor-progress-report-the-road-to-0-2.md delete mode 100644 emptysquare/content/motor-progress-report.md delete mode 100644 emptysquare/content/nginx-spellcasting.md delete mode 100644 emptysquare/content/pausing-with-tornado.md delete mode 100644 emptysquare/content/pymongo-2-4-2-is-out.md delete mode 100644 emptysquare/content/pymongo-use-greenlets-followup.md delete mode 100644 emptysquare/content/pymongos-new-default-safe-writes.md delete mode 100644 emptysquare/content/read-your-writes-consistency-pymongo.md delete mode 100644 emptysquare/content/real-time-profiling-a-mongodb-cluster.md delete mode 100644 emptysquare/content/refactoring-tornado-code-with-gen-engine.md delete mode 100644 emptysquare/content/requests-in-python-and-mongodb.md delete mode 100644 emptysquare/content/restructured-text-chrome-livereload.md delete mode 100644 emptysquare/content/restructuredtext-in-pycharm-firefox-and-anger.md delete mode 100644 emptysquare/content/save-the-monkey-reliably-writing-to-mongodb.md delete mode 100644 emptysquare/content/synchronously-build-mongodb-indexes.md delete mode 100644 emptysquare/content/toro-0-6-released.md delete mode 100644 emptysquare/content/toro-0-7-released.md delete mode 100644 emptysquare/content/using-jqtouch-js-with-ibutton-js.md delete mode 100644 emptysquare/content/wasps-nest-read-copy-update-python.md delete mode 100644 emptysquare/content/yieldpoints-simple-extensions-to-tornado-gen.md diff --git a/emptysquare/content/a-curious-concurrency-case.md b/emptysquare/content/a-curious-concurrency-case.md deleted file mode 100644 index 953581bc..00000000 --- a/emptysquare/content/a-curious-concurrency-case.md +++ /dev/null @@ -1,109 +0,0 @@ -+++ -type = "post" -title = "A Curious Concurrency Case" -date = "2013-03-03T16:14:09" -description = "A subtle performance bug in the MongoDB Ruby driver's connection pool." -category = ["MongoDB", "Programming"] -tag = ["ruby"] -enable_lightbox = false -thumbnail = "percentage-unused-sockets.png" -draft = false -disqus_identifier = "5133b83353937431d6bf0c88" -disqus_url = "https://emptysqua.re/blog/5133b83353937431d6bf0c88/" -+++ - -

Last month, the team in charge of 10gen's Ruby driver for MongoDB ran into a few concurrency bugs, reported by a customer running the driver in JRuby with a large number of threads and connections. I've barely written a line of Ruby in my life, but I jumped in to help for a week anyway.

-

I helped spot a very interesting performance bug in the driver's connection pool. The fix was easy, but thoroughly characterizing the bug turned out to be complex. Here's a record of my investigation.

-
-

The Ruby driver's pool assigns a socket to a thread when the thread first calls checkout, and that thread stays pinned to its socket for life. Until the pool reaches its configured max_size, each new thread has a bespoke socket created for it. Additional threads are assigned random existing sockets. When a thread next calls checkout, if its socket's in use (by another thread) the requesting thread waits in a queue.

-

Here's a simplified version of the pool:

-
class Pool
-  def initialize(max_size)
-    @max_size       = max_size
-    @sockets        = []
-    @checked_out    = []
-    @thread_to_sock = {}
-    @lock           = Mutex.new
-    @queue          = ConditionVariable.new
-  end
-
-  # Check out an existing socket or create a
-  # new socket if max_size not exceeded.
-  # Otherwise, wait for the next socket.
-  def checkout
-    tid = Thread.current.object_id
-    loop do
-      @lock.synchronize do
-        if sock = @thread_to_sock[tid]
-
-          # Thread wants its prior socket
-          if !@checked_out.include?(sock)
-            # Acquire the socket
-            @checked_out << sock
-            return sock
-          end
-
-        else
-
-          if @sockets.size < @max_size
-
-            # Assign new socket to thread
-            sock = create_connection
-            @thread_to_sock[tid] = sock
-            return sock
-
-          elsif @checked_out.size < @sockets.size
-
-            # Assign random socket to thread
-            sock = available[rand(available.length)]
-            @thread_to_sock[tid] = sock
-            return sock
-
-          end
-
-        end
-
-        # Release lock, wait to try again
-        @queue.wait(@lock)
-      end
-    end
-  end
-
-  # Return a socket to the pool.
-  def checkin(socket)
-    @lock.synchronize do
-      @checked_out.delete(socket)
-      @queue.signal
-    end
-  end
-end
-
- - -

When a thread returns a socket, it signals the queue and wakes the next thread in line. That thread goes to the top of the loop and tries again to acquire its socket. The bug is in checkin: if the next thread in the queue is waiting for a different socket than the one just checked in, it may fail to acquire its socket, and it will sleep again.

-

When I first saw this I thought there must be the possibility of a deadlock. After all, if threads sometimes call checkin without really waking other threads, mustn't there come a time when everyone's waiting and no one has a socket?

-

I wrote a Python script to simulate the Ruby pool and ran it for a few thousand ticks, with various numbers of threads and sockets. It never deadlocked.

-

So I had to stop coding and start thinking.

-
-

Let's say there are N threads and S sockets. N can be greater than, less than, or equal to S. Doesn't matter. Assume the pool has already created all S sockets, and all N threads have sockets assigned. Each thread either:

-
    -
  1. Has checked out its socket, and is going to return it and signal the queue, or
  2. -
  3. Is waiting for its socket, or will ask for it in the future, or
  4. -
  5. Has returned its socket and will never ask for it again.
  6. -
-

To deadlock, all threads must be in state 2.

-

To reach that point, we need N - 1 threads in state 2 and have the Nth thread transition from 1 to 2. (By definition it doesn't go from state 3 to 2.) But when the Nth thread returns its socket and signals the queue, all sockets are now returned, so the next awakened thread won't wait again—its socket is available, so it goes to state 1. Thus, no deadlock.

-

The old code definitely wasn't efficient. It's easy to imagine cases where all a socket's threads were waiting, even though one of them could have been running. Let's say there are 2 sockets and 4 threads:

-
    -
  1. Thread 1 has Socket A checked out, Thread 2 has Socket B, Thread 3 is waiting for A, Thread 4 is waiting for B, and they're enqueued like [3, 4].
  2. -
  3. Thread 2 returns B, signals the queue.
  4. -
  5. Thread 3 wakes, can't get A, waits again.
  6. -
-

At this point, Thread 4 should be running, since its Socket B is available, but it's waiting erroneously for Thread 1 to return A before it wakes.

-

So we changed the code to do queue.broadcast instead of signal, so checkin wakes all the threads, and we released the fixed driver. In the future, even better code may prevent multiple threads from contending for the same socket at all.

-

The bugfix was obvious. It's much harder to determine exactly how bad the bug was—how common is it for a socket to be unused?

-
-

In my simulated pool there are 10 sockets. Each thread uses its socket for 1‑20 seconds, sleeps one second, and asks for its socket again. I counted how many sockets were in use each second, and subtracted that from S * total_time to get an inefficiency factor:

-

Percentage unused sockets

-

If N=S=10, threads never wait but there's some fake "inefficiency" due to the 1-second sleep. For larger numbers of threads the sleep time becomes irrelevant (because there's always another thread ready to use the socket), but signal adds an inefficiency that declines very slowly from 8% as the number of threads increases. A pool that uses broadcast, in contrast, can saturate its sockets if it has more than 30 threads.

-

I spent hours (mostly on planes) trying to determine why the inefficiency factor acts this way—why 8%? Shouldn't it be worse? And why does it fall, slowly, as N rises? But I'm calling it quits now. Leave a comment if you have any insights, but I'm satisfied that the old pool was wasteful and that the new one is a substantial improvement.

diff --git a/emptysquare/content/adding-an-include-tag-to-underscore-js-templates.md b/emptysquare/content/adding-an-include-tag-to-underscore-js-templates.md deleted file mode 100644 index f358e1f3..00000000 --- a/emptysquare/content/adding-an-include-tag-to-underscore-js-templates.md +++ /dev/null @@ -1,78 +0,0 @@ -+++ -type = "post" -title = "Adding an \"include\" tag to Underscore.js templates" -date = "2011-11-18T14:32:19" -description = "" -category = ["Programming"] -tag = ["javascript"] -enable_lightbox = false -draft = false -disqus_identifier = "169 http://emptysquare.net/blog/?p=169" -disqus_url = "https://emptysqua.re/blog/169 http://emptysquare.net/blog/?p=169/" -+++ - -

I use Backbone.js a lot -lately, and since Backbone requires -Underscore.js, I usually -end up using Underscore's templates rather than introducing another -Javascript library dependency like Mustache -templates. But Underscore's -micro-templating language has an omission that bothered me today: -templates can't include each other.

-

So here's a quick and dirty <% include %> tag for Underscore -templates:

-
// Extend underscore's template() to allow inclusions
-function template(str, data) {
-    // match "<% include template-id %>"
-    return _.template(
-        str.replace(
-            /<%\s*include\s*(.*?)\s*%>/g,
-            function(match, templateId) {
-                var el = document.getElementById(templateId);
-                return el ? el.innerHTML : '';
-            }
-        ),
-        data
-    );
-}
-
- - -

As you can see, the code simply replaces tags like

-
<% include foo %>
-
- - -

with the contents of the element with id "foo". Use it by throwing code -like this into the body of your HTML page:

-
<script type="text/template" id="base-template">
-    Here is a number: <%= n %>
-</script>
-
-<script type="text/template" id="imaginary-template">
-    <% include base-template %> + <%= imaginary %>i
-</script>
-
- - -

And in your Javascript code, do this:

-
// Outputs "Here's a number: 17"
-function showSimpleNumber() {
-    var t = template($('#base-template').html());
-    $('body').html(t({ n: 17 }));
-}
-
-// Outputs "Here's a number: 17 + 42i"
-function showComplexNumber() {
-    var t = template($('#imaginary-template').html());
-    $('body').html(t({ n: 17, i: 42 }));
-}
-
- - -

Enjoy! I leave as an exercise for the reader:

-
    -
  1. Cache included templates so the template() function needn't keep doing document.getElementById().innerHTML for an often-included template
  2. -
  3. Create replaceable blocks in templates
  4. -
  5. Pass variables from one template to another
  6. -
diff --git a/emptysquare/content/an-event-synchronization-primitive-for-ruby.md b/emptysquare/content/an-event-synchronization-primitive-for-ruby.md deleted file mode 100644 index db307e91..00000000 --- a/emptysquare/content/an-event-synchronization-primitive-for-ruby.md +++ /dev/null @@ -1,41 +0,0 @@ -+++ -type = "post" -title = "An Event synchronization primitive for Ruby" -date = "2013-02-09T13:40:53" -description = "A port of Python's threading.Event synchronization primitive for Ruby" -category = ["Programming"] -tag = ["threading"] -enable_lightbox = false -draft = false -disqus_identifier = "51167f1d5393747dd209a86d" -disqus_url = "https://emptysqua.re/blog/51167f1d5393747dd209a86d/" -+++ - -

I helped some Ruby friends implement a rendezvous (aka a barrier). I'm accustomed to using an Event to implement a rendezvous in Python but Ruby doesn't have Events, only Mutexes and ConditionVariables. That's fine, Python's Event is implemented in terms of a mutex and a condition, so it's easy to make an Event in Ruby:

-
class Event
-    def initialize
-        @lock = Mutex.new
-        @cond = ConditionVariable.new
-        @flag = false
-    end
-
-    def set
-        @lock.synchronize do
-            @flag = true
-            @cond.broadcast
-       end
-    end
-
-    def wait
-        @lock.synchronize do
-            if not @flag
-                @cond.wait(@lock)
-            end
-        end
-    end
-end
-
- - -

Ruby's cond.wait(lock) pattern is interesting—you enter a lock so you can call wait, then wait releases the lock so another thread can broadcast the condition, and finally wait reacquires the lock before continuing.

-

I didn't implement is_set since it's unreliable (another thread can change it between the time you check the value and the time you act upon the information) and I didn't do clear since you can just replace the Event with a fresh one.

diff --git a/emptysquare/content/announcing-motor-0-2-rc0.md b/emptysquare/content/announcing-motor-0-2-rc0.md deleted file mode 100644 index 96bfdb07..00000000 --- a/emptysquare/content/announcing-motor-0-2-rc0.md +++ /dev/null @@ -1,39 +0,0 @@ -+++ -type = "post" -title = "Announcing Motor 0.2 release candidate" -date = "2014-04-04T22:32:45" -description = "Motor 0.2 rc0 is a huge change from 0.1, reflecting big improvements in PyMongo, Tornado, and MongoDB itself." -category = ["MongoDB", "Motor", "Programming", "Python"] -tag = [] -enable_lightbox = false -thumbnail = "motor-musho.png" -draft = false -disqus_identifier = "533f58dc53937441561c1131" -disqus_url = "https://emptysqua.re/blog/533f58dc53937441561c1131/" -+++ - -

Motor

-

I'm excited to offer you Motor 0.2, release candidate zero. Motor is my non-blocking driver for MongoDB and Tornado.

-

The changes from Motor 0.1 to 0.2 are epochal. They were motivated primarily by three events:

- -

Please read the changelog before upgrading. There are backwards-breaking API changes; you must update your code. I tried to make the instructions clear and the immediate effort small. A summary of the changes is in my post, "the road to 0.2".

-

Once you're done reading, upgrade:

-
pip install pymongo==2.7
-pip install https://github.com/mongodb/motor/archive/0.2rc0.zip
-
- - -

The owner's manual is on ReadTheDocs. At the time of this writing, Motor 0.2's docs are in the "latest" branch:

-
-

http://motor.readthedocs.org/en/latest/

-
-

...and Motor 0.1's docs are in "stable":

-
-

http://motor.readthedocs.org/en/stable/

-
-

Enjoy! If you find a bug or want a feature, report it. If I don't hear of any bugs in the next week I'll make the release official.

-

In any case, tweet me if you're building something nifty with Motor. I want to hear from you.

diff --git a/emptysquare/content/announcing-motor-0-4-1.md b/emptysquare/content/announcing-motor-0-4-1.md deleted file mode 100644 index 079bca17..00000000 --- a/emptysquare/content/announcing-motor-0-4-1.md +++ /dev/null @@ -1,20 +0,0 @@ -+++ -type = "post" -title = "Announcing Motor 0.4.1" -date = "2015-05-09T12:07:05" -description = "One critical bugfix." -category = ["MongoDB", "Motor", "Programming", "Python"] -tag = [] -enable_lightbox = false -thumbnail = "motor-musho.png" -draft = false -disqus_identifier = "554e0d3f5393741c64c21709" -disqus_url = "https://emptysqua.re/blog/554e0d3f5393741c64c21709/" -+++ - -

Motor

-

I received an extraordinarily helpful bug report yesterday from Brent Miller, who showed me that Motor's replica set client hangs if it tries two operations at once, while it is setting up its initial connection. He sent a script that not only reproduces the hang, but diagnoses it, too, by regularly dumping all threads' stacks to a file.

-

A report this generous made my work easy. I found that I'd caused this bug while fixing another one. In the previous bug, if Motor's replica set client was under load while reconnecting to your servers, it could start multiple greenlets to monitor your replica set, instead of just one. (Eventually, Motor will be designed to start multiple greenlets and monitor all servers in parallel, the same as PyMongo 3, but for now, starting multiple monitor greenlets is a bug.)

-

I fixed that bug overzealously: now if you start multiple operations on a replica set client as it connects, it does not start the monitor greenlet at all, and deadlocks. Motor 0.4.1 gets it right. It starts one and only one monitor greenlet as it connects to your replica set. Get it from PyPI:

-
pip install motor==0.4.1
-
diff --git a/emptysquare/content/caution-critical-bug-in-pymongo-3-could-not-find-cursor-in-cache.md b/emptysquare/content/caution-critical-bug-in-pymongo-3-could-not-find-cursor-in-cache.md deleted file mode 100644 index 201c882a..00000000 --- a/emptysquare/content/caution-critical-bug-in-pymongo-3-could-not-find-cursor-in-cache.md +++ /dev/null @@ -1,40 +0,0 @@ -+++ -type = "post" -title = "Caution: Critical Bug In PyMongo 3, \"could not find cursor in cache\"" -date = "2015-04-15T17:39:28" -description = "If you use multiple mongos servers in a sharded cluster, be cautious upgrading to PyMongo 3, we've just discovered a critical bug." -category = ["MongoDB", "Programming", "Python"] -tag = ["pymongo"] -enable_lightbox = false -draft = false -disqus_identifier = "552ed95a5393741c7644f817" -disqus_url = "https://emptysqua.re/blog/552ed95a5393741c7644f817/" -+++ - -

If you use multiple mongos servers in a sharded cluster, be cautious upgrading to PyMongo 3. We've just discovered a critical bug related to our new mongos load-balancing feature.

-

Update: PyMongo 3.0.1 was released April 21, 2015 with fixes for this and other bugs.

-

If you create a MongoClient instance with PyMongo 3 and pass the addresses of several mongos servers, like so:

-
client = MongoClient('mongodb://mongos1,mongos2')
-
- - -

...then the client load-balances among the lowest-latency of them. Read the load-balancing documentation for details. This works correctly except when retrieving more than 101 documents, or more than 4MB of data, from a cursor:

-
collection = client.db.collection
-for document in collection.find():
-    # ... do something with each document ...
-    pass
-
- - -

PyMongo wrongly tries to get subsequent batches of documents from random mongos servers, instead of streaming results from the same server it chose for the initial query. The symptom is an OperationFailure with a server error message, "could not find cursor in cache":

-
Traceback (most recent call last):
-  File "/usr/local/lib/python2.7/dist-packages/pymongo/cursor.py", line 968, in __next__
-        if len(self.__data) or self._refresh():
-  File "/usr/local/lib/python2.7/dist-packages/pymongo/cursor.py", line 922, in _refresh
-        self.__id))
-  File "/usr/local/lib/python2.7/dist-packages/pymongo/cursor.py", line 838, in __send_message
-        codec_options=self.__codec_options)
-  File "/usr/local/lib/python2.7/dist-packages/pymongo/helpers.py", line 110, in _unpack_response
-        cursor_id)
-pymongo.errors.CursorNotFound: cursor id '1025112076089406867' not valid at server
-
diff --git a/emptysquare/content/greenletprofiler.md b/emptysquare/content/greenletprofiler.md deleted file mode 100644 index 976ba31c..00000000 --- a/emptysquare/content/greenletprofiler.md +++ /dev/null @@ -1,83 +0,0 @@ -+++ -type = "post" -title = "GreenletProfiler, A Fast Python Profiler For Gevent" -date = "2014-01-27T12:11:20" -description = "A new profiler that can accurately analyze Gevent applications." -category = ["Programming", "Python"] -tag = [] -enable_lightbox = false -thumbnail = "cProfile-bar-vs-foo.png" -draft = false -disqus_identifier = "52e53b465393747fe3c1c018" -disqus_url = "https://emptysqua.re/blog/52e53b465393747fe3c1c018/" -+++ - -

If you use Gevent, you know it's great for concurrency, but alas, none of the Python performance profilers work on Gevent applications. So I'm taking matters into my own hands. I'll show you how both cProfile and Yappi stumble on programs that use greenlets, and I'll demonstrate GreenletProfiler, my solution.

-

cProfile Gets Confused by Greenlets

-

I'll write a script that spawns two greenlets, then I'll profile the script to look for the functions that cost the most. In my script, the foo greenlet spins 20 million times. Every million iterations, it yields to Gevent's scheduler (the "hub"). The bar greenlet does the same, but it spins only half as many times.

-
import cProfile
-import gevent
-import lsprofcalltree
- 
-MILLION = 1000 * 1000
- 
-def foo():
-    for i in range(20 * MILLION):
-        if not i % MILLION:
-            # Yield to the Gevent hub.
-            gevent.sleep(0)
- 
-def bar():
-    for i in range(10 * MILLION):
-        if not i % MILLION:
-            gevent.sleep(0)
- 
-profile = cProfile.Profile()
-profile.enable()
- 
-foo_greenlet = gevent.spawn(foo)
-bar_greenlet = gevent.spawn(bar)
-foo_greenlet.join()
-bar_greenlet.join()
- 
-profile.disable()
-stats = lsprofcalltree.KCacheGrind(profile)
-stats.output(open('cProfile.callgrind', 'w'))
-
- - -

Let's pretend I'm a total idiot and I don't know why this program is slow. I profile it with cProfile, and convert its output with lsprofcalltree so I can view the profile in KCacheGrind. cProfile is evidently confused: it thinks bar took twice as long as foo, although the opposite is true:

-

CProfile bar vs foo

-

cProfile also fails to count the calls to sleep. I'm not sure why cProfile's befuddlement manifests this particular way. If you understand it, please explain it to me in the comments. But it's not surprising that cProfile doesn't understand my script: cProfile is built to trace a single thread, so it assumes that if one function is called, and then a second function is called, that the first must have called the second. Greenlets defeat this assumption because the call stack can change entirely between one function call and the next.

-

Yappi Stumbles Over Greenlets

-

Next let's try Yappi, the excellent profiling package by Sumer Cip. Yappi has two big advantages over cProfile: it's built to trace multithreaded programs, and it can measure CPU time instead of wall-clock time. So maybe Yappi will do better than cProfile on my script? I run Yappi like so:

-
yappi.set_clock_type('cpu')
-yappi.start(builtins=True)
- 
-foo_greenlet = gevent.spawn(foo)
-bar_greenlet = gevent.spawn(bar)
-foo_greenlet.join()
-bar_greenlet.join()
- 
-yappi.stop()
-stats = yappi.get_func_stats()
-stats.save('yappi.callgrind', type='callgrind')
-
- - -

Yappi thinks that when foo and bar call gevent.sleep, they indirectly call Greenlet.run, and eventually call themselves:

-

Yappi call graph

-

This is true in some philosophical sense. When my greenlets sleep, they indirectly cause each other to be scheduled by the Gevent hub. But it's wrong to say they actually call themselves recursively, and it confuses Yappi's cost measurements: Yappi attributes most of the CPU cost of the program to Gevent's internal Waiter.get function. Yappi also, for some reason, thinks that sleep is called only once each by foo and bar, though it knows it was called 30 times in total.

-

Yappi costs

-

GreenletProfiler Groks Greenlets

-

Since Yappi is so great for multithreaded programs, I used it as my starting point for GreenletProfiler. Yappi's core tracing code is in C, for speed. The C code has a notion of a "context" which is associated with each thread. I added a hook to Yappi that lets me associate contexts with greenlets instead of threads. And voilà, the profiler understands my script! foo and bar are correctly measured as two-thirds and one-third of the script's total cost:

-

GreenletProfiler costs

-

Unlike Yappi, GreenletProfiler also knows that foo calls sleep 20 times and bar calls sleep 10 times:

-

GreenletProfiler call graph

-

Finally, I know which functions to optimize because I have an accurate view of how my script executes.

-

Conclusion

-

I can't take much credit for GreenletProfiler, because I stand on the shoulders of giants. Specifically I am standing on the shoulders of Sumer Cip, Yappi's author. But I hope it's useful to you. Install it with pip install GreenletProfiler, profile your greenletted program, and let me know how GreenletProfiler works for you.

- diff --git a/emptysquare/content/how-to-do-an-isolated-install-of-brubeck.md b/emptysquare/content/how-to-do-an-isolated-install-of-brubeck.md deleted file mode 100644 index 2156e93e..00000000 --- a/emptysquare/content/how-to-do-an-isolated-install-of-brubeck.md +++ /dev/null @@ -1,102 +0,0 @@ -+++ -type = "post" -title = "How To Do An Isolated Install of Brubeck" -date = "2012-01-05T15:56:56" -description = "I wanted to install James Dennis's Brubeck web framework, but lately I've become fanatical about installing nothing, nothing, in the system-wide directories. A simple rm -rf brubeck/ should make it like nothing ever happened. So that I [ ... ]" -category = ["Programming", "Python"] -tag = ["brubeck", "isolated", "virtualenv"] -enable_lightbox = false -thumbnail = "brubeck.png" -draft = false -disqus_identifier = "286 http://emptysquare.net/blog/?p=286" -disqus_url = "https://emptysqua.re/blog/286 http://emptysquare.net/blog/?p=286/" -+++ - -

-

I wanted to install James Dennis's Brubeck web -framework, but lately I've become fanatical about installing nothing, -nothing, in the system-wide directories. A simple rm -rf brubeck/ -should make it like nothing ever happened.

-

So that I remember this for next time, here's how I did an isolated -install of Brubeck and all its dependencies on Mac OS Lion.

-

Install virtualenv and virtualenvwrapper (but of course you've already -done this, because you're elite like me).

-

Make a virtualenv

-
mkvirtualenv brubeck; cdvirtualenv
-
- - -

ZeroMQ

-
wget http://download.zeromq.org/zeromq-2.2.0.tar.gz
-tar zxf zeromq-2.2.0.tar.gz 
-cd zeromq-2.2.0
-./autogen.sh
-./configure --prefix=$VIRTUAL_ENV # Don't install system-wide, just in this directory
-make && make install./c
-cd ..
-
- - -

Mongrel2

-
git clone https://github.com/zedshaw/mongrel2.git
-cd mongrel2
-emacs Makefile
-
- - -

Add a line like this to the top of the Makefile, so the compiler can -find where you've installed ZeroMQ's header and lib files:

-
OPTFLAGS += -I$(VIRTUAL_ENV)/include -L$(VIRTUAL_ENV)/lib
-
- - -

and replace PREFIX?=/usr/local with PREFIX?=$(VIRTUAL_ENV)

-
make && make install
-cd ..
-
- - -

Libevent

-

Libevent (required by Gevent) is pretty much the same dance as ZeroMQ:

-
wget https://github.com/downloads/libevent/libevent/libevent-2.0.19-stable.tar.gz
-tar zxf libevent-2.0.19-stable.tar.gz
-cd libevent-2.0.19-stable
-./configure --prefix=$VIRTUAL_ENV
-make
-make install
-cd ..
-
- - -

Python Packages

-

First get Brubeck's requirements file:

-
git clone https://github.com/j2labs/brubeck.git
-cd brubeck
-
- - -

Now we need our isolated include/ and lib/ directories available on the -path when we install Brubeck's Python package dependencies. -Specifically, the gevent_zeromq package has some C code that needs to -find zmq.h and libzmq in order to compile. We'll do that by setting the -LIBRARY_PATH and C_INCLUDE_PATH environment variables:

-
export LIBRARY_PATH=$VIRTUAL_ENV/lib
-export C_INCLUDE_PATH=$VIRTUAL_ENV/include
-pip install -I -r ./envs/brubeck.reqs
-pip install -I -r ./envs/gevent.reqs
-
- - -

How nice is that?

-

(If it didn't work because of a gcc error message, try symlinking gcc into the place that Python expects it:

-
sudo ln -s /usr/bin/gcc /usr/bin/gcc-4.2
-
- - -

... and try pip install again.)

-

Next

-

Once you're here, you have a completely isolated install of ZeroMQ, -Mongrel2, Brubeck, and all its package dependencies. Continue with -James's Brubeck installation -instructions at the "A Demo" -portion.

diff --git a/emptysquare/content/mongodb-full-text-search.md b/emptysquare/content/mongodb-full-text-search.md deleted file mode 100644 index 790a2dc6..00000000 --- a/emptysquare/content/mongodb-full-text-search.md +++ /dev/null @@ -1,122 +0,0 @@ -+++ -type = "post" -title = "MongoDB Full Text Search" -date = "2013-01-12T12:20:57" -description = "How to power your Python web application's search with MongoDB" -category = ["MongoDB", "Programming", "Python"] -tag = [] -enable_lightbox = false -thumbnail = "320px-dictionary-indents-headon.jpg" -draft = false -disqus_identifier = "50f199ba53937408d1c6e87e" -disqus_url = "https://emptysqua.re/blog/50f199ba53937408d1c6e87e/" -+++ - -

Dictionary indents headon

-

Wikimedia commons

-

Yesterday we released the latest unstable version of MongoDB; the headline feature is basic full-text search. You can read all about MongoDB's full text search in the release notes.

-

This blog had been using a really terrible method for search, involving regular expressions, a full collection scan for every search, and no ranking of results by relevance. I wanted to replace all that cruft with MongoDB's full-text search ASAP. Here's what I did.

-

Plain Text

-

My blog is written in Markdown and displayed as HTML. What I want to actually search is the posts' plain text, so we need a new field called plain on each post document in MongoDB. That plain field is what we're going to index.

-

First, I customized Python's standard HTMLParser to strip tags from the HTML:

-
import re
-from HTMLParser import HTMLParser
-
-whitespace = re.compile('\s+')
-
-class HTMLStripTags(HTMLParser):
-    """Strip tags
-    """
-    def __init__(self, *args, **kwargs):
-        HTMLParser.__init__(self, *args, **kwargs)
-        self.out = ""
-
-    def handle_data(self, data):
-        self.out += data
-
-    def handle_entityref(self, name):
-        self.out += '&%s;' % name
-
-    def handle_charref(self, name):
-        return self.handle_entityref('#' + name)
-
-    def value(self):
-        # Collapse whitespace
-        return whitespace.sub(' ', self.out).strip()
-
-def plain(html):
-    parser = HTMLStripTags()
-    parser.feed(html)
-    return parser.value()
-
- - -

Updated Jan 14, 2013: Better code, fixed whitespace-handling bugs.

-

I wrote a script that runs through all my existing posts, extracts the plain text from the HTML, and stores it in a new field on each document called plain. I also updated my blog's code so it now updates the plain field on each post whenever I save a post.

-

Creating the Index

-

I installed MongoDB 2.3.2 and started it with this command line option:

-
--setParameter textSearchEnabled=true
-
- - -

Without that option, creating a text index causes a server error, "text search not enabled".

-

Next I created a text index on posts' titles, category names, tags, and the plain text that I generated above. I can set different relevance weights for each field. The title contributes most to a post's relevance score, followed by categories and tags, and finally the text. In Python, the index declaration looks like:

-
db.posts.create_index(
-    [
-        ('title', 'text'),
-        ('categories.name', 'text'),
-        ('tags', 'text'), ('plain', 'text')
-    ],
-    weights={
-        'title': 10,
-        'categories.name': 5,
-        'tags': 5,
-        'plain': 1
-    }
-)
-
- - -

Note that you'll need to install PyMongo from the current master in GitHub or wait for PyMongo 2.4.2 in order to create a text index. PyMongo 2.4.1 and earlier throw an exception:

-
TypeError: second item in each key pair must be
-ASCENDING, DESCENDING, GEO2D, or GEOHAYSTACK
-
- - -

If you don't want to upgrade PyMongo, just use the mongo shell to create the index:

-
db.posts.createIndex(
-    {
-        title: 'text',
-        'categories.name': 'text',
-        tags: 'text',
-        plain: 'text'
-    },
-    {
-        weights: {
-            title: 10,
-            'categories.name': 5,
-            tags: 5,
-            plain: 1
-        }
-    }
-)
-
- - -

Searching the Index

-

To use the text index I can't do a normal find, I have to run the text command. In my async driver Motor, this looks like:

-
response = yield motor.Op(self.db.command, 'text', 'posts',
-    search=q,
-    filter={'status': 'publish', 'type': 'post'},
-    projection={
-        'display': False,
-        'original': False,
-        'plain': False
-    },
-    limit=50)
-
- - -

The q variable is whatever you typed into the search box on the left, like "mongo" or "hamster" or "python's thread locals are weird". The filter option ensures only published posts are returned, and the projection avoids returning large unneeded fields. Results are sorted with the most relevant first, and the limit is applied after the sort.

-

In Conclusion

-

Simple, right? The new text index provides a simple, fully consistent way to do basic search without deploying any extra services. Go read up about it in the release notes.

diff --git a/emptysquare/content/mongodb-testing-network-errors.md b/emptysquare/content/mongodb-testing-network-errors.md deleted file mode 100644 index 26d5c14c..00000000 --- a/emptysquare/content/mongodb-testing-network-errors.md +++ /dev/null @@ -1,49 +0,0 @@ -+++ -type = "post" -title = "Testing Network Errors With MongoDB" -date = "2014-03-20T21:33:50" -description = "A little-known method for simulating a temporary outage with MongoDB." -category = ["MongoDB", "Programming"] -tag = [] -enable_lightbox = false -draft = false -disqus_identifier = "532b9397539374726c12b367" -disqus_url = "https://emptysqua.re/blog/532b9397539374726c12b367/" -+++ - -

Someone asked on Twitter today for a way to trigger a connection failure between MongoDB and the client. This would be terribly useful when you're testing your application's handling of network hiccups.

-

You have options: you could use mongobridge to proxy between the client and the server, and at just the right moment, kill mongobridge.

-

Or you could use packet-filtering tools to accomplish the same: iptables on Linux and ipfw or pfctl on Mac and BSD. You could use one of these tools to block MongoDB's port at the proper moment, and unblock it afterward.

-

There's yet another option, not widely known, that you might find simpler: use a MongoDB "failpoint" to break your connection.

-

Failpoints are our internal mechanism for triggering faults in MongoDB so we can test their consequences. Read about them on Kristina's blog. They're not meant for public consumption, so you didn't hear about it from me.

-

The first step is to start MongoDB with the special command-line argument:

-
mongod --setParameter enableTestCommands=1
-
- - -

Next, log in with the mongo shell and tell the server to abort the next two network operations:

-
> db.adminCommand({
-...   configureFailPoint: 'throwSockExcep',
-...   mode: {times: 2}
-... })
-2014-03-20T20:31:42.162-0400 trying reconnect to 127.0.0.1:27017 (127.0.0.1) failed
-
- - -

The server obeys you instantly, before it even replies, so the command itself appears to fail. But fear not: you've simply seen the first of the two network errors you asked for. You can trigger the next error with any operation:

-
> db.collection.count()
-2014-03-20T20:31:48.485-0400 trying reconnect to 127.0.0.1:27017 (127.0.0.1) failed
-
- - -

The third operation succeeds:

-
> db.collection.count()
-2014-03-20T21:07:38.742-0400 trying reconnect to 127.0.0.1:27017 (127.0.0.1) failed
-2014-03-20T21:07:38.742-0400 reconnect 127.0.0.1:27017 (127.0.0.1) ok
-1
-
- - -

There's a final "failed" message that I don't understand, but the shell reconnects and the command returns the answer, "1".

-

You could use this failpoint when testing a driver or an application. If you don't know exactly how many operations you need to break, you could set times to 50 and, at the end of your test, continue attempting to reconnect until you succeed.

-

Ugly, perhaps, but if you want a simple way to cause a network error this could be a reasonable approach.

diff --git a/emptysquare/content/motor-0-1-1-released.md b/emptysquare/content/motor-0-1-1-released.md deleted file mode 100644 index 3c3a15ae..00000000 --- a/emptysquare/content/motor-0-1-1-released.md +++ /dev/null @@ -1,34 +0,0 @@ -+++ -type = "post" -title = "Motor 0.1.1 released" -date = "2013-06-24T12:09:32" -description = "Fixes an incompatibility between Motor and the latest version of PyMongo, by pinning Motor's dependency to PyMongo 2.5.0 exactly." -category = ["MongoDB", "Motor", "Programming", "Python"] -tag = [] -enable_lightbox = false -thumbnail = "motor-musho.png" -draft = false -disqus_identifier = "51c86f1253937473788cbc8a" -disqus_url = "https://emptysqua.re/blog/51c86f1253937473788cbc8a/" -+++ - -

Motor

-

Motor is my async driver for Tornado and MongoDB. Version 0.1 has been out since early March and is having a successful career with no serious bugs reported so far. Unfortunately PyMongo, the blocking driver that Motor wraps, has changed a bit since then and Motor is no longer compatible with the latest PyMongo. If you did pip install motor you'd pull in Motor 0.1 and PyMongo 2.5.2, and see a failure when opening a MotorReplicaSetClient, like:

-
Traceback (most recent call last):
-  File "example.py", line 3, in <module>
-    client = MotorReplicaSetClient(replicaSet='foo').open_sync()
-  File "motor/__init__.py", line 967, in open_sync
-    super(MotorReplicaSetClient, self).open_sync()
-  File "motor/__init__.py", line 804, in open_sync
-    for pool in self._get_pools():
-  File "motor/__init__.py", line 1004, in _get_pools
-    self.delegate._MongoReplicaSetClient__members.values()]
-  File "pymongo/collection.py", line 1418, in __call__
-    self.__name)
-TypeError: 'Collection' object is not callable. If you meant to call the 'values' method on a 'Database' object it is failing because no such method exists.
-
- - -

This morning I've released a bugfix version of Motor, version 0.1.1, to correct the problem. This version simply updates the installer to pull in PyMongo 2.5.0, the last version that works with Motor, rather than PyMongo 2.5.2, the latest.

-

In the medium term, we'll release a PyMongo 3.0 with well-specified hooks for Motor, and for other libraries that want to do deep customization. Motor can switch to using those hooks, and be much less tightly coupled with particular PyMongo versions.

-

When that happens I can release a Motor 1.0. Meanwhile, I think Motor's low version numbers properly reflect that it's too tightly coupled to PyMongo's internal properties.

diff --git a/emptysquare/content/motor-0-3-3-released.md b/emptysquare/content/motor-0-3-3-released.md deleted file mode 100644 index 3a3f67e3..00000000 --- a/emptysquare/content/motor-0-3-3-released.md +++ /dev/null @@ -1,33 +0,0 @@ -+++ -type = "post" -title = "Motor 0.3.3 Released" -date = "2014-10-04T20:47:26" -description = "Fixes an infinite loop and memory leak." -category = ["MongoDB", "Motor", "Programming", "Python"] -tag = [] -enable_lightbox = false -thumbnail = "motor-musho.png" -draft = false -disqus_identifier = "543083145393740961f61a1e" -disqus_url = "https://emptysqua.re/blog/543083145393740961f61a1e/" -+++ - -

Motor

-

Today I released version 0.3.3 of Motor, the asynchronous MongoDB driver for Python and Tornado. This release is compatible with MongoDB 2.2, 2.4, and 2.6. It requires PyMongo 2.7.1.

-

This release fixes an occasional infinite loop and memory leak. The bug was triggered when you passed a callback to MotorCursor.each, and Motor had to open a new socket in the process of executing your callback, and your callback raised an exception:

-
from tornado.ioloop import IOLoop
-import motor
-
-loop = IOLoop.instance()
-
-def each(result, error):
-    raise Exception()
-
-collection = motor.MotorClient().test.test
-cursor = collection.find().each(callback=each)
-loop.start()
-
- - -

The bug has been present since Motor 0.2. I am indebted to Eugene Protozanov for an excellent bug report.

-

Get the latest version with pip install --upgrade motor. The documentation is on ReadTheDocs. View the changelog here. If you encounter any issues, please file them in Jira.

diff --git a/emptysquare/content/motor-01-migration.md b/emptysquare/content/motor-01-migration.md deleted file mode 100644 index 83b8c84d..00000000 --- a/emptysquare/content/motor-01-migration.md +++ /dev/null @@ -1,30 +0,0 @@ -+++ -type = "post" -title = "Motor 0.1 Migration Instructions" -date = "2013-03-07T11:42:17" -description = "If you've been using Motor prior to the 0.1 release, here's how to upgrade." -category = ["MongoDB", "Motor", "Programming", "Python"] -tag = [] -enable_lightbox = false -draft = false -disqus_identifier = "5138c369539374244689c955" -disqus_url = "https://emptysqua.re/blog/5138c369539374244689c955/" -+++ - -

Motor (which is indeed my non-blocking driver for MongoDB and Tornado) had a 0.1 release to PyPI yesterday. It had an odd history prior, so there are various versions of the code that you, dear reader, may have installed on your system. All you need to do is:

-
$ pip uninstall pymongo motor
-$ pip install motor
-
- - -

Motor will pull in the official PyMongo, plus Tornado and Greenlet, as dependencies. You should now have Motor 0.1 and PyMongo 2.4.2:

-
>>> import pymongo
->>> pymongo.version
-'2.4.2'
->>> import motor
->>> motor.version
-'0.1'
-
- - -

(The lore is: I started Motor last year in a branch of my fork of PyMongo, so you could've installed an experimental version of both PyMongo and Motor from there. Then we transferred Motor into its own repo within the MongoDB.org organization on January 15. And on February 1st a zealous fan actually grabbed the "Motor" package name on PyPI and uploaded my code to it, then transferred ownership to me, just to make sure I could use the name Motor.)

diff --git a/emptysquare/content/motor-installation-instructions.md b/emptysquare/content/motor-installation-instructions.md deleted file mode 100644 index de0a511c..00000000 --- a/emptysquare/content/motor-installation-instructions.md +++ /dev/null @@ -1,88 +0,0 @@ -+++ -type = "post" -title = "Motor Installation Instructions" -date = "2012-10-31T12:31:41" -description = "" -category = ["MongoDB", "Motor", "Programming", "Python"] -tag = [] -enable_lightbox = false -thumbnail = "motor-musho.png" -draft = false -disqus_identifier = "50914b165393741e3a02ed17" -disqus_url = "https://emptysqua.re/blog/50914b165393741e3a02ed17/" -+++ - -

Motor

-

Update: Motor is in PyPI now, this is all moot

-

I've done a bad job with installation instructions for Motor, my non-blocking driver for MongoDB and Tornado. I've gotten a bunch of emails from people complaining about this:

-
Traceback (most recent call last):    
-  File "myfile.py", line 2, in <module>
-    connection = motor.MotorConnection().open_sync()
-  File ".../motor/__init__.py", line 690, in open_sync
-    raise outcome['error']
-pymongo.errors.ConfigurationError: Unknown option _pool_class
-
- - -

You'll get this ConfigurationError if you installed Motor without uninstalling PyMongo first. But you couldn't know that, because I forgot to tell you.

-

Here's installation instructions, followed by an explanation of why installation is wonky right now and how it will improve, and what Motor's status is now.

-

Installation

-

I assume you have pip, and I recommend you use virtualenv—these are just best practices for all Python application development. You need regular CPython, 2.5 or better.

-
# if you have pymongo installed previously, you MUST uninstall it
-pip uninstall pymongo
-
-# install prerequisites
-pip install tornado greenlet
-
-# get motor
-pip install git+https://github.com/ajdavis/mongo-python-driver.git@motor
-
- - -

Now you should have my versions of pymongo, bson, gridfs, and motor installed:

-
>>> import motor
->>>
-
- - -

Update: If you're testing against a particular version of Motor, you can freeze that requirement and install that version by git hash, like:

-
pip install git+https://github.com/ajdavis/mongo-python-driver.git@694436f
-
- - -

pip will say, "Could not find a tag or branch '694436f', assuming commit," which is what you want. You can put Motor and its dependencies in your requirements.txt:

-
greenlet==0.4.0
-tornado==2.4
-git+https://github.com/ajdavis/mongo-python-driver.git@694436f
-
- - -

And install:

-
pip install -r requirements.txt
-
- - -

Confusingly, the command to uninstall Motor is:

-
pip uninstall pymongo
-
- - -

Why Is Installation Wonky?

-

Why do you have to uninstall 10gen's official PyMongo before installing Motor? Why isn't Motor in PyPI? Why doesn't Motor automatically install the Tornado and Greenlet packages as dependencies? All will be revealed.

-

Implementing Motor requires a few extra hooks in the core PyMongo module. For example, I added a _pool_class option to PyMongo's Connection class. Thus Motor and PyMongo are coupled, and I want them to be versioned together. Motor is a feature of PyMongo that you can choose to use. In the future when Motor is an official 10gen product, Motor and PyMongo will be in the same git repository, and in the same package in PyPI, and when you pip install pymongo, you'll get the motor module installed in your site-packages, just like the pymongo, bson, gridfs modules now. There will never be a separate "Motor" package in PyPI.

-

Even once Motor is official, the whole PyMongo package shouldn't require Tornado and Greenlet as dependencies. So you'll still need to manually install them to make Motor work. PyMongo will still work without Tornado and Greenlet, of course—they won't be necessary until you import motor.

-

Since that's the goal—the Motor module as a feature of PyMongo, in the same repository and the same PyPI package—this beta period is awkward. I'm building Motor in my fork of the PyMongo repo, on a motor branch, and regularly merging the upstream repo's changes. Sometimes, upstream changes to PyMongo break Motor and need small fixes.

-

I don't want to make a PyPI package for Motor, since that package will be obsolete once Motor's merged upstream. And since the eventual version of the PyMongo package that includes Motor won't require Tornado or Greenlet as dependencies, neither does the version in my git repo.

-

Status

-

Motor is feature-complete, and it's compatible with all the Python versions that Tornado is. MotorConnection has been load-tested by the QA team at a large corporation, with good results. At least one small startup has put MotorReplicaSetConnection in production, with one bug reported and fixed—Motor threw the wrong kinds of exceptions during a replica-set failover. I'm now hunting a similar MotorReplicaSetConnection bug reported on the Tornado mailing list.

-

Besides that bug, Motor has 37 TODOs. All are reminders to myself to refactor Motor's interaction with PyMongo, and to ensure every corner of Motor is reviewed, tested, and documented. I need to:

- -

At that point, Bernie and I will decide if Motor is ready to go official, and I'll announce on this blog, and throw a party.

-

-Party Cat -

diff --git a/emptysquare/content/motor-is-growing-up.md b/emptysquare/content/motor-is-growing-up.md deleted file mode 100644 index 940d5077..00000000 --- a/emptysquare/content/motor-is-growing-up.md +++ /dev/null @@ -1,32 +0,0 @@ -+++ -type = "post" -title = "Motor Is Growing Up" -date = "2013-01-24T23:36:21" -description = "Motor, my async driver for MongoDB and Python Tornado, will be its own package." -category = ["MongoDB", "Motor", "Programming", "Python"] -tag = [] -enable_lightbox = false -thumbnail = "motor-musho.png" -draft = false -disqus_identifier = "51020bc55393747de89b6614" -disqus_url = "https://emptysqua.re/blog/51020bc55393747de89b6614/" -+++ - -

Motor

-

For a long time I've thought that Motor, my non-blocking Python driver for MongoDB and Tornado, ought to be included as a module within the standard PyMongo package. Everyone both inside and outside 10gen has told me they'd prefer Motor be a separate distribution. Last week, I was suddenly enlightened. I agree!

-

(My argument for keeping Motor and PyMongo together was that changes in PyMongo might require changes in Motor, so they should be versioned and released together. But as Motor nears completion and I see the exact extent of its coupling with PyMongo, the risk of incompatibilities arising seems lower to me than it had.)

-

We completed the first step of the separation yesterday: We released PyMongo 2.4.2, the first version of PyMongo that includes the hooks Motor needs to wrap it and make it non-blocking.

-

The next step is to make a standalone distribution of Motor, and that's almost done, too. Motor has left its parent's house. It has:

- -

And now, installing Motor is finally normal:

-
$ git clone git://github.com/mongodb/motor.git
-$ cd motor
-$ python setup.py install
-
- - -

Motor's not done yet, but it's heading to a 0.1 release in PyPI, as a standalone package, real soon now.

diff --git a/emptysquare/content/motor-officially-released.md b/emptysquare/content/motor-officially-released.md deleted file mode 100644 index ab499f48..00000000 --- a/emptysquare/content/motor-officially-released.md +++ /dev/null @@ -1,33 +0,0 @@ -+++ -type = "post" -title = "Motor Officially Released" -date = "2013-03-06T14:40:06" -description = "The first release of Motor: my full-featured, non-blocking driver for Python, Tornado, and MongoDB." -category = ["MongoDB", "Motor", "Programming", "Python"] -tag = [] -enable_lightbox = false -thumbnail = "motor-musho.png" -draft = false -disqus_identifier = "51379a405393741a1404f58b" -disqus_url = "https://emptysqua.re/blog/51379a405393741a1404f58b/" -+++ - -

Motor

-

It's happened. Motor 0.1 is in PyPI. You can now install it with a simple:

-
$ pip install motor
-
- - -

This is the first official, production release of Motor.

-

That said, there will be bugs: please file them and I'll respond as quickly as I can.

-

Links:

- -

Motor's Future

-

Motor is now feature-complete and fully tested. I expect to put it on the back burner and concentrate on other projects.

-

Motor will keep up easily with PyMongo development, because I designed it to. I don't intend for it to lag more than a smidge. For example, PyMongo 2.5 will bring some new security and authentication features; in the following Motor release I'll support those, too.

-

I believe this is the coolest thing I've ever made. I hope you have fun with it. Tweet me and let me know what you build with it.

diff --git a/emptysquare/content/motor-progress-report-the-road-to-0-2.md b/emptysquare/content/motor-progress-report-the-road-to-0-2.md deleted file mode 100644 index 36c27265..00000000 --- a/emptysquare/content/motor-progress-report-the-road-to-0-2.md +++ /dev/null @@ -1,53 +0,0 @@ -+++ -type = "post" -title = "Motor Progress Report: The Road to 0.2" -date = "2013-12-23T15:47:26" -description = "Big changes are coming in the next release of my async MongoDB driver." -category = ["MongoDB", "Motor", "Programming", "Python"] -tag = [] -enable_lightbox = false -thumbnail = "motor-musho.png" -draft = false -disqus_identifier = "52b89e9d53937479d528dfac" -disqus_url = "https://emptysqua.re/blog/52b89e9d53937479d528dfac/" -+++ - -

Motor

-

Update: Motor 0.2rc0 is out, its manual and changelog are on ReadTheDocs.

-
-

Motor, my non-blocking driver for MongoDB and Tornado, is approaching the next big release, version 0.2. The improvements fall into three buckets: ease of use, features, and server compatibility.

-

Ease Of Use

-

In Motor's current version, 0.1.2, you have to use an awkward style to do async operations in a coroutine:

-
@gen.coroutine
-def f():
-    document = yield motor.Op(collection.find_one, {'_id': 1})
-
- - -

In the next release, motor.Op will be deprecated and you'll call Motor functions directly, the same as in PyMongo. The yield keyword is the only difference that remains:

-
@gen.coroutine
-def f():
-    document = yield collection.find_one({'_id': 1})
-
- - -

The new syntax matches the latest style of other Tornado libraries, and it's the style used in Python's new asyncio library.

-

The other awkward thing in Motor is open_sync(). Since there's no way to do async I/O before starting Tornado's event loop, you have to do this:

-
client = MotorClient()
-client.open_sync()
-
-# ...additional application setup....
-
-IOLoop.current().start()
-
- - -

In the next release, open_sync will be unnecessary. In fact I'm removing it entirely. I've added features to PyMongo itself (in its next release, version 2.7) that Motor can use to connect to the server on demand, when you first attempt an async operation.

-

Features

-

Motor 0.1.2 wraps PyMongo 2.5.0, which was released in March, so it lacks a number of features introduced in more recent PyMongos: exhaust cursors, streaming inserts, a more robust BSON decoder, several options for finer control of the connection pool, and more authentication mechanisms for enterprise environments. You can see all the features introduced since 2.5.0 in PyMongo's changelog. By wrapping PyMongo 2.7 instead of 2.5, the next Motor will get all these features, too.

-

Motor has implemented SSL encryption since the first release, but didn't supported client or server certificate validation, much less X509 authentication. The next release will do it all; Motor will have the same comprehensive SSL support as PyMongo.

-

Server Compatibility

-

There's a lot of new features in the next release of the MongoDB server itself. MongoDB 2.6 will come out with aggregation cursors, bulk write operations, a new role-management system, operation time limits, and more. All of these features require changes to PyMongo. Since Motor 0.2 will wrap the latest PyMongo, Motor will also support the latest MongoDB features.

-

Current Status

-

By the time I go on vacation next week, Motor's code on master will be ready for the 0.2 release. But there will be a brief lull: we have to wait for the MongoDB 2.6 release candidate, and then we have to release PyMongo 2.7. Then Motor can correctly list PyMongo 2.7 in its requirements, and I'll put it on PyPI.

-

Meanwhile, please don't install Motor from GitHub. Use Motor 0.1.2 from PyPI, with PyMongo 2.5.0. The documentation for that version of Motor is the "stable" version on ReadTheDocs until the next Motor release. There's been some confusion among new Motor users about installing the correct versions of Motor and PyMongo. Stick to these recommendations for now, and I'll find ways to ease the installation troubles in the next release.

diff --git a/emptysquare/content/motor-progress-report.md b/emptysquare/content/motor-progress-report.md deleted file mode 100644 index e6eca6c6..00000000 --- a/emptysquare/content/motor-progress-report.md +++ /dev/null @@ -1,31 +0,0 @@ -+++ -type = "post" -title = "Motor Progress Report" -date = "2012-08-29T23:54:04" -description = "" -category = ["Motor", "Programming", "Python"] -tag = [] -enable_lightbox = false -thumbnail = "motor-musho.png" -draft = false -disqus_identifier = "503ee3dc5393744800000000" -disqus_url = "https://emptysqua.re/blog/503ee3dc5393744800000000/" -+++ - -

Motor

-

Motor, my async driver for MongoDB and Tornado, is now compatible with all the same Python versions as Tornado: CPython 2.5, 2.6, 2.7, and 3.2, and PyPy 1.9.

-

To get Motor working with Python 3 I had to make a backwards breaking change: MotorCursor.next is now next_object. So this:

-
cursor = db.collection.find()
-cursor.next(my_callback)
-
- - -

... must now be:

-
cursor = db.collection.find()
-cursor.next_object(my_callback)
-
- - -

I had to do this to neatly support Python 3, because 2to3 was unhelpfully transforming MotorCursor.next into __next__. But the change was worthy even without that problem: next_object is closer to nextObject in the Node.js MongoDB driver, whose API I'm trying to emulate. Besides, I wasn't using next the way Python intends, so I went ahead and renamed it. I'm sorry if this breaks your code. This is what the alpha phase is for.

-

The only remaining feature to implement is GridFS, which I'll do within the month. There's some more testing and documentation to do, and then we'll move from alpha to beta.

-

I know a few people are trying out Motor. I've received no bug reports so far, but some users have reported omissions in the docs which I've filled in. If you're using Motor, get in touch and let me know: jesse@10gen.com.

diff --git a/emptysquare/content/nginx-spellcasting.md b/emptysquare/content/nginx-spellcasting.md deleted file mode 100644 index 106eeede..00000000 --- a/emptysquare/content/nginx-spellcasting.md +++ /dev/null @@ -1,35 +0,0 @@ -+++ -type = "post" -title = "Nginx spellcasting" -date = "2011-11-20T22:45:35" -description = "Gandalf in Ralph Bakshi's animated version of The Lord of the Rings. I write the following lines for the sake of future generations, seeking lore about Nginx. Should this omen appear: nginx: [warn] 1024 worker_connections are more than [ ... ]" -category = ["Programming"] -tag = [] -enable_lightbox = false -thumbnail = "BakshiGandalf.jpg" -draft = false -disqus_identifier = "188 http://emptysquare.net/blog/?p=188" -disqus_url = "https://emptysqua.re/blog/188 http://emptysquare.net/blog/?p=188/" -+++ - -

-

Gandalf in Ralph Bakshi's animated version of The Lord of the Rings.

-

I write the following lines for the sake of future generations, seeking -lore about Nginx. Should this omen appear:

-
nginx: [warn] 1024 worker_connections are more than open file resource limit: 256
-
- - -

Recite the following incantation in a deep, resonant voice:

-
sudo bash; ulimit -n 65536
-
- - -

Then start Nginx again in the shell in which you called ulimit.

-

Another spell needful to the young wizard is this, which rids you of all -daemonic Nginxes:

-
ps aux|grep nginx\:\ master\ process|grep -v grep|awk '{ print $2; }'|sudo xargs kill
-
- - -

Use it wisely.

diff --git a/emptysquare/content/pausing-with-tornado.md b/emptysquare/content/pausing-with-tornado.md deleted file mode 100644 index 58f55d6a..00000000 --- a/emptysquare/content/pausing-with-tornado.md +++ /dev/null @@ -1,34 +0,0 @@ -+++ -type = "post" -title = "Pausing with Tornado" -date = "2012-04-20T21:26:41" -description = "Throwing this in my blog so I don't forget again. The way to sleep for a certain period of time using tornado.gen is: import tornado.web from tornado.ioloop import IOLoop from tornado import gen class [ ... ]" -category = ["Programming", "Python"] -tag = ["tornado"] -enable_lightbox = false -draft = false -disqus_identifier = "430 http://emptysquare.net/blog/?p=430" -disqus_url = "https://emptysqua.re/blog/430 http://emptysquare.net/blog/?p=430/" -+++ - -

Throwing this in my blog so I don't forget again. The way to sleep for a -certain period of time using tornado.gen is:

-
import tornado.web
-from tornado.ioloop import IOLoop
-from tornado import gen
-
-class MyHandler(tornado.web.RequestHandler):
-    @tornado.web.asynchronous
-    @gen.engine
-    def get(self):
-        self.write("sleeping .... ")
-        # Do nothing for 5 sec
-        loop = IOLoop.instance()
-        yield gen.Task(loop.add_timeout, time.time() + 5)
-        self.write("I'm awake!")
-        self.finish()
-
- - -

Simple once you see it, but for some reason this has been the hardest -for me to get used to.

diff --git a/emptysquare/content/pymongo-2-4-2-is-out.md b/emptysquare/content/pymongo-2-4-2-is-out.md deleted file mode 100644 index 02d12f3c..00000000 --- a/emptysquare/content/pymongo-2-4-2-is-out.md +++ /dev/null @@ -1,30 +0,0 @@ -+++ -type = "post" -title = "PyMongo 2.4.2 Is Out" -date = "2013-01-24T09:50:46" -description = "Changes in PyMongo, the MongoDB Python driver" -category = ["MongoDB", "Programming", "Python"] -tag = ["pymongo"] -enable_lightbox = false -draft = false -disqus_identifier = "5101497e5393747ddd768988" -disqus_url = "https://emptysqua.re/blog/5101497e5393747ddd768988/" -+++ - -

Yesterday we released PyMongo 2.4.2, the latest version of 10gen's Python driver for MongoDB. You can see the whole list of nine bugs fixed. Here are some highlights:

- -

(Down here we have to speak very quietly, because the next part is top-secret: I snuck a feature into what's supposed to be a bugfix release. PyMongo 2.4.2 has the hooks Motor needs to wrap PyMongo and make it non-blocking. This lets Motor take a new direction, which I'll blog about shortly.)

diff --git a/emptysquare/content/pymongo-use-greenlets-followup.md b/emptysquare/content/pymongo-use-greenlets-followup.md deleted file mode 100644 index a10d18c2..00000000 --- a/emptysquare/content/pymongo-use-greenlets-followup.md +++ /dev/null @@ -1,58 +0,0 @@ -+++ -type = "post" -title = "PyMongo's \"use_greenlets\" Followup" -date = "2015-03-15T22:29:53" -description = "I wrote in December that we were removing a quirky feature from PyMongo. Here's how my conversation went with a critic." -category = ["MongoDB", "Programming", "Python"] -tag = ["gevent", "pymongo"] -enable_lightbox = false -thumbnail = "fern.jpg" -draft = false -disqus_identifier = "550634af539374097d8896b1" -disqus_url = "https://emptysqua.re/blog/550634af539374097d8896b1/" -+++ - -

Fern - (cc) Wingchi Poon

-

In December, I wrote that we are removing the idiosyncratic use_greenlets option from PyMongo when we release PyMongo 3.

-

In PyMongo 2 you have two options for using Gevent. First, you can do:

-
from gevent import monkey; monkey.patch_all()
-from pymongo import MongoClient
-
-client = MongoClient()
-
- - -

Or:

-
from gevent import monkey; monkey.patch_socket()
-from pymongo import MongoClient
-
-client = MongoClient(use_greenlets=True)
-
- - -

In the latter case, I wrote, "you could use PyMongo after calling Gevent's patch_socket without having to call patch_thread. But who would do that? What conceivable use case had I enabled?" So I removed use_greenlets in PyMongo 3; the first example code continues to work but the second will not.

-

In the comments, PyMongo user Peter Hansen replied,

-
-

I hope you're not saying that the only way this will work is if one uses monkey.patch_all, because, although this is a very common way to use Gevent, it's absolutely not the only way. (If it were, it would just be done automatically!) We have a large Gevent application here which cannot do that, because threads must be allowed to continue working as regular threads, but we monkey patch only what we need which happens to be everything else (with monkey.patch_all(thread=False)).

-
-

So Peter, Bernie, and I met online and he told us about his very interesting application. It needs to interface with some C code that talks an obscure network protocol; to get the best of both worlds his Python code uses asynchronous Gevent in the main thread, and it avoids blocking the event loop by launching Python threads to talk with the C extension. Peter had, in fact, perfectly understood PyMongo 2's design and was using it as intended. It was I who hadn't understood the feature's use case before I diked it out.

-

So what now? I would be sad to lose the great simplifications I achieved in PyMongo by removing its Gevent-specific code. Besides, occasional complaints from Eventlet and other communities motivated us to support all frameworks equally.

-

Luckily, Gevent 1.0 provides a workaround for the loss of use_greenlets in PyMongo. Beginning the same as the first example above:

-
from gevent import monkey; monkey.patch_all()
-from pymongo import MongoClient
-
-client = MongoClient()
-
-
-def my_function():
-    # Call some C code that drops the GIL and does
-    # blocking I/O from C directly.
-    pass
-
-start_new_thread = monkey.saved['thread']['start_new_thread']
-real_thread = start_new_thread(my_function, ())
-
- - -

I checked with Gevent's author Denis Bilenko whether monkey.saved was a stable API and he confirmed it is. If you use Gevent and PyMongo as Peter does, port your code to this technique when you upgrade to PyMongo 3.

-

Image: Wingchi Poon, CC BY-SA 3.0

diff --git a/emptysquare/content/pymongos-new-default-safe-writes.md b/emptysquare/content/pymongos-new-default-safe-writes.md deleted file mode 100644 index 9aa2cb67..00000000 --- a/emptysquare/content/pymongos-new-default-safe-writes.md +++ /dev/null @@ -1,119 +0,0 @@ -+++ -type = "post" -title = "PyMongo's New Default: Safe Writes!" -date = "2012-11-27T09:54:53" -description = "I joyfully announce that we are changing all of 10gen's MongoDB drivers to do \"safe writes\" by default. In the process we're renaming all the connection classes to MongoClient, so all the drivers now use the same term for the central class. [ ... ]" -category = ["MongoDB", "Programming", "Python"] -tag = ["pymongo"] -enable_lightbox = false -thumbnail = "get-last-error.png" -draft = false -disqus_identifier = "50b4d3f75393744a41fe2c70" -disqus_url = "https://emptysqua.re/blog/50b4d3f75393744a41fe2c70/" -+++ - -

I joyfully announce that we are changing all of 10gen's MongoDB drivers to do "safe writes" by default. In the process we're renaming all the connection classes to MongoClient, so all the drivers now use the same term for the central class.

-

PyMongo 2.4, released today, has new classes called MongoClient and MongoReplicaSetClient that have the new default setting, and a new API for configuring write-acknowledgement called "write concerns". PyMongo's old Connection and ReplicaSetConnection classes remain untouched for backward compatibility, but they are now considered deprecated and will disappear in some future release. The changes were implemented by PyMongo's maintainer (and my favorite colleague) Bernie Hackett.

-
-

Contents:

- -

Background

-

MongoDB's writes happen in two phases. First the driver sends the server an insert, update, or remove message. The MongoDB server executes the operation and notes the outcome: it records whether there was an error, how many documents were updated or removed, and whether an upsert resulted in an update or an insert.

-

In the next phase, the driver runs the getLastError command on the server and awaits the response:

-

getLastError

-

This getLastError call can be omitted for speed, in which case the driver just sends all its write messages without awaiting acknowledgment. "Fire-and-forget" mode is obviously very high-performance, because it can take advantage of network throughput without being affected by network latency. But this mode doesn't report errors to your application, and it doesn't guarantee that a write has completed before you do a query. It's not the right mode to use by default, so we're changing it now.

-

In the past we haven't been particularly consistent in our terms for these modes, sometimes talking about "safe" and "unsafe" writes, at other times "blocking" and "non-blocking", etc. From now on we're trying to stick to "acknowledged" and "unacknowledged," since that goes to the heart of the difference. I'll stick to these terms here.

-

(In 10gen's ancient history, before my time, the plan was to make a full platform-as-a-service stack with MongoDB as the data layer. It made sense then for getLastError to be a separate operation that was run explicitly, and to not call getLastError automatically by default. But MongoDB is a standalone product and it's clear that the default needs to change.)

-

The New Defaults

-

In earlier versions of PyMongo you would create a connection like this:

-
from pymongo import Connection
-connection = Connection('localhost', 27017)
-
- - -

By default, Connection did unacknowledged writes—it didn't call getLastError at all. You could change that with the safe option like:

-
connection = Connection('localhost', 27017, safe=True)
-
- - -

You could also configure arguments that were passed to every getLastError call that made it wait for specific events, e.g. to wait for the primary and two secondaries to replicate the write, you could pass w=3, and to wait for the primary to commit the write to its journal, you could pass j=True:

-
connection = Connection('localhost', 27017, w=3, j=True)
-
- - -

(The "w" terminology comes from the Dynamo whitepaper that's foundational to the NoSQL movement.)

-

Connection hasn't changed in PyMongo 2.4, but we've added a MongoClient which does acknowledged writes by default:

-
from pymongo import MongoClient
-client = MongoClient('localhost', 27017)
-
- - -

MongoClient lets you pass arguments to getLastError just like Connection did:

-
from pymongo import MongoClient
-client = MongoClient('localhost', 27017, w=3, j=True)
-
- - -

Instead of an odd overlap between the safe and w options, we've now standardized on using w only. So you can get the old behavior of unacknowledged writes with the new classes using w=0:

-
client = MongoClient('localhost', 27017, w=0)
-
- - -

w=0 is the new way to say safe=False.

-

w=1 is the new safe=True and it's now the default. Other options like j=True or w=3 work the same as before. You can still set options per-operation:

-
client.db.collection.insert({'foo': 'bar'}, w=1)
-
- - -

ReplicaSetConnection is also obsolete, of course, and succeeded by MongoReplicaSetClient.

-

Write Concerns

-

The old Connection class let you set the safe attribute to True or False, or call set_lasterror_options() for more complex configuration. These are deprecated, and you should now use the MongoClient.write_concern attribute. write_concern is a dict whose keys may include w, wtimeout, j, and fsync:

-
>>> client = MongoClient()
->>> # default empty dict means "w=1"
->>> client.write_concern
-{}
->>> client.write_concern = {'w': 2, 'wtimeout': 1000}
->>> client.write_concern
-{'wtimeout': 1000, 'w': 2}
->>> client.write_concern['j'] = True
->>> client.write_concern
-{'wtimeout': 1000, 'j': True, 'w': 2}
->>> client.write_concern['w'] = 0 # disable write acknowledgement
-
- - -

You can see that the default write_concern is an empty dictionary. It's equivalent to w=1, meaning "do regular acknowledged writes".

-

auto_start_request

-

This is very nerdy, but my personal favorite. The default value for auto_start_request is changing from True to False.

-

The short explanation is this: with the old Connection, you could write some data to the server without acknowledgment, and then read that data back immediately afterward, provided there wasn't an error and that you used the same socket for the write and the read. If you used a different socket for the two operations then there was no guarantee of "read your writes consistency," because the write could still be enqueued on one socket while you completed the read on the other.

-

You could pin the current thread to a single socket with Connection.start_request(), and in fact the default was for Connection to start a request for you with every operation. That's auto_start_request. It offers some consistency guarantees but requires the driver to open extra sockets.

-

Now that MongoClient waits for acknowledgment of every write, auto_start_request is no longer needed. If you do this:

-
>>> collection = MongoClient().db.collection
->>> collection.insert({'foo': 'bar'})
->>> print collection.find_one({'foo': 'bar'})
-
- - -

... then the find_one won't run until the insert is acknowledged, which means your document has definitely been inserted and you can query for it confidently on any socket. We turned off auto_start_request for improved performance and fewer sockets. If you're doing unacknowledged writes with w=0 followed by reads, you should consider whether to call MongoClient.start_request(). See the details (with charts!) in my blog post on requests from April.

-

Migration

-

Connection and ReplicaSetConnection will remain for a while (not forever), so your existing code will work the same and you have time to migrate. We are working to update all documentation and example code to use the new classes. In time we'll add deprecation warnings to the old classes and methods before removing them completely.

-

If you maintain a library built on PyMongo, you can check for the new classes with code like:

-
try:
-    from pymongo import MongoClient
-    has_mongo_client = True
-except ImportError:
-    has_mongo_client = False
-
- - -

What About Motor?

-

Motor's in beta, so I'll break backwards compatibility ruthlessly for the sake of cleanliness. In the next week or two I'll merge the official PyMongo changes into my fork, and I'll nuke MotorConnection and MotorReplicaSetConnection, to be replaced with MotorClient and MotorReplicaSetClient.

-

The Uplifting Conclusion

-

We've known for a while that unacknowledged writes were the wrong default. Now it's finally time to fix it. The new MongoClient class lets you migrate from the old default to the new one at your leisure, and brings a bonus: all the drivers agree on the name of the main entry-point. For programmers new to MongoDB, turning on write-acknowledgment by default is a huge win, and makes it much more intuitive to write applications on MongoDB.

diff --git a/emptysquare/content/read-your-writes-consistency-pymongo.md b/emptysquare/content/read-your-writes-consistency-pymongo.md deleted file mode 100644 index 28ffcf63..00000000 --- a/emptysquare/content/read-your-writes-consistency-pymongo.md +++ /dev/null @@ -1,62 +0,0 @@ -+++ -type = "post" -title = "Read-Your-Writes Consistency With PyMongo" -date = "2013-11-18T16:23:03" -description = "What's the best way to get read-your-writes consistency in PyMongo?" -category = ["MongoDB", "Programming", "Python"] -tag = [] -enable_lightbox = false -thumbnail = "quill.jpg" -draft = false -disqus_identifier = "528a797653937479d528989c" -disqus_url = "https://emptysqua.re/blog/528a797653937479d528989c/" -+++ - -

Quill -Photo: Thomas van de Vosse

-

A PyMongo user asked me a good question today: if you want read-your-writes consistency, is it better to do acknowledged writes with a connection pool (the default), or to do unacknowledged writes over a single socket?

-

A Little Background

-

Let's say you update a MongoDB document with PyMongo, and you want to immediately read the updated version:

-
client = pymongo.MongoClient()
-collection = client.my_database.my_collection
-collection.update(
-    {'_id': 1},
-    {'$inc': {'n': 1}})
-
-print collection.find_one({'_id': 1})
-
- - -

In a multithreaded application, PyMongo's connection pool may have multiple sockets in it, so we don't promise that you'll use the same socket for the update and for the find_one. Yet you're still guaranteed read-your-writes consistency: the change you wrote to the document is reflected in the version of the document you subsequently read with find_one. PyMongo accomplishes this consistency by waiting for MongoDB to acknowledge the update operation before it sends the find_one query. (I explained last year how acknowledgment works in PyMongo.)

-

There's another way to get read-your-writes consistency: you can send both the update and the find_one over the same socket, to ensure MongoDB processes them in order. In this case, you can tell PyMongo not to request acknowledgment for the update with the w=0 option:

-
# Reserve one socket for this thread.
-with client.start_request():
-    collection.update(
-        {'_id': 1},
-        {'$inc': {'n': 1}},
-        w=0)
-
-    print collection.find_one({'_id': 1})
-
- - -

If you set PyMongo's auto_start_request option it will call start_request for you. In that case you'd better let the connection pool grow to match the number of threads by removing its max_pool_size:

-
client = pymongo.MongoClient(
-    auto_start_request=True,
-    max_pool_size=None)
-
- - -

(See my article on requests for details.)

-

So, to answer the user's question: If there are two ways to get read-your-writes consistency, which should you use?

-

The Answer

-

You should accept PyMongo's default settings: use acknowledged writes. Here's why:

-

Number of sockets: A multithreaded Python program that uses w=0 and auto_start_request needs more connections to the server than does a program that uses acknowledged writes instead. With auto_start_request we have to reserve a socket for every application thread, whereas without it, threads can share a pool of connections smaller than the total number of threads.

-

Back pressure: If the server becomes very heavily loaded, a program that uses w=0 won't know the server is loaded because it doesn't wait for acknowledgments. In contrast, the server can exert back pressure on a program using acknowledged writes: the program can't continue to write to the server until the server has completed and acknowledged the writes currently in progress.

-

Error reporting: If you use w=0, your application won't know whether the writes failed due to some error on the server. For example, an insert might cause a duplicate-key violation. Or you might try to increment a field in a document, but the server rejects the operation because the field isn't a number. By default PyMongo raises an exception under these circumstances so your program doesn't continue blithely on, but if you use w=0 such errors pass silently.

-

Consistency: Acknowledged writes guarantee read-your-writes consistency, whether you're connected to a mongod or to a mongos in a sharded cluster.

-

Using w=0 with auto_start_request also guarantees read-your-writes consistency, but only if you're connected to a mongod. If you're connected to a mongos, using w=0 with auto_start_request does not guarantee any consistency, because some writes may be queued in the writeback listener and complete asynchronously. Waiting for acknowledgment ensures that all writes have really been completed in the cluster before your program proceeds.

-

Forwards compatibility with MongoDB: The next version of the MongoDB server will offer a new implementation for insert, update, and delete, which will diminish the performance boost of w=0.

-

Forwards compatibility with PyMongo: You can tell by now that we're not big fans of auto_start_request. We're likely to remove it from PyMongo in version 3.0, so you're better off not relying on it.

-

Conclusion

-

In short, you should just accept PyMongo's default settings: acknowledged writes with auto_start_request=False. There are many disadvantages and almost no advantages to w=0 with auto_start_request, and in the near future these options will be diminished or removed anyway.

diff --git a/emptysquare/content/real-time-profiling-a-mongodb-cluster.md b/emptysquare/content/real-time-profiling-a-mongodb-cluster.md deleted file mode 100644 index f09745b2..00000000 --- a/emptysquare/content/real-time-profiling-a-mongodb-cluster.md +++ /dev/null @@ -1,128 +0,0 @@ -+++ -type = "post" -title = "Real-time Profiling a MongoDB Sharded Cluster" -date = "2013-06-25T11:29:02" -description = "Let's experiment with queries and commands in a sharded cluster. We'll learn how shard keys and read preferences determine where your operations are run." -category = ["MongoDB", "Programming"] -tag = [] -enable_lightbox = false -thumbnail = "blue-shards.jpg" -draft = false -disqus_identifier = "51bf5c6e5393747680ca1ba1" -disqus_url = "https://emptysqua.re/blog/51bf5c6e5393747680ca1ba1/" -+++ - -

Blue shards -[Source]

-

In a sharded cluster of replica sets, which server or servers handle each of your queries? What about each insert, update, or command? If you know how a MongoDB cluster routes operations among its servers, you can predict how your application will scale as you add shards and add members to shards.

-

Operations are routed according to the type of operation, your shard key, and your read preference. Let's set up a cluster and use the system profiler to see where each operation is run. This is an interactive, experimental way to learn how your cluster really behaves and how your architecture will scale.

-
-

Setup

-

You'll need a recent install of MongoDB (I'm using 2.4.4), Python, a recent version of PyMongo (at least 2.4—I'm using 2.5.2) and the code in my cluster-profile repository on GitHub. If you install the Colorama Python package you'll get cute colored output. These scripts were tested on my Mac.

-

Sharded cluster of replica sets

-

Run the cluster_setup.py script in my repository. It sets up a standard sharded cluster for you running on your local machine. There's a mongos, three config servers, and two shards, each of which is a three-member replica set. The first shard's replica set is running on ports 4000 through 4002, the second shard is on ports 5000 through 5002, and the three config servers are on ports 6000 through 6002:

-

The setup

-

For the finale, cluster_setup.py makes a collection named sharded_collection, sharded on a key named shard_key.

-

In a normal deployment, we'd let MongoDB's balancer automatically distribute chunks of data among our two shards. But for this demo we want documents to be on predictable shards, so my script disables the balancer. It makes a chunk for all documents with shard_key less than 500 and another chunk for documents with shard_key greater than or equal to 500. It moves the high chunk to replset_1:

-
client = MongoClient()  # Connect to mongos.
-admin = client.admin  # admin database.
-
-# Pre-split.
-admin.command(
-    'split', 'test.sharded_collection',
-    middle={'shard_key': 500})
-
-admin.command(
-    'moveChunk', 'test.sharded_collection',
-    find={'shard_key': 500},
-    to='replset_1')
-
- - -

If you connect to mongos with the MongoDB shell, sh.status() shows there's one chunk on each of the two shards:

-
{ "shard_key" : { "$minKey" : 1 } } -->> { "shard_key" : 500 } on : replset_0 { "t" : 2, "i" : 1 }
-{ "shard_key" : 500 } -->> { "shard_key" : { "$maxKey" : 1 } } on : replset_1 { "t" : 2, "i" : 0 }
-
- - -

The setup script also inserts a document with a shard_key of 0 and another with a shard_key of 500. Now we're ready for some profiling.

-

Profiling

-

Run the tail_profile.py script from my repository. It connects to all the replica set members. On each, it sets the profiling level to 2 ("log everything") on the test database, and creates a tailable cursor on the system.profile collection. The script filters out some noise in the profile collection—for example, the activities of the tailable cursor show up in the system.profile collection that it's tailing. Any legitimate entries in the profile are spat out to the console in pretty colors.

-

Experiments

-

Targeted queries versus scatter-gather

-

Let's run a query from Python in a separate terminal:

-
>>> from pymongo import MongoClient
->>> # Connect to mongos.
->>> collection = MongoClient().test.sharded_collection
->>> collection.find_one({'shard_key': 0})
-{'_id': ObjectId('51bb6f1cca1ce958c89b348a'), 'shard_key': 0}
-
- - -

tail_profile.py prints:

-

replset_0 primary on 4000: query test.sharded_collection {"shard_key": 0}

-

The query includes the shard key, so mongos reads from the shard that can satisfy it. Adding shards can scale out your throughput on a query like this. What about a query that doesn't contain the shard key?:

-
>>> collection.find_one({})
-
- - -

mongos sends the query to both shards:

-

replset_0 primary on 4000: query test.sharded_collection {"shard_key": 0}
-replset_1 primary on 5000: query test.sharded_collection {"shard_key": 500}

-

For fan-out queries like this, adding more shards won't scale out your query throughput as well as it would for targeted queries, because every shard has to process every query. But we can scale throughput on queries like these by reading from secondaries.

-

Queries with read preferences

-

We can use read preferences to read from secondaries:

-
>>> from pymongo.read_preferences import ReadPreference
->>> collection.find_one({}, read_preference=ReadPreference.SECONDARY)
-
- - -

tail_profile.py shows us that mongos chose a random secondary from each shard:

-

replset_0 secondary on 4001: query test.sharded_collection {"$readPreference": {"mode": "secondary"}, "$query": {}}
-replset_1 secondary on 5001: query test.sharded_collection {"$readPreference": {"mode": "secondary"}, "$query": {}}

-

Note how PyMongo passes the read preference to mongos in the query, as the $readPreference field. mongos targets one secondary in each of the two replica sets.

-

Updates

-

With a sharded collection, updates must either include the shard key or be "multi-updates". An update with the shard key goes to the proper shard, of course:

-
>>> collection.update({'shard_key': -100}, {'$set': {'field': 'value'}})
-
- - -

replset_0 primary on 4000: update test.sharded_collection {"shard_key": -100}

-

mongos only sends the update to replset_0, because we put the chunk of documents with shard_key less than 500 there.

-

A multi-update hits all shards:

-
>>> collection.update({}, {'$set': {'field': 'value'}}, multi=True)
-
- - -

replset_0 primary on 4000: update test.sharded_collection {}
-replset_1 primary on 5000: update test.sharded_collection {}

-

A multi-update on a range of the shard key need only involve the proper shard:

-
>>> collection.update({'shard_key': {'$gt': 1000}}, {'$set': {'field': 'value'}}, multi=True)
-
- - -

replset_1 primary on 5000: update test.sharded_collection {"shard_key": {"$gt": 1000}}

-

So targeted updates that include the shard key can be scaled out by adding shards. Even multi-updates can be scaled out if they include a range of the shard key, but multi-updates without the shard key won't benefit from extra shards.

-

Commands

-

In version 2.4, mongos can use secondaries not only for queries, but also for some commands. You can run count on secondaries if you pass the right read preference:

-
>>> cursor = collection.find(read_preference=ReadPreference.SECONDARY)
->>> cursor.count()
-
- - -

replset_0 secondary on 4001: command count: sharded_collection
-replset_1 secondary on 5001: command count: sharded_collection

-

Whereas findAndModify, since it modifies data, is run on the primaries no matter your read preference:

-
>>> db = MongoClient().test
->>> test.command(
-...     'findAndModify',
-...     'sharded_collection',
-...     query={'shard_key': -1},
-...     remove=True,
-...     read_preference=ReadPreference.SECONDARY)
-
- - -

replset_0 primary on 4000: command findAndModify: sharded_collection

-

Go Forth And Scale

-

To scale a sharded cluster, you should understand how operations are distributed: are they scatter-gather, or targeted to one shard? Do they run on primaries or secondaries? If you set up a cluster and test your queries interactively like we did here, you can see how your cluster behaves in practice, and design your application for future growth.

diff --git a/emptysquare/content/refactoring-tornado-code-with-gen-engine.md b/emptysquare/content/refactoring-tornado-code-with-gen-engine.md deleted file mode 100644 index 92b4c00b..00000000 --- a/emptysquare/content/refactoring-tornado-code-with-gen-engine.md +++ /dev/null @@ -1,227 +0,0 @@ -+++ -type = "post" -title = "Refactoring Tornado Code With gen.engine" -date = "2012-07-11T02:37:35" -description = "" -category = ["MongoDB", "Motor", "Programming", "Python"] -tag = ["tornado"] -enable_lightbox = false -draft = false -disqus_identifier = "4ffd1f2f5393742d5b000001" -disqus_url = "https://emptysqua.re/blog/4ffd1f2f5393742d5b000001/" -+++ - -

Sometimes writing callback-style asynchronous code with Tornado is a pain. But the real hurt comes when you want to refactor your async code into reusable subroutines. Tornado's gen module makes refactoring easy, but you need to learn a few tricks first.

-

For Example

-

I'll use this blog to illustrate. I built it with Motor-Blog, a trivial blog platform on top of Motor, my new driver for Tornado and MongoDB.

-

When you came here, Motor-Blog did three or four MongoDB queries to render this page.

-

1: Find the blog post at this URL and show you this content.

-

2 and 3: Find the next and previous posts to render the navigation links at the bottom.

-

Maybe 4: If the list of categories on the left has changed since it was last cached, fetch the list.

-

Let's go through each query and see how the tornado.gen module makes life easier.

-

Fetching One Post

-

In Tornado, fetching one post takes a little more work than with blocking-style code:

-
db = motor.MotorConnection().open_sync().my_blog_db
-
-class PostHandler(tornado.web.RequestHandler):
-    @tornado.asynchronous
-    def get(self, slug):
-        db.posts.find_one({'slug': slug}, callback=self._found_post)
-
-    def _found_post(self, post, error):
-        if error:
-            raise tornado.web.HTTPError(500, str(error))
-        elif not post:
-            raise tornado.web.HTTPError(404)
-        else:
-            self.render('post.html', post=post)
-
- - -

Not so bad. But is it better with gen?

-
class PostHandler(tornado.web.RequestHandler):
-    @tornado.asynchronous
-    @gen.engine
-    def get(self, slug):
-        post, error = yield gen.Task(
-            db.posts.find_one, {'slug': slug})
-
-        if error:
-            raise tornado.web.HTTPError(500, str(error))
-        elif not post:
-            raise tornado.web.HTTPError(404)
-        else:
-            self.render('post.html', post=post)
-
- - -

A little better. The yield statement makes this function a generator. -gen.engine is a brilliant hack which runs the generator until it's complete. -Each time the generator yields a Task, gen.engine schedules the generator -to be resumed when the task is complete. Read the -source -code of the Runner class for details, it's exhilarating. Or just -enjoy the glow of putting all your logic in a single function again, without -defining any callbacks.

-

Motor includes a subclass of gen.Task called motor.Op. It handles checking and raising the exception for you, so the above can be simplified further:

-
@tornado.asynchronous
-@gen.engine
-def get(self, slug):
-    post = yield motor.Op(
-        db.posts.find_one, {'slug': slug})  
-    if not post:
-        raise tornado.web.HTTPError(404)
-    else:
-        self.render('post.html', post=post)
-
- - -

Still, no huge gains. gen starts to shine when you need to parallelize some tasks.

-

Fetching Next And Previous

-

Once Motor-Blog finds the current post, it gets the next and previous posts. Since the two -queries are independent we can save a few milliseconds by doing them in parallel. -How does this look with callbacks?

-
@tornado.asynchronous
-def get(self, slug):
-    db.posts.find_one({'slug': slug}, callback=self._found_post)
-
-def _found_post(self, post, error):
-    if error:
-        raise tornado.web.HTTPError(500, str(error))
-    elif not post:
-        raise tornado.web.HTTPError(404)
-    else:
-        _id = post['_id']
-        self.post = post
-
-        # Two queries in parallel
-        db.posts.find_one({'_id': {'$lt': _id}},
-            callback=self._found_prev)
-        db.posts.find_one({'_id': {'$gt': _id}},
-            callback=self._found_next)
-
-def _found_prev(self, prev, error):
-    if error:
-        raise tornado.web.HTTPError(500, str(error))
-    else:
-        self.prev = prev
-        if self.next:
-            # Done
-            self._render()
-
-def _found_next(self, next, error):
-    if error:
-        raise tornado.web.HTTPError(500, str(error))
-    else:
-        self.next = next
-        if self.prev:
-            # Done
-            self._render()
-
-def _render(self)
-    self.render('post.html',
-        post=self.post, prev=self.prev, next=self.next)
-
- - -

This is completely disgusting and it makes me want to give up on Tornado. -All that boilerplate can't be factored out. Will gen help?

-
@tornado.asynchronous
-@gen.engine
-def get(self, slug):
-    post, error = yield motor.Op(
-        db.posts.find_one, {'slug': slug})
-    if not post:
-        raise tornado.web.HTTPError(404)
-    else:
-        prev, next = yield [
-            motor.Op(db.posts.find_one, {'_id': {'$lt': _id}}),
-            motor.Op(db.posts.find_one, {'_id': {'$gt': _id}})]
-
-        self.render('post.html', post=post, prev=prev, next=next)
-
- - -

Now our single get function is just as nice as it would be with blocking code. -In fact, the parallel fetch is far easier than if you were multithreading instead of using Tornado. -But what about factoring out a common subroutine that request handlers can share?

-

Fetching Categories

-

Every page on my blog needs to show the category list on the left side. Each request handler could just include -this in its get method:

-
categories = yield motor.Op(
-    db.categories.find().sort('name').to_list)
-
- - -

But that's terrible engineering. Here's how to factor it into a subroutine with gen:

-
@gen.engine
-def get_categories(db, callback):
-    try:
-        categories = yield motor.Op(
-            db.categories.find().sort('name').to_list)
-    except Exception, e:
-        callback(None, e)
-        return
-
-    callback(categories, None)
-
- - -

This function does not have to be part of a request handler—it stands on its own at the module scope. -To call it from a request handler, do:

-
class PostHandler(tornado.web.RequestHandler):
-    @tornado.asynchronous
-    @gen.engine
-    def get(self, slug):
-        categories = yield motor.Op(get_categories)
-        # ... get the current, previous, and next posts as usual, then ...
-        self.render('post.html',
-            post=post, prev=prev, next=next, categories=categories)
-
- - -

gen.engine runs get until it yields get_categories, then a -separate engine runs get_categories until it calls the callback, which -resumes get. It's almost like a regular function call!

-

This is particularly nice because I want to cache the categories between page -views. get_categories can be updated very simply to use a cache:

-
egories = None
-@gen.engine
-def get_categories(db, callback):
-    global categories
-    if not categories:
-        try:
-            categories = yield motor.Op(
-                db.categories.find().sort('name').to_list)
-        except Exception, e:
-            callback(None, e)
-            return
-
-    callback(categories, None)
-
- - -

(Note for nerds: I invalidate the cache whenever a post with a never-before-seen -category is added. The "new category" signal is saved to a -capped collection -in MongoDB, which all the Tornado servers are always tailing. That'll be the -subject of a future post.)

-

Conclusion

-

The gen module's excellent documentation -shows briefly how a method that makes a few async calls can be -simplified using gen.engine, but the power really comes when you need to -factor out a common subroutine. It's not obvious how to do that at first, but -there are only three steps:

-

1. Decorate the subroutine with @gen.engine.

-

2. Make the subroutine take a callback argument (it must be called callback), -to which the subroutine will pass its results when finished.

-

3. Call the subroutine within an engine-decorated function like:

-
result = yield gen.Task(subroutine)
-
- - -

result contains the value or values that subroutine passed to the callback.

-

If you follow Motor's convention where every callback takes arguments -(result, error), then you can use motor.Op to deal with the exception:

-
result = yield motor.Op(subroutine)
-
diff --git a/emptysquare/content/requests-in-python-and-mongodb.md b/emptysquare/content/requests-in-python-and-mongodb.md deleted file mode 100644 index 823ddcb9..00000000 --- a/emptysquare/content/requests-in-python-and-mongodb.md +++ /dev/null @@ -1,223 +0,0 @@ -+++ -type = "post" -title = "Requests in Python and MongoDB" -date = "2012-04-26T15:36:12" -description = "PyMongo 2.2's connection pooling." -category = ["MongoDB", "Programming", "Python"] -tag = [] -enable_lightbox = false -thumbnail = "pymongo-2-1.png" -draft = false -disqus_identifier = "472 http://emptysquare.net/blog/?p=472" -disqus_url = "https://emptysqua.re/blog/472 http://emptysquare.net/blog/?p=472/" -+++ - -

If you use PyMongo, -10gen's official MongoDB driver for Python, I want to ensure you -understand how it manages sockets and threads, and I want to brag about -performance improvements in PyMongo 2.2, which we plan to release next -week.

-

The Problem: Threads and Sockets

-

Each PyMongo Connection object includes a connection pool (a pool of -sockets) to minimize the cost of reconnecting. If you do two operations -(e.g., two find()s) on a Connection, it creates a socket for the first -find(), then reuses that socket for the second. (Update: Starting -with PyMongo 2.4 you should use MongoClient instead of Connection.)

-

When sockets are returned to the pool, the pool checks if it has more -than max_pool_size spare sockets, and if so, it closes the extra -sockets. By default max_pool_size is 10. (Update: in PyMongo 2.6, max_pool_size is now 100, -and its meaning has changed since I wrote this article.)

-

What if multiple Python threads share a Connection? A possible -implementation would be for each thread to get a random socket from the -pool when needed, and return it when done. But consider the following -code. It updates a count of visitors to a web page, then displays the -number of visitors on that web page including this visit:

-
connection = pymongo.Connection()
-counts = connection.my_database.counts
-counts.update(
-    {'_id': this_page_url()},
-    {'$inc': {'n': 1}},
-    upsert=True)
-
-n = counts.find_one({'_id': this_page_url()})['n']
-
-print 'You are visitor number %s' % n
-
- - -

Since PyMongo defaults to unsafe writes—that is, it does not ask the -server to acknowledge its inserts and updates—it will send the update -message to the server and then instantly send the find_one, then await -the result. (Update: if you use MongoClient, safe writes are the default.) If PyMongo gave out sockets to threads at random, then the -following sequence could occur:

-
    -
  1. This thread gets a socket, which I'll call socket 1, from the pool.
  2. -
  3. The thread sends the update message to MongoDB on socket 1. The - thread does not ask for nor await a response.
  4. -
  5. The thread returns socket 1 to the pool.
  6. -
  7. The thread asks for a socket again, and gets a different one: socket - 2.
  8. -
  9. The thread sends the find_one message to MongoDB on socket 2.
  10. -
  11. MongoDB happens to read from socket 2 first, and executes the - find_one.
  12. -
  13. Finally, MongoDB reads the update message from socket 1 and executes - it.
  14. -
-

In this case, the count displayed to the visitor wouldn't include this -visit.

-

I know what you're thinking: just do the find_one first, add one to it, -and display it to the user. Then send the update to MongoDB to -increment the counter. Or use -findAndModify -to update the counter and get its new value in one round trip. Those are -great solutions, but then I would have no excuse to explain requests to -you.

-

Maybe you're thinking of a different fix: use update(safe=True). That -would work, as well, with the added advantage that you'd know if the -update failed, for example because MongoDB's disk is full, or you -violated a unique index. But a safe update comes with a latency cost: -you must send the update, wait for the acknowledgement, then send -the find_one and wait for the response. In a tight loop the extra -latency is significant.

-

The Fix: One Socket Per Thread

-

PyMongo solves this problem by automatically assigning a socket to each -thread, when the thread first requests one. (Update: since MongoClient defaults to -using safe writes, it no longer assigns a socket to each thread. Instead all sockets are kept in a connection pool.) -The socket is stored in a -thread-local variable within the connection pool. Since MongoDB -processes messages on any single socket in order, using a single socket -per thread guarantees that in our example code, update is processed -before find_one, so find_one's result includes the current visit.

-

More Awesome Connection Pooling

-

While PyMongo's socket-per-thread behavior nicely resolves the -inconsistency problem, there are some nasty performance costs that are -fixed in the forthcoming PyMongo 2.2. (I did most of this work, at the -direction of PyMongo's maintainer Bernie Hackett and with -co-brainstorming by my colleague Dan Crosta.)

-

Connection Churn

-

PyMongo 2.1 stores each thread's socket in a thread-local variable. -Alas, when the thread dies, its thread locals are garbage-collected and -the socket is closed. This means that if you regularly create and -destroy threads that access MongoDB, then you are regularly creating and -destroying connections rather than reusing them.

-

You could call Connection.end_request() before the thread dies. -end_request() returns the socket to the pool so it can be used by a -future thread when it first needs a socket. But, just as most people -don't recycle their plastic bottles, most developers don't use -end_request(), so good sockets are wasted.

-

In PyMongo 2.2, I wrote a "socket reclamation" feature that notices when -a thread has died without calling end_request, and reclaims its socket -for the pool. Under the hood, I wrap each socket in a SocketInfo -object, whose __del__ method returns the socket to the pool. For your -application, this means that once you've created as many sockets as you -need, those sockets can be reused as threads are created and destroyed -over the lifetime of the application, saving you the latency cost of -creating a new connection for each thread.

-

Total Number of Connections

-

Consider a web crawler that launches hundreds of threads. Each thread -downloads pages from the Internet, analyzes them, and stores the results -of that analysis in MongoDB. Only a couple threads access MongoDB at -once, since they spend most of their time downloading pages, but PyMongo -2.1 must use a separate socket for each. In a big deployment, this could -result in thousands of connections and a lot of overhead for the MongoDB -server.

-

In PyMongo 2.2 we've added an auto_start_request option to the -Connection constructor. It defaults to True, in which case PyMongo 2.2's -Connection acts the same as 2.1's, except it reclaims sockets from dead -threads. If you set auto_start_request to False, however, threads can -freely and safely share sockets. The Connection will only create as many -sockets as are actually used simultaneously. In our web crawler -example, if you have a hundred threads but only a few of them are -simultaneously accessing MongoDB, then only a few sockets are ever -created.

-

start_request and end_request

-

If you create a Connection with auto_start_request=False you might -still want to do some series of operations on a single socket for -read-your-own-writes consistency. For that case I've provided an API -that can be used three ways, in ascending order of convenience.

-

You can call start/end_request on the Connection object directly:

-
connection = pymongo.Connection(auto_start_request=False)
-counts = connection.my_database.counts
-connection.start_request()
-try:
-    counts.update(
-        {'_id': this_page_url()},
-        {'$inc': {'n': 1}},
-        upsert=True)
-
-    n = counts.find_one({'_id': this_page_url()})['n']
-finally:
-    connection.end_request()
-
- - -

The Request object

-

start_request() returns a Request object, so why not use it?

-
connection = pymongo.Connection(auto_start_request=False)
-counts = connection.my_database.counts
-request = connection.start_request()
-try:
-    counts.update(
-        {'_id': this_page_url()},
-        {'$inc': {'n': 1}},
-        upsert=True)
-
-    n = counts.find_one({'_id': this_page_url()})['n']
-finally:
-    request.end()
-
- - -

Using the Request object as a context manager

-

Request objects can be used as context -managers -in Python 2.5 and later, so the previous example can be terser:

-
connection = pymongo.Connection(auto_start_request=False)
-counts = connection.my_database.counts
-with connection.start_request() as request:
-    counts.update(
-        {'_id': this_page_url()},
-        {'$inc': {'n': 1}},
-        upsert=True)
-
-    n = counts.find_one({'_id': this_page_url()})['n']
-
- - -

Proof

-

I wrote a very messy test script to -verify the effect of my changes on the number of open sockets, and the -total number of sockets created.

-

The script queries Mongo for 60 seconds. It starts a thread each second -for 40 seconds, each thread lasting for 20 seconds and doing 10 queries -per second. So there's a 20-second rampup until there are 20 threads, -then 20 seconds of steady-state with 20 concurrent threads (one dying -and one created per second), then a 20 second cooldown until the last -thread completes. My script then parses the MongoDB log to see when -sockets were opened and closed.

-

I tested the script with the current PyMongo 2.1, and also with PyMongo -2.2 with auto_start_request=True and with auto_start_request=False.

-

PyMongo 2.1 has one socket per thread throughout the test. Each new -thread starts a new socket because old threads' sockets are lost. It -opens 41 total sockets (one for each worker thread plus one for the -main) and tops out at 21 concurrent sockets, because there are 21 -concurrent threads (counting the main thread):

-

-

PyMongo 2.2 with auto_start_request=True acts rather differently (and -much better). It ramps up to 21 sockets and keeps them open throughout -the test, reusing them for new threads when old threads die:

-

-

And finally, with auto_start_request=False, PyMongo 2.2 only needs as many -sockets as there are threads concurrently waiting for responses from -MongoDB. In my test, this tops out at 7 sockets, which stay open until -the whole pool is deleted, because max_pool_size is 10:

-

-

Conclusion

-

Applications that create and destroy a lot of threads without calling -end_request() should run significantly faster with PyMongo 2.2 because -threads' sockets are automatically reused after the threads die.

-

Although we had to default the new auto_start_request option to True -for backwards compatibility, virtually all applications should set it to -False. Heavily multithreaded apps will need far fewer sockets this way, -meaning they'll spend less time establishing connections to MongoDB, and -put less load on the server.

diff --git a/emptysquare/content/restructured-text-chrome-livereload.md b/emptysquare/content/restructured-text-chrome-livereload.md deleted file mode 100644 index 9a660c9c..00000000 --- a/emptysquare/content/restructured-text-chrome-livereload.md +++ /dev/null @@ -1,39 +0,0 @@ -+++ -type = "post" -title = "reStructured Text With Chrome And LiveReload" -date = "2014-10-06T11:41:12" -description = "An effective little workflow for writing RST." -category = ["Programming", "Python"] -tag = [] -enable_lightbox = false -thumbnail = "disabled.png" -draft = false -disqus_identifier = "5430ba9d5393740961f61a4b" -disqus_url = "https://emptysqua.re/blog/5430ba9d5393740961f61a4b/" -+++ - -

I've found a useful set of tools for writing RST, when I must. I'll show you how to configure LiveReload and Chrome to make the experience of writing RST's tortured syntax somewhat bearable.

-

(This article is an improvement over the method I wrote about last year.)

-

LiveReload

-

I bought LiveReload from the Mac App Store for $10, and opened it. Under "Monitored Folders" I added my project's home directory: I was updating Motor's documentation so I added the "motor/doc" directory.

-

LiveReload

-

Next to "Monitoring 44 file extensions" I hit "Options" and added "rst" as a 45th.

-

LiveReload file extension options

-

Then I checked "Run custom command after processing changes" and hit "Options". In the popup dialog I added the command for building Motor's documentation. It's a typical Sphinx project, so the build command is:

-
/Users/emptysquare/.virtualenvs/motor/bin/sphinx-build \
-  -b html -d _build/doctrees . _build/html
-
- - -

Note that I specified the full path to the virtualenv'ed sphinx script.

-

That's all there is to configuring LiveReload. Hit the green box on the lower right of its main window to see the build command's output. Now whenever you change an RST file you should see some Sphinx output scroll by:

-

LiveReload Sphinx output

-

Chrome

-

Next, follow LiveReload's instructions for installing the Chrome extension. Pay attention to LiveReload's tip: "If you want to use it with local files, be sure to enable 'Allow access to file URLs' checkbox in Tools > Extensions > LiveReload after installation."

-

Now open one of the HTML files Sphinx made, and click the LiveReload icon on your browser to enable it. The difference between "enabled" and "disabled" is damn subtle. This is disabled:

-

Disabled

-

This is enabled:

-

Enabled

-

The icon plays it close to the chest, but if you hover your mouse over it, it'll admit whether it's enabled or not.

-

Back at the LiveReload application, you'll now see "1 browser connected."

-

Try it out! Now you can make changes to your RST and see it live in your browser. I don't think I'll ever learn to type RST's syntax reliably, but at least now, I can see at once whether I've typed it right or not.

diff --git a/emptysquare/content/restructuredtext-in-pycharm-firefox-and-anger.md b/emptysquare/content/restructuredtext-in-pycharm-firefox-and-anger.md deleted file mode 100644 index e35fbeb5..00000000 --- a/emptysquare/content/restructuredtext-in-pycharm-firefox-and-anger.md +++ /dev/null @@ -1,27 +0,0 @@ -+++ -type = "post" -title = "reStructuredText in PyCharm, Firefox, and Anger" -date = "2013-04-10T11:09:07" -description = "An only-somewhat-shitty workflow for writing reST." -category = ["Programming", "Python"] -tag = [] -enable_lightbox = false -thumbnail = "auto-reload.png" -draft = false -disqus_identifier = "5165808353937474b99b1857" -disqus_url = "https://emptysqua.re/blog/5165808353937474b99b1857/" -+++ - -

I spend a lot of time writing Python package documentation in reST. Nevertheless, I find reST's markup permanently unlearnable, so I format docs by trial and error: I type a few backticks and colons and angle-brackets and random crap, sphinx-build the docs as HTML, and see if they look okay.

-

Here's some tools to support this expert workflow.

-

PyCharm: My favorite Python IDE has basic syntax-highlighting and auto-completion for reST. It's not much, but it far exceeds the amount of reStructuredText syntax that can fit in my tiny brain. It really shines when I'm embedding Python code examples in my docs: PyCharm gives me full IDE support, including automatically adding imports, auto-completing method names and parameters, and nearly all the help I get when editing normal Python files.

-

There's a file-watcher plugin for PyCharm that seems like a nice way to rebuild docs when the source files change, but it's not yet compatible with the latest version of PyCharm. So instead:

-

Watchdog: I install the watchdog Python package, which watches files and directories for changes. Watchdog gives me a command-line tool called watchmedo. (I find this fact unlearnable, too; why isn't the tool called watchdog the same as the package?) I tell it to watch my package's files for changes and rebuild the docs whenever I save a file:

-
watchmedo shell-command --command="sphinx-build doc build" .
-
- - -

Now that I can regenerate HTML automatically, I need a way to reload the browser window automatically:

-

auto-reload is a Firefox extension that detects any tab with a file:// URL and reloads it when the file changes. In my testing it seems to detect changes in linked files (CSS and Javascript) too. A nice little bar slides down to tell me when it's reloading. That way I know that the reason the page is still a mess is because my reST is still wrong, not because it hasn't reloaded:

-

Auto reload

-

This little suite of tools deals well with invoking Sphinx and reloading my web page, so I can focus on the task at hand: trying to write reStructuredText, which is a loathsome afterbirth expelled from the same womb as XML and TeX.

diff --git a/emptysquare/content/save-the-monkey-reliably-writing-to-mongodb.md b/emptysquare/content/save-the-monkey-reliably-writing-to-mongodb.md deleted file mode 100644 index 23a088c5..00000000 --- a/emptysquare/content/save-the-monkey-reliably-writing-to-mongodb.md +++ /dev/null @@ -1,263 +0,0 @@ -+++ -type = "post" -title = "Save the Monkey: Reliably Writing to MongoDB" -date = "2011-12-08T14:41:20" -description = "" -category = ["MongoDB", "Programming", "Python"] -tag = [] -enable_lightbox = false -thumbnail = "3064180867_0f293b8f27.jpg" -draft = false -disqus_identifier = "236 http://emptysquare.net/blog/?p=236" -disqus_url = "https://emptysqua.re/blog/236 http://emptysquare.net/blog/?p=236/" -+++ - -

-

Photo: Kevin Jones

-

MongoDB replica sets claim "automatic failover" when a primary server -goes down, and they live up to the claim, but handling failover in your -application code takes some care. I'll walk you through writing a -failover-resistant application in PyMongo.

-

Update: This article is superseded by my MongoDB World 2016 talk and the accompanying article:

-

Writing Resilient MongoDB Applications

-

Setting the Scene

-

Mabel the Swimming Wonder -Monkey is -participating in your cutting-edge research on simian scuba diving. To -keep her alive underwater, your application must measure how much oxygen -she consumes each second and pipe the same amount of oxygen to her scuba -gear. In this post, I'll describe how to write reliably to MongoDB.

-

MongoDB Setup

-

Since Mabel's life is in your hands, you want a robust Mongo deployment. -Set up a 3-node replica set. We'll do this on your local machine using -three TCP ports, but of course in production you'll have each node on a -separate machine:

-
$ mkdir db0 db1 db2
-$ mongod --dbpath db0 --logpath db0/log --pidfilepath db0/pid --port 27017 --replSet foo --fork
-$ mongod --dbpath db1 --logpath db1/log --pidfilepath db1/pid --port 27018 --replSet foo --fork
-$ mongod --dbpath db2 --logpath db2/log --pidfilepath db2/pid --port 27019 --replSet foo --fork
-
- - -

(Make sure you don't have any mongod processes running on those ports -first.)

-

Now connect up the nodes in your replica set. My machine's hostname is -'emptysquare.local'; replace it with yours when you run the example:

-
$ hostname
-emptysquare.local
-$ mongo
-> rs.initiate({
-  _id: 'foo',
-  members: [
-    {_id: 0, host:'emptysquare.local:27017'},
-    {_id: 1, host:'emptysquare.local:27018'},
-    {_id: 2, host:'emptysquare.local:27019'}
-  ]
-})
-
- - -

The first _id, 'foo', must match the name you passed with --replSet on -the command line, otherwise MongoDB will complain. If everything's -correct, MongoDB replies with, "Config now saved locally. Should come -online in about a minute." Run rs.status() a few times until you see -that the replica set has come online—the first member's stateStr will be -"PRIMARY" and the other two members' stateStrs will be "SECONDARY". On -my laptop this takes about 15 seconds.

-

Voilà: a bulletproof 3-node replica set! Let's start the Mabel -experiment.

-

Definitely Writing

-

Install PyMongo -and create a Python script called mabel.py with the following:

-
import datetime, random, time
-import pymongo
-
-mabel_db = pymongo.MongoReplicaSetClient(
-    'localhost:27017,localhost:27018,localhost:27019',
-    replicaSet='foo'
-).mabel
-
-while True:
-    time.sleep(1)
-    mabel_db.breaths.insert({
-        'time': datetime.datetime.utcnow(),
-        'oxygen': random.random()
-    })
-
-    print 'wrote'
-
- - -

mabel.py will record the amount of oxygen Mabel consumes (or, in our -test, a random amount) and insert it into MongoDB once per second. Run it:

-
$ python mabel.py
-wrote
-wrote
-wrote
-
- - -

What happens when our good-for-nothing sysadmin unplugs the primary -server? Grab the primary's process id from db0/pid and kill it. Now, -all is not well with our Python script:

-
Traceback (most recent call last):
-  File "mabel.py", line 10, in <module>
-    'oxygen': random.random()
-  File "/Users/emptysquare/.virtualenvs/pymongo/mongo-python-driver/pymongo/collection.py", line 310, in insert
-    continue_on_error, self.__uuid_subtype), safe)
-  File "/Users/emptysquare/.virtualenvs/pymongo/mongo-python-driver/pymongo/mongo_replica_set_client.py", line 738, in _send_message
-    raise AutoReconnect(str(why))
-pymongo.errors.AutoReconnect: [Errno 61] Connection refused
-
- - -

This is terrible. WTF happened to "automatic failover"? And why does -PyMongo raise an AutoReconnect error rather than actually automatically -reconnecting?

-

Well, automatic failover does work, in the sense that one of the -secondaries will take over as a new primary in a few seconds. Do rs.status() in -the mongo shell to confirm that:

-
$ mongo --port 27018 # connect to one of the surviving mongods
-PRIMARY> rs.status()
-// edited for readability ...
-{
-    "set" : "foo",
-    "members" : [ {
-            "_id" : 0,
-            "name" : "emptysquare.local:27017",
-            "stateStr" : "(not reachable/healthy)",
-            "errmsg" : "socket exception"
-        }, {
-            "_id" : 1,
-            "name" : "emptysquare.local:27018",
-            "stateStr" : "PRIMARY"
-        }, {
-            "_id" : 2,
-            "name" : "emptysquare.local:27019",
-            "stateStr" : "SECONDARY",
-        }
-    ]
-}
-
- - -

Depending on which mongod took over as the primary, your output could be -a little different. Regardless, there is a new primary, so why did -our write fail? The answer is that PyMongo doesn't try repeatedly to -insert your document—it just tells you that the first attempt failed. -It's your application's job to decide what to do about that. To explain -why, let us indulge in a brief digression.

-

Brief Digression: Monkeys vs. Kittens

-

Monkeys vs Kittens

-

If what you're inserting is voluminous but no single document is very -important, like pictures of kittens or web analytics, then in the -extremely rare event of a failover you might prefer to discard a few -documents, rather than blocking your application while it waits for the -new primary. Throwing an exception if the primary dies is often the -right thing to do: You can notify your user that he should try uploading -his kitten picture again in a few seconds once a new primary has been -elected.

-

But if your updates are infrequent and tremendously valuable, like -Mabel's oxygen data, then your application should try very hard to write -them. Only you know what's best for your data, so PyMongo lets you -decide. Let's return from this digression and implement that.

-

Trying Hard to Write

-

Let's bring up the mongod we just killed:

-
$ mongod --dbpath db0 --logpath db0/log --pidfilepath db0/pid --port 27017 --replSet foo --fork
-
- - -

And update mabel.py with the following armor-plated loop:

-
while True:
-    time.sleep(1)
-    data = {
-        'time': datetime.datetime.utcnow(),
-        'oxygen': random.random()
-    }
-
-    # Try for five minutes to recover from a failed primary
-    for i in range(60):
-        try:
-            mabel_db.breaths.insert(data)
-            print 'wrote'
-            break # Exit the retry loop
-        except pymongo.errors.AutoReconnect, e:
-            print 'Warning', e
-            time.sleep(5)
-    else:
-        raise Exception("Couldn't write!")
-
- - -

In a Python for-loop, the "else" clause executes if we exhaust the loop without executing the "break" statement. So this loop waits a full minute for a new primary, trying every 5 seconds. If there's no primary after a minute, there may never be one. Perhaps the sysadmin unplugged a majority of the members. In this case we raise an exception.

-

Now run python mabel.py, and again kill the primary. mabel.py's output -will look like:

-
wrote
-Warning [Errno 61] Connection refused
-Warning emptysquare.local:27017: [Errno 61] Connection refused, emptysquare.local:27019: [Errno 61] Connection refused, emptysquare.local:27018: [Errno 61] Connection refused
-Warning emptysquare.local:27017: not primary, emptysquare.local:27019: [Errno 61] Connection refused, emptysquare.local:27018: not primary
-wrote
-wrote
-.
-.
-.
-
- - -

mabel.py goes through a few stages of grief when the primary dies, but -in a few seconds it finds a new primary, inserts its data, and continues -happily.

-

What About Duplicates?

-

Leaving monkeys and kittens aside, another reason PyMongo doesn't -automatically retry your inserts is the risk of duplication: If the -first attempt caused an error, PyMongo can't know if the error happened -before Mongo wrote the data, or after. What if we end up writing Mabel's -oxygen data twice? Well, there's a trick you can use to prevent this: -generate the document id on the client.

-

Whenever you insert a document, Mongo checks if it has an "_id" field -and if not, it generates an ObjectId for it. But you're free to choose -the new document's id before you insert it, as long as the id is unique -within the collection. You can use an ObjectId or any other type of -data. In mabel.py you could use the timestamp as the document id, but -I'll show you the more generally applicable ObjectId approach:

-
from pymongo.objectid import ObjectId
-
-while True:
-    time.sleep(1)
-    data = {
-        '_id': ObjectId(),
-        'time': datetime.datetime.utcnow(),
-        'oxygen': random.random()
-    }
-
-    # Try for five minutes to recover from a failed primary
-    for i in range(60):
-        try:
-            mabel_db.breaths.insert(data)
-            print 'wrote'
-            break # Exit the retry loop
-        except pymongo.errors.AutoReconnect, e:
-            print 'Warning', e
-            time.sleep(5)
-        except pymongo.errors.DuplicateKeyError:
-            # It worked the first time
-            break
-    else:
-        raise Exception("Couldn't write!")
-
- - -

We set the document's id to a newly-generated ObjectId in our Python -code, before entering the retry loop. Then, if our insert succeeds just -before the primary dies and we catch the AutoReconnect exception, then -the next time we try to insert the document we'll catch a -DuplicateKeyError and we'll know for sure that the insert succeeded. You -can use this technique for safe, reliable writes in general.

-
-

Bibliography

-

Apocryphal story of Mabel, the Swimming Wonder -Monkey

-

More likely true, very brutal story of 3 monkeys killed by a computer -error

-
-

History: Updated April 3, 2014 for current PyMongo syntax.

diff --git a/emptysquare/content/synchronously-build-mongodb-indexes.md b/emptysquare/content/synchronously-build-mongodb-indexes.md deleted file mode 100644 index 7e8e2358..00000000 --- a/emptysquare/content/synchronously-build-mongodb-indexes.md +++ /dev/null @@ -1,78 +0,0 @@ -+++ -type = "post" -title = "Synchronously Build Indexes On a Whole MongoDB Replica Set" -date = "2013-07-05T15:14:58" -description = "How do you know when an index has finished building on all the members of a replica set?" -category = ["MongoDB", "Programming", "Python"] -tag = [] -enable_lightbox = false -draft = false -disqus_identifier = "51d71a4f5393747383eaed99" -disqus_url = "https://emptysqua.re/blog/51d71a4f5393747383eaed99/" -+++ - -**Update**: Welcome to 2017! We now create indexes with the "createIndexes" command, which has accepted a writeConcern parameter since MongoDB 3.4. To build indexes on all replicas: - -```python -from pymongo import * - -collection = MongoClient().db.get_collection( - "my_collection", write_concern=WriteConcern(w=3)) - -collection.create_index([('a', 1)]) -``` - -Replace "3" with the number of replica set members you have (excluding arbiters). - -*** - -

I help maintain PyMongo, 10gen's Python driver for MongoDB. Mainly this means I write a lot of tests, and writing tests sometimes requires me to solve problems no normal person would encounter. I'll describe one such problem and the fix: I'm going to explain how to wait for an index build to finish on all secondary members of a replica set.

-

Normally, this is how I'd build an index on a replica set:

-
client = MongoReplicaSetClient(
-    'server0,server1,server2',
-    replicaSet='replica_set_name')
-
-collection = client.test.collection
-collection.create_index([('key', ASCENDING)])
-print("All done!")
-
- - -

Once "All done!" is printed, I know the index has finished building on the primary. (I could pass background=True if I didn't want to wait for the build to finish.) Once the index is built on the primary, the primary inserts a description of the index into the system.indexes collection, and appends the insert operation to its oplog:

-
{
-    "ts" : { "t" : 1373049049, "i" : 1 },
-    "op" : "i",
-    "ns" : "test.system.indexes",
-    "o" : {
-        "ns" : "test.collection",
-        "key" : { "key" : 1 },
-        "name" : "key_1"
-    }
-}
-
- - -

The ts is the timestamp for the operation. "op": "i" means this is an insert, and the "o" subdocument is the index description itself. The secondaries see the entry and start their own index builds.

-

But now my call to PyMongo's create_index returns and Python prints "All done!" In one of the tests I wrote, I couldn't start testing until the index was ready on the secondaries, too. How do I wait until then?

-

The trick is to insert the index description into system.indexes manually. This way I can insert with a write concern so I wait for the insert to be replicated:

-
client = MongoReplicaSetClient(
-    'server0,server1,server2',
-    replicaSet='replica_set_name')
-
-# Count the number of replica set members.
-w = 1 + len(client.secondaries)
-
-# Manually form the index description.
-from pymongo.collection import _gen_index_name
-index = {
-    'ns': 'test.collection',
-    'name': _gen_index_name([('key', 1)]),
-    'key': {'key': ASCENDING}}
-
-client.test.system.indexes.insert(index, w=w)
-
-print("All done!")
-
- - -

Setting the w parameter to the number of replica set members (one primary plus N secondaries) makes insert wait for the operation to complete on all members. First the primary builds its index, then it adds it to its oplog, then the secondaries all start building the index. Only once all secondaries have finished building the index is the insert operation considered complete. Once Python prints "All done!" we know the index is finished everywhere.

diff --git a/emptysquare/content/toro-0-6-released.md b/emptysquare/content/toro-0-6-released.md deleted file mode 100644 index cc2f365f..00000000 --- a/emptysquare/content/toro-0-6-released.md +++ /dev/null @@ -1,24 +0,0 @@ -+++ -type = "post" -title = "Toro 0.6 Released" -date = "2014-07-08T22:05:47" -description = "One minor bug fixed in Toro, my package of semaphores, locks, and queues for Tornado coroutines." -category = ["Programming", "Python"] -tag = ["tornado"] -enable_lightbox = false -thumbnail = "toro.png" -draft = false -disqus_identifier = "53bca2fb5393745d31c3f8b7" -disqus_url = "https://emptysqua.re/blog/53bca2fb5393745d31c3f8b7/" -+++ - -

Toro

-

I've just released version 0.6 of Toro. Toro provides semaphores, queues, and so on, for advanced control flows with Tornado coroutines. Get it with "pip install --upgrade toro". Toro's documentation, with plenty of examples, is on ReadTheDocs.

-

There is one bugfix in this release. A floating point maxsize had been treated as infinite. So if you did this:

-
q = toro.Queue(maxsize=1.3)
-
- - -

...then the queue would never be full. In the newest version of Toro, a maxsize of 1.3 now acts like a maxsize of 2.

-

Shouldn't Toro just require that maxsize be an integer? Well, the Python standard Queue allows a floating-point number. So when Vajrasky Kok noticed that asyncio's Queue treats a floating-point maxsize as infinity, he proposed a fix that handles floats the same as the standard Queue does. (That asyncio bug was my fault, too.)

-

Once Guido van Rossum accepted that fix, I updated Toro to comply with the other two Queues.

diff --git a/emptysquare/content/toro-0-7-released.md b/emptysquare/content/toro-0-7-released.md deleted file mode 100644 index d8c9dd25..00000000 --- a/emptysquare/content/toro-0-7-released.md +++ /dev/null @@ -1,33 +0,0 @@ -+++ -type = "post" -title = "Toro 0.7 Released" -date = "2014-10-29T10:09:55" -description = "A major bug fixed in Toro, my package of semaphores, locks, and queues for Tornado coroutines." -category = ["Programming", "Python"] -tag = ["tornado"] -enable_lightbox = false -thumbnail = "toro.png" -draft = false -disqus_identifier = "5450f3bb5393740960d41350" -disqus_url = "https://emptysqua.re/blog/5450f3bb5393740960d41350/" -+++ - -

Toro

-

I've just released version 0.7 of Toro. Toro provides semaphores, locks, events, conditions, and queues for Tornado coroutines. It enables advanced coordination among coroutines, similar to what you do in a multithreaded application. Get the latest version with "pip install --upgrade toro". Toro's documentation, with plenty of examples, is on ReadTheDocs.

-

There is one bugfix in this release. Semaphore.wait() is supposed to wait until the semaphore can be acquired again:

-
@gen.coroutine
-def coro():
-    sem = toro.Semaphore(1)
-    assert not sem.locked()
-
-    # A semaphore with initial value of 1 can be acquired once,
-    # then it's locked.
-    sem.acquire()
-    assert sem.locked()
-
-    # Wait for another coroutine to release the semaphore.
-    yield sem.wait()
-
- - -

... however, there was a bug and the semaphore didn't mark itself "locked" when it was acquired, so "wait" always returned immediately. I'm grateful to "abing" on GitHub for noticing the bug and contributing a fix.

diff --git a/emptysquare/content/using-jqtouch-js-with-ibutton-js.md b/emptysquare/content/using-jqtouch-js-with-ibutton-js.md deleted file mode 100644 index bbbc8922..00000000 --- a/emptysquare/content/using-jqtouch-js-with-ibutton-js.md +++ /dev/null @@ -1,84 +0,0 @@ -+++ -type = "post" -title = "Using jQTouch.js with iButton.js" -date = "2011-10-26T10:27:52" -description = "Fixing an incompatibility between two Javascript libraries for making iOS-like web apps." -category = ["Programming"] -tag = ["javascript"] -enable_lightbox = false -thumbnail = "how_wide.png" -draft = false -disqus_identifier = "41 http://emptysquare.net/blog/?p=41" -disqus_url = "https://emptysqua.re/blog/41 http://emptysquare.net/blog/?p=41/" -+++ - -

jQTouch is a jQuery-based Javascript library -that simulates an iPhone-like interface using only Javascript and HTML5. -It's designed for WebKit browsers (Safari Desktop, Safari Mobile, -Android, Chrome) but is adaptable to Firefox with little work. (Don't -ask about IE.) By default, it renders HTML like this:

-
<span class="toggle"><input type="checkbox"></span>
-
- - -

... as toggle switches, like this:

-

-

 Another library, -iButton.js, provides -similar functionality but has some advantages: it works on all browsers, -you can easily togglify your checkboxes at runtime, dragging laterally -across the control with your mouse or fingertip works as expected, and -frankly it makes prettier toggles:

-

-

So you might be motivated to combine jQTouch with iButton.js. It should -be simple — just remove all the \<span class="toggle"> tags and run -iButton's initialization method — but you'll run into some troubles. (If -you don't believe me when I say "troubles", skim this -discussion.)

-

So, here's the precise problem with combining these two libraries.

-

When jQTouch initializes, it styles every top-level div with -display=none, except for the currently showing div. Here's the CSS rules -it uses:

-
#jqt > * {
-  display: none;
-}
-
-#jqt > .current {
-  display: block !important;
-  z-index: 10;
-}
-
- - -

This way jQTouch can treat top-level divs like screens (for you iOS -devs, that's a UIViewController) in an iOS app, hiding and showing them -according to where the user is in the navigation stack.

-

When iButton.js initializes, it wraps every checkbox with its fancy -toggle-control HTML, and then it measures the width of the HTML it -created so it knows how far to slide the toggle control when a user -clicks on it.

-

-

Alas, it's impossible to measure the width of a hidden element. First -jQTouch hides all but the current div, then iButton tries to initialize -all the toggles, and it thinks they're all zero pixels wide.

-

My solution is to wait for jQTouch to display a page before I run -iButton on the checkboxes in that page, like so:

-
var pagesWithCheckboxes = _.uniq($('input[type="checkbox"]').closest('div.page'));
-_.each(pagesWithCheckboxes, function(page) {
-    var $page = $(page);
-    $page.bind('pageAnimationEnd', function(e, info) {
-        if(info.direction === 'in') {
-            $page.find('input[type="checkbox"]').iButton();
-        }
-    });
-});
-
- - -

_.uniq() and _.each() are from underscore.js. I use _uniq() to ensure -I don't bind the event handler multiple times to pages with multiple -checkboxes.

-

A final note: if you create checkboxes dynamically after the page has -loaded, you must call \$(my_new_checkbox_element).iButton() on them, -once they're visible, to ensure they get the proper toggle-switch -behavior.

diff --git a/emptysquare/content/wasps-nest-read-copy-update-python.md b/emptysquare/content/wasps-nest-read-copy-update-python.md deleted file mode 100644 index f01e83a4..00000000 --- a/emptysquare/content/wasps-nest-read-copy-update-python.md +++ /dev/null @@ -1,215 +0,0 @@ -+++ -type = "post" -title = "Wasp's Nest: The Read-Copy-Update Pattern In Python" -date = "2013-05-08T22:47:45" -description = "A concurrency-control pattern that solves some reader-writer problems without mutexes." -category = ["MongoDB", "Programming", "Python"] -tag = [] -enable_lightbox = false -thumbnail = "paper-wasp-closeup.jpg" -draft = false -disqus_identifier = "518b091c53937474bbee4005" -disqus_url = "https://emptysqua.re/blog/518b091c53937474bbee4005/" -+++ - -

Paper Wasp -© MzePhotos.com, Some Rights Reserved

-

In recent work on PyMongo, I used a concurrency-control pattern that solves a variety of reader-writer problem without mutexes. It's similar to the read-copy-update technique used extensively in the Linux kernel. I'm dubbing it the Wasp's Nest. Stick with me—by the end of this post you'll know a neat concurrency pattern, and have a good understanding of how PyMongo handles replica set failovers.

-

Update: In this post's first version I didn't know how close my code is to "ready-copy-update". Robert Moore schooled me in the comments. I also named it "a lock-free concurrency pattern" and Steve Baptiste pointed out that I was using the term wrong. My algorithm merely solves a race condition without adding a mutex, it's not lock-free. I love this about blogging: in exchange for a little humility I get a serious education.

-
- -

The Mission

- -

MongoDB is deployed in "replica sets" of identical database servers. A replica set has one primary server and several read-only secondary servers. Over time a replica set's state can change. For example, if the primary's cooling fans fail and it bursts into flames, a secondary takes over as primary a few seconds later. Or a sysadmin can add another server to the set, and once it's synced up it becomes a new secondary.

-

I help maintain PyMongo, the Python driver for MongoDB. Its MongoReplicaSetClient is charged with connecting to the members of a set and knowing when the set changes state. Replica sets and PyMongo must avoid any single points of failure in the face of unreliable servers and networks—we must never assume any particular members of the set are available.

-

Consider this very simplified sketch of a MongoReplicaSetClient:

-
class Member(object):
-    """Represents one server in the set."""
-    def __init__(self, pool):
-        # The connection pool.
-        self.pool = pool
-
-class MongoReplicaSetClient(object):
-    def __init__(self, seeds):
-        self.primary = None
-        self.members = {}
-        self.refresh()
-
-        # The monitor calls refresh() every 30 sec.
-        self.monitor = MonitorThread(self)
-
-    def refresh(self):
-        # If we're already connected, use our list of known
-        # members. Otherwise use the passed-in list of
-        # possible members, the 'seeds'.
-        seeds = self.members.keys() or self.seeds
-
-        # Try seeds until first success.
-        ismaster_response = None
-        for seed in seeds:
-            try:
-                # The 'ismaster' command gets info
-                # about the whole set.
-                ismaster_response = call_ismaster(seed)
-                break
-            except socket.error:
-                # Host down / unresolvable, try the next.
-                pass
-
-        if not ismaster_response:
-            raise ConnectionFailure()
-
-        # Now we can discover the whole replica set.
-        for host in ismaster_response['hosts']:
-            pool = ConnectionPool(host)
-            member = Member(pool)
-            self.members[host] = member
-
-        # Remove down members from dict.
-        for host in self.members.keys():
-            if host not in ismaster_response['hosts']:
-                self.members.pop(host)
-
-        self.primary = ismaster_response.get('primary')
-
-    def send_message(self, message):
-        # Send an 'insert', 'update', or 'delete'
-        # message to the primary.
-        if not self.primary:
-            self.refresh()
-
-        member = self.members[self.primary]
-        pool = member.pool
-        try:
-            send_message_with_pool(message, pool)
-        except socket.error:
-            self.primary = None
-            raise AutoReconnect()
-
- - -

We don't know which members will be available when our application starts, so we pass a "seed list" of hostnames to the MongoReplicaSetClient. In refresh, the client tries them all until it can connect to one and run the isMaster command, which returns information about all the members in the replica set. The client then makes a connection-pool for each member and records which one is the primary.

-

Once refresh finishes, the client starts a MonitorThread which calls refresh again every 30 seconds. This ensures that if we add a secondary to the set it will be discovered soon and participate in load-balancing. If a secondary goes down, refresh removes it from self.members. In send_message, if we discover the primary's down, we raise an error and clear self.primary so we'll call refresh the next time send_message runs.

-

The Bugs

- -

PyMongo 2.1 through 2.5 had two classes of concurrency bugs: race conditions and thundering herds.

-

The race condition is easy to see. Look at the expression self.members[self.primary] in send_message. If the monitor thread runs refresh and pops a member from self.members while an application thread is executing the dictionary lookup, the latter could get a KeyError. Indeed, that is exactly the bug report we received that prompted my whole investigation and this blog post.

-

The other bug causes a big waste of effort. Let's say the primary server bursts into flames. The client gets a socket error and clears self.primary. Then a bunch of application threads all call send_message at once. They all find that self.primary is None, and all call refresh. This is a duplication of work that only one thread need do. Depending how many processes and threads we have, it has the potential to create a connection storm in our replica set as a bunch of heavily-loaded applications lurch to the new primary. It also compounds the race condition because many threads are all modifying the shared state. I'm calling this duplicated work a thundering herd problem, although the official definition of thundering herd is a bit different.

-

Fixing With A Mutex

- -

We know how to fix race conditions: let's add a mutex! We could lock around the whole body of refresh, and lock around the expression self.members[self.primary] in send_message. No thread sees members and primary in a half-updated state.

-

...and why it's not ideal

- -

This solution has two problems. The first is minor: the slight cost of acquiring and releasing a lock for every message sent to MongoDB, especially since it means only one thread can run that section of send_message at a time. A reader-writer lock alleviates the contention by allowing many threads to run send_message as long as no thread is running refresh, in exchange for greater complexity and cost for the single-threaded case.

-

The worse problem is the behavior such a mutex would cause in a very heavily multithreaded application. While one thread is running refresh, all threads running send_message will queue on the mutex. If the load is heavy enough our application could fail while waiting for refresh, or could overwhelm MongoDB once they're all simultaneously unblocked. Better under most circumstances for send_message to fail fast, saying "I don't know who the primary is, and I'm not going to wait for refresh to tell me." Failing fast raises more errors but keeps the queues small.

-

The Wasp's Nest Pattern

- -

There's a better way, one that requires no locks, is less error-prone, and fixes the thundering-herd problem too. Here's what I did for PyMongo 2.5.1, which we'll release next week.

-

First, all information about the replica set's state is pulled out of MongoReplicaSetClient and put into an RSState object:

-
class RSState(object):
-    def __init__(self, members, primary):
-        self.members = members
-        self.primary = primary
-
- - -

MongoReplicaSetClient gets one RSState instance that it puts in self.rsstate. This instance is immutable: no thread is allowed to change the contents, only to make a modified copy. So if the primary goes down, refresh doesn't just set primary to None and pop its hostname from the members dict. Instead, it makes a deep copy of the RSState, and updates the copy. Finally, it replaces the old self.rsstate with the new one.

-

Each of the RSState's attributes must be immutable and cloneable, too, which requires a very different mindset. For example, I'd been tracking each member's ping time using a 5-sample moving average and updating it with a new sample like so:

-
class Member(object):
-    def add_sample(self, ping_time):
-        self.samples = self.samples[-4:]
-        self.samples.append(ping_time)
-        self.avg_ping = sum(self.samples) / len(self.samples)
-
- - -

But if Member is immutable, then adding a sample means cloning the whole Member and updating it. Like this:

-
class Member(object):
-    def clone_with_sample(self, ping_time):
-        # Make a new copy of 'samples'
-        samples = self.samples[-4:] + [ping_time]
-        return Member(samples)
-
- - -

Any method that needs to access self.rsstate more than once must protect itself against the state being replaced concurrently. It has to make a local copy of the reference. So the racy expression in send_message becomes:

-
rsstate = self.rsstate  # Copy reference.
-member = rsstate.members[rsstate.primary]
-
- - -

Since the rsstate cannot be modified by another thread, send_message knows its local reference to the state is safe to read.

-

A few summers ago I was on a Zen retreat in a rural house. We had paper wasps building nests under the eaves. The wasps make their paper from a combination of chewed-up plant fiber and saliva. The nest hangs from a single skinny petiole. It's precarious, but it seems to protect the nest from ants who want to crawl in and eat the larvae. The queen periodically spreads an ant-repellant secretion around the petiole; its slenderness conserves her ant-repellant, and concentrates it in a small area.

-

Wasp's Nest in Situ -[Source]

-

I think of the RSState like a wasp's nest: it's an intricate structure hanging off the MongoReplicaSetClient by a single attribute, self.rsstate. The slenderness of the connection protects send_message from race conditions, just as the thin petiole protects the nest from ants.

-

Since I was fixing the race condition I fixed the thundering herd as well. Only one thread should run refresh after a primary goes down, and all other threads should benefit from its labor. I nominated the monitor to be that one thread:

-
class MonitorThread(threading.Thread):
-    def __init__(self, client):
-        threading.Thread.__init__(self)
-        self.client = weakref.proxy(client)
-        self.event = threading.Event()
-        self.refreshed = threading.Event()
-
-    def schedule_refresh(self):
-        """Refresh immediately."""
-        self.refreshed.clear()
-        self.event.set()
-
-    def wait_for_refresh(self, timeout_seconds):
-        """Block until refresh completes."""
-        self.refreshed.wait(timeout_seconds)
-
-    def run(self):
-        while True:
-            self.event.wait(timeout=30)
-            self.event.clear()
-
-            try:
-                try:
-                    self.client.refresh()
-                finally:
-                    self.refreshed.set()
-            except AutoReconnect:
-                pass
-            except:
-                # Client was garbage-collected.
-                break
-
- - -

(The weakref proxy prevents a reference cycle and lets the thread die when the client is deleted. The weird try-finally syntax is necessary in Python 2.4.)

-

The monitor normally wakes every 30 seconds to notice changes in the set, like a new secondary being added. If send_message discovers that the primary is gone, it wakes the monitor early by signaling the event it's waiting on:

-
rsstate = self.rsstate
-if not rsstate.primary:
-    self.monitor.schedule_refresh()
-    raise AutoReconnect()
-
- - -

No matter how many threads call schedule_refresh, the work is only done once.

-

Any MongoReplicaSetClient method that needs to block on refresh can wait for the "refreshed" event:

-
rsstate = self.rsstate
-if not rsstate.primary:
-    self.monitor.schedule_refresh()
-    self.monitor.wait_for_refresh(timeout_seconds=5)
-
-# Get the new state.
-rsstate = self.rsstate
-if not rsstate.primary:
-    raise AutoReconnect()
-
-# Proceed normally....
-
- - -

This pattern mitigates the connection storm from a heavily-loaded application discovering that the primary has changed: only the monitor thread goes looking for the new primary. The others can abort or wait.

-

The wasp's nest pattern is a simple and high-performance solution to some varieties of reader-writer problem. Compared to mutexes it's easy to understand, and most importantly it's easy to program correctly. For further reading see my notes in the source code.

-

Paper wasp and nest -[Source]

diff --git a/emptysquare/content/yieldpoints-simple-extensions-to-tornado-gen.md b/emptysquare/content/yieldpoints-simple-extensions-to-tornado-gen.md deleted file mode 100644 index ccffdab3..00000000 --- a/emptysquare/content/yieldpoints-simple-extensions-to-tornado-gen.md +++ /dev/null @@ -1,42 +0,0 @@ -+++ -type = "post" -title = "YieldPoints: simple extensions to tornado.gen" -date = "2012-12-07T18:42:19" -description = "I affectionately introduce YieldPoints, my littlest project yet. It's just some simple extensions to Tornado's gen module. The cutest example of what you can do with YieldPoints is the WaitAny class, which lets you begin multiple [ ... ]" -category = ["Programming", "Python"] -tag = ["tornado"] -enable_lightbox = false -thumbnail = "yield.png" -draft = false -disqus_identifier = "50c27e815393745f98527db0" -disqus_url = "https://emptysqua.re/blog/50c27e815393745f98527db0/" -+++ - -

YieldPoints

-

I affectionately introduce YieldPoints, my littlest project yet. It's just some simple extensions to Tornado's gen module.

-

The cutest example of what you can do with YieldPoints is the WaitAny class, which lets you begin multiple asynchronous tasks and handle their results in the order they complete:

-
@gen.engine
-def f():
-    callback0 = yield gen.Callback(0)
-    callback1 = yield gen.Callback(1)
-
-    # Fire callback1 soon, callback0 later
-    IOLoop.instance().add_timeout(
-        timedelta(seconds=0.1), partial(callback1, 'foo'))
-
-    IOLoop.instance().add_timeout(
-        timedelta(seconds=0.2), partial(callback0, 'bar'))
-
-    keys = set([0, 1])
-    while keys:
-        key, result = yield yieldpoints.WaitAny(keys)
-        print 'key:', key, ', result:', result
-        keys.remove(key)
-
- - -

More examples are in the docs: you can use WithTimeout to wrap any callback in a timeout, and use Cancel or CancelAll to decline to wait for a callback you registered earlier. There's an adorable extended example that uses my library to start downloading multiple URLs at once, and process the results in the order received.

-

Further reading:

-

YieldPoints on Read the Docs

-

YieldPoints on Github

-

YieldPoints on PyPI