Thursday, May 28, 2009

Using hooks in Google App Engine

Introduction
As code grows in the wild, it sometimes becomes hard to maintain and modify. Implementing features like profiling datastore usage or adding "audit data" (who changed that row?) throughout the codebase can be tough. How can we be sure that we did not miss any place in our sources? How can we prevent another developer from making errors when later extending the code?

Background
Often, we would like to implement a change in the global behavior of an application that has nothing to do with its business logic. The goal is to make a big change with very little risk and code. We would like to apply a crosscutting concern, such as logging of performance-related data, to an entire application. We would like to achieve this goal globally, no matter what kind of higher-level api (Models versus datastore, Django versus webob) a developer uses.

The natural module to use is api_proxy_map, a low-level registry of RPC services that delegate higher level API calls to the appropriate service implementation. Traditionally, the way of modifying api_proxy_map is to monkeypatch this module, and replace its main entry point apiproxy_stub_map.MakeSyncCall. While this approach will work in some cases, it is not without its dangers. To make unit testing easier, other modules sometimes choose to inject the MakeSyncCall method and store a local reference to it -- this way, a test can easily substitute a mock instead.

The memcache module is a good example how this technique is used. Unfortunately, this also means that whatever memcache-Client is created before we apply our patch will bypass our modifications altogether. That could be for example the case if the memcache module was imported by a script before our patch was applied. Tracking down and fixing all these references is complex, and there is no guarantee that a new version of an SDK (or any other externally developed tool library that we might use for some other purpose) will not introduce new static dependencies later on.

Luckily, there is an alternative to monkeypatching. Version 1.1.8 of the SDK introduces a new mechanism into api_proxy_map: the concept of hooks. Any developer can now define a method with the same signature as MakeSyncCall:

def hook(service, call, request, response): ...and register it with the runtime using one of the following methods:

apiproxy_stub_map.apiproxy.GetPreCallHooks().Append('unique_name', hook, 'optional_api_identifier')apiproxy_stub_map.apiproxy.GetPostCallHooks().Append('unique_name', hook, 'optional_api_identifier')apiproxy_stub_map.apiproxy.GetPreCallHooks().Push('unique_name', hook, 'optional_api_identifier')apiproxy_stub_map.apiproxy.GetPostCallHooks().Push('unique_name', hook, 'optional_api_identifier')There are two different types of hooks. A PreCallHook is executed before an RPC call is made, a PostCallHook is executed after the RPC call. It is also possible to specify an optional api identifier that will make sure that a hook only gets invoked for a particular type of rpc (such as datastore or memcache). It is possible to register more than one hook to the apiproxy: Append will append a new hook to the end of the list of existing hooks, Push will append the hook at the beginning.

Example
Let's take a look at a concrete example and implement some datastore profiling. The idea is to automatically collect statistics for each model type in our database. We'd like separate counters for put/get/query/delete -- but without having to modify the model classes or handlers that access our data. For performance reasons, we decide not to write any of our results to the datastore. Instead, we collect some quick and dirty (not accurate or long-lived) stats in memcache, and put more detailed information into logging statements: each datastore operation creates a log entry which can be retrieved and analyzed offline.

The first step is to create a helper that can collect data in memcache and log more verbose information. It is worth noting that the counts in memcache are only approximations, since we have no way of locking that data and performing increments in a transactional way. That's ok though, since it is only meant as a rough overview, compared to the more detailed data in the logs. The following db_log achieves that objective.

def db_log(model, call, details=''): """Call this method whenever the database is invoked. Args: model: the model name (aka kind) that the operation is on call: the kind of operation (Get/Put/...) details: any text that should be added to the detailed log entry. """ # First, let's update memcache if model: stats = memcache.get('DB_TMP_STATS') if stats is None: stats = {} key = '%s_%s' % (call, model) stats[key] = stats.get(key, 0) + 1 memcache.set('DB_TMP_STATS', stats) # Next, let's log for some more detailed analysis logging.debug('DB_LOG: %s @ %s (%s)', call, model, details)Note: memcache does have transactional capabilities for increasing counts, such as incr() and decr(), but these work on single values and not a dictionary of counters).

Now, let's put that helper to good use and actually write our hook. We use a PreCallHook that evaluates our data before the rpc call is made (a PostCallHook would work just as fine in this particular case, though):

def patch_appengine(): """Apply a hook to app engine that logs db statistics.""" def model_name_from_key(key): return key.path().element_list()[0].type() def hook(service, call, request, response): assert service == 'datastore_v3' if call == 'Put': for entity in request.entity_list(): db_log(model_name_from_key(entity.key()), call) elif call in ('Get', 'Delete'): for key in request.key_list(): db_log(model_name_from_key(key), call) elif call == 'RunQuery': kind = datastore_index.CompositeIndexForQuery(request)[1] db_log(kind, call) else: db_log(None, call) apiproxy_stub_map.apiproxy.GetPreCallHooks().Append( 'db_log', hook, 'datastore_v3')We can also create a very simple script that displays our statistics in a browser:

def main(): """A very simple handler that will print the temporary statistics.""" print 'Content-Type: text/plain' print '' print 'Mini stats' print '----------' stats = memcache.get('DB_TMP_STATS') if stats is None: stats = {} for name, count in sorted(stats.items(), key=operator.itemgetter(0)): print '%s : %s' % (name, count) print '----------' if __name__ == "__main__": main()Most of the effort in the code is used to determine the model name on which we are working. If you are interested in how to get to this data, take a look into the modules entity_pb.py and datastore_pb.py of the SDK. These contain the protocol buffers used for datastore RPCs.

So far, we have created

a helper that compiles datastore information in memcache (and logging)
a hook that executes the helper and
a simple handler to display the results.

All three components can be stored in a single module (let's call it db_log.py). We can activate the hack in our application by adding the following two lines of code to an application:

import db_logdb_log.patch_appengine()This snippet should be placed where it is guaranteed to be executed, no matter what handler gets invoked. For example, if there is a common main method (in case of a django application), adding this before that main method gets defined would be a good location. Also, do not forget to add db_log.py to your app.yaml to get access to the quick-and-dirty memcache stats. You can find the entire source code for this article in the Google App Engine cookbook.

No comments:

Post a Comment