For years now, I have been meaning to write a series of articles on the automated crash reporting system in the IMVU client. This first article will give a bit of background on the structure of the client and show how we handle Python exceptions.
At IMVU, we generally subscribe to the Fail Fast philosophy of handling errors: when the client encounters an unexpected error, we immediately crash the program and ask the user to submit a crash report. As part of the crash report, we send log files, stack traces, system information, and anything else that might help us debug the failure.
You might wonder why we crash the program whenever anything goes wrong rather than trying to catch the error and continue running. Counterintuitively, crashing the program forces us to act on crashes and immediately exposes bugs that might trigger unwanted behavior or lost data down the road.
Now let's talk a little bit about how the client is structured. The IMVU client is written primarily in Python, with time-critical components such as the 3D renderer written in C++. Since the client is a cross between a normal interactive Windows program and a real-time game, the main loop looks something like this:
def main(): while running: pumpWindowsMessages() # for 1/30th of a second updateAnimations() redrawWindows()
This structure assumes that no exceptions bubble into or out of the main loop. Let's imagine that updateAnimations() has a bug and occasionally raises an uncaught exception. If running the client with a standard command-line python invocation, the program would print the exception and stack trace to the console window and exit. That's all great, but our users don't launch our client by invoking python from the command line: we use py2exe to build a standalone executable that users ultimately run. With an unmodified py2exe application, uncaught exceptions are printed to sys.stderr (as above), except there is no console window to display the error. Thus, the py2exe bootstrap code registers a handler so that errors are logged to a file, and when the program shuts down, a dialog box shows something like "An error has been logged. Please see IMVUClient.exe.log."
From a crash reporting standpoint, this is not good enough. We can't be asking our users to manually hunt down some log files on their hard drives and mail them to us. It's just too much work - they will simply stop using our product. (Unfortunately, most of the software out there asks users to do exactly this!) We need a way for the client to automatically handle errors and prompt the users to submit the reports back to us. So let's rejigger main() a bit:
def mainLoop(): while running: pumpWindowsMessages() updateAnimations() redrawWindows() def main(): try: mainLoop() except: error_information = sys.exc_info() if OK == askUserForPermission(): submitError(error_information)
This time, if a bug in updateAnimations() raises an exception, the top-level try: except: clause catches the error and handles it intelligently. In our current implementation, we post the error report to a Bugzilla instance, where we have built custom tools to analyze and prioritize the failures in the field.
This is the main gist of how the IMVU client automatically reports failures. The next post in this series will cover automatic detection of errors in our C++ libraries.
The IMVU client has a platform for developing asynchronous or long-running operations that we call the task system. The following describes why we built it. (Chris finally convinced me to post this information to the web.)
Most programs that we write can be split into two categories. There are short-lived programs like gzip that take command line arguments and files and inputs and output to other files or stdout.
There are also longer-lived programs that run indefinitely, handling events that happen to it. Think about MySQL or memcached. It takes inputs from the network, time, and files; doing work and spitting output back onto the network or the file system.
The operating system kernel is one of those event-driven, long-lived programs. It schedules the execution of subsidiary short- and long-lived programs.
We can certainly map from one type another. Take PHP for example. It is running within the context of a long-running server process, but the programming model for a single page is a lot closer to gzip. The program takes inputs (session data, request variables) and outputs HTML.
You can also think of a web service as a set of computers all running a bunch of short-lived processes that combine into one large program. PHP is the CPU, memcached is memory, the database is the file system, etc.
Desktop applications, including the IMVU client and other "rich internet clients" are structured a lot like servers: they are long-lived programs that respond to events from the keyboard, mouse, time, the windowing system, network requests... and interact with its users via graphics, sound, files, and the network.
The key notion here is that these are abstractions on top of an enormous state machine known as your processors and memory.
Now let's talk a little bit about concepts and modeling. When you come into a code base, say, our client, you expect to see structures that look and behave like concepts you know: avatars, chat windows, buddy list, inventory, products from the catalog, etc. This is one reason OOP is so powerful - it's an effective method for describing concepts and the way they interact.
Now let's talk about a simpler concept. One so ingrained you probably don't even think about it. Functions (in the imperative sense). Functions are an abstraction on top of the state machine that is your processor. We're all very familiar with functions: "This operation takes X and Y as inputs, does steps A, B, and C, and returns Z." Even if you throw in a few branches and loops and call some other functions, we're good at following the logic. For example:
def _readNumber(self):
isfloat = False
result = self._next()
peek = self._peek()
while peek is not None and (peek.isdigit() or peek == "."):
isfloat = isfloat or peek == "."
result = result + self._next()
peek = self._peek()
try:
if isfloat:
return float(result)
else:
return int(result)
except ValueError:
raise ReadException, "Not a valid JSON number: '%s'" % result
This a fairly elaborate function, but we have no problem following it. Still, it's an abstraction on top of the processor and memory: it's compiled by Python into bytecode, executed by the interpreter loop, which is written in C, which is compiled to x86 assembly, which is being converted by the processor into micro ops... The point here is that we can build our own abstractions if they make us more effective at doing our jobs.
There is a concept in the IMVU client that may be unfamiliar to some of you: long-running processes. These are like functions, except that many can be running at the same time, and they can be blocked, waiting for results from the network or disk. Or perhaps just waiting for some time to elapse. Pseudocode for one of these processes might look like this:
def loadAvatar(chatId, userId):
avatarInfo = getAvatarInfo(userId)
if failure, show dialog box and stop loading
products = getCurrentOutfit(chatId, userId)
for p in products:
loadProduct(p)
loadAvatarIntoScene(products) # this might take a while
In fact, these processes can be fairly easily implemented on top of threads, and many people do it that way, to varying degrees of success.
However, threads have several disadvantages:
Another way to manage these types of long-running operations is by building an explicit state machine and periodically pumping it, when it will check and update its state if possible. This is generally how interactive games work, because they can get away with using up all of the CPU and having a fixed frame rate. Much of the code in the IMVU client uses this model, but we've found it 1) hard to understand the interactions between states and 2) inefficient. Most of the time you're _not_ updating the state of a particular object, so you don't need to check every single frame. For example, some of our most avid users have thousands of items in their inventory, and our old inventory UI walks that entire list 30 times per second, asking “does this one have any events to process? does this one have any events to process? does this one have any events to process?” This takes up to 10% of the CPU on the user's computer without a 3D window open! Finally, this approach generally makes the application feel sluggish, because the latency between an event (say, a mouse click) being sent to the client and it being processed depends on the rate at which the application is pumped, rather than having the client immediately handle messages as they come in.
A third technique you might call continuation-passing style or "onComplete callbacks". Basically, any function that can be asynchronous takes an onComplete callback function that it runs (with the result or error) when the operation is completed. The problem with this style is that hearkens back to the days of goto with spaghetti callback loops and non-linear control flow. Moreover, if you accidentally forget to call an onComplete, or you call it twice, you've just introduced a very hard-to-find bug. We have had both of these types of bugs multiple times in our client.
We did not find any of the above approaches particularly compelling, so we built the task system on top of Python's bidirectional generators. First, I'll enumerate the advantages:
Here's what the hypothetical example from above would look like as a task:
@task
def loadAvatar(chatId, userId):
try:
avatarInfo = yield getAvatarInfo(userId)
except Exception:
showErrorDialog()
return
products = yield getCurrentOutfit(chatId, userId)
for p in products:
try:
yield loadProduct(p)
except Exception:
pass # ignore products that fail to load?
yield loadAvatarIntoScene(products)
At every 'yield', other tasks get a chance to run, and if that task does not return immediately, then the task is paused until the asynchronous call is done. getAvatarInfo, for example, will contact the server in order to fetch the avatar name. Tasks can yield on other task calls, suspending their execution until the subsidiary call returns a value or raises an exception. (Sound familiar?)
Tasks can start other tasks, with the admittedly strange “yield Start(task)” syntax which gives back a Future object. Think of futures as values that may not have materialized yet. You can later wait for the result of a future by yielding on it.
@task
def printFuture(future):
# this will throw an exception if the future
# contains an error
result = yield future
print result
@task
def printAvatarInfo():
avatarInfo = yield Start(getAvatarInfo(userId))
print 'started fetching avatarInfo'
yield printFuture(avatarInfo)
You can use futures to start several tasks in parallel and then wait on all of them, making efficient use of parallel resources. Some examples with real code might help. Unfortunately you can't run them, as we have not (yet) open sourced the task system.
Today, much of the client's code uses threads, polling, and onCompletes, but we're slowly converting things to tasks, usually to fix bugs. I hope this means that the quality of the client from a user's perspective will continue to improve and engineers will be able to work in the code more easily.