Start now
Apr 18, 2012 hacks.bluesmoon.info
boomerang always comes back, except when it hits something.
boomerang is a piece of javascript that you add to your web pages, where it measures the performance of your website from your end user's point of view. It has the ability to send this data back to your server for further analysis. With boomerang, you find out exactly how fast your users think your site is.
boomerang is opensource and released under the BSD license, and we have a whole bunch of documentation about it.
boomerang comes to you from the Exceptional Performance team at Yahoo!, aided by the Yahoo! Developer Network.
Get the code from github.com/yahoo/boomerang.
Apr 18, 2012 codeascraft.etsy.com
Posted by Ian Malpass | Filed under data, engineering, infrastructure
If Engineering at Etsy has a religion, it’s the Church of Graphs. If it moves, we track it. Sometimes we’ll draw a graph of something that isn’t moving yet, just in case it decides to make a run for it. In general, we tend to measure at three levels: network, machine, and application. (You can read more about our graphs in Mike’s Tracking Every Release post.)
Application metrics are usually the hardest, yet most important, of the three. They’re very specific to your business, and they change as your applications change (and Etsy changes a lot). Instead of trying to plan out everything we wanted to measure and putting it in a classical configuration management system, we decided to make it ridiculously simple for any engineer to get anything they can count or time into a graph with almost no effort. (And, because we can push code anytime, anywhere, it’s easy to deploy the code too, so we can go from “how often does X happen?” to a graph of X happening in about half an hour, if we want to.)
StatsD is a simple NodeJS daemon (and by “simple” I really mean simple — NodeJS makes event-based systems like this ridiculously easy to write) that listens for messages on a UDP port. (See Flickr’s “Counting & Timing” for a previous description and implementation of this idea, and check out the open-sourced code on github to see our version.) It parses the messages, extracts metrics data, and periodically flushes the data to graphite.
We like graphite for a number of reasons: it’s very easy to use, and has very powerful graphing and data manipulation capabilities. We can combine data from StatsD with data from our other metrics-gathering systems. Most importantly for StatsD, you can create new metrics in graphite just by sending it data for that metric. That means there’s no management overhead for engineers to start tracking something new: simply tell StatsD you want to track “grue.dinners” and it’ll automagically appear in graphite. (By the way, because we flush data to graphite every 10 seconds, our StatsD metrics are near-realtime.)
Not only is it super easy to start capturing the rate or speed of something, but it’s very easy to view, share, and brag about them.
So, why do we use UDP to send data to StatsD? Well, it’s fast — you don’t want to slow your application down in order to track its performance — but also sending a UDP packet is fire-and-forget. Either StatsD gets the data, or it doesn’t. The application doesn’t care if StatsD is up, down, or on fire; it simply trusts that things will work. If they don’t, our stats go a bit wonky, but the site stays up. Because we also worship at the Church of Uptime, this is quite alright. (The Church of Graphs makes sure we graph UDP packet receipt failures though, which the kernel usefully provides.)
Here’s how we do it using our PHP StatsD library:
StatsD::increment("grue.dinners");
That’s it. That line of code will create a new counter on the fly and increment it every time it’s executed. You can then go look at your graph and bask in the awesomeness, or for that matter, spot someone up to no good in the middle of the night:

We can use graphite’s data-processing tools to take the the data above and make a graph that highlights deviations from the norm:

(We sometimes use the “rawData=true” option in graphite to get a stream of numbers that can feed into automatic monitoring systems. Graphs like this are very “monitorable.”)
We don’t just track trivial things like how many people are signing into the site — we also track really important stuff, like how much coffee is left in the kitchen:

In addition to plain counters, we can track times too:
$start = microtime(true);
eat_adventurer();
StatsD::timing("grue.dinners", (microtime(true) - $start) * 1000);
StatsD automatically tracks the count, mean, maximum, minimum, and 90th percentile times (which is a good measure of “normal” maximum values, ignoring outliers). Here, we’re measuring the execution times of part of our search infrastructure:

One thing we found early on is that if we want to track something that happens really, really frequently, we can start to overwhelm StatsD with UDP packets. To cope with that, we added the option to sample data, i.e. to only send packets a certain percentage of the time. For very frequent events, this still gives you a statistically accurate view of activity.
To record only one in ten events:
StatsD::increment(“adventurer.heartbeat”, 0.1);
What’s important here is that the packet sent to StatsD includes the sample rate, and so StatsD then multiplies the numbers to give an estimate of a 100% sample rate before it sends the data on to graphite. This means we can adjust the sample rate at will without having to deal with rescaling the y-axis of the resulting graph.
We’ve found that tracking everything is key to moving fast, but the only way to do it is to make tracking anything easy. Using StatsD, we enable engineers to track what they need to track, at the drop of a hat, without requiring time-sucking configuration changes or complicated processes.
Try StatsD for yourself: grab the open-sourced code from github and start measuring. We’d love to hear what you think of it.
At Memonic we rely on MySQL for most of our data storage. In any relational database system the correct creation of indices is important, otherwise queries will be inefficient and slow. The problem with that is, that the indices often are forgotten, especially when updating an existing query. As a tool to detect queries without proper indices MySQL offers the slow query log. All queries that take more than a certain time are logged there.
We host our platform in Amazon’s cloud. For database we rely on their on their Relational Database Service (RDS) service. As we don’t have root access to those machines we can’t just tail the slow log to see what’s up. Instead RDS optionally writes the slow log into a special system table. From there a query can be used to retrieve the data. See the Amazon RDS FAQ about how to configure the slow log on RDS.
For automated analysis of the slow logs we like to use mk-query-digest. This excellent utility groups all logged queries together by their general type and thus allows a big-picture overview. As an example take these three queries that may have been logged:
SELECT * FROM events WHERE user_id = 'foo';
SELECT * FROM data WHERE created_at >= '2011-12-13 10:12:00';
SELECT * FROM events WHERE user_id = 'bar';
These will be grouped together by mk-query-digest as just two queries:
SELECT * FROM events WHERE user_id = '?';
SELECT * FROM data WHERE created_at >= '?';
This is accompanied with how often each query type was executed, how long it took in total, etc. This is a great way to focus any optimization effort first on the queries that are actually used a lot.
Unfortunately mk-query-digest only works with the normal MySQL slow query log format and can’t access the proprietary table that Amazon RDS keeps. To work around this, we wrote the db2log.py script which we hereby release into the public domain.
#!/usr/bin/env python"""Queries the slowlog database table maintained by Amazon RDS and outputs it inthe normal MySQL slow log text format."""
import _mysql
db = _mysql.connect(db="mysql", read_default_file="/root/.my.cnf")db.query("""SELECT * FROM slow_log ORDER BY start_time""")r = db.use_result()
print """/usr/sbin/mysqld, Version: 5.1.49-3-log ((Debian)). started with:Tcp port: 3306 Unix socket: /var/run/mysqld/mysqld.sockTime Id Command Argument"""
while True: results = r.fetch_row(maxrows=100, how=1) if not results: break
for row in results: row['year'] = row['start_time'][2:4] row['month'] = row['start_time'][5:7] row['day'] = row['start_time'][8:10] row['time'] = row['start_time'][11:]
hours = int(row['query_time'][0:2]) minutes = int(row['query_time'][3:5]) seconds = int(row['query_time'][6:8]) row['query_time_f'] = hours * 3600 + minutes * 60 + seconds
hours = int(row['lock_time'][0:2]) minutes = int(row['lock_time'][3:5]) seconds = int(row['lock_time'][6:8]) row['lock_time_f'] = hours * 3600 + minutes * 60 + seconds
if not row['sql_text'].endswith(';'): row['sql_text'] += ';'
print '# Time: {year}{month}{day} {time}'.format(**row) print '# User@Host: {user_host}'.format(**row) print '# Query_time: {query_time_f} Lock_time: {lock_time_f} Rows_sent: {rows_sent} Rows_examined: {rows_examined}'.format(**row) print 'use {db};'.format(**row) print row['sql_text']
It simply dumps the current contents of the RDS slow_log table in the same format that MySQL usually uses for their slow log file. This output can then be piped into mk-query-digest to generate a report.
We ended up doing just that in a daily cron job which sends a mail to our developers.
#!/bin/bash(/usr/local/bin/db2log | \ mk-query-digest --fingerprints \ --filter '$event->{user} !~ m/^(bi|memonic)$/') 2>&1 | \ mail -s "MySQL slow logs" root
# Rotate slow logs. Will move them into the backup table slow_log_backup. If# that table exists it's overwritten with the primary slow log.# So with this strategy we can still access yesterday's slow log by querying# slow_log_backup.mysql mysql -e 'CALL rds_rotate_slow_log'There is one line which needs further explanation: the filter. We filter out any slow log events that were triggered by either the bi or the memonic users. The former is used for asynchronous generation of some statistics and performance isn’t required for that. The latter we use for ad-hoc queries which we don’t need to optimize.
So there you have it: an automated mechanism to analyze slow MySQL queries. From time to time when deploying a new release a new slow query may pop up. But the next day we are informed about it and can fix the issue.
Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, and is a lot of fun to use!
The Rationale page on the wiki explains what Storm is and why it was built. The slides from Storm's launch presentation also provide a good introduction to the project.
Documentation and tutorials can be found on the Storm wiki.
Feel free to ask questions on Storm's mailing list: http://groups.google.com/group/storm-user
You can also come to the #storm-user room on freenode. You can usually find a Storm developer there to help you out.
The use and distribution terms for this software are covered by the Eclipse Public License 1.0 (http://opensource.org/licenses/eclipse-1.0.php) which can be found in the file LICENSE.html at the root of this distribution. By using this software in any fashion, you are agreeing to be bound by the terms of this license. You must not remove this notice, or any other, from this software.
GOOGLE-WIDE PROFILING (GWP), A CONTINUOUS PROFILING INFRASTRUCTURE FOR
DATA CENTERS, PROVIDES PERFORMANCE INSIGHTS FOR CLOUD APPLICATIONS. WITH
NEGLIGIBLE OVERHEAD, GWP PROVIDES STABLE, ACCURATE PROFILES AND A
DATACENTER-SCALE TOOL FOR TRADITIONAL PERFORMANCE ANALYSES. FURTHERMORE,
GWP INTRODUCES NOVEL APPLICATIONS OF ITS PROFILES, SUCH AS APPLICATIONPLATFORM
AFFINITY MEASUREMENTS AND IDENTIFICATION OF PLATFORM-SPECIFIC,
MICROARCHITECTURAL PECULIARITIES.
| Google-Wide Profiling.pdf | 3.21 MB |