Monitoring a server cluster without a dedicated sys admin?


Monitoring a server cluster
without a dedicated sys admin?

You'll love Scout.

The pioneer in hosted server monitoring.





Airbrake collects errors generated by other applications, and aggregates the results for review.

See plans and pricing

Tools vs. Automation


Tools vs. Automation

Sysadmins talk a lot about "automation" but I think a more specific definition is needed.

"Tool writing" is when we create a program (script, whatever) that takes a task that that we do and does it better/faster/more accurately. For example, creating a new account used to take 10 or more manual steps (creating the homedir, setting permissions, adding a line to /etc/passwd, /etc/group, etc). Good examples include: FreeBSD "pw adduser" or Linux "useradd". In short, a tool improves our ability to do a task.

"Automation" is when we create a system that eliminates a task. Continuing with our example, if we "automate" account management we might build a system that polls our HR database and creates an account for any new employee and suspends accounts for anyone terminated. This eliminates our need to create/delete accounts completely.

A conflict arises when sysadmins on a team are used to using tools and someone creates automation. The people that are used to tools create accounts the old way which confuses the automation. It might delete the account because it doesn't see it in the HR database. Or, new features might be added to the automation and therefore might not be communicated to the system administrators. For example, the automation might be extended to create a default WWW homepage for new users; the sysadmins that work around the automation may not be aware of this and the new users they create "on the side" find themselves without the internal home page that other new users receive.

While I encourage the creation of tools to make sysadmins tasks easier, the creation of systems that eliminate tasks is much more important. While automating our tasks often involves creating tools, writing tools does not automate our work.

There's a difference.

Posted by Tom Limoncelli at November 10, 2011 10:30 AM

What's coming in Graylog2 v0.9.6 | Lennart Koopmann


{ :blog => true }

Upcoming talks

Currently no upcoming talks. Meet me at the Rails Usergroup Hamburg!

I am in the last steps of a Graylog2 v0.9.6 beta release these days: There are only a few tickets for the server and web interface left.

I’d like to take some time and give you an overview about what is changing and coming in this next version of Graylog2.

ElasticSearch is the new message storage

MongoDB has been dropped as message storage. It will stay for message counts (see faster graphs), settings and health values but no longer for storing the actual messages. Reason for this are performance problems when storing a lot of messages. To get a good speed it would have to keep all the messages in memory. ElasticSearch offers fast reads and real full text search features. Future versions of Graylog2 will make use of the full text search features of ElasticSearch - In 0.9.6 you will only get a huge performance increase. MongoDB is still great for storing the other stuff, but using it for the log messages seems to have been a mistake.

Faster graphs

In prior Graylog2 releases the graphs (like the analytics graph and the small stream graphs) were generated (and cached) from actual counts against the message collection. This was getting really slow if you had a high message throughput. From 0.9.6 on the server will count and store the message counts for per minute in MongoDB. This is not only much faster and less IO intensive but also more user friendly: Because it is independent from the message storage you can draw graphs over time periods that are no longer in ElasticSearch. You will be able to only keep messages of the last 2 months, but draw graphs over years. The UI for this will also change to allow easy drawing of long term graphs.

The Analytics Shell

Note that this screencast shows an early version of the shell. Count, distinct, and distribution queries for example are now displayed in the shell itself, not below it. You can also use stream names instead of their IDs for stream selectors.

See this blog post for an explanation and the screencast of the new Analytics Shell:

Also check out the wiki page explaining the shell:

New stream filter rules

There are some new stream filter rules like filename/line, regex host, full message and an “or higher” option for severities.

Hostgroups are dead

The hostgroups functionality has been removed. Read this explanation blog post for more information.

Bugfixes and improvements

A lot of bugfixes and improvements. There were some bugs in previous versions that could have been avoided. - Sorry about that. To avoid that in the future there will be a beta release and extended testing phase before releases. Expect improvements in the UX like the one for empty streams: When you created a stream it had no rules and was matching all messages coming in. From now on streams with no rules catch no messages. Streams are also disabled until you enable them - For example after fully configuring the stream rules and alarms.

All in all you can expect a double-awesome version 0.9.6 of Graylog2 that focuses on performance for huge amounts of log messages and long term archiving.

There will be a beta version very soon! A preview version with working ElasticSearch integration is already available for download.

Subscribe to this blog, the @graylog2 Twitter stream or the mailing list to stay up to date!

Recent comments

Blog comments powered by

Nagios Exchange - check_jungledisk

Tested with Nagios Core 3.x using the embedded perl interpreter, but it should work on all versions. BYO XML::Simple.

usage: check_jungledisk -H hostname -f atom_feed_url [-s stale_hours]

Checks the health of your most recent Jungle Disk backup for the hostname specified by parsing
the atom feed. You can specify either just a hostname or the FQDN that appears in JD.

The URL should be something like:
NOTE: Be sure to use quotes around the URL otherwise the last '&' character will probably instruct
your shell to run this process in the background.

By default, this plugin will warn if the most recent backup is more than 720 hours (30 days) old.
Change this default using the -s option.

In this example, jungledisk warns if backups are more than 24 hours old:
check_jungledisk -H myhost -f feed-URL -s 24

initial version: 1 Dec 2010 by Arya Abdian
Thanks to Intunity Pty Ltd ( for permitting release.

2011-02-01 v1.1 Looks like JungleDisk made a small change to their atom feed overnight. Previous version will no longer work, please upgrade to this release.
2011-02-03 v1.2 More unannounced changes by JungleDisk.
1) feed->entry->id now matches
2) feed->updated now matches format YYYY-mm-ddTHH:MM:SS.0000000Z
this is not natively supported by GNU date, so bullet has been bitten
and all date/time processing is being handled within perl, thus
improving portability.
3) feed->entry->link->"URL" is now feed->entry->link->href->"URL"
4) feed->updated is reporting a timestamp in the future. Bug raised [and fixed within 24 hours! Go JungleDisk!]

The Virtues of Monitoring


The Virtues of Monitoring

Posted on 05 Jan 2011 by Mathias Meyer

Over the last year I haven't only grown very fond of coffee, but also of infrastructure. Working on Scalarium has been a fun ride so far, for all kinds of reasons, one of them is dealing so much with infrastructure. Being an infrastructure platform provider, what can you do, right?

As being responsible for deployment, performance tuning, monitoring, infrastructure has always been a part of many of my job I thought it'd be about time to sprinkle some of my thoughts and daily ops thoughts on a couple of articles. The simple reason being that no matter how much you try, no matter how far away from dealing with servers you go (think Heroku), there will always be infrastructure, and it will always affect you and your application in some way.

On today's menu: monitoring. People have all kinds of different meanings for monitoring, and they're all right, because there is no one way to monitor your applications and infrastructure. I just did a recount, and there are no less than six levels of detail you can and probably should get. Note that these are my definitions, they don't necessarily have to be officially named, they're solely based on my experiences. Let's start from the top, the outside view of your application.

Availability Level

Availability is a simple measure to the user, either your site is available or it's not. There is nothing in between. When it's slow, it's not available. It's a beautifully binary measure really. From your point of view, any component or layer in your infrastructure could be the problem. The art is to quickly find out which one it is.

So how do you notice when your site is not available? Waiting for your users to tell you is an option, but generally a pretty embarrassing one. Instead you generally start polling some part of your site that's representative of it as a whole. When that particular site is not available, your whole application may as well not be.

What that page should do is get a quick measure of the most important components of your site, check if they're available (maybe even with a timeout involved so you get an idea if a specific component is broken) and return the result. An external process can then monitor that page and notify you when it doesn't return the expected result. Make sure the site does a bit more than just return "OK". If it doesn't hit any of the major components in your stack, there's a chance you're not going to notice that e.g. your database is becoming unavailable.

You should run this process from a different host, but what do you do if that host is not available? Even as an infrastructure provider I like outsourcing parts of my own infrastructure. Here's where Pingdom comes into play. They can monitor a specific URL, TCP ports and whatnot from some two dozen locations across the planet and they randomly go through all of them, notifying you when your site is unavailable or the result doesn't match the expectations.


Business Level

These aren't necessarily metrics related to your application's or infrastructure's availability, they're more along the lines of what your users are doing right now, or have done over the last month. Think number of new users per day, number of sales in the last hour, or, in our case, number of EC2 instances running at any minute. Stuff like Google Analytics or click paths (using tools like Hummingbird, for example) in general also fall into this category.

These kind of metrics may be more important to your business than to your infrastructure, but they're important nonetheless, and they could e.g. be integrated with another metrics collection tool, some of which we'll get to in a minute. Depending on what kind of data you're gathering they're also useful to analyze spikes in your application's performance.

This kind of data can be hard to track in a generic way. Usually it's up to your application to gather them and turn them into a format that's acceptable to a different tool to collect them. They're also usually very specific to your application and its business model.

Application Level

Digging deeper from the outsider's view, you want to be able to track what's going on inside of your application right now. What are the main entry points, what are the database queries involved, where are the hot spots, which queries are slow, what kinds of errors are being caused by your application, to name a few.

This will give you an overview of the innards of your code, and it's simply invaluable to have that kind of insight. You usually don't need much historical data in this area, just a couple of days worth will usually be enough to analyze problems in retrospect. It can't hurt to keep them around though, because growth also shows trends in potential application code hot spots or database queries getting slower over time.

To get an inside view of your application, services like New Relic exist. While their services aren't exactly cheap (most monitoring services aren't, no surprise here), they're invaluable. You can dig down from the Rails controller level to find the method calls and database queries that are slowest at a given moment in time (most likely you'll be wanting to check the data for the last hours to analyze an incident), digging deeper into other metrics from there. Here's an example of what it looks like.

New Relic

You can also use the Rails log file and tools like Request-log-analyzer. They can help you get started for free, but don't expect a similar, fine-grained level of detail like you get with New Relic. However, with Rails 3 it's become a lot easier to instrument code that's interesting to you and gather data on runtimes of specific methods yourself.

Other means are e.g. JMX, one of the neat features you get when using a JVM-based language like JRuby. Your application can contiuously collect and expose metrics through a defined interface to be inspected or gathered by other means. JMX can even be used to call into your application from the outside, without having to go through a web interface.

Application level monitoring also includes exception reporting. Services like Exceptional or Hoptoad are probably the most well known in that area, though in higher price regions New Relic also includes exception reporting.

Process Level

Going deeper (closer to inception than you think) from the application level we reach the processes that serve your application. Application servers, databases, web servers, background processing, they all need a process to be available.

But processes crash. It's a bitter and harsh truth, but they do, for whatever reason, maybe they consumed too many resources, causing the machine to swap or the process to simply crash because the machine doesn't have any memory left to allocate. Think of a memory leaking Rails application server process or the last time you used RMagick.

Someone must ensure that the processes keep running or that they don't consume more resources than they're allowed to, to ensure availability on that level. These tools are called supervisors. Give them a pid file and a process, running or not, and they'll make sure that it is. Whether a process is running can depend on multiple metrics, availability over the network, a file size (think log files) or simply the existence of the process, while allowing you to send some sort of grace period, so they'll retry a number of times with a timeout before actually restarting the process or giving up monitoring it altogether.

A good supervisor will also let you alert someone when the expected conditions move outside or their acceptable perimeter and a process had to be restarted. A classic in this area is Monit, but people also like God and Bluepill. On a lower level you have tools like runit or upstart, but their capabilities are usually built around a pid file and a process, not allowing to go on a higher level of checking system resources.

While I find the syntax of Monit's configuration to not be very aesthetically pleasing, it's proven to be reliable and has a very small footprint on the system, so it's our default on our own infrastructure, and we add it to most our cookbooks for Scalarium, as it's installed on all managed instances anyway. It's a matter of preference.

Infrastructure/Server Level

Another step down from processes we reach the system itself. CPU and memory usage, load average, disk I/O, network traffic, are all traditional metrics collected on this level. The tools (both commercial and open source) in this area can't be counted. In the open source world, the main means to visualize these kinds of metrics is rrdtool. Many tools use it to graph data and to keep an aggregated data history around, using averages for hours, days or weeks to store the data efficiently.

This data is very important in several ways. For one, it will show you what your servers are doing right now, or in the last couple of minutes, which is usually enough to notice a problem. Second, the data collected is very useful to discover trends, e.g. memory usage increasing over time, swap usage increasing, or a partition running out of disk space. Any value constantly increasing over time is a good sign that you'll hit a wall at some point. Noticing trends will usually give you a good indication that something needs to be changed in your infrastructure or your application.


There's a countless number of tools in this area, Munin (see screenshot), Nagios, Ganglia, collectd on the open source end, and CloudKick, Circonus, Server Density and Scout on the paid service level, and an abundance of commercial tools on the very expensive end of server monitoring. I never really bother with the commercial ones, because I either resort to the open source tools or pay someone to take care of the monitoring and alerting for me on a service basis. Most of these tools will run some sort of agent on every system, collecting data in a predefined cycle, delivering it to a master process, or the master processing picking up the data from the agents.

Again, it's a matter of taste. Most of the open source tools available tend to look pretty ugly on the UI part, but if the data and the graphs are all that matters to you, they'll do just fine. We do our own server monitoring using Server Density, but on Scalarium we resort to using Ganglia as an integrated default, because it's much more cost effective on our users, and given the elastic nature of EC2 it's much easier for us to add and remove instances as they come and go. In general I'm also a fan of Munin.

Most of them come with some sort of alerting that allows you to define thresholds which trigger the alerts. You'll never get the thresholds right the first time you configure them, constantly keep an eye on them to get a picture of what thresholds are normal, and which are indeed problem areas and require an alert to be triggered.

The beauty about these tools is that you can throw any metric at them you can think of. They can even be used to collect business level data, utilizing the existing graphing and even alerting capabilities.

Log Files

The much dreaded log file won't go out of style for a long time, that's for sure. Your web server, your database, your Rails application, your application server, your mail server, all of them dump more or less useful information into log files. They're usually the most immediate and uptodate view of what's going on in your application, if you chose to actually log something, Rails appliations traditionally seem to be less of a candidate here, but your background services sure are, or any other service running on your servers. The log is the first to know when there's problems delivering email or your web server is returning an unexpected amount of 500 errors.

The biggest problem however is aggregating the log data, centralized logging if you will. syslog and all the alternative tools are traditionally sufficient, while on the larger scale end you have custom tools like Cloudera's Flume or Facebook's Scribe. There's also a bunch of paid services specializing on logging, most noteworthy are Splunk and Loggly. Loggly relies on syslog to collect and transmit data from your servers, but they also have a custom API to transmit data. The data is indexed and can easily be searched, which is usually exactly what you want to do with logs. Think about the last time you grepped for something in multiple log files, trying to narrow down the data found to a specific time frame.

There's a couple of open source tools available too, Graylog2 is a syslog server with a MongoDB backend and a Java server to act as a syslog endpoint, and a web UI allowing nicer access to the log data. A bit more kick-ass is logstash which uses RabbitMQ and ElasticSearch for indexing and searching log data. Almost like a self-hosted Loggly.

When properly aggregated log files can show trends too, but aggregating them gets much harder the more log data your infrastructure accumulates.

ZOMG! So much monitoring, really?

Infrastructure purists would start by saying that there's a different between monitoring, metrics gathering and log files. To me, they're a similar means to a similar end. It doesn't exactly matter what you call it, the important thing is to collect and evaluate the data.

I'm not suggesting you need every single kind of logging, monitoring and metrics gathering mentioned here. There is however one reason why eventually you'll want to have most if not all of them. At any incident in your application or infrastructure, you can correlate all the available data to find the real reason for a downtime, a spike or slow queries, or problems introduced by recent deployments.

For example, your site's performance is becoming sluggish in certain areas, users start complaining. Application level monitoring indicates specific actions taking longer than usual, pointing to a specific query. Server monitoring for your database master indicates an increased number of I/O waits, usually a sign that too much data is read from or written to disk. Simplest reason could be an index missing or that your data doesn't fit into memory anymore and too much of it is swapped out to disk. You'll finally be looking at MySQL's slow query log (or something similar for your favorite database) to find out what query is causing the trouble, eventually (and hopefully) fixing it.

That's the power of monitoring, and you just can't put any price on a good setup that will give you all the data and information you need to assess incidents or predict trends. And while you can set up a lot of this yourself, it doesn't hurt to look into paid options. Managing monitoring yourself means managing more infrastructure. If you can afford to pay someone else to do it for you, look at some of the mentioned services, which I have no affiliation with, I just think they're incredibly useful.

Even being an infrastructure enthusiast myself, I'm not shy of outsourcing where it makes sense. Added features like SMS alerts, iPhone push notifications should also be taken into account. Remember that it'd be up to you to implement all this. It's not without irony that I mention PagerDuty. They sit on top of all the other monitoring solutions you have implemented and just take care of the alerting, with the added benefit of on-call schedules, alert escalation and more.

Tracking Every Release


Tracking Every Release

Posted on 08 December 2010

We spend a lot of time gathering metrics for our network, servers, and many things going on within the code that drives Etsy. It’s no secret that this is one of our keys to moving fast. We use a variety of monitoring tools to help us correlate issues across our architecture. But what most monitoring tools achieve is correlating the effects of change, rather than the causes.

Change to application code (deploys) are opportunities for failure. Tweaking pages and features on your web site cause ripples throughout the metrics you monitor, including database load, cache requests, web server requests, and outgoing bandwidth. When you break something on your site, those metrics will typically start to skew up or down.

Something obviously happened here… but what was it? We might correlate this sudden spike in PHP warnings with a drop in member logins or a drop in traffic on our web servers, but these point to effects and not to a root cause.

We need to track changes that we make to the system.

Different companies track change in ways that are reflective of their release cycle. A company that only releases new software or services once or twice a year might literally do this by distributing of a press release. Companies that move more quickly and release new products every few weeks might rely on company-wide emails to track changes. The faster the iteration schedule, the smaller and less formal the announcement becomes.

When you reach the point of releasing changes a couple of times a day, this needs to be automated and needs to be distributed to places where it is quickly accessible, such as your monitoring tools and IRC channels. At Etsy, we are releasing changes to code and application configs over 25 times a day. When the system metrics we monitor start to skew we need to be able to immediately identify whether this is a human-induced change (application code) or not (hardware failure, third-party APIs, etc.). We do this by tracking the time of every single change we ship to our production servers.

We’ve been using Graphite for monitoring application-level metrics for nearly a year now. These include things like numbers of new registrations, shopping carts, items sold, image uploaded, forum posts, and application errors. Getting metrics into Graphite is simple, you send a metric name, a value, and the current Unix timestamp. To track time-based events, the value sent for the metric can simply be “1″. Erik Kastner added this right into our code deployment tool so that every single deploy is automatically tracked. You didn’t think we did this by hand, did you? 1 1287106599

The trick to displaying events in Graphite is to apply the drawAsInfinite() function. This displays events as a vertical line at the time of the event. (Hat tip: Mark Lin, since this is not well documented.) The result looks like this:

Graphite has a wonderfully flexible URL API that allows for mixing together multiple data sets in a single graph. We can mix our code deployments right into the graph of PHP warnings we saw above.

Ah-ha! A code change occurred right after 4 PM that set off the warnings. And you can see that a second deploy was made about 10 minutes later that fixed most of the warnings, and a third deploy that squashed anything remaining.

We maintain a battery of thousands of tests that run against our application code before every single deploy, and we’re adding more every day. Combined with engineers pairing up for code reviews, we catch most issues before they get deployed. Tracking every deploy allows us to quickly detect any bugs that we missed.

Equally useful is the reassurance we have that we can deploy many times a day without disrupting core functionality on the site. Across the 16 code deploys shown below, not a single one caused an unexpected blip in our member logins.

These tools highlight the good events along with the bad. Ian Malpass, who works on our customer support tools, uses Graphite to monitor the number of new posts written in our forums, where Etsy members discuss selling, share tips, report bugs, and ask for help. When we correlate these with deploys, you can see the flurry of excitement in our forums after one of our recent product launches.

Automated tracking of code deploys is essential for teams who practice Continuous Deployment. Monitoring every aspect of your server and network architecture helps detect when something has gone awry. Correlating the times of each and every code deploy helps to quickly identify human-triggered problems and greatly cut down on your time to resolve them.

Very large monitoring capabilities

Shinken monitoring

Shinken is a new powerful monitoring tool compatible with Nagios® and written in python that enables organizations to identify and resolve IT problems before they impact the business.

Shinken can monitor all the IT elements like systems, services and applications. In case of a failure Shinken can alert the operation engineers so they can promptly repair it. It has the same capabilities of Nagios and even more with advanced built-in facilities for load balanced and high availability monitoring.

Very large monitoring capabilities

Shinken is able to use Nagios®’s plugins, so it can monitor a large types of systems :

  • Monitoring of network services (SMTP, IMAP, HTTP, NNTP, ICMP, SNMP, FTP, SSH and custom protocols via 3rd party plugins)
  • Monitoring of system resources (processor load, disk usage, system logs) on a majority of operating systems, including Unixes and Microsoft Windows.
  • Monitoring of anything else like probes (temperature, alarms…) which have the ability to be reach by the network with 3rd party plugins
  • Simple plugin design that allows users to easily develop their own service checks depending on requirements by using the language of choice (shell scripts, C++, Perl, Ruby, Python, PHP, C#, etc.)
  • Add-ons are available for graphing data (Nagiosgraph, Nagiosgrapher, PNP4Nagios, and others available)

Efficient alerts about root problems

You can define objects dependencies in Shinken like a server being connected on a switch or a web page being dependent of a database. It allow Shinken to only alert about real problems and give the list of impacted elements:

  • Ability to define network host hierarchy, allowing detection of and distinction between hosts that are down and those that are unreachable
  • Contact notifications for service or host problems and resolution (via e-mail, pager, SMS, or any user-defined method through plugin system)
  • Ability to define event handlers to be run during service or host events for proactive problem resolution

And a cloud alike architecture !

Shinken architecture is it’s major strength : it’s a a real monitoring in the cloud system!

Shinken allows users to have a huge configuration (more than 10000 hosts checks with one server) and run under a lot of system, Windows included. The migration from Nagios® to Shinken is also very easy! To see a comparison between Nagios® and Shinken features, you can look at this page.

Shinken can export data via broker modules to various databases backend like MySQL, Oracle, Couchdb and Sqlite. So you can use it with various interfaces like Centreon, Ninja, Nagvis, Thruk or even the Nagios CGI interface.

15th May 2009 JMXetric 0.0.3 released ¶

JVM instrumentation to Ganglia

JMXetric is a 100% java, configurable JVM agent that periodically polls MBean attributes and reports their values to Ganglia.

15th May 2009 JMXetric 0.0.3 released

Project goals are that JMXetric should be

  • configurable (xml)
  • lightweight (small memory and cpu footprint)
  • standalone (not depend on many third party libraries)

The gmetric protocol implementation uses classes generated by the remotetea project.

Screenshots from the ganglia web interface:

Screenshots from using rrdtool directly:

If you use this then please let me know!

How To Monitor Cassandra with Nagios


How To Monitor Cassandra with Nagios

  • About Cassandra & Nagios

    How To Monitor Cassandra with Nagios
    The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model. Cassandra was open sourced by Facebook in 2008, and is now developed by Apache committers and contributors from many companies. 1

    Nagios is a powerful monitoring system that enables organizations to identify and resolve IT infrastructure problems before they affect critical business processes. Nagios monitors your entire IT infrastructure to ensure systems, applications, services, and business processes are functioning properly. In the event of a failure, Nagios can alert technical staff of the problem, allowing them to begin remediation processes before outages affect business processes, end-users, or customers. 2

  • How it Works

    This tutorial will show you how to start monitoring 'HeapMemoryUsage' in Nagios using a plugin called check_jmx3 aka jmxquery. Basically Java Management Extensions (JMX) allow us to monitor the JVMs running on a Cassandra cluster via port 8080 (in this case).