memonic

Fully Automated MySQL slow log analysis on Amazon RDS

Save

At Memonic we rely on MySQL for most of our data storage. In any relational database system the correct creation of indices is important, otherwise queries will be inefficient and slow. The problem with that is, that the indices often are forgotten, especially when updating an existing query. As a tool to detect queries without proper indices MySQL offers the slow query log. All queries that take more than a certain time are logged there.

We host our platform in Amazon’s cloud. For database we rely on their on their Relational Database Service (RDS) service. As we don’t have root access to those machines we can’t just tail the slow log to see what’s up. Instead RDS optionally writes the slow log into a special system table. From there a query can be used to retrieve the data. See the Amazon RDS FAQ about how to configure the slow log on RDS.

For automated analysis of the slow logs we like to use mk-query-digest. This excellent utility groups all logged queries together by their general type and thus allows a big-picture overview. As an example take these three queries that may have been logged:

SELECT * FROM events WHERE user_id = 'foo';
SELECT * FROM data WHERE created_at >= '2011-12-13 10:12:00';
SELECT * FROM events WHERE user_id = 'bar';

These will be grouped together by mk-query-digest as just two queries:

SELECT * FROM events WHERE user_id = '?';
SELECT * FROM data WHERE created_at >= '?';

This is accompanied with how often each query type was executed, how long it took in total, etc. This is a great way to focus any optimization effort first on the queries that are actually used a lot.

Unfortunately mk-query-digest only works with the normal MySQL slow query log format and can’t access the proprietary table that Amazon RDS keeps. To work around this, we wrote the db2log.py script which we hereby release into the public domain.

#!/usr/bin/env python
"""
Queries the slowlog database table maintained by Amazon RDS and outputs it in
the normal MySQL slow log text format.
"""

import _mysql

db = _mysql.connect(db="mysql", read_default_file="/root/.my.cnf")
db.query("""SELECT * FROM slow_log ORDER BY start_time""")
r = db.use_result()

print """/usr/sbin/mysqld, Version: 5.1.49-3-log ((Debian)). started with:
Tcp port: 3306 Unix socket: /var/run/mysqld/mysqld.sock
Time Id Command Argument
"""

while True:
    results = r.fetch_row(maxrows=100, how=1)
    if not results:
        break

    for row in results:
        row['year'] = row['start_time'][2:4]
        row['month'] = row['start_time'][5:7]
        row['day'] = row['start_time'][8:10]
        row['time'] = row['start_time'][11:]

        hours = int(row['query_time'][0:2])
        minutes = int(row['query_time'][3:5])
        seconds = int(row['query_time'][6:8])
        row['query_time_f'] = hours * 3600 + minutes * 60 + seconds

        hours = int(row['lock_time'][0:2])
        minutes = int(row['lock_time'][3:5])
        seconds = int(row['lock_time'][6:8])
        row['lock_time_f'] = hours * 3600 + minutes * 60 + seconds

        if not row['sql_text'].endswith(';'):
            row['sql_text'] += ';'

        print '# Time: {year}{month}{day} {time}'.format(**row)
        print '# User@Host: {user_host}'.format(**row)
        print '# Query_time: {query_time_f} Lock_time: {lock_time_f} Rows_sent: {rows_sent} Rows_examined: {rows_examined}'.format(**row)
        print 'use {db};'.format(**row)
        print row['sql_text']

view raw db2log.py This Gist brought to you by GitHub.

It simply dumps the current contents of the RDS slow_log table in the same format that MySQL usually uses for their slow log file. This output can then be piped into mk-query-digest to generate a report.

We ended up doing just that in a daily cron job which sends a mail to our developers.

#!/bin/bash
(/usr/local/bin/db2log | \
    mk-query-digest --fingerprints \
        --filter '$event->{user} !~ m/^(bi|memonic)$/') 2>&1 | \
        mail -s "MySQL slow logs" root

# Rotate slow logs. Will move them into the backup table slow_log_backup. If
# that table exists it's overwritten with the primary slow log.
# So with this strategy we can still access yesterday's slow log by querying
# slow_log_backup.
mysql mysql -e 'CALL rds_rotate_slow_log'
view raw cronjob.sh This Gist brought to you by GitHub.

There is one line which needs further explanation: the filter. We filter out any slow log events that were triggered by either the bi or the memonic users. The former is used for asynchronous generation of some statistics and performance isn’t required for that. The latter we use for ad-hoc queries which we don’t need to optimize.

So there you have it: an automated mechanism to analyze slow MySQL queries. From time to time when deploying a new release a new slow query may pop up. But the next day we are informed about it and can fix the issue.

Instagram Engineering • Sharding & IDs at Instagram

Save

Sharding & IDs at Instagram

With more than 25 photos & 90 likes every second, we store a lot of data here at Instagram. To make sure all of our important data fits into memory and is available quickly for our users, we’ve begun to shard our data—in other words, place the data in many smaller buckets, each holding a part of the data.

Our application servers run Django with PostgreSQL as our back-end database. Our first question after deciding to shard out our data was whether PostgreSQL should remain our primary data-store, or whether we should switch to something else. We evaluated a few different NoSQL solutions, but ultimately decided that the solution that best suited our needs would be to shard our data across a set of PostgreSQL servers.

Before writing data into this set of servers, however, we had to solve the issue of how to assign unique identifiers to each piece of data in the database (for example, each photo posted in our system). The typical solution that works for a single database—just using a database’s natural auto-incrementing primary key feature—no longer works when data is being inserted into many databases at the same time. The rest of this blog post addresses how we tackled this issue.

Before starting out, we listed out what features were essential in our system:

  1. Generated IDs should be sortable by time (so a list of photo IDs, for example, could be sorted without fetching more information about the photos)
  2. IDs should ideally be 64 bits (for smaller indexes, and better storage in systems like Redis)
  3. The system should introduce as few new ‘moving parts’ as possible—a large part of how we’ve been able to scale Instagram with very few engineers is by choosing simple, easy-to-understand solutions that we trust.

Existing solutions

Many existing solutions to the ID generation problem exist; here are a few we considered:

Generate IDs in web application

This approach leaves ID generation entirely up to your application, and not up to the database at all. For example, MongoDB’s ObjectId, which is 12 bytes long and encodes the timestamp as the first component. Another popular approach is to use UUIDs.

Pros:

  1. Each application thread generates IDs independently, minimizing points of failure and contention for ID generation
  2. If you use a timestamp as the first component of the ID, the IDs remain time-sortable

Cons:

  1. Generally requires more storage space (96 bits or higher) to make reasonable uniqueness guarantees
  2. Some UUID types are completely random and have no natural sort

Generate IDs through dedicated service

Ex: Twitter’s Snowflake, a Thrift service that uses Apache ZooKeeper to coordinate nodes and then generates 64-bit unique IDs

Pros:

  1. Snowflake IDs are 64-bits, half the size of a UUID
  2. Can use time as first component and remain sortable
  3. Distributed system that can survive nodes dying

Cons:

  1. Would introduce additional complexity and more ‘moving parts’ (ZooKeeper, Snowflake servers) into our architecture

DB Ticket Servers

Uses the database’s auto-incrementing abilities to enforce uniqueness. Flickr uses this approach, but with two ticket DBs (one on odd numbers, the other on even) to avoid a single point of failure.

Pros:

  1. DBs are well understood and have pretty predictable scaling factors

Cons:

  1. Can eventually become a write bottleneck (though Flickr reports that, even at huge scale, it’s not an issue).
  2. An additional couple of machines (or EC2 instances) to admin
  3. If using a single DB, becomes single point of failure. If using multiple DBs, can no longer guarantee that they are sortable over time.

Of all the approaches above, Twitter’s Snowflake came the closest, but the additional complexity required to run an ID service was a point against it. Instead, we took a conceptually similar approach, but brought it inside PostgreSQL.

Our solution

Our sharded system consists of several thousand ‘logical’ shards that are mapped in code to far fewer physical shards. Using this approach, we can start with just a few database servers, and eventually move to many more, simply by moving a set of logical shards from one database to another, without having to re-bucket any of our data. We used Postgres’ schemas feature to make this easy to script and administrate.

Schemas (not to be confused with the SQL schema of an individual table) are a logical grouping feature in Postgres. Each Postgres DB can have several schemas, each of which can contain one or more tables. Table names must only be unique per-schema, not per-DB, and by default Postgres places everything in a schema named ‘public’.

Each ‘logical’ shard is a Postgres schema in our system, and each sharded table (for example, likes on our photos) exists inside each schema.

We’ve delegated ID creation to each table inside each shard, by using PL/PGSQL, Postgres’ internal programming language, and Postgres’ existing auto-increment functionality.

Each of our IDs consists of:

  • 41 bits for time in milliseconds (gives us 41 years of IDs with a custom epoch)
  • 13 bits that represent the logical shard ID
  • 10 bits that represent an auto-incrementing sequence, modulus 1024. This means we can generate 1024 IDs, per shard, per millisecond

Let’s walk through an example: let’s say it’s September 9th, 2011, at 5:00pm and our ‘epoch’ begins on January 1st, 2011. There have been 1387263000 milliseconds since the beginning of our epoch, so to start our ID, we fill the left-most 41 bits with this value with a left-shift:

id = 1387263000 << (64-41)

Next, we take the shard ID for this particular piece of data we’re trying to insert. Let’s say we’re sharding by user ID, and there are 2000 logical shards; if our user ID is 31341, then the shard ID is 31341 % 2000 -> 1341. We fill the next 13 bits with this value:

id |= 1341 << (64-41-13)

Finally, we take whatever the next value of our auto-increment sequence (this sequence is unique to each table in each schema) and fill out the remaining bits. Let’s say we’d generated 5,000 IDs for this table already; our next value is 5,001, which we take and mod by 1024 (so it fits in 10 bits) and include it too:

id |= (5001 % 1024)

We now have our ID, which we can return to the application server using the RETURNING keyword as part of the INSERT.

Here’s the PL/PGSQL that accomplishes all this (for an example schema insta5):

CREATE OR REPLACE FUNCTION insta5.next_id(OUT result bigint) AS $$
DECLARE
    our_epoch bigint := 1314220021721;
    seq_id bigint;
    now_millis bigint;
    shard_id int := 5;
BEGIN
    SELECT nextval('insta5.table_id_seq') %% 1024 INTO seq_id;

    SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;
    result := (now_millis - our_epoch) << 23;
    result := result | (shard_id << 10);
    result := result | (seq_id);
END;
$$ LANGUAGE PLPGSQL;

And when creating the table, we do:

CREATE TABLE insta5.our_table (
    "id" bigint NOT NULL DEFAULT insta5.next_id(),
    ...rest of table schema...
)

And that’s it! Primary keys that are unique across our application (and as a bonus, contain the shard ID in them for easier mapping). We’ve been rolling this approach into production and are happy with the results so far. Interested in helping us figure out these problems at scale? We’re hiring!

Mike Krieger, co-founder

Discuss this post on Hacker News.

140 Notes/ Hide

  1. formation-management-toulouse reblogged this from instagram-engineering

  2. preise-vergleichen-624 reblogged this from instagram-engineering

  3. schufa-kredite115 reblogged this from instagram-engineering

  4. tht-pure-pwnage389 reblogged this from instagram-engineering

  5. car-breakdown-cover reblogged this from instagram-engineering

  6. gustavoarjones reblogged this from instagram-engineering

  7. business-news-time reblogged this from instagram-engineering

  8. mutuelleenfrance reblogged this from instagram-engineering

  9. acheter-pneu reblogged this from instagram-engineering

  10. lifemutuelle reblogged this from instagram-engineering

  11. business-news-time reblogged this from instagram-engineering

  12. ericelias reblogged this from instagram-engineering

  13. shefketazemi reblogged this from instagram-engineering

  14. supermattresswarehouse reblogged this from instagram-engineering

  15. temporarybuildings reblogged this from instagram-engineering

  16. handyvertrag-kundigen reblogged this from instagram-engineering

  17. handyvertrag-kundigen reblogged this from instagram-engineering

  18. briefmarken-wert reblogged this from instagram-engineering

  19. briefmarken-wert reblogged this from instagram-engineering

  20. solidworks-tutorial reblogged this from instagram-engineering

  21. photo-bloger reblogged this from itiskiwi

  22. breadloaves reblogged this from instagram-engineering

  23. irelandescorts360 reblogged this from instagram-engineering

  24. irelandescorts360 reblogged this from instagram-engineering

  25. watch-movie-series reblogged this from rudramakesmovie

Brewer's CAP Theorem

Save

Brewer's CAP Theorem

The kool aid Amazon and Ebay have been drinking

On Friday 4th June 1976, in a small upstairs room away from the main concert auditorium, the Sex Pistols kicked off their first gig at Manchester's Lesser Free Trade Hall. There's some confusion as to who exactly was there in the audience that night, partly because there was another concert just six weeks later, but mostly because it's considered to be a gig that changed western music culture forever. So iconic and important has that appearance become that David Nolan wrote a book, I Swear I Was There: The Gig That Changed the World, investigating just whose claim to have been present was justified. Because the 4th of June is generally considered to be the genesis of punk rock6.

Prior to this (in fact since around 1971) there had been a number of protopunk bands, such as the New York Dolls and the Velvet Underground, but it was this one set by the Sex Pistols that in music folklore started the revolution that set in motion the driving guitars of the Buzzcocks, the plaintive wailing of The Smiths, the eclectic syncopations of the The Fall, the rising majesty of Joy Division and Simply Red (I guess you can't have everything).

Wednesday 19th July 2000, may not go down in popular culture with quite the same magnitude but it's had a similar impact on internet scale business as the Sex Pistols did on music a quarter of a century earlier, for that was the keynote speech by Eric Brewer at the ACM Symposium on the Principles of Distributed Computing (PODC).

The Sex Pistols had shown that barely-constrained fury was more important to their contemporaries than art-school structuralism, giving anyone with three chords and something to say permission to start a band. Eric Brewer, in what became known as Brewer's Conjecture, said that as applications become more web-based we should stop worrying about data consistency, because if we want high availability in these new distributed applications, then guaranteed consistency of data is something we cannot have, thus giving anyone with three servers and a keen eye for customer experience permission to start an internet scale business. Disciples of Brewer (present that day or later converts) include the likes of Amazon, EBay, and Twitter.

Two years later, in 2002, Seth Gilbert and Nancy Lynch of MIT, formally proved Brewer to be correct and thus Brewer's Theorem was born.

Brewer's (CAP) Theorem

So what exactly is Brewer's Theorem, and why does it warrant comparison with a 1976 punk gig in Manchester?

Brewer's 2000 talk was based on his theoretical work at UC Berkley and observations from running Inktomi, though Brewer and others were talking about trade-off decisions that need to be made in highly scalable systems years before that (e.g. "Cluster-Based Scalable Network Services" from SOSP in 1997 and "Harvest, yield, and scalable tolerant systems" in 1999) so the contents of the presentation weren't new and, like many of these ideas, they were the work of many smart people (as I am sure Brewer himself would be quick to point out).

What he said was there are three core systemic requirements that exist in a special relationship when it comes to designing and deploying applications in a distributed environment (he was talking specifically about the web but so many corporate businesses are multi-site/multi-country these days that the effects could equally apply to your data-centre/LAN/WAN arrangement).

The three requirements are: Consistency, Availability and Partition Tolerance, giving Brewer's Theorem its other name - CAP.

To give these some real-world meaning let's use a simple example: you want to buy a copy of Tolstoy's War & Peace to read on a particularly long vacation you're starting tomorrow. Your favourite web bookstore has one copy left in stock. You do your search, check that it can be delivered before you leave and add it to your basket. You remember that you need a few other things so you browse the site for a bit (have you ever bought just one thing online? Gotta maximise the parcel dollar). While you're reading the customer reviews of a suntan lotion product, someone, somewhere else in the country, arrives at the site, adds a copy to their basket and goes right to the checkout process (they need an urgent fix for a wobbly table with one leg much shorter than the others).

  • Consistency

    A service that is consistent operates fully or not at all. Gilbert and Lynch use the word "atomic" instead of consistent in their proof, which makes more sense technically because, strictly speaking, consistent is the C in ACID as applied to the ideal properties of database transactions and means that data will never be persisted that breaks certain pre-set constraints. But if you consider it a preset constraint of distributed systems that multiple values for the same piece of data are not allowed then I think the leak in the abstraction is plugged (plus, if Brewer had used the word atomic, it would be called the AAP theorem and we'd all be in hospital every time we tried to pronounce it).

    In the book buying example you can add the book to your basket, or fail. Purchase it, or not. You can't half-add or half-purchase a book. There's one copy in stock and only one person will get it the next day. If both customers can continue through the order process to the end (i.e. make payment) the lack of consistency between what's in stock and what's in the system will cause an issue. Maybe not a huge issue in this case - someone's either going to be bored on vacation or spilling soup - but scale this up to thousands of inconsistencies and give them a monetary value (e.g. trades on a financial exchange where there's an inconsistency between what you think you've bought or sold and what the exchange record states) and it's a huge issue.

    We might solve consistency by utilising a database. At the correct moment in the book order process the number of War and Peace books-in-stock is decremented by one. When the other customer reaches this point, the cupboard is bare and the order process will alert them to this without continuing to payment. The first operates fully, the second not at all.

    Databases are great at this because they focus on ACID properties and give us Consistency by also giving us Isolation, so that when Customer One is reducing books-in-stock by one, and simultaneously increasing books-in-basket by one, any intermediate states are isolated from Customer Two, who has to wait a few milliseconds while the data store is made consistent.

  • Availability

    Availability means just that - the service is available (to operate fully or not as above). When you buy the book you want to get a response, not some browser message about the web site being uncommunicative. Gilbert & Lynch in their proof of CAP Theorem make the good point that availability most often deserts you when you need it most - sites tend to go down at busy periods precisely because they are busy. A service that's available but not being accessed is of no benefit to anyone.

  • Partition Tolerance

    If your application and database runs on one box then (ignoring scale issues and assuming all your code is perfect) your server acts as a kind of atomic processor in that it either works or doesn't (i.e. if it has crashed it's not available, but it won't cause data inconsistency either).

    Once you start to spread data and logic around different nodes then there's a risk of partitions forming. A partition happens when, say, a network cable gets chopped, and Node A can no longer communicate with Node B. With the kind of distribution capabilities the web provides, temporary partitions are a relatively common occurrence and, as I said earlier, they're also not that rare inside global corporations with multiple data centres.

    Gilbert & Lynch defined partition tolerance as:

    No set of failures less than total network failure is allowed to cause the system to respond incorrectly

    and noted Brewer's comment that a one-node partition is equivalent to a server crash, because if nothing can connect to it, it may as well not be there.

The Significance of the Theorem

CAP Theorem comes to life as an application scales. At low transactional volumes, small latencies to allow databases to get consistent has no noticeable affect on either overall performance or the user experience. Any load distribution you do undertake, therefore, is likely to be for systems management reasons.

But as activity increases, these pinch-points in throughput will begin limit growth and create errors. It's one thing having to wait for a web page to come back with a response and another experience altogether to enter your credit card details to be met with "HTTP 500 java.lang.schrodinger.purchasingerror" and wonder whether you've just paid for something you won't get, not paid at all, or maybe the error is immaterial to this transaction. Who knows? You are unlikely to continue, more likely to shop elsewhere, and very likely to phone your bank.

Either way this is not good for business. Amazon claim that just an extra one tenth of a second on their response times will cost them 1% in sales. Google said they noticed that just a half a second increase in latency caused traffic to drop by a fifth.

I've written a little about scalability before, so won't repeat all that here except to make two points: the first is that whilst addressing the problems of scale might be an architectural concern, the initial discussions are not. They are business decisions. I get very tired of hearing, from techies, that such-and-such an approach is not warranted because current activity volumes don't justify it. It's not that they're wrong; more often than not they're quite correct, it's that to limit scale from the outset is to implicitly make revenue decisions - a factor that should be made explicit during business analysis.

The second point is that once you embark on discussions around how to best scale your application the world falls broadly into two ideological camps: the database crowd and the non-database crowd.

The database crowd, unsurprisingly, like database technology and will tend to address scale by talking of things like optimistic locking and sharding, keeping the database at the heart of things.

The non-database crowd will tend to address scale by managing data outside of the database environment (avoiding the relational world) for as long as possible.

I think it's fair to say that the former group haven't taken to CAP Theorem with quite the same gusto as the latter (though they are talking about it). This is because if you have to drop one of consistency, availability, or partition tolerance, many opt to drop consistency which is the raison d'être of the database. The logic. no doubt, is that availability and partition-tolerance keep your money-making application alive, whereas inconsistency just feels like one of those things you can work around with clever design.

Like so much else in IT, it's not as black and white as this. Eric Brewer, on slide 13 of his PODC talk, when comparing ACID and it's informal counterpart BASE even says "I think it's a spectrum". And if you're interested in this as a topic (it's slightly outside of what I want to talk about here) you could do worse than start with a paper called "Design and Evaluation of a Continuous Consistency Model for Replicated Services" by Haifeng Yu and Amin Vahdat. Nobody should interpret CAP as implying the database is dead.

Where both sides agree though is that the answer to scale is distributed parallelisation not, as was once thought, supercomputer grunt. Eric Brewer's influence on the Network of Workstations projects of the mid-nineties led to the architectures that exposed CAP theorem, because as he says in another presentation on Inktomi and the Internet Bubble the answer has always been processors working in parallel:

If they're not working in parallel you have no chance to get the problem done in a reasonable amount of time. This is a lot like anything else. If you have a really big job to do you get lots of people to do it. So if you are building a bridge you have lots of construction workers. That's parallel processing also. So a lot of this will end up being "how do we mix parallel processing and the internet?"

The Proof in Pictures

Here's a simplified proof, in pictures because I find it much easier to understand that way. I've mostly used the same terms as Gilbert and Lynch so that this ties up with their paper.

The diagram above shows two nodes in a network, N1 and N2. They both share a piece of data V (how many physical copies of War and Peace are in stock), which has a value V0. Running on N1 is an algorithm called A which we can consider to be safe, bug free, predictable and reliable. Running on N2 is a similar algorithm called B. In this experiment, A writes new values of V and B reads values of V.

In a sunny-day scenario this is what happens: (1) First A writes a new value of V, which we'll call V1. (2) Then a message (M) is passed from N1 to N2 which updates the copy of V there. (3) Now any read by B of V will return V1.

If the network partitions (that is messages from N1 to N2 are not delivered) then N2 contains an inconsistent value of V when step (3) occurs.

Hopefully that seems fairly obvious. Scale this is up to even a few hundred transactions and it becomes a major issue. If M is an asynchronous message then N1 has no way of knowing whether N2 gets the message. Even with guaranteed delivery of M, N1 has no way of knowing if a message is delayed by a partition event or something failing in N2. Making M synchronous doesn't help because that treats the write by A on N1 and the update event from N1 to N2 as an atomic operation, which gives us the same latency issues we have already talked about (or worse). Gilbert and Lynch also prove, using a slight variation on this, that even in a partially-synchronous model (with ordered clocks on each node) atomicity cannot be guaranteed.

So what CAP tells us is that if we want A and B to be highly available (i.e. working with minimal latency) and we want our nodes N1 to Nn (where n could be hundreds or even thousands) to remain tolerant of network partitions (lost messages, undeliverable messages, hardware outages, process failures) then sometimes we are going to get cases where some nodes think that V is V0 (one copy of War and Peace in stock) and other nodes will think that V is V1 (no copies of War and Peace in stock).

We'd really like everything to be structured, consistent and harmonious, like the music of a prog rock band from the early seventies, but what we are faced with is a little bit of punk-style anarchy. And actually, whilst it might scare our grandmothers, it's OK once you know this, because both can work together quite happily.

Let's quickly analyse this from a transactional perspective.

If we have a transaction (i.e. unit of work based around the persistent data item V) called α, then α1 could be the write operation from before and α2 could be the read. On a local system this would be easily be handled by a database with some simple locking, isolating any attempt to read in α2 until α1 completes safely. In the distributed model though, with nodes N1 and N2 to worry about, the intermediate synchronising message has also to complete. Unless we can control when α2 happens, we can never guarantee it will see the same data values α1 writes. All methods to add control (blocking, isolation, centralised management, etc) will impact either partition tolerance or the availability of α1 (A) and/or α2 (B).

Dealing with CAP

You've got a few choices when addressing the issues thrown up by CAP. The obvious ones are:

  1. Drop Partition Tolerance

    If you want to run without partitions you have to stop them happening. One way to do this is to put everything (related to that transaction) on one machine, or in one atomically-failing unit like a rack. It's not 100% guaranteed because you can still have partial failures, but you're less likely to get partition-like side-effects. There are, of course, significant scaling limits to this.

  2. Drop Availability

    This is the flip side of the drop-partition-tolerance coin. On encountering a partition event, affected services simply wait until data is consistent and therefore remain unavailable during that time. Controlling this could get fairly complex over many nodes, with re-available nodes needing logic to handle coming back online gracefully.

  3. Drop Consistency

    Or, as Werner Vogels puts it, accept that things will become "Eventually Consistent" (updated Dec 2008). Vogels' article is well worth a read. He goes into a lot more detail on operational specifics than I do here.

    Lots of inconsistencies don't actually require as much work as you'd think (meaning continuous consistency is probably not something we need anyway). In my book order example if two orders are received for the one book that's in stock, the second just becomes a back-order. As long as the customer is told of this (and remember this is a rare case) everybody's probably happy.

  4. The BASE Jump

    The notion of accepting eventual consistency is supported via an architectural approach known as BASE (Basically Available, Soft-state, Eventually consistent). BASE, as its name indicates, is the logical opposite of ACID, though it would be quite wrong to imply that any architecture should (or could) be based wholly on one or the other. This is an important point to remember, given our industry's habit of "oooh shiny" strategy adoption.

    And here I defer to Professor Brewer himself who emailed me some comments on this article, saying:

    the term "BASE" was first presented in the 1997 SOSP article that you cite. I came up with acronym with my students in their office earlier that year. I agree it is contrived a bit, but so is "ACID" -- much more than people realize, so we figured it was good enough. Jim Gray and I discussed these acronyms and he readily admitted that ACID was a stretch too -- the A and D have high overlap and the C is ill-defined at best. But the pair connotes the idea of a spectrum, which is one of the points of the PODC lecture as you correctly point out.

    Dan Pritchett of EBay has a nice presentation on BASE.

  5. Design around it

    Guy Pardon, CTO of atomikos wrote an interesting post which he called "A CAP Solution (Proving Brewer Wrong)", suggesting an architectural approach that would deliver Consistency, Availability and Partition-tolerance, though with some caveats (notably that you don't get all three guaranteed in the same instant).

    It's worth a read as Guy eloquently represents an opposing view in this area.

Summary

That you can only guarantee two of Consistency, Availability and Partition Tolerance is real and evidenced by the most successful websites on the planet. If it works for them I see no reason why the same trade-offs shouldn't be considered in everyday design in corporate environments. If the business explicitly doesn't want to scale then fine, simpler solutions are available, but it's a conversation worth having. In any case these discussions will be about appropriate designs for specific operations, not the whole shebang. As Brewer said in his email "the only other thing I would add is that different parts of the same service can choose different points in the spectrum". Sometimes you absolutely need consistency whatever the scaling cost, because the risk of not having it is too great.

These days I'd go so far as to say that Amazon and EBay don't have a scalability problem. I think they had one and now they have the tools to address it. That's why they can freely talk about it. Any scaling they do now (given the size they already are) is really more of the same. Once you've scaled, your problems shift to those of operational maintenance, monitoring, rolling out software updates etc. - tough to solve, certainly, but nice to have when you've got those revenue streams coming in.

References

  1. HP's take on CAP Theorem, a white paper entitled "There is no free lunch with distributed data"

  2. Computer Science Notes on Distributed Transactions and Network Partitions from University of Sussex

  3. Nice post by Jens Alfke on databases, scaling and Twitter.

  4. Pat Helland's Microsoft paper on distributed transactions and SOA called Data on the Outside versus Data on the Inside, which he later related to CAP Theorem here

  5. Another set of Computer Science course slides, this time from George Mason University in Virginia, on Distributed Software Systems and particularly CAP Theorem and the clash between ACID and BASE ideologies.

  6. The 4th June 1976 is considered to be the birth of Punk Rock in the UK. Thanks to Charlie Dellacona for pointing out that the Ramones take credit for starting the movement in early 1974 in the US, though their official punk recordings are more or less contemporaneous.

  7. Thanks to Hiroshi Yuki, a Japanese translation of this article is available.

  8. Thanks to Daniel Cohen, a two part Hebrew translation of this article is available.

  9. Thanks to Wang Qi, a Chinese translation of this article is available.

Comments (2)

Aengus Walton

Aengus Walton Dec 8, 2011

Enjoyed that. Good article, despite the rather far-fetched tie-in with the dawn of punk music.
Patrice Neff

Patrice Neff Dec 8, 2011

Hey at least I learned about Sex Pistols. ;-)

mysqldump: Error 2020: Got packet bigger than ‘max_allowed_packet’ bytes when dumping table

Save

mysqldump: Error 2020: Got packet bigger than ‘max_allowed_packet’ bytes when dumping table

I got this error when I was dumping a huge table which size is more than 26GB in size.

mysqldump: Error 2020: Got packet bigger than ‘max_allowed_packet’ bytes when dumping table `QPR` at row: 5659

The server’s default max_allowed_packet value is 1MB. You can increase this if the server needs to handle big queries (for example, if you are working with big BLOB columns). The largest possible packet that can be transmitted to or from a MySQL 5.1 server or client is 1GB.

edit your my.cnf file and add
max_allowed_packet=1024M then restart

[root@db01 ~]# service mysqld restart
Stopping MySQL:                                            [  OK  ]
Starting MySQL:                                             [  OK  ]
mysql> show GLOBAL variables like 'max_allowed_packet%';
+--------------------+------------+
| Variable_name      | Value      |
+--------------------+------------+
| max_allowed_packet | 1073740800 |
+--------------------+------------+
1 row in set (0.00 sec)
mysql>

NB: I need huge blog type data to be fetched from db. We can also avoid this restart buy
Make sure that if the parameter values are changed by executing this command (ex.)show parameters like ‘max_allowed_packet’ from MySQL client(default).

Suppose if you dumping huge blob data from another host, you also need to change this mysql variables to get it worked. Here I’m setting this to 1GB

mysql> <strong>set global max_allowed_packet=100000000000;</strong>
Query OK, 0 rows affected (0.00 sec)
mysql> show variables like 'max_allowed_packet%';
+--------------------+---------+
| Variable_name      | Value   |
+--------------------+---------+
| max_allowed_packet | 1048576 |
+--------------------+---------+
1 row in set (0.10 sec)
mysql> <strong>set  max_allowed_packet=100000000000;</strong>
Query OK, 0 rows affected (0.00 sec)
mysql> show variables like 'max_allowed_packet%';
+--------------------+------------+
| Variable_name      | Value      |
+--------------------+------------+
| max_allowed_packet | 1073741824 |
+--------------------+------------+
1 row in set (0.00 sec)
mysql>

Finally the real solution is you need to add the “max_allowed_packets” parameter along with the mysqldump. mysqldump is notorious for ignoring this value in my.cnf, but setting it as the command line parameter always works.

#mysqldump -u root -p –max_allowed_packet=512M -B database –tables tblblogdb > dump.sql

I think it’s always better to run large mysqldump from another server from the same network which help to reduce disk resource and cpu power during the peak hours. Here is the one sample,
#$MYSQLDUMP –max_allowed_packet=1G -u $MyUSER -h $MyHOST -p$MyPASS -B $db | bzip2 > $FILE

Don't use MongoDB

Save
Don't use MongoDB
=================

I've kept quiet for awhile for various political reasons, but I now
feel a kind of social responsibility to deter people from banking
their business on MongoDB.

Our team did serious load on MongoDB on a large (10s of millions
of users, high profile company) userbase, expecting, from early good
experiences, that the long-term scalability benefits touted by 10gen
would pan out.  We were wrong, and this rant serves to deter you
from believing those benefits and making the same mistake
we did.  If one person avoid the trap, it will have been
worth writing.  Hopefully, many more do.

Note that, in our experiences with 10gen, they were nearly always
helpful and cordial, and often extremely so.  But at the same
time, that cannot be reason alone to supress information about
the failings of their product.

Why this matters
----------------

Databases must be right, or as-right-as-possible, b/c database
mistakes are so much more severe than almost every other variation
of mistake.  Not only does it have the largest impact on uptime,
performance, expense, and value (the inherit value of the data),
but data has *inertia*.  Migrating TBs of data on-the-fly is
a massive undertaking compared to changing drcses or fixing the
average logic error in your code.  Recovering TBs of data while
down, limited by what spindles can do for you, is a helpless
feeling.

Databases are also complex systems that are effectively black
boxes to the end developer.  By adopting a database system,
you place absolute trust in their ability to do the right thing
with your data to keep it consistent and available.

Why is MongoDB popular?
-----------------------

To be fair, it must be acknowledged that MongoDB is popular,
and that there are valid reasons for its popularity.

 * It is remarkably easy to get running
 * Schema-free models that map to JSON-like structures
   have great appeal to developers (they fit our brains),
   and a developer is almost always the individual who
   makes the platform decisions when a project is in
   its infancy
 * Maturity and robustness, track record, tested real-world
   use cases, etc, are typically more important to sysadmin
   types or operations specialists, who often inherit the
   platform long after the initial decisions are made
 * Its single-system, low concurrency read performance benchmarks
   are impressive, and for the inexperienced evaluator, this
   is often The Most Important Thing

Now, if you're writing a toy site, or a prototype, something
where developer productivity trumps all other considerations,
it basically doesn't matter *what* you use.  Use whatever
gets the job done.

But if you're intending to really run a large scale system
on Mongo, one that a business might depend on, simply put:

Don't.

Why not?
--------

**1. MongoDB issues writes in unsafe ways *by default* in order to
win benchmarks**

If you don't issue getLastError(), MongoDB doesn't wait for any
confirmation from the database that the command was processed.
This introduces at least two classes of problems:

 * In a concurrent environment (connection pools, etc), you may
   have a subsequent read fail after a write has "finished";
   there is no barrier condition to know at what point the
   database will recognize a write commitment
 * Any unknown number of save operations can be dropped on the floor
   due to queueing in various places, things outstanding in the TCP
   buffer, etc, when your connection drops of the db were to be KILL'd or
   segfault, hardware crash, you name it

**2. MongoDB can lose data in many startling ways**

Here is a list of ways we personally experienced records go missing:

 1. They just disappeared sometimes.  Cause unknown.
 2. Recovery on corrupt database was not successful,
    pre transaction log.
 3. Replication between master and slave had *gaps* in the oplogs,
    causing slaves to be missing records the master had.  Yes,
    there is no checksum, and yes, the replication status had the
    slaves current
 4. Replication just stops sometimes, without error.  Monitor
    your replication status!

**3. MongoDB requires a global write lock to issue any write**

Under a write-heavy load, this will kill you.  If you run a blog,
you maybe don't care b/c your R:W ratio is so high.

**4. MongoDB's sharding doesn't work that well under load**

Adding a shard under heavy load is a nightmare.
Mongo either moves chunks between shards so quickly it DOSes
the production traffic, or refuses to more chunks altogether.

This pretty much makes it a non-starter for high-traffic
sites with heavy write volume.

**5. mongos is unreliable**

The mongod/config server/mongos architecture is actually pretty
reasonable and clever.  Unfortunately, mongos is complete
garbage.  Under load, it crashed anywhere from every few hours
to every few days.  Restart supervision didn't always help b/c
sometimes it would throw some assertion that would bail out a
critical thread, but the process would stay running.  Double
fail.

It got so bad the only usable way we found to run mongos was
to run haproxy in front of dozens of mongos instances, and
to have a job that slowly rotated through them and killed them
to keep fresh/live ones in the pool.  No joke.

**6. MongoDB actually once deleted the entire dataset**

MongoDB, 1.6, in replica set configuration, would sometimes
determine the wrong node (often an empty node) was the freshest
copy of the data available.  It would then DELETE ALL THE DATA
ON THE REPLICA (which may have been the 700GB of good data)
AND REPLICATE THE EMPTY SET.  The database should never never
never do this.  Faced with a situation like that, the database
should throw an error and make the admin disambiguate by
wiping/resetting data, or forcing the correct configuration.
NEVER DELETE ALL THE DATA.  (This was a bad day.)

They fixed this in 1.8, thank god.

**7. Things were shipped that should have never been shipped**

Things with known, embarrassing bugs that could cause data
problems were in "stable" releases--and often we weren't told
about these issues until after they bit us, and then only b/c
we had a super duper crazy platinum support contract with 10gen.

The response was to send up a hot patch and that they were
calling an RC internally, and then run that on our data.

**8. Replication was lackluster on busy servers**

Replication would often, again, either DOS the master, or
replicate so slowly that it would take far too long and
the oplog would be exhausted (even with a 50G oplog).

We had a busy, large dataset that we simply could
not replicate b/c of this dynamic.  It was a harrowing month
or two of finger crossing before we got it onto a different
database system.

**But, the real problem:**

You might object, my information is out of date; they've
fixed these problems or intend to fix them in the next version;
problem X can be mitigated by optional practice Y.

Unfortunately, it doesn't matter.

The real problem is that so many of these problems existed
in the first place.

Database developers must be held to a higher standard than
your average developer.  Namely, your priority list should
typically be something like:

 1. Don't lose data, be very deterministic with data
 2. Employ practices to stay available
 3. Multi-node scalability
 4. Minimize latency at 99% and 95%
 5. Raw req/s per resource

10gen's order seems to be, #5, then everything else in some
order.  #1 ain't in the top 3.

These failings, and the implied priorities of the company,
indicate a basic cultural problem, irrespective of whatever
problems exist in any single release:  a lack of the requisite
discipline to design database systems businesses should bet on.

Please take this warning seriously.

Comments (1)

Patrice Neff

Patrice Neff Nov 8, 2011

Also see the discussion on Hacker News: http://news.ycombinator.com/item?id=3202081

About Fluidinfo

Save

Fluidinfo is an online information storage and search platform. Its design supports shared openly-writable metadata of any type and about anything, provides modern writable APIs, and allows data to be branded with domain names.

Read-only information storage is a major disadvantage for business in a world increasingly driven by lightweight social applications. In the natural world we routinely put metadata where it is most useful, e.g., bookmarks in books, post-it notes in specific locations, and name tags around necks at conferences. But in the digital world businesses and consumers usually cannot store metadata in its most useful location—in context—because traditional storage is not openly writable. Potential value is reduced when data is in read-only silos as related information cannot be combined or searched across. Flexibility and spontaneity are restricted when future needs must be anticipated or write permission must first be obtained.

The unique properties of Fluidinfo change this situation completely

Fluidinfo provides a universal metadata engine because it has an object for everything imaginable, just like Wikipedia has a web page for everything. Fluidinfo objects can always have data added to them by any user or application, so related metadata can be stored in the same place. This allows it to be combined in searches, and remixed to great effect, increasing its value over metadata held in isolated databases. Fluidinfo allows information owners to put their internet domain names onto data and has a simple, flexible, and powerful permissions system.

Humans are diverse and unpredictable. We create, share, and organize information in an infinite variety of ways. We've even built machines to process it. Yet for all their capacity and speed, using computers to work with information is often awkward and frustrating. We are allowed very little of the spontaneity that characterizes normal human behavior. Our needs must be anticipated in advance by programmers. Far too often we can look, but not touch.

Why isn't it easier to work with information using a computer?

At Fluidinfo we believe the answer lies in information architecture. A rigid underlying platform inhibits or prevents spontaneity. A new information architecture can be the basis for a new class of applications. It can provide freedom and flexibility to all applications, and these advantages could be passed on to users.

We've spent the last several years designing and building Fluidinfo to be just such an architecture.

Fluidinfo makes it possible for data to be social. It allows almost unlimited information personalization by individual users and applications, and also between them. This makes it simple to build a wide variety of applications that benefit from cooperation, and which are open to unanticipated future enhancements. Even more importantly, Fluidinfo facilitates and encourages the growth of applications that leave users in control of their own data.

SQLite 3 Date and Time Datatype

Save

1.2 Date and Time Datatype

SQLite does not have a storage class set aside for storing dates and/or times. Instead, the built-in Date And Time Functions of SQLite are capable of storing dates and times as TEXT, REAL, or INTEGER values:

  • TEXT as ISO8601 strings ("YYYY-MM-DD HH:MM:SS.SSS").
  • REAL as Julian day numbers, the number of days since noon in Greenwich on November 24, 4714 B.C. according to the proleptic Gregorian calendar.
  • INTEGER as Unix Time, the number of seconds since 1970-01-01 00:00:00 UTC.

Applications can chose to store dates and times in any of these formats and freely convert between formats using the built-in date and time functions.

Cassandra: Backing up data

Save

Backing up data

Cassandra can snapshot data while online using nodetool snapshot. You can then back up those snapshots using any desired system, although leaving them where they are is probably the option that makes the most sense on large clusters.

With some combinations of operating system/jvm you may receive an error related to the inability to create a process during the snapshotting, such as this on Linux

Exception in thread "main" java.io.IOException: Cannot run program "ln": java.io.IOException: error=12, Cannot allocate memory

Remain calm. The operating system is trying to allocate for the "ln" process a memory space as large as the parent process (the cassandra server), even if it's not going to use it. So if you have a machine with 8GB of RAM and no swap, and you gave 6 to the cassandra server, it will fail during this because the operating system will wan 12 GB of memory before allowing you to create the process.

This can be worked around depending on the operating system by either creating a swap file, snapshotting, turning it off or by turning on "memory overcommit". Since the child process memory is the same as the parent, until it performs an exec("ln") call the operating system will not use the new memory and will just refer to the old one, and everything will work.

Currently, only flushed data is snapshotted (not data that only exists in the commitlog). Run nodetool flush first and wait for that to complete, to make sure you get all data in the snapshot.

To revert to a snapshot, shut down the node, clear out the old commitlog and sstables, and move the sstables from the snapshot location to the live data directory.

Consistent backups

You can get an eventually consistent backup by flushing all nodes and snapshotting; no individual node's backup is guaranteed to be consistent but if you restore from that snapshot then clients will get eventually consistent behavior as usual.

There is no such thing as a consistent view of the data in the strict sense, except in the trivial case of writes with consistency level = ALL.

The Neo4j REST Server - Part1: Get it going!

Save
Introduction
As requested and wished by many, finally Neo4j got its own standalone server mode, based on interaction via REST. The code is still very fresh and not thoroughly tested, but I thought I might write up some first documentation on it, based on the Getting Started with REST Wiki page

Installation

The first version of the distribution can be downloaded from here: zip, tar.gz. After unpacking, you just go to the unpacked directory and run (on OSX/Linux - see the wiki entry for details on Windows)
$ ./bin/neo4j-rest start
which will start the Neo4j REST server at port 9999 and put the database files under a directory neo4j-rest-db/ (lazily with the first request). Now, let's point our browser (not Internet Explorer since it doesn't send any useful Accept-headers and will get JSON back, this will be fixed later) to http://localhost:9999 and we will see the following:



Things seem to be running! The reason for the HTML interface is the Browser sending Accept: text/html. Now, setting the Accept to application/json will produce
peterneubauer$ curl -H Accept:application/json -H Content-Type:application/json -v http://localhost:9999
* About to connect() to localhost port 9999 (#0)
*   Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 9999 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.19.7 (i386-apple-darwin10.2.0) libcurl/7.19.7 zlib/1.2.3
> Host: localhost:9999
> Accept:application/json
> Content-Type:application/json
* Connection #0 to host localhost left intact
* Closing connection #0
{
  "reference node":"http://localhost:9999/node/0"
}

Now, with "200 OK" this is a good starting point. We can see full references to the interesting starting points -the reference node and the index subsystem. Let's check out the reference node:
peterneubauer$ curl -H Accept:application/json -H Content-Type:application/json -v http://localhost:9999/node/0
* About to connect() to localhost port 9999 (#0)
*   Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 9999 (#0)
> GET /node/0 HTTP/1.1
> User-Agent: curl/7.19.7 (i386-apple-darwin10.2.0) libcurl/7.19.7 zlib/1.2.3
> Host: localhost:9999
> Accept:application/json
> Content-Type:application/json
>
{
"incoming typed relationships":"http://localhost:9999/node/0/relationships/in/{-list|&|types}",
"incoming relationships":"http://localhost:9999/node/0/relationships/in",
"data":{},
"traverse":"http://localhost:9999/node/0/traverse/{returnType}",
"all typed relationships":"http://localhost:9999/node/0/relationships/all/{-list|&|types}",
"outgoing typed relationships":"http://localhost:9999/node/0/relationships/out/{-list|&|types}",
}
Which gives us some info about what the Node 0 can do, how to get its relationships and properties and the syntax of how to construct queries for getting properties, creating relationships etc.

Insert some data

According to RESTful thinking, data creation is handled be POST, updates by PUT. Let's insert a node:
peterneubauer$ curl -X POST -H Accept:application/json -v localhost:9999/node
* About to connect() to localhost port 9999 (#0)
*   Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 9999 (#0)
> POST /node HTTP/1.1
> User-Agent: curl/7.19.7 (i386-apple-darwin10.2.0) libcurl/7.19.7 zlib/1.2.3
> Host: localhost:9999
> Accept:application/json
>
{
...
"data":{},
...
}
Resulting in a new node with the URL localhost:9999/node/1 (described by the "self" property in the JSON representation) and no properties set ("data":{}). The Neo4j REST API is really trying to be explicit about possible further destinations, making it self-describing even for new users, and of course abstracting away the server instance in the future. This makes dealing with multiple Neo4j servers easier in the future. We can see the URIs for traversing, listing properties and relationships. The PUT semantics on properties work like for nodes.
We delete the node again with
curl -X DELETE  -v localhost:9999/node/1

and get 204 - No Content back. The Node is gone and will give a 404 - Not Found if we try to GET it again.

The Matrix

Now with properties encoded in JSON we can easily start to create our little Matrix example:



In order to create relationships, we do a POST on the originating Node and post the relationship data along with the request (escaping the whitespaces and others special characters):
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"name":"Mr. Andersson"}' -v localhost:9999/node
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"name":"Morpheus"}' -v localhost:9999/node
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"name":"Trinity"}' -v localhost:9999/node
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"name":"Cypher"}' -v localhost:9999/node
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"name":"Agent Smith"}' -v localhost:9999/node
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"name":"The Architect"}' -v localhost:9999/node

Getting http://localhost:9999/node/1, http://localhost:9999/node/2, http://localhost:9999/node/3 as the new URIs back. Now, we can connect the persons (escaping ruining readability a bit ...):
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"to":"http://localhost:9999/node/1","type":"ROOT"}' -v http://localhost:9999/node/0/relationships
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"to":"http://localhost:9999/node/2","type":"KNOWS"}' -v http://localhost:9999/node/1/relationships
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"to":"http://localhost:9999/node/3","type":"KNOWS"}' -v http://localhost:9999/node/2/relationships
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"to":"http://localhost:9999/node/4","type":"KNOWS"}' -v http://localhost:9999/node/2/relationships
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"to":"http://localhost:9999/node/5","type":"KNOWS"}' -v http://localhost:9999/node/4/relationships
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"to":"http://localhost:9999/node/6","type":"CODED BY"}' -v http://localhost:9999/node/5/relationships
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"to":"http://localhost:9999/node/1","type":"LOVES"}' -v http://localhost:9999/node/3/relationships

Now, pointing our browser at http://localhost:9999/node/3/relationships/all will list all relationships of Trinity:



Our first traversal

To start with, the Neo4j default Traverser framework (updated to be more powerful than the current) is supported in REST, and other implementations like Gremlin and Pipes to follow. The documentation on the traversals is in the making here. There are a number of different parameters:
http://localhost:9999/node/3/traverse/node specifies a return type of "node", returning node references. There are other return types such as relationship, position and path returning other interesting info respective. The Traverser description is pluggable and has default values - a full description looks like
{
"order": "depth first",
"uniqueness": "node path",
"relationships": [
{ "type": "KNOWS", "direction": "out" },
{ "type": "LOVES" }
],
"prune evaluator": {
"language", "javascript",
"body", "position.node().getProperty('date')>1234567;"
},
"return filter": {
"language": "builtin",
"name", "all"
},
"max depth": 2
}

To note here is the pluggable description of the "return filter" (what to include in the return) and "prune evaluator" (where to stop traversing). Right now only JavaScript is supported for writing these more complicated constructs up, but other languages are coming. Very cool. To finish, let's get all the nodes at depth 1 from Trinity via trivial traversal:
curl -X POST -H Accept:application/json -H Content-Type:application/json -d '{"order":"breadth first"}' -v http://localhost:9999/node/3/traverse/node

Which just returns all nodes of all relationships types at depth one (default) as a JSON Array of node descriptions as above, in this case http://localhost:9999/node/1 and http://localhost:9999/node/2.

Summary

Having the Neo4j REST API and with it the Neo4j REST Server coming along is great news for all that want to use a graph database over the network, especially PHP or .NET clients that have no good Java bindings. Already a first client wrapper for .NET by Magnus Mårtensson from Jayway is underway, and a first PHP client is on Al James' GIThub.
This will even pave the way for higher-level sharding and distribution scenarios and can be used in many other ways. Stay tuned for a deeper explanation of the different traversal possibilities with Neo4j and REST in a next post!
(1 - 10 of 25)