Probably my favourite project at local.ch was news.local.ch. That’s a site which crawls Swiss newspapers for news and classifies them by town. It only shows news which can be associated with any Swiss town. This leads to a good news collection even for very small villages.
In this post I’m going to outline the architecture and how we used Amazon’s web services to make building this a lot easier than it could have been. There will be a follow-up detailing how we used Python in this project.
I initially started the project as proof of concept. We quickly found that it has a lot of potential. And we quickly saw that there are a few components where we had to try different approaches to find which one works best – mostly in the area of crawling and the geo classification. So Harry Fuecks and I decided to write a pipeline instead of one big application.
The pipeline in this case is implemented as eight different processes which communicate using a queue. Each component can be replaced without having to notify the other components.
This approach buys mostly the following advantages: easy load balancing, good performance, modularity and extensibility.
- Load balancing: Individual processes can run on different machines without having to touch a single line of code.
- Performance: To speed things up each pipeline component can be loaded more than once. This way we can linearly scale the throughput of the whole pipeline.
- Modularity: Every component is very small. Each of them can be read top to bottom in less then half an hour so it’s very easy to debug problems.
- Extensibility: To add new functionality a new component can be inserted without having to touch other parts of the pipeline.
All these small processes cause quite some overhead. The queue processing in our case takes up big chunks of the processing time because the work units are very small. So raw performance would be better with a monolithic process. But we decided that in our case the advantages of the pipeline system easily outweigh the disadvantages.
The currently implemented pipeline components are:
- Crawler: Parses newspaper article lists and extracts all URLs which lead to a story.
- Parser: Parses each story and extracts a structured article with title, lead, body, date, etc.
- Deduplicator: Finds duplicates and near-duplicates of past stories and handles them.
- Geo classifier: Attaches a list of Swiss towns to the article based on parsing the body.
- Region classifier: Attaches a list of regions to the article (usually a city with some towns around it)
- Image extractor: Gets the images which are used by the article and stores them under local.ch control.
- Indexer: Puts the article into an index from where it’s then served to the live site.
- Purger: This process is responsible for cleaning up some garbage when an article is to be deleted.
With the exception of the parser and the geo classifier they those are all very small programs. Well under a hundred lines of code each.
Amazon Web services
We do queueing with the SQS queue service. I like the service a lot with one small caveat. You have to poll the service to check if there are new messages. Ideally I’d be able to keep a connection open and be notified when a new message is available.
To store the structured news articles we use S3. Each article is one XML document. Additionally the images are stored on S3 as well. The local.ch binarypool handles that part.
In SimpleDB we store small lookup tables for duplicate detection. We wanted to use SimpleDB also for the article storage, but unfortunately it has a maximum value size of just 1 KB.