Start now
Many improvements have been made in this new version -- for one thing, it's compatible with both Python 2 and Python 3. One of the biggest changes is that Beautiful Soup no longer uses its own parser; instead it chooses from what's available on your system, preferring the blazingly-fast lxml but falling back to other parsers and using Python's batteries-included parser if nothing else is available.
A useful tip if you're on Windows: you can find a pre-compiled Windows version of lxml here. That site has lots of pre-compiled Python extensions which is extremely helpful, as some of these packages (like lxml) otherwise require some serious gyrations in order to install them on your Windows box. (I work regularly on Mac, Windows 7 and Ubuntu Linux, in order to ensure that whatever I'm working on is cross-platform.)
Beautiful Soup has been refactored in many places; sometimes these changes constitute a significant improvement to the programming model, other times the changes are just to conform to the Python naming syntax or to ensure Python 2/3 compatibility.
The author Leonard Richardson is open to suggestions for improvements, so if you've had a feature request sitting on your back burner, now's the time.
Here's the introductory link to the Beautiful Soup 4 Beta.
I've been using Beautiful Soup to process a book that I'm coauthoring via Google Docs. We can work on the book remotely at the same time, which is something I've tried to do with other technologies via screen sharing. It works best with Google Docs because there's no setup necessary if we want to have a phone conversation about the document while working on it. Then I download the book in HTML format and apply the Beautiful Soup tools I've written to process the HTML. Although I've spent a fair amount of time on these, the investment is worth it because HTML isn't going away anytime soon so my Beautiful Soup skills should come in handy again and again.
Friday, 25 November 2011
To be a full-fledged format on the Web, you need to support links -- something sorely missing in JSON, which many have noticed lately.
In fact, too many; everybody seems to be piling on with their own take on how a link should look in JSON. Rather than adding to the pile (just yet), I thought I'd look around a bit first.
What am I looking for? Primarily, a way to serialise typed links (as defined by RFC5988, "Web Linking") into JSON, just like they can be into Atom, HTTP headers and (depending on the HTML5 WG's mood today), HTML.
5988 isn't perfect by any stretch (primarily because it was an after-the-fact compromise), but it does sketch out a path for typed links to become a first-class, format-independent part of the Web -- as they well should be, since URIs are the most important leg holding up the Web.
My immediate use case is being able to generically pull links out of JSON documents so that I can "walk" an HTTP API, as alluded to previously.
I'm also going in with a bit of caution, because we have at least one proof that getting a generic linking convention to catch on is hard; see XLink.
Now, to the contenders.
JSON-LD is a JSON format with a Linked Data (nee: Semantic Web) twist.
{
"@context": "http://purl.org/jsonld/Person",
"@subject": "http://dbpedia.org/resource/John_Lennon",
"name": "John Lennon",
"birthday": "10-09",
"member": "http://dbpedia.org/resource/The_Beatles"
}
Obviously, if you want to fit RDF into JSON, this is what you'd be looking at. Which is great, but most of the developers in the world aren't (yet) interested in this (no matter how hard the proponents push for it!). It also fails to provide a mapping from 5988; where do I put the link relation type?
I've seen a fair bit of advocacy for JSON-LD, especially on Twitter, but in almost every instance, I've seen the non-believers push back.
If JSON-LD is too complex / high-level, its opposite would be JSON Reference, a new-ish Internet-Draft by Chris Zyp and Paul Bryan (who are also working on JSON Schema, JSON Pointer and JSON PATCH, currently being discussed in the APPSAWG).
It's effectively a one-page spec, where a link looks like:
{ "$ref": "http://example.com/example.json#/foo/bar" }
This is effectively static serialisation of a type they've defined in JSON Schema, a sort of "meta-schema for links." I'd previously pushed back on that, because it effectively requires schema support / understanding to get the links out of the document -- a real non-starter in many scenarios.
So, I like the concreteness. However, it still lacks any way to talk about relation types, or pop on other metadata; while this could be grafted on separately, the whole point is to have one way to do it.
Another attempt is HAL, by Mike Kelly. The JSON portion of his serialisation looks like this:
{
"_links": {
"self": { "href": "/orders" },
"next": { "href": "/orders?page=2" },
"search": { "href": "/orders?id={order_id}" }
}
}
Here, links are an object whose members are link relations (yay!).
I'm a bit concerned, however, that this object might be a bit awkward to drop into some formats; it relies on _links to identify the structure, so in some places, you'd need a two-level deep object to convey just a simple link.
Also, there doesn't appear to be any way to link to more than one URI of a given relation type.
I don't have a specific thing in mind yet, and it's entirely possible that any of these proposals could be adapted to my needs (or others, I'm sure that they're out there).
I do have some requirements for consideration, though, along with a few sketches of ideas.
It should be really easy to find links in a document; as discussed above, requiring use of a schema is a non-starter.
This means that there needs to be some sort of marker the link data structure to trigger the link semantics (e.g., how JSON Reference uses "$ref"), and ideally some way to indicate on the document itself that it's using that convention (to avoid collisions, and help processors find these documents).
JSON Reference does this with a media type parameter;
Content-Type: application/json; profile=http://json-schema.org/json-ref
While at a glance that seems reasonable, I have two concerns; first, that JSON itself doesn't define a profile parameter on the media type (this needs to be done properly), and more seriously, that you can't declare conformance to multiple such conventions using this mechanism.
For example, if I want to say that my JSON conforms to this convention on links, and another convention about (say) namespaces, I'm out of luck.
I was in Boston recently (for the OpenStack design summit), and during a lull went up to the W3C offices to have lunch. Having this very much on the mind, I asked TimBL for his take, and we sketched out a sort of JSON metadata container, something like:
{
"_meta": {
"json-linking": null,
"json-namespaces": ["foo", "bar", "baz"]
},
// … rest of doc here
}
The exact format and content isn't important here; the idea is having a controlled way (i.e., the keys would probably be in a registry somewhere, and/or URIs) of adding annotations to a JSON document about its contents.
This is not an envelope; it's just a document metadata container, much like HEAD in HTML. You could put links up there too, I suppose, if it'd help. The important part is that you'd know to look for one of *those* because the media type of the document (or one of its parameters, if we go that way) indicates it's in use.
What do people think? Is this JSONic enough (whatever that means)?
As mentioned in the discussion of HAL above, the link convention needs to be easy to insert in a format, and not be too convoluted. This probably means an object that's "marked" with a particular, reserved key, much as JSON Reference does it. A list of links then becomes an array of objects, which seems pretty natural.
Again, as discussed, a mapping to RFC5988 is important to me -- both so that links can be serialised in various formats, with reasonable fidelity, and so that we can pivot from talking about "RESTful" APIs in terms of URIs to talking about them in terms of formats and link relations, as Roy advocates:
A REST API should spend almost all of its descriptive effort in defining the media type(s) used for representing resources and driving application state, or in defining extended relation names and/or hypertext-enabled mark-up for existing standard media types. Any effort spent describing what methods to use on what URIs of interest should be entirely defined within the scope of the processing rules for a media type (and, in most cases, already defined by existing media types).
The object defined needs to be explicitly extensible by the format it's in, so that link-specific metadata can be added. See the discussion of namespaces in JSON.
A complement to linking in JSON is linking to JSON. While JSON Pointer looks really promising in this regard, and is getting a fair amount of buzz, but I wonder if another mechanism, analogous to xml:id, is necessary.
The use case here is when you want to link to a specific object in a document that may be changing over time; a JSON pointer can be brittle in the face of such change, while a document-unique identifier is much more stable.
This makes a lot of sense in XML, which is primarily a document format. I'm not sure about whether the justification is as strong in JSON, which is primarily a data representation format, but it's worth talking about.
Again, there's not much value in having fifteen ways to serialise a link in JSON; it will end up a pretty ugly mess.

It provides scaffolding generators and helper functions to quickly create a lightweight CRUD interface for your data, along with a pre-rolled user authentication system.
Casein is designed to be completely decoupled from the front-end. Therefore it may be added to new or existing Rails projects, or even used as a standalone CMS to drive platforms built on other technologies.
→ Casein on GitHub
(Download or fork the source, and browse documentation)
Generate a modern and minimal CRUD interface for your data using the supplied Rails generators.
The generated views and controller logic can be infinitely customised as required. Casein comes with a range of helper functions to get you started.
User authentication and basic management to support developers and clients is pre-rolled and ready to go.
While the central mission of Casein is to provide developers with a basic CMS platform and the freedom to customise as required, we do have ideas of future features that may find their way into the project core.
If you're using Casein and are working to add additional functionality then let us know.
In the meantime, this is our current roadmap:
Here are some example websites and applications running on Casein. Do you have a project built with Casein? Let us know and we will add it to the list.

Multi-lingual website with many different functionalities implemented. Developed by Spoiled Milk.
→ www.popscan.com

Web application in Flash for an art-promoting organization. Developed by Spoiled Milk.
→ www.sound-development.com

Website for Swiss music sensation with various media management. Developed by Spoiled Milk.
→ www.stressmusic.com

Portfolio website for sound designer Martin Straka. Developed by Pixelate.
→ www.martinstraka.de

A subscription-based app delivering stories, films, interviews and comics is driven by a Casein server platform. Developed by Russell Quinn.
→ iphone.mcsweeneys.net

Online hall of fame backend for Mr. Bounce iPhone game managed with Casein. Developed by Pixelate.
→ www.pixelate.de/games/mr-bounce-iphone/

As the largest and most diverse collection of information in human history, the web grants us tremendous insight if we can only understand it better. For example, web crawl data can be used to spot trends and identify patterns in politics, economics, health, popular culture and many other aspects of life. It provides an immensely rich corpus for scientific research, technological advancement, and innovative new businesses. It is crucial for our information-based society that the web be openly accessible to anyone who desires to utilize it.
We strive to be transparent in all of our operations and we support nofollow and robots.txt. For more information about the ccBot, please see FAQ. For more information on Common Crawl data and how to access it, please see Data.
A freely accessible index of 5 billion web pages, their page rank, their link graphs and other metadata, hosted on Amazon EC2, was announced today by the Common Crawl Foundation. "It is crucial [in] our information-based society that Web crawl data be open and accessible to anyone who desires to utilize it," writes Foundation director Lisa Green on the organization's blog.
The Foundation is an organization dedicated to leveraging the falling costs of crawling and storage for the benefit of "individuals, academic groups, small start-ups, big companies, governments and nonprofits." It's lead by Gilad Elbaz, the forefather of Google AdSense and the CEO of data platform startup Factual. Joining Elbaz on the Foundation board is internet public domain champion Carl Malamud and semantic web serial entrepreneur Nova Spivack. Director Lisa Green came to the Foundation by way of Creative Commons.
The Foundation explains the scope of the project thusly.
"Common Crawl is a Web Scale crawl, and as such, each version of our crawl contains billions of documents from the various sites that we are successfully able to crawl. This dataset can be tens of terabytes in size, making transfer of the crawl to interested third parties costly and impractical. In addition to this, performing data processing operations on a dataset this large requires parallel processing techniques, and a potentially large computer cluster."Luckily for us, Amazon's EC2/S3 cloud computing infrastructure provides us with both a theoretically unlimited storage capacity coupled with localized access to an elastic compute cloud."
The organization was formed three years ago, just now started talking about itself publicly and believes that free access to all this information could lead to "a new wave of innovation, education and research."
Open Web Advocate James Walker agrees: "An openly accessible archive of the web - that's not owned and controlled by Google - levels the playing field pretty significantly for research and innovation."
A free tool for designers and web developers. It parses your css and returns a copy with all external media "baked" right into it as Base64 encoded datasets. The number of time consuming http-requests on your website is decreased significantly, resulting in a massive speed-boost (server-side gzip-compression must be enabled).

Jul 19, 2011 msdn.microsoft.com
Jump List tasks are application-specific actions that are tailored to a website. By using Jump List tasks, your website can surface the most frequently used commands to users. You should define the Jump List tasks based on both the website's features and the key actions a user is expected to undertake with them. The tasks provide a set of static URIs that users can access at any time, even if the browser instance is not running. Furthermore, these tasks provide a mechanism for your website to promote their most common destinations to users even when the user is not visiting your site. For instance, a web-based communication application could surface commands enabling users to quickly access their contacts, inbox, and profile information.

Figure 7: Jump List tasks associated with a communication site
All Jump List tasks are directly accessed by using a static URL path that is stored inside the .website file. Tasks are not expected to change frequently; however, they can be updated by modifying the meta elements on the webpage. Changes take effect the next time the user launches the pinned site, rather than when they are initially loaded by the browser.
You define Jump List tasks by using HTML meta tags. When accessing a pinned website, Windows caches and applies these tags during installation. URLs defined in tasks are not restricted to a domain. The following code example defines two Jump List tasks on a webpage: Task 1 and Task 2. When the user clicks Task 1, the pinned site window launches Page1.html. Similarly, when the user clicks Task 2, the pinned site window launches Page2.html on the microsoft.com domain.
<META name="msapplication-task" content="name=Task 1;action-uri=http://host/Page1.html;icon-uri=http://host/icon1.ico"/>
<META name="msapplication-task" content="name=Task 2;action-uri=http://microsoft.com/Page2.html;icon-uri=http://host/icon2.ico"/>
The pinned site window opens all tasks inside their own tab in the current pinned site window. If no browser instance exists, a new one is created. A website can define a maximum number of five tasks. Relative URLs on the action-uri field are resolved during install by using the URI of the page that contains the meta information.
Meta elements representing tasks can be updated by sites at any time. Changes to the Jump List tasks will be reflected the next time the site is launched.
Comments (1)
Patrice Neff Sep 10, 2011