memonic

TinyPNG - Shrink your PNG files

Save

How does it work? Excellent question! When you upload a PNG (Portable Network Graphics) file, similar colours in your image are combined. This technique is called "quantisation". Because the number of colours is reduced, 24-bit PNG files can be converted to much smaller 8-bit indexed colour images. All unnecessary metadata is stripped too.

Beautiful Soup 4 Now in Beta

Save
Computing Thoughts
Beautiful Soup 4 Now in Beta
by Bruce Eckel
February 23, 2012

Many improvements have been made in this new version -- for one thing, it's compatible with both Python 2 and Python 3. One of the biggest changes is that Beautiful Soup no longer uses its own parser; instead it chooses from what's available on your system, preferring the blazingly-fast lxml but falling back to other parsers and using Python's batteries-included parser if nothing else is available.

A useful tip if you're on Windows: you can find a pre-compiled Windows version of lxml here. That site has lots of pre-compiled Python extensions which is extremely helpful, as some of these packages (like lxml) otherwise require some serious gyrations in order to install them on your Windows box. (I work regularly on Mac, Windows 7 and Ubuntu Linux, in order to ensure that whatever I'm working on is cross-platform.)

Beautiful Soup has been refactored in many places; sometimes these changes constitute a significant improvement to the programming model, other times the changes are just to conform to the Python naming syntax or to ensure Python 2/3 compatibility.

The author Leonard Richardson is open to suggestions for improvements, so if you've had a feature request sitting on your back burner, now's the time.

Here's the introductory link to the Beautiful Soup 4 Beta.

I've been using Beautiful Soup to process a book that I'm coauthoring via Google Docs. We can work on the book remotely at the same time, which is something I've tried to do with other technologies via screen sharing. It works best with Google Docs because there's no setup necessary if we want to have a phone conversation about the document while working on it. Then I download the book in HTML format and apply the Beautiful Soup tools I've written to process the HTML. Although I've spent a fair amount of time on these, the investment is worth it because HTML isn't going away anytime soon so my Beautiful Soup skills should come in handy again and again.

mnot’s blog: Linking in JSON

Save

Friday, 25 November 2011

Linking in JSON

To be a full-fledged format on the Web, you need to support links -- something sorely missing in JSON, which many have noticed lately.

In fact, too many; everybody seems to be piling on with their own take on how a link should look in JSON. Rather than adding to the pile (just yet), I thought I'd look around a bit first.

What am I looking for? Primarily, a way to serialise typed links (as defined by RFC5988, "Web Linking") into JSON, just like they can be into Atom, HTTP headers and (depending on the HTML5 WG's mood today), HTML.

5988 isn't perfect by any stretch (primarily because it was an after-the-fact compromise), but it does sketch out a path for typed links to become a first-class, format-independent part of the Web -- as they well should be, since URIs are the most important leg holding up the Web.

My immediate use case is being able to generically pull links out of JSON documents so that I can "walk" an HTTP API, as alluded to previously.

I'm also going in with a bit of caution, because we have at least one proof that getting a generic linking convention to catch on is hard; see XLink.

Now, to the contenders.

JSON-LD: JSON for Linking Data

JSON-LD is a JSON format with a Linked Data (nee: Semantic Web) twist.

{
  "@context": "http://purl.org/jsonld/Person",
  "@subject": "http://dbpedia.org/resource/John_Lennon",
  "name": "John Lennon",
  "birthday": "10-09",
  "member": "http://dbpedia.org/resource/The_Beatles"
}

Obviously, if you want to fit RDF into JSON, this is what you'd be looking at. Which is great, but most of the developers in the world aren't (yet) interested in this (no matter how hard the proponents push for it!). It also fails to provide a mapping from 5988; where do I put the link relation type?

I've seen a fair bit of advocacy for JSON-LD, especially on Twitter, but in almost every instance, I've seen the non-believers push back.

JSON Reference

If JSON-LD is too complex / high-level, its opposite would be JSON Reference, a new-ish Internet-Draft by Chris Zyp and Paul Bryan (who are also working on JSON Schema, JSON Pointer and JSON PATCH, currently being discussed in the APPSAWG).

It's effectively a one-page spec, where a link looks like:

{ "$ref": "http://example.com/example.json#/foo/bar" }

This is effectively static serialisation of a type they've defined in JSON Schema, a sort of "meta-schema for links." I'd previously pushed back on that, because it effectively requires schema support / understanding to get the links out of the document -- a real non-starter in many scenarios.

So, I like the concreteness. However, it still lacks any way to talk about relation types, or pop on other metadata; while this could be grafted on separately, the whole point is to have one way to do it.

HAL - Hypertext Application Language

Another attempt is HAL, by Mike Kelly. The JSON portion of his serialisation looks like this:

{
  "_links": {
    "self": { "href": "/orders" },
    "next": { "href": "/orders?page=2" },
    "search": { "href": "/orders?id={order_id}" }
  }
}

Here, links are an object whose members are link relations (yay!).

I'm a bit concerned, however, that this object might be a bit awkward to drop into some formats; it relies on _links to identify the structure, so in some places, you'd need a two-level deep object to convey just a simple link.

Also, there doesn't appear to be any way to link to more than one URI of a given relation type.

What I'd Really Like

I don't have a specific thing in mind yet, and it's entirely possible that any of these proposals could be adapted to my needs (or others, I'm sure that they're out there).

I do have some requirements for consideration, though, along with a few sketches of ideas.

Discoverable

It should be really easy to find links in a document; as discussed above, requiring use of a schema is a non-starter.

This means that there needs to be some sort of marker the link data structure to trigger the link semantics (e.g., how JSON Reference uses "$ref"), and ideally some way to indicate on the document itself that it's using that convention (to avoid collisions, and help processors find these documents).

JSON Reference does this with a media type parameter;

Content-Type: application/json; profile=http://json-schema.org/json-ref

While at a glance that seems reasonable, I have two concerns; first, that JSON itself doesn't define a profile parameter on the media type (this needs to be done properly), and more seriously, that you can't declare conformance to multiple such conventions using this mechanism.

For example, if I want to say that my JSON conforms to this convention on links, and another convention about (say) namespaces, I'm out of luck.

I was in Boston recently (for the OpenStack design summit), and during a lull went up to the W3C offices to have lunch. Having this very much on the mind, I asked TimBL for his take, and we sketched out a sort of JSON metadata container, something like:

{
  "_meta": {
    "json-linking": null,
    "json-namespaces": ["foo", "bar", "baz"]
  },
  // … rest of doc here
}

The exact format and content isn't important here; the idea is having a controlled way (i.e., the keys would probably be in a registry somewhere, and/or URIs) of adding annotations to a JSON document about its contents.

This is not an envelope; it's just a document metadata container, much like HEAD in HTML. You could put links up there too, I suppose, if it'd help. The important part is that you'd know to look for one of *those* because the media type of the document (or one of its parameters, if we go that way) indicates it's in use.

What do people think? Is this JSONic enough (whatever that means)?

Self-Contained

As mentioned in the discussion of HAL above, the link convention needs to be easy to insert in a format, and not be too convoluted. This probably means an object that's "marked" with a particular, reserved key, much as JSON Reference does it. A list of links then becomes an array of objects, which seems pretty natural.

Mappable to RFC5988

Again, as discussed, a mapping to RFC5988 is important to me -- both so that links can be serialised in various formats, with reasonable fidelity, and so that we can pivot from talking about "RESTful" APIs in terms of URIs to talking about them in terms of formats and link relations, as Roy advocates:

A REST API should spend almost all of its descriptive effort in defining the media type(s) used for representing resources and driving application state, or in defining extended relation names and/or hypertext-enabled mark-up for existing standard media types. Any effort spent describing what methods to use on what URIs of interest should be entirely defined within the scope of the processing rules for a media type (and, in most cases, already defined by existing media types).

Extensible

The object defined needs to be explicitly extensible by the format it's in, so that link-specific metadata can be added. See the discussion of namespaces in JSON.

Anchorable

A complement to linking in JSON is linking to JSON. While JSON Pointer looks really promising in this regard, and is getting a fair amount of buzz, but I wonder if another mechanism, analogous to xml:id, is necessary.

The use case here is when you want to link to a specific object in a document that may be changing over time; a JSON pointer can be brittle in the face of such change, while a document-unique identifier is much more stable.

This makes a lot of sense in XML, which is primarily a document format. I'm not sure about whether the justification is as strong in JSON, which is primarily a data representation format, but it's worth talking about.

Just One, Please

Again, there's not much value in having fifteen ways to serialise a link in JSON; it will end up a pretty ugly mess.


Casein - A lightweight Ruby on Rails CMS

Save

Casein is an open source CMS for Ruby on Rails, originally developed by Spoiled Milk.

It provides scaffolding generators and helper functions to quickly create a lightweight CRUD interface for your data, along with a pre-rolled user authentication system.

Casein is designed to be completely decoupled from the front-end. Therefore it may be added to new or existing Rails projects, or even used as a standalone CMS to drive platforms built on other technologies.

→ Casein on GitHub

(Download or fork the source, and browse documentation)

Features

  • Quick start

    Generate a modern and minimal CRUD interface for your data using the supplied Rails generators.

  • Powerful customisation

    The generated views and controller logic can be infinitely customised as required. Casein comes with a range of helper functions to get you started.

  • User management

    User authentication and basic management to support developers and clients is pre-rolled and ready to go.

Get involved


While the central mission of Casein is to provide developers with a basic CMS platform and the freedom to customise as required, we do have ideas of future features that may find their way into the project core.

If you're using Casein and are working to add additional functionality then let us know.

In the meantime, this is our current roadmap:


  • Media uploader/picker widget
  • Active scaffolding version
  • Content versioning
  • Native support for has_many relationships
  • Full namespacing of the controllers
  • Built-in support for list sorting
  • Slugs—human-readable URLs

Built with Casein

Here are some example websites and applications running on Casein. Do you have a project built with Casein? Let us know and we will add it to the list.

Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the

Save

Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education and research.

As the largest and most diverse collection of information in human history, the web grants us tremendous insight if we can only understand it better. For example, web crawl data can be used to spot trends and identify patterns in politics, economics, health, popular culture and many other aspects of life.  It provides an immensely rich corpus for scientific research, technological advancement, and innovative new businesses.  It is crucial for our information-based society that the web be openly accessible to anyone who desires to utilize it.

We strive to be transparent in all of our operations and we support nofollow and robots.txt. For more information about the ccBot, please see FAQ. For more information on Common Crawl data and how to access it, please see Data.

New 5 Billion Page Web Index with Page Rank Now Available for Free from Common Crawl Foundation

Save

commoncrawllogo.jpgA freely accessible index of 5 billion web pages, their page rank, their link graphs and other metadata, hosted on Amazon EC2, was announced today by the Common Crawl Foundation. "It is crucial [in] our information-based society that Web crawl data be open and accessible to anyone who desires to utilize it," writes Foundation director Lisa Green on the organization's blog.

The Foundation is an organization dedicated to leveraging the falling costs of crawling and storage for the benefit of "individuals, academic groups, small start-ups, big companies, governments and nonprofits." It's lead by Gilad Elbaz, the forefather of Google AdSense and the CEO of data platform startup Factual. Joining Elbaz on the Foundation board is internet public domain champion Carl Malamud and semantic web serial entrepreneur Nova Spivack. Director Lisa Green came to the Foundation by way of Creative Commons.

The Foundation explains the scope of the project thusly.

"Common Crawl is a Web Scale crawl, and as such, each version of our crawl contains billions of documents from the various sites that we are successfully able to crawl. This dataset can be tens of terabytes in size, making transfer of the crawl to interested third parties costly and impractical. In addition to this, performing data processing operations on a dataset this large requires parallel processing techniques, and a potentially large computer cluster.

"Luckily for us, Amazon's EC2/S3 cloud computing infrastructure provides us with both a theoretically unlimited storage capacity coupled with localized access to an elastic compute cloud."

The organization was formed three years ago, just now started talking about itself publicly and believes that free access to all this information could lead to "a new wave of innovation, education and research."

Open Web Advocate James Walker agrees: "An openly accessible archive of the web - that's not owned and controlled by Google - levels the playing field pretty significantly for research and innovation."


ifttt / About ifttt

Save

Put the internet to work for you by creating tasks that fit this simple structure: Think of all the things you could do if you were able to define any task as: when something happens (this) then do something else (that). The this part of a task is the Trigger.

Comments (1)

Patrice Neff

Patrice Neff Sep 10, 2011

A bit like email filters but for the whole Internet.

Shuush

Save

Shuush is a prototype by BERG. It's a web-based Twitter reader that displays the updates of the people you follow in relation to the frequency of their tweets. It aims to amplify the people that don't usually get heard, and scale back those with frequent updates.

Spritebaker - Ridiculous easy Base64 encoding for Designers

Save

A free tool for designers and web developers. It parses your css and returns a copy with all external media "baked" right into it as Base64 encoded datasets. The number of time consuming http-requests on your website is decreased significantly, resulting in a massive speed-boost (server-side gzip-compression must be enabled).

(1 - 10 of 19)