As a member of the IKS european project Nuxeo contributes to the development of an Open Source software project named fise whose goal is to help bring new and trendy semantic features to CMS by giving developers a stack of reusable HTTP semantic services to build upon.
As such concepts might be new to some readers, the first part of this blog post is presented as a QA.
A semantic engine is a software component that extracts the meaning of a electronic document to organize it as partially structured knowledge and not just as a piece of unstructured text content.
Current semantic engines can typically:
- categorize documents (is this document written in English, Spanish, Chinese? is this an article that should be filed under the Business, Lifestyle, Technology categories? ...);
- suggest meaningful tags from a controlled taxonomy and assert there relative importance with respect to the text content of the document;
- find related documents in the local database or on the web;
- extract and recognize mentions of known entities such as famous people, organizations, places, books, movies, genes, ... and link the document to there knowledge base entries (like a biography for a famous person);
- detect yet unknown entities of the same afore mentioned types to enrich the knowledge base;
- extract knowledge assertions that are present in the text to fill up a knowledge base along with a reference to trace the origin of the assertion. Examples of such assertions could be the fact that a company is buying another along with the amount of the transaction, the release date of a movie, the new club of a football player...
During the last couple of years, many such engines have been made available through web-based API such as Open Calais, Zemanta and Evri just to name a few. However to our knowledge there aren't many such engines distributed under an Open Source license to be used offline, on your private IT infrastructure with your sensitive data.
Linking content items to semantic entities and topics that are defined in open universal databases (such as DBpedia, freebase or the NY Times database) allows for many content driven applications like online websites or private intranets to share a common conceptual frame and improve findability and interoperability.
Publishers can leverage such technologies to build automatically updated entity hubs that aggregate resources of different types (documents, calendar events, persons, organizations, ...) that are related to a given semantic entity identified by an disambiguated universal identifiers that span all applications.
You can test fise using the online demo or you can download a snapshot of the all-in-one executable jar launcher (67MB) or you can build your own instance from source. If you want to run your local instance just launch it with a java 6 virtual machine as follows:
java -Xmx512M -jar eu.iksproject.fise.launchers.sling-0.9-20100802.jar
Once the server is up and running, fise offers three HTTP endpoints: the engines, the store and the sparql endpoint:
- the /engines endpoint all the user to analyse English text content and send back the results of the analysis without storing anything on the server: this is stateless HTTP service
- the /store endpoint does the same analysis but furthermore store the results on the fise server: this a stateful HTTP service. Analysis results are then available for later browsing.
- the /sparql endpoint provide a machine level access to perform complex graph queries the enhancements extracted on content items sent to the /store endpoint.
Let us focus on the /engines endpoint. The view first list the active registered analysis components and then ask for a user input. Type an English sentence that mentions famous or non famous people, organizations and places such as countries and cities. I your are lazy, just copy and paste some article from a public news feed such as wikinews and submit your content with "Run engines". Depending on the registered engines and the length of your content, the processing time will typically vary from less than one second to around a minute.
Submitting text content to the /engines endpoint using the web interface
By default fise launches three engines in turns:
- the first engine performs named entity detection using the OpenNLP library: it will try find occurrences of names of people, places and organizations based on a statistical model of the structure of English sentences
- the second engine tries to recognize the previously detected entities using a local Apache Lucene index of the top 10,000 most famous entities from DBpedia. This index is configurable and will be improved in future versions of fise.
- the last engine then asynchronously fetches additional data from DBpedia such as the GPS coordinates of places, thumbnails and Wikipedia descriptions of the recognized entities. Fetched entities are cached in the fise store for faster lookup next time the entity is recognized. A summary of those informations are then display the results in the fise UI as columns of entities and a word map display the locations:
Overview of the extracted entities in the submitted text.
Up until now we have used the web user interface for human beings who want to test the capabilities of the engines manually and navigate through the results using there browser. This is primarily a demo mode.
The second way to use fise is the RESTful API for machines (e.g. third party ECM applications such as Nuxeo DM and Nuxeo DAM) that will use fise as an HTTP service to enhance the content of there documents. The detailed documentation if the REST API is available on a per-endpoint basis in the Web UI by clicking on the "REST API" link in the top right corner of the page:
Accessing the inline documentation for the REST API
curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" \ --data "Fise can detect famous cities such as Paris." \ http://fise.demo.nuxeo.com/engines/
Right now the packaged engines can only deal with English text content. We plan to progressively add statistical models for other languages as well.
Right now if you submit a sentence that starts with "United Kingdom prime minister David Cameron declared to the press..." you will get an output such as:
"David Cameron" is detected as a person but not recognized since the fise index was built on a DBpedia dump extracted before his election. Furthermore fise is currently not able to extract the relation between the entity "David Cameron" and the entity "United Kingdom". In future versions of fise we plan to extract the role "prime minister" that links the person to the country. This should be achievable by combining syntactic parsing with semantic alignment of english words with an ontology such as DBpedia.
Extracting relations between entities will help knowledge workers incrementally build large knowledge bases at a low cost. For instance, this can be very interesting for economic intelligence or data-driven journalism: imagine automatically building the social networks of public figures from news feed and their relationships with business entities such as companies and financial institutions for instance.
Right now fise is a standalone HTTP service with a basic web interface mainly used for demo purposes. To make it really useful some work is needed to integrate it with the Nuxeo platform so that Nuxeo DM, Nuxeo DAM and Nuxeo CMF users will benefit from a seamless semantic experience.
Hey, by the way, Nuxeo is recruiting. We're looking for skilled, motivated and easy to work with Java Developers (junior, senior, or even interns, as long as you're willing to keep growing your skill set with the rest of the team). So check out these job descriptions and apply (write to jobs(at)nuxeo.com) if you're interested.