Semantics

How To Get Semantified

First, let me say I’m pretty sure semantified isn’t even a ‘real word’ (yet), but I’ve seen it popping up lately, so I thought I’d help make it into a real word if it isn’t already.

Like anyone into semantics, I love defining categories and then classifying things into those categories. I guess it sort of goes with the job territory! So I’m going to share with you my category scheme for approaches to semantic technologies. I define four categories: constructive, inductive, blended or hybrid, and generative. In practice, specific instances of approaches falling within any given category may draw upon some of the elements of the other categories. In the case of the blended or hybrid approach, I’ll claim it involves a tight coupling of two of the approaches and it’s different enough to be its own category. Descriptions of each of the four categories follow.

Constructive

Constructive approaches essentially hand-craft their semantic models. As a knowledge engineer, I’ve been involved in several projects using this approach and I can tell you it can be really hard work with often slow progress. Some projects using the constructive approach are done by relatively small, dedicated teams of knowledge engineers and some are more community-based or crowd-sourced type efforts. Some produce proprietary or private models and some open or public models. A few are general purpose, like Cyc/OpenCyc, but most are targeted at specific vertical domains such as finance, travel or healthcare. I view the Semantic Web’s Linked Open Data (LOD) models as being constructive models. Some constructive models are developed for internal use and some for use by and/or sale to others. Some are explicitly exposed as models – conceptual schemas or ontologies, or at least taxonomies. Some are embedded behind applications and are never made visible to their consumers.

The constructive approach is a good fit if you want to produce a relatively-static semantic model for a well-bounded and relatively-static target domain. This approach has often been used when the resulting semantic model is shared and is intended to be consistent across the set of shared users. Although hand-crafting a large, complex model may not be a wise endeavor for the faint of heart or those with not a lot of time to spare, a constructive approach may be quite tractable if there’s a large, enthusiastic community contributing to the development (and maintenance!) of the model and if the problem space lends itself to ‘divide and conquer’ tactics.

Inductive

As the name implies, this approach involves inducing semantics through techniques such as topic clustering and other statistical [text] analysis applied against large volumes of corpora – think millions or hundreds of millions of documents. In other words, this approach can be described as machine learning or analytics performed over big data sets (or Big Data sets, to use buzz-worthy terminology).

Google is the star example here. Think about how Google Page Rank works with statistics based on the number of links to and from a given Web resource to some other resource, down to the keyword level in many cases. With enough data, you can create indices and associated statistical models based on the relationships among those resources and then use those to retrieve search results, suggest related topics, etc. Simple text indexing works pretty much the same way, where you extract keywords, analyze statistics about the frequency of their occurrence within a document (using for example term frequency inverse document frequency or TF-IDF algorithms), their co-occurrence with other keywords, etc. There are of course more complex algorithms for text analysis, as well as algorithms for images, voice and other multimedia types. In any case, it’s all about statistics and statistical patterns and relationships. This approach therefore works best when there are big data sets available to feed the analysis. Put another way, this approach makes sense if you have a really large amount of data and you want to be able to relate it (i.e., to index it) to other data in a relatively ad hoc, dynamic fashion. I would further describe this approach as being more actionable than reflective, so you should use this approach if you care primarily about operationally using the indices to provide results or answers, and not as much about creating and persisting specific, explicit semantic models behind those results or answers.

Blended

This approach is a blended or hybrid approach involving both constructive and inductive approaches. Typically this approach involves starting with a small-ish core or ‘upper’ ontology that’s typically comprised of quite broad classes or categories and then using that to help classify the topics or concepts that are induced via algorithms like cluster analysis. Where the topics or concepts aren’t already in the starter model, then the output of the [deeper] induction process can be used to extend the model with these new more specialized concepts.  Unlike the pure inductive method, here the model itself and richly-indexed content are both targeted outputs of the process. This process goes on in the standard “lather, rinse, repeat” fashion until you run out of compute power or money, or simply cannot statistically-induce any more semantics.

This approach is most appropriate if you want to create extensive, multi-dimensional, relatively-persistent semantic (i.e., concept) indices for large amounts of data and then use those indices to intelligently retrieve relatively-small numbers of highly-relevant results. Examples of such applications include information discovery within enterprise content management systems and question answering assistants, such as for customer care systems.  This approach may not be feasible for massive amounts of Web data that changes constantly. But for large sets of data that are relatively more persistent – like enterprise information – this approach can produce higher quality results over time. Of course given the additional processing, this approach can be slower than a pure inductive approach, so it may require the introduction of optimization techniques, particularly for real-time applications.

An example of a technology using this approach is a start-up under the umbrella of Frost Data Capital (formerly Frost Venture Partners) called MAANA, Inc. I had the opportunity to help them during the early, incubation stage of their life-cycle. Without getting into details, I can say they are doing leading-edge work in multi-dimensional semantic indexing, specifically over Hadoop/HDFS-based data stores, and that work includes innovative optimization techniques for large, enterprise-scale data sets.

Generative

The generative approach is a probabilistic approach that is essentially the opposite of the inductive approach. With this approach you start with a relatively small set of building block concepts. These are constructive primitives or atomic concepts rather than the broad or general concepts associated with the blended/hybrid approach. These get used with a set of generative rules to generate or synthesize candidate concepts that then get validated using a smaller set of reference corpora for evidential purposes.

The generative approach is applicable if you want to dynamically generate and utilize on-the-fly semantic models, particularly for highly-specialized or individualized topics that aren’t necessarily possible or feasible to model in advance. This covers two extremes, one where the volume of data is too small to lend itself to induction (for example, for new areas of data collection where there isn’t sufficient data yet) and to extremely large domains (where the sheer number of possible combinations and the cost to model those in advance using any of the other approaches would be prohibitive). In addition to the value of generating the individualized models themselves on-demand, this approach is valuable for content discovery and filtering (e.g., for applications such as personalized research or news readers) and for contextual knowledge building for personal assistants and other forms of intelligent software agents.

So far as I know, there is only one example today of a semantic technology that uses this generative approach and that’s a company I have been associated with for the past few years called Primal (www.primal.com).

Previously, during the last generation of semantic technology in the 1990s, I worked for another company that pursued this approach, albeit somewhat differently. That company was called Ontological Technology (Ontek) Corporation. In that case, the technology – which was called the Platform for the Automated Construction of Intelligent Systems or PACIS — depended upon a very precise, formal, foundational ontology from which all the other ontologies were to be – in theory at least – automatically generated. The driver for that was this: after having tried to hand-build a complete ontology for the engineering and manufacturing domains, the visionary behind PACIS decided to define a foundational ontology from which the ontologies for engineering and manufacturing – and potentially many other domains as well – could be automatically generated. It failed for obvious reasons. Or at least they were obvious after it failed. Frankly, the precision expected of that foundational ontology and of the ontologies to be generated from it – was unrealistic to achieve at that time and likely even still today. There were in any case many valuable results produced along the way, and from failure you learn.

That’s why I became associated years later with Primal. I felt Primal’s generative-based technology – which was a working prototype at the time I got involved – was much more pragmatic and practical, and scoped towards more achievable use cases. I spent a little over 2 years up in Canada working with the talented team at Primal to progress from prototype to Minimum Viable Product (MVP), through Alpha release, and now to the first commercial product built on that core technology — an intelligent automated content service. In other words, that technology, which dynamically generates a kind of taxonomy of user interests referred to as an Interest Graph, is now commercially in use.

Picking the right approach depends on the nature of the semantic modeling work you are doing and the resources available to do the work. Using the right approach, there will in any case be hard work ahead, but you should be able to achieve your goals. Choose the wrong approach, and as my son would say, it’s destined to end in an epic FAIL.

Advertisements
Standard