Structured Data Search

Searching structured data is hard. At first glance it is reasonable to assume that the additional information provided by structure – called meta-data – would make it easier to provide relevant results, but is not.

It speaks to the strength of Page’s and Brian’s approach of ranking based on link authority and its current evolution that web search relevance is still hard to beat.

By 2020, there will be 4 billion people online creating 50 trillion gigabytes of data according to IDC and most of it is structured (or semi-structured to be precise). I continue believe that the next technology wave in search will come from addressing the structured data search problem.

One useful classification of structured data is the way it is generated:

  • Community generated – Wikipedia as the crowd-sourced encyclopedia is the number one source of topic specific structured data with others examples being IMDb a movie database,
  • Expert Curated – examples are Wolfram Alpha’ dataset or more domain specific government databases such as NASA’s NSSDC, and
  • Proprietary – for example commercial product and services catalogs.

One challenge in using and aggregating structured data is authority of the source. Google for example doesn’t list in its topic summaries the source while Bing does. In both cases the entry is most likely generated from a mixture of homegrown and external data sources.

The more structured data is pulled into the experience the more relevant the question of authority will become. As a user I simply would like to know where the underlying data – for something that is presented as a fact – comes from.

The second classification of structured data is based on how it is used in search:

  • Understanding – extracting entities and relationships with known semantics from a dataset is the base for query-understanding and disambiguation, and
  • Aggregation – summarization of data from one or more sources to serve it in the result set, and
  • Navigation – of the graph from one entity to another following semantic relationships.

I like to think of structured data less as an enhancement of the SERP – a very domain specific approach once you know what a user wants (i.e. a vertical) – but rather as help to find additional interesting information (in the tail) that is more comprehensive in dealing with what I am looking for.

What I would like to see is the uses of the entities and relationships of the knowledge graph to identify and rank authoritative sources about a topic rather than to pull topic-related information from a few sources into SERP, sources of which I can’t control the authority.

Clicking on a link is easy; determining what is relevant and authoritative for a query is hard and time consuming – that’s where algorithms could shine.