Topic Modeling in Social Streams

Using social media to conduct commerce is challenging for businesses since it requires constant monitoring of the real-time stream for topics of interest. As part of davai we developed a sophisticated module to automate the monitoring process by using a topic-modeling approach:

  1. A user would register interest by creating a topic, labels it and adds keywords,
  2. System analyzes filtered messages, detects topic and adapts filter so that it stays current (versus topic drift and trending topics),
  3. Social network effect to address new topic discovery.


The challenge of topic modeling in social streams is twofold: messages (documents) even if aggregated are too small to create meaningful models and new topics are constantly published that could be of interest to a user.

Our approach leverages the network structure of user:

  • Connected users that share interests and influence each other are more similar,
  • Merge user’s stream with authored stream of his friends of friends,
  • Use ranking to find authoritative users with similar interest.

The topic modeling approach was the base for further research at the University of Washington by Professor Ankur and can be found in: “SocialLDA: Scalable Topic Modeling in Social Networks”.

Merchandising on Social and Mobile

Social is a channel every successful commerce strategy has to target – it is where customers are. However unlocking its potential is challenging because of the conversational nature of the channel and the constant evolution of the underlying social media platforms.

Compared to all other online channels mobile allows retailers to come closer to customers and campaigns can be much more targeted and personalized. However mobile apps suffers from a discovery problem, a reason why I believe that mobile apps have to be social.

At davai we develop the concept of merchandizing apps that offer a rich media experience with integrated social features and embedded commerce. Apps tell a living story about products using retail and user generated content.

Users interact with the story through social actions such as like, vote or comment and the resulting activity stream adapts or enhances the story.


If you are interested in more details how it worked view the following slide deck: Merchandising Apps

Activity Streams

The web is rapidly changing from being just a source of information consumption to a place where people produce, consume, share and interact. The social web is engaging people away from a web of clicks to a web of online activities and real-world interactions performed in a conversational style.

With services such as Facebook and Twitter a new world of online activities became observable, promising richer signals than the web of pages/links and queries/clicks. Google has shown that any observable web activity by online users can be turned in economic value through targeted online advertising.

Davai (Давай =>Lets Go!!) a startup was founded to explore the commercial usage of network effects on social media under the assumption that the following signals are statistically significant:

  • Influence – with the number of a user’s friends performing an action, the likelihood that the user also performs the action is increased,
  • Repetition – a users who performed an action will have a higher probability to perform the action again than a user who has never performed the action, and
  • Correlation – two users who are befriended have a higher probability to perform an action at the same time than users randomly chosen from the network.

analytic flow

The basic model we developed associates users that share connections with objects – people, places, and things – in a similar way as Facebook’s Open Graph does this today. Connections through objects allow information to flow form user to user; hence a channel for information flow is created.

With information flow influence spreads through the network (graph) and can be quantified by scoring graph elements based on topological properties. Scores represent sub-graphs as much as their individual nodes and imply their similarity.

Connectivity breeds similarity and targeting the social network neighbors is an efficient strategy of audience identification for advertisers compared say to targeting a set of random nodes.

The following presentation details the approach: Marketing to networked customers

Our infrastructure made heavy use of the open source stack and Amazon EC. Hadoop, Hbase, Mahout, Lucene/Solr where integrated into a 24/7 pipeline with a Flash/Browser-based workbench. The depth of the open source stack and the dedication of contributors to components are amazing. Lucene and Weka are my favorite communities.

Search Ecosystems

Google’s search ecosystem has shown impressive strength. The virtuous cycle of publishers optimizing content for Google and users searching on Google to find such optimized content is still going strong.

Google’s or in fact any web search engines weak spot is “dark” content that is not crawlable, in proprietary formats or databases, or purposely hidden from search engines. To find innovative ways to access and search structured content was my focus as partner development manager at Bing.

One area of innovation was structured content submitted by publishers in feed format or through APIs. For the shopping search vertical we had developed a feed ingestion pipeline, which became the starting point for Bing’s massive scalable structured data ingestion infrastructure.

We started to explore how a structured data ecosystem could be created and operated with a recipe vertical and efforts targeted at enterprises with commercial web sites. As part of the Fast acquisition by Office a team of search engineers in Oslo joined my group to strengthen the enterprise-focused effort.

The second area of investment was centered on using structured data to compute answers and to allow people to explore search results through sophisticated refinements. Our approach was to describe data by a dynamic model that during retrieval was executed by an engine embedded in the search stack to compute results.

The effort to bring computational power into the search stack was accelerated with a partnership with Wolfram|Alpha (WA) and its technology stack to serve knowledge computed from expertly curated data. The WA results were used to enrich Bing’s results in select areas such as nutrition, health, and advanced mathematics.

The technical challenge of integrating a computational engine with unbound execution time into a web search stack – with a response time in milliseconds – was tremendous and required a heroic effort by everybody involved.

It is interesting to follow recent efforts of Google/ Bing to provide answers / knowledge based on structured data. However, without a functioning publisher ecosystem efforts will not scale and without innovation in how to rank and disambiguate structured data results will suffer from low relevance.

Commerce Queries

In its effort to catch-up in technology with Google Microsoft poured an astonishing amount of talent into its search effort. It was a real privilege to see how the team did grow from a small committed team into a well-oiled organization that would crank out new technology with an astonishing pace.

The economy of scale Google had gained for its search engine is a steep hill to climb for Bing. One of the key areas of innovation I saw and still see up to today are commerce queries. Answering commerce queries right in the search experience not only drives customer satisfaction but can also create strong economic value.

When I drove MSN Shopping I started to evangelize the concept of a database of commerce query for search – a concept that would be described by Danny Sullivan from Search Engine Land much more eloquently as Database of Intent. To better address commerce in search we moved MSN Shopping into the Bing organization (at that time still called MSN/Windows Live Search) and the service became shopping search.

My team drove features such as structured data ingestion and processing, query intent detection, and comparison-shopping features deep into the core of the Bing infrastructure. Query classification became the hallmark of our effort.

Commerce queries typically have a very regular structure build around brand and product names. Amassing query intent understanding became a focus and the underlying technology did quickly spread-out from our effort to other search products.

In an effort to amass the largest product selection in the market we pushed the technology to automatically extract product records from web pages and reached a catalog of 100 million products. However a catalog of crawled products is hard to monetize and the feed-based version with is pay-for-click and pay-for-transaction model was continued ultimately.

In an interesting twist Google has lately moved to a paid inclusion model that forces merchant to pay and fundamentally changes its previous position of not introducing advertising in algorithmic results. The move also reduces selection significantly a strange move for Google and it will be interesting how this turns out.

Commerce queries are an area ripe for innovation in search engines and the balance between selection (as seen from a user’s perspective) and the monetization model hasn’t been found. Search engines for commercial queries are competitors to marketplaces such as Amazon and eBay and that is where the battle ultimately will be fought.

One of the notable innovations Bing introduces was “cashback” a highly visible effort to help advertisers reach searchers with compelling offers, and to provide a shopping experience that would change user behavior and drive new users to Bing.

The effort was highly successful in getting customers to come to Bing to find great deals and to get merchants to run campaigns with great discounts. Recently however it was discontinued and for me the lesson learned is that bargain hunters are not very loyal users of a service. Nevertheless the project was a significant cross-team effort and deepened my interest in online ecosystems and monetization models.

Online Marketplaces

Microsoft had to adapt its online service and Internet access business to Google as the new force on the web. MSN with its content channels and services such as Hotmail and messenger has been a popular destination but its display and paid placement-advertising model were under attack.

The company just started to understand Google and its ad-funded business model and renewed its effort to compete by investing into MSN content channels and incubating a search engine.

I joined MSN Shopping – a shopping comparison engine – being responsible for the engineering effort. Shopping was one of MSN’s most profitable channels but with declining selection, traffic and user engagement caused by a paid placement business model and explicit volume traffic guarantees.

Microsoft decided to do a major investment into the underlying shopping technology to create a marketplaces platform that could serve Windows Marketplace and other product-specific online stores as well as the general MSN Shopping destination.

The team in a heroic effort revamped the complete system of catalog processing and deployment as well as web front-end generation, addressed all the different requirements from partner teams, and enjoyed the attention of top-level management.

Technical challenge for the shopping channel was to move the infrastructure to handle the 50 million and growing product selection up from the previous 5 million. But the true challenge was the change of the business model from a paid placement model that was entrenched in the internal business processes to a pay-per-click model. The new site once deployed showed a 3 times improvement in customer satisfaction and was ranked 1st in a comparison by Jupiter Research.

Shopping comparison engines were becoming acquisition targets with astronomical valuations (eBay bought for $620 million and Scripps paid $525 million for for businesses that basically have an arbitrage model in decline: acquire traffic cheap and sell more directed traffic to merchants for a higher price.

In this highly competitive space we started to look for product differentiators and started to invest in ratings/reviews and opinion extraction. The goal was to introduce a new ranking signal derived form reviews rather than product popularity. The rating and review service was ultimately deployed by 15 channels in the MSN Network and the opinion extraction technology found its way into several other search products.

As part of the Windows Live branding effort and the resulting investment rush a completely new shopping front-end was developed which introduced many features that make up social shopping sites today. Ultimately the branding effort evolved by splitting the MSN Network into content channels named MSN and services named Windows Live services. Our new Windows Live-branded shopping experience however became a casualty of this transition.

e-Business Servers

A unique challenge inside Microsoft is the positioning of a stand-alone product versus Windows and Office the cash cows of the company. To drive upgrade cycles the company is forever forced to integrate more and more functionality into these platforms thereby cannibalizing products that live at the fringe or even worse in-between.

A good example are application servers that are used to manage e-commerce and service interactions over the Web. Microsoft Transaction Server (MTS), called “Viper” was at the forefront of this technology when the hotly contested decision was made to ship it as part of Windows. Microsoft therefore has been perceived as not having an entry in this $7 billion and fast growing market segment.

Microsoft’s focus on the enterprise spawned a pack of individual server efforts including BizTalk, Commerce, Content Management, Host Integration and more. These servers with their overlapping functionally, missions and market segments – confused customers as well as internal organizations.

An e-business server division was formed to integrate all these products and find a proper market strategy. I joined the team to explore and find new product strategies and form teams around them. The first product we conceived was HWS (Human Workflow Services), the service focused on moving BizTalk up the Microsoft stack by offering a highly innovative ad-hoc workflow environment for Office.

HWS evolved into workflow technology for Microsoft SharePoint and ultimately evolved into a service in Windows.

The second product initiative we conceived was dubbed BAM (Business Activity Monitoring) targeted at the area of enterprise event monitoring and analysis. BAM in typical Microsoft fashion made it very simple to roll out deployments in an enterprise.

The third strategy was to provide easy to use portal technology that wrapped BizTalk and could be shipped as part of Microsoft SMB Server. Its target was small to medium sized business with simple activity orchestration and monitoring capabilities.

Microsoft’s enterprise and server development at this time was at its prime. Understanding the Microsoft ecosystem and the economic engine has been a privilege and highly stimulating. Living in-between Office and Windows was a true challenge but fun.

Transforming and Visualizing Data

Microsoft today is an acknowledged enterprise software company and considered one of the four major database vendors with SQL Server. At its beginning SQL Server had its work cutout to convince large organizations to change to a new DBMS vendor.

Besides tackling the SMB markets, targeting green field accounts and offering better TCO (Total Cost of Ownership) than competitors we devised a strategy to use business intelligence to get a foot into the door.

By offering analytics (OLAP), reporting, data transformation and data warehousing services right out of the box Microsoft could get its database into shops that historically had been deeply committed to Oracle or IBM. Each installation penetrated the wall and potentially opened the door to a broader deployment.

Integration of all these services required the exchange of meta-data and the repository team joined SQL Server to become Meta Data Services – a native store for data warehouse object definitions. Over time the team did grow to include teams such as Data Transformation Services and “English Query” the natural language query interface to SQL Server.

We discontinued English Query because of low usage but it was a clever technology far ahead of its time considering the popularity of the natural language user interface Siri, acquired by Apple for $200 million recently.

Besides the integration of all this services the next challenges that needed to be tackled was the completion of the data-warehousing stack with reporting and decision portal capabilities. Microsoft Reporting Services and Microsoft Integration Services were spawned as part of these product efforts as well as the Digital Dashboard Resource Kit (DDRK).

Digital Dashboard Resource Kit (DDRK) was a server-based application that allowed creating a portal experience in the web browser made up of distinct units called web parts. Combined with SQL Server it provided single-click access to analytical data and business intelligence.

The forward-looking Digital Dashboard technology developed together with Office has become a product in form of SharePoint Server. DDRK was a highly controversial technology in Microsoft since it showcased what later became a common thread – the abstraction of the Windows desktop experience through content and services rendered in a Web Browser.

Microsoft Repository

Meta-data repositories have always been perceived as the magical solution to integrate tools and apps in a heterogeneous environment. Any exchange of information implies in some form or shape the exchange of data that describes the information structures and its meaning.

While the benefits of meta-data exchange and repositories are relatively easy to see their design and integration problems are numerous and hard to crack. Repositories suffer from a cold start problem since it makes only sense to integrate with one if it already contains data that can be shared.

When I joined Microsoft I took over responsibility for the Microsoft Repository and Visual SourceSafe (a version control system) and their development teams. The first release of Microsoft Repository with the Open Information Model (OIM) allowed external tool vendors such as Popkin’s System Architect or CA Erwin to integrate with Visual Basic.

OIM is a set of metadata specifications to facilitate sharing and reuse in the application development and data warehousing domains.

Successive versions of the repository engine and OIM shipped with Microsoft Visual Studio and Microsoft SQL Server and offered deeper integration with several of their components. Microsoft Repository’s COM (Component Object Model) based architecture has been engineered to offer versioning in COM – it basically persisted COM objects and created them on the fly.

The enthusiasm of small tool and application vendors to gain access to Microsoft’s server and tools metadata always outstripped the pace in which Microsoft internal partner teams shared such data. This seems to be a general rule for repositories and similar exchange technologies.

With the web and the pervasiveness of XML meta-data management became much more a model-mapping problem than that of access. You can find an interesting starting point to this problem in Phil Bernstein’s work, e.g. Panel: Is Generic Metadata Management Feasible?

Object Management

Computer-aided software engineering (CASE) is a buzzword in software development with its promise of tools and methodologies that automate the entire process.

Managing software projects in large financial institutes or manufacturing companies has its unique challenges. Observing 100s of software developer implementing transaction screens that run the car manufacturing processes of Fiat SA is a sight to be seen. Having this happening on an end-to-end development platform was unique to the Softlab Maestro II platform.

Working for Softlab GmbH in Munich I had architecture and management responsibility for the development of Object Management System (OMS) for Maestro II. OMS (Enabler after it has been acquired by Fujitsu) is a network-based repository of design information produced by the software development tools of Maestro or by external tools that integrate with the platform.

A software repository is a storage location inside of an enterprise or in the cloud that aggregates and archives all deliverables and descriptive information about a software product and its design process. As such they include source code, project management data, resource descriptions, and test data. For large installations this can easily pile up 100s of gigabytes of data.

Repositories model and store data that describes other data – we call this meta-data. Meta-data is necessary to chain different tools used in a design process together. A meta-data repository (exchange) is necessary in any kind of environment that integrates heterogeneous software systems into a flow.

One of the key challenges in managing meta-data is that it typically evolves in different versions or branches that at some point need to be merged and reconciled. With OMS II we introduced a unique distributed versioning architecture with automatic versioning and optimized merge support.