How to Impose Structure on Unstructured Data

  • Tuesday, April 7, 2015

By Evan A. Schindman – for Data Informed

For over a decade, the business world has been enthralled with big data. Some of this data arrives structured, but much of it is unstructured and or even in the form of text. Over this same time period, it has become common for analysts mine unstructured data for useful correlations.

As interesting as some of the correlations this approach generates can be, much of what is found is presented without the structure of theory. Like blindly shining a flashlight into a dark forest to find the way out, this method guarantees narrow discovery, but not much else. Sure, you might find the right path, but you are just as likely to end up on the wrong path or just staring at trees. Structuring your analysis of unstructured data allows you to systematically record where you have shined the light, so you know the options in front of us. Adding domain expertise to this simple structure adds wattage to your flashlight, better illuminating the alternatives throughout the forest. Deep domain expertise allows you to utilize structure and shine a veritable flood light on the world of big data.

Text Analytics

One of the frontiers of unstructured data is the text that invades every facet of modern life. Whether it is analyzing social network comments, full news stories, or sophisticated regulatory information, text is vital to virtually every industry. So how do we go about analyzing that text? The traditional way is simple: read it. But in the modern world, there is just too much information. Even if it were possible to read everything available, readership bias would plague every step of the process. So, text data must be analyzed in a systematic, unbiased way. This type of examination is commonly referred to as sentiment analysis.

Unfortunately, sentiment analysis has a bad reputation, earned by years of rudimentary software that simply uses dictionary-defined terms as buzzwords. These buzzwords are categorized as good or bad, then sentiment is scored by good­buzzword minus bad­buzzword. Slightly more sophisticated versions of this method incorporate the context of said buzzwords by subtracting bad-buzz­phrase from good-buzz­phrase to glean sentiment. Although more precise, this phase-­analysis method is almost as biased as the simple word approach because both introduce selection bias by pre-defining key words or elements of a communication.

While it would be great to have a pure data science solution to this problem, this is simply not possible when analyzing complex text. For any remotely complicated subject matter, it is crucial to introduce expertise into the computing solution. In particular, deep domain expertise can allow a human to train a text analytics algorithm based on comprehensive rules that encompass whole communications rather than simple words or phrases. This method eliminates buzzword bias and ensures that context is crucial to the analysis. Essentially, this means having an expert scale documents in an impartial way to train a sophisticated text analytics system.

Take, for example, analysis of Central Bank policy. Astute readers might remember the “Briefcase Watch” back in the 1990s. Back then, reporters used to follow around Federal Reserve Chairman Alan Greenspan on the day of a Fed meeting to get a glimpse of his briefcase. The idea was that if his briefcase was thick, he had been reviewing data and rates were likely to change. If the briefcase was thin, then rates would remain the same. Aside from the fact that a modern briefcase would contain a laptop, not stacks of paper, the modern method of Fed watching has not changed much. The big difference is that, nowadays, central banks release far more information to the public. The trouble is that, like in the Greenspan era, Fed watchers are still focused on a narrow subject, a few key words, to determine if policy is going to change.

The rationale behind modern Fed watchers’ focusing on words or phrases in larger communications derives from the Fed’s own method of editing press releases. Its use of track changes in those releases led reporters to cue in on simple word changes. Unfortunately, this method of analysis has carried over to much larger, uniquely written communications like meeting minutes and speeches, not to mention communications from central banks all over the world that may or may not use track changes. The bottom line is that this narrow method of analysis leaves carefully crafted data (words) unanalyzed.

To properly analyze the complex communications released by central banks, it is vital not only to examine every word of every communication, but also to scale the system with as little bias as possible. In the context of central banks, this means using historical market reaction to define “hawkish” or “dovish” communications on an ordinal scale. To mitigate selection bias of cherry picking certain overtly hawkish/dovish communications, the system can be rounded out with a sufficiently large sample of random communications in the training set. The text analysis algorithm can then score an archive of historical communications based on similarities in language to the training data. With each added input, the system better learns and parameterizes hawkish/dovish sentiment and adapts to shifting language and language patterns.

Essentially, this mix of natural language processing incorporates machine learning so that changes in language can be properly incorporated, and scores reflect new and changing terms in the central bankers’ lexicon.

The end result is a single score for every communication produced by a central bank. These scores are normalized around zero such that negative numbers indicate dovish sentiment and positive numbers indicate hawkish sentiment. Nearly all scores fall between ­2 and +2 because those serve as two standard deviations from the mean. These scores represent the central bank’s sentiment toward the state of the economy (or at least toward inflation expectations), so the trend (rising or falling) is essential to understanding where monetary policy and, more broadly, the economy, is likely to go in the near future.

Using Central Bank Sentiment Data 

Although economics (and more so, finance) is a quantitative discipline, this approach deviates significantly from the established method of Fed watching. Many portfolio managers find value plugging this data directly into a complex multi­factor model as a key signal in that model. Other portfolio managers see value in using this data to better inform their qualitative strategies by eliminating bias and speeding along their existing macro analysis. Regardless of how the data is being used, the most important thing is to remember that any individual score is only telling in how it fits into the larger trend.

Analysis of this trend data has been used to find stark correlations across asset classes. This means that sentiment scores can be used to mitigate downside risk and generate significant alpha across asset classes. Moreover, this method of using domain expertise to impose structure on central bank texts can be seen as a template for how to analyze other complex documents.

We all know that the era of big data is here to stay, but while much of the corporate world has spent years accumulating this data, it has only begun to scratch the surface in analyzing it. Domain or subject-matter expertise in particular areas should be sought-after skills to aid in automating vital analysis. Some experts will always ask if aiding in training a system will eventually make them irrelevant, but I contend that subject matter experts should want to be on the leading edge, training the systems, so they know how they work and how to interpret new, structured data. Similarly, data scientists should want to work with domain experts to advance data analytics efficiently, without stumbling through the dark forest.

Click here to read the article on Data Informed