Seven Questions to ask a Text Analytics Vendor

Posted on December 1, 2008. Filed under: Opinion Pieces | Tags: , , , |

Perhaps you have noticed that there really are no successful text analytics systems, which are in general use on people’s desktops. It is fair to ask why this is the case.

It isn’t that people don’t have the need to absorb larger bunches of text. In fact, I might take a guess that the basic approach taken by the makers and vendors that have preceded us isn’t appropriate to what people want to get from text data.

Alternatively, Leximancer is built to analyze big, medium or small; English or Greek or Malay; medical, CPG or high-tech; long or short bodies of unstructured text from just about any source.  The idea of what we’re accomplishing is new, and so is our way of making customers and partners successful.

The purpose of this posting is to examine what text offers most people, then compare this with what previous attempts at text analyzing software have tried to do, and failed, as well as arm you with questions to consider when evaluating options.

Text tells the story.

Text tells us the story. A good story lays out the ideas and characters with their attributes. We read the text to set the scene – to explain the situation that we have dropped in on. It is like the first episode of a TV series. After that, we read on to see how the characters and ideas interact. There are changing relationships.

A survey or report or set of online product reviews are no different. We need to see what issues, products or services are front-of-mind for the authors or responders, what attributes they assign to these issues and products, and how they see the relationships. We then move on to start answering questions and fixing problems. This is how we apply the knowledge gained.

In concrete terms:
1.    We discover the concepts of the situation from the text.
2.    We discover the explanations, or insights, from the text.
3.    We can then act on these insights to alter the system.

Step 1 is important and neglected. You cannot understand the situation without understanding the background ideas.

You cannot understand an IT textbook using the concepts from political science. You would struggle to paint a seascape with a palette suitable for a childs cartoon. Unfortunately, this problem is insidious and leads to mistakes that we fail to notice. Why? Because if we naively analyze some data with a set of ideas that we know well, and we fondly expect will apply to the data, we may never see that we are missing a quite different perspective.

Most text analyzing systems will not automatically extract a clear set of the concepts and actors that characterize the text. Systems that come with predefined sets of categories, dictionaries and entity lists are a menace. You cannot risk interpreting your data filtered through an understanding created by someone who is not familiar with your data and your situation, even if the answer looks simple and neat. This leads to

Question 1: Does the system’s set of categories, entities, and concepts reflect a real understanding of my data and my situation?

Some systems use predefined categories that are manually tuned by the vendor during pre-sales. The vendor’s consultants will sift through your data and construct extensive lists of terms, pattern matchers and possibily rules. The analysis will then look okay at that time, but things change. New issues will arise in your business, and the terms and entities will change over time. This leads to Question 2:

Question 2: How much time and effort did the vendor invest in tuning the category dictionaries, rules, and entity lists before go-live? When your data inevitably changes, can you afford to feasibly repeat this process to maintain the fidelity of your analysis?

If the analytics system does not use predefined categories, it may use document or word clustering. Many such systems do not produce clear or validated concepts. Remember that for easy and regular use, the discovered patterns of meaning need to be stable and clear. Don’t be fooled by people who say that this sort of system works because it looks attractive and even compelling. There are ways to check whether discovered term clusters are real measures of meaning, or whether they are wasting your time. Here are some questions for vendors who offer term or document clustering or other concept map solutions:

Question 3: If the product uses document clustering: how does the system scale with vast numbers of documents? If a document contains several different ideas, can it be in two topics at once? If I cut up the same documents into different chunks, would the pattern of clusters be similar? Text content isn’t always organized in predictable ways, so this is an important set of questions.

Question 4: If I take two different documents either by different authors or in different languages, would the discovered patterns of meaning look similar between the two? Multinationals – think about this if you want a consistent, true view of your customer comments.

Step 2 is almost totally ignored. Text information can tell you a story so you can improve business performance—with customers, with marketing. What else would you really want to do with it?

Quantitative, categorical, and numerical data mining is really good for establishing metrics and testing to see if pre-defined metrics change. Great. Do this.  It is really good for predicting whether a pre-selected situation is matched, such as customer churn probability.

But don’t forget that analyzing text comments from customers or competitor product reviews on the other hand excels at telling you what is happening. Because text is human communication – that is what it is for. So why waste this extremely valuable and rich source of intelligence?

Think of it this way. If your metrics show your sales are rising, everyone feels great. If your metrics show you your results are falling off a cliff, how do you work out how to fix the system? This is the feedback you need for controlling a system. Your text data will tell you how to turn things around faster and more accurately than almost any other source of management information.

Unfortunately, this is where most text analytics systems fail or don’t even bother. Here are some other questions:

Question 5: Does the system suggest chains of meaning which are well supported by the data, and which I can understand and explain to a manager? In other words, is it an explanatory model?

Question 6: Can I test hypotheses (educated guesses) based on the perspective of the customer?

Question 7: How does a simple list of terms tell me much about the reasons for what is happening, without having to do a whole lot of guessing or having to read large amounts of text after all?

Step 3: Set your bar high and expect an automatic, systematic and scalable system that can enable unstructured textual information to become a real enterprise asset—good for uncovering new customer insights, new product ideas, and business process improvements that were previously unachievable. And now act on what you find!

I hope this helps. People are still doing a whole lot of writing and talking trying to tell you things. I think we need to listen more carefully, understand what they are saying and then act thoughtfully.

By Andrew E. Smith


Make a Comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

5 Responses to “Seven Questions to ask a Text Analytics Vendor”

RSS Feed for Customer Insight Blog Comments RSS Feed

[…] unknown wrote an interesting post today onSeven Questions to ask a Text Analytics Vendor « Customer Insight BlogHere’s a quick excerptMost text analyzing systems will not automatically extract a clear set of the concepts and actors that characterize the text. Systems that come with predefined sets of categories, dictionaries and entity lists are a menace. … […]

Andrew, these are helpful guidelines, but you set the bar VERY high with “general use on people’s desktops.” There is NO semantic text analysis system of the type you describe in general use anywhere, nor will there be for several years.

Arguably Google comes closest. It’s a text-analytics system — it has significant abilities to recognize named entities and pattern-based entities such as phone numbers — and it’s in general use of people’s desktops (even if it isn’t a desktop system).

Given the very high proportion of textual information that folks want to analyze that’s out on the Net and economies of scale in collecting and analyzing it, I doubt that systems that actually run on desktop will ever be the prevailing general-use approach.

Thanks for your comment Seth. I think if you relax the “general desktop use” to “general business use (regardless of platform)” then it holds true. While Leximancer is not yet the de facto standard in business, we do enable casual users to quickly get value and get to an analysis that is useful to them. This has long held true in the academic world and is now being played out in use of It is this ease-of-use with automation, clarity and control that offers the opportunity for Leximancer to become ubiquitous.

Hi Seth,

Thanks for your input. I would make just a couple of technical observations.

It seems to me that the majority of text analytics approaches at the moment are a little like the person looking for their car keys under the street light, even though they dropped them somewhere else.

I say this because the dominant paradigm, which is based on trying to control grammar, lexicon, and semantics, is expensive and time consuming to customise and maintain as the data changes. It is also risky under conditions of change or inadequate human understanding of the data corpus, or for informal language such as speech transcription. So these systems tend to be unusable for about 80% of the total number of text analysis applications out there.

But more than that: why would you pay large amounts of money for text analysis software, and then have to manually specify and maintain for yourself the categories of words and names, and possibly the rules of grammar. Succinctly, why pay a dog and bark yourself.

Why do we need constructed stemming rules that specify that the words ‘service’ and ‘services’ mean the same thing, when a corpus-based method can find that in fact ‘service’ refers to the idea of customer service, whereas ‘services’ refers to consulting offerings, in the data in question. Why would we want to specify upfront that ATM means Automatic Teller Machine rather than Asynchronous Transfer Mode, when systems exist that can work this out from the real text.

There have been methods of corpus based text analysis for some time, such as HAL, LSA, the Topic Model, the SP Model. We have followed on from these experimental university systems to create Leximancer. Leximancer did not just fall out of the sky – there is a strong, validated research tradition behind what we do.

My point here is that semantics is not about making a word mean what we think it should mean, but working out what the authors of the text data thought the word meant.

Just to clarify what I meant by ‘desktop users’ in the original post. I was referring really to people who could use text analysis support in their everyday work. The adoption of the spreadsheet inspired me – the spreadsheet went from a back-office tool for professional financial analysts to a ubiquitous productivity application. Besides our many analyst customers, we have Leximancer customers who routinely use the system to perform literature reviews, check the messaging of a report they are writing, or check that their tender submission is well integrated and actually addresses the buyers’ requirements.

To me it seems that comprehending and recalling a 100 page document for a busy person can be just as demanding and critical for them as for an analyst with a 50,000 response survey or a marketing executive who needs to monitor relevant web content.


Hello webmaster
I would like to share with you a link to your site
write me here

Where's The Comment Form?

Liked it here?
Why not try sites on the blogroll...

%d bloggers like this: