Text Analytics with SAP HANA

Introduction

BI companies are dealing with huge amounts of data which comes in a different kind of forms. Data and its surrounding content are telling us which is the most selling product, what people think about the product prices, are they too high, how many people based on their ages or gender are using specific product, and similar. From the given data, companies have to find and extract information they need.

What is important to say is that the data can be structured and unstructured. Structured data is data that resides in a fixed field within a record, data contained in relation databases or spreadsheets. With the structured data we won’t have much problems while analyzing it and storing to database. On the other hand, unstructured data is information that doesn’t reside in a traditional row-column database. It usually includes text and multimedia content like e-mail messages, videos, photos, webpages and many other kinds of business documents. It is harder to extract meaningful information from the unstructured data, so here comes the text analysis, a process of analyzing unstructured data which extracts relevant information and transforms it into structured information.

Text Processing and Text Analysis

Text processing capabilities support search, text analysis and text mining. Search technique covers thirty two languages and provides full-text search which means you can search an item like you are searching it on the Google, regardless of the order of the words or characters. The search engine also supports fuzzy search which is able to search for incomplete words, words with typos or similar.

On the following image are presented pre-processing and analysis steps. The first four steps are executed on the unstructured text in order to create full text index, a hidden column that is attached to the source table. These steps include:

  • File format filtering – converts any binary document format to text or HTML
  • Language detection – detects proper language obviously
  • Tokenization – decomposes word phrase or sentence into tokens
  • Stemming language – a process which finds a linguistic base, a stem, of the word
Text analysis steps
Text processing and text analysis steps

 

The last four steps are optional. They may be executed on the unstructured text to expand full text indexing. I will give a brief introduction to few steps in the next examples.

1. Example

a) This is a basic example where we will create a full text index without specifying possible languages.

The following query searches for specific restaurant terms in full text context.

We see that English language is assumed for all inputs in the query result.

1a_query_result
SNIPPETS returns each query term match with parts of the text around them, while HIGHLIGHTED returns the full text content with key terms tagged.

b) We can also use LANGUAGE COLUMN to identify the language in the text for each row or we can detect specific languages defining them in LANGUAGE DETECT method. With the following query we detect English, French and German language, but before creating a new index with modified parameters, you have to drop the index first.

This query returns the following result:

1b_query_result

2. Example

Furthermore, we can extend this query by including a text analysis option. Text analysis is a process which goes through the full-text indexing steps and saves the result in the $TA_table. It can extract entities such as persons, products or places from documents. Words can be assigned part of speech tags, divided into noun groups, entities and facts can be extracted, the sentiments can be linked to the topic etc. If a source table changes, $TA_table gets automatically updated just like the full-text index. If a source table is dropped, $TA_table is also automatically dropped.

a) Example with text analysis on

The output will return only records in English.

2a_query_result

In the output, default tokens are used for word breaking. However, if a user wishes to use less or some other tokens, he can define them. As an example, look at the date from this result. It is broken out by month, day, and year, but if we want to get the whole date, rather than the date parts, we should execute the query in the next example.

b) This query searches only for empty separators to divide words in the text.

Executing the same select statement the second time resulted in:

2b_query_result

In text analysis we can specify configuration settings by adding the configuration parameter and change the analysis output. Linguistic analysis output will show tokens, stems and parts of speech, while entity and fact extraction will display entities, semantic types and relationships. User is also able to create his own configuration files following a specific syntax.

Sentiment analysis, as well known as Voice of Customer, is an interesting part of the fact extraction modules which is available in ten languages so far. Words can be extracted and assigned with a matching emotion based on the set of rules that includes requirements for extracting customer sentiments, requests, emoticons and profanities.  Emotion can be classified as strong or weak, so as positive, negative or even neutral sentiment. Extracted words are linked to the corresponding topics, moreover it can be detected if there are any emoticons used in the text, if a request is made or if there are some problems or words that should be censored etc. We simply include Voice of Customer parameter in a query which is a default configuration. If standard configuration doesn’t suffice to the users’ needs, users can easily customize sentiment keywords in dictionaries without dealing with rules and changing them.

3. Example

In this example we will recreate the full text index using text analysis with predefined EXTRACTION_CORE_VOICEOFCUSTOMER configuration.

The following query shows the output of sentiment analysis for the English language without predefined entity types such as PERSON, PRODUCT, ADDRESS etc.

3_query

Dictionary

Dictionaries are used when a user wants to find customized information about the entities. Dictionary distinguishes standard and variant names of the entities such as abbreviations, different spellings, aliases and similar, e.g. United States of America is a standard form of country name while its other forms could be USA, US or similar. If a default dictionary result doesn’t suite to the user’s needs, user can create his own custom dictionary.

4. Example

In this example we are going to use customized dictionary file where a classification of the word “horrible” is changed from strong negative to strong positive sentiment. In the extract from a dictionary, an adjective “horrible” can be found under the category SPS, meaning strong positive sentiment.

Result:

4_query_result

Text mining

Text mining is an optional data structure built from the text analysis results that works at the document level.  It compares the content between documents and makes semantic analysis. Main functions of text mining are identifying similar documents, identifying key terms, related terms and categorization of new documents based on a training corpus.

5. Text mining example

This is just one example of what text mining can do. The following statement uses function TM_GET_SUGGESTED_TERMS which returns suggested terms for an incomplete input.

Result:

5_query_result

Conclusion

Text analysis is a quite useful option in BI companies nowadays, and I would say it’s essential. Due to text analysis, companies do not have to struggle with massive volumes of data. SAP HANA offers modeling tools that are easy to use and provides a text analysis functionality that makes a process of combining unstructured and structured data easier. SAP HANA’s tools capabilities are able to analyze and extract valuable information from huge texts which help companies to solve their critical problems, improve their operations and make quick intelligent decisions.

Leave a Reply

Your email address will not be published. Required fields are marked *