Mathematica: Using TextCases to Extract Information from Natural Language Text 

Written by on September 13, 2020 in Mathematica, Natural Language Processing, Programming with 0 Comments

Extracting meaningful information from unstructured, human readable text is a hot topic of research today and has important applications in many domains. I have written a few blogs related to this topic, for example, see this and this.

In today’s article, I would like to show how Mathematica can be a great help when working with natural language text.

Mathematica’s Wolfram Language has had, since release 10.2, a function called TextCases that can find interesting syntactic and semantic patterns in natural language text. The good news is that its functionality is continuously being enhanced in each release. The current version 12.1.1 offers some incredible features that I haven’t seen in any other framework or API.

Let us start with a straightforward and common example. Given a paragraph, extract all the sentences contained in it. This functionality is available in many libraries.

Extracting Sentences

Extracting Sentences

No fancy stuff here. Next, let us ask for “adjectives” and “proper nouns” in the text.

Adjectives

Adjectives

The result shows that the function has correctly identified the requested POS words. This feature is also widely available.

What about word groups, such as phrases? In the following, the function identifies the different “noun groups”:

Noun Groups

Noun Groups

The following is an example of “quantifier phrase”:

Quantifier Phrase

Quantifier Phrase

Note that the extracted item contains both a number and the corresponding measurement unit. We can also extract “wh-adjective phrases”:

WH-Phrase

WH-Phrase

We know that “fortunate” is an adjective. The “WH” words are who, whose, whom, which, what, where, when, why, how. The phrase “how fortunate” is therefore a “wh-adjective phrase”. 

The above examples show how we can extract “syntactic” information from the given text. Let us look at some cases involving “semantic” patterns. 

The following example illustrates how we can identify sentences that mention dog “breed”.

Dog Breed

Dog Breed

Here “doberman” is a dog breed. The function has correctly identified that. How about checking for positive or negative sentiments? We can do that too. Look at the following example.

Positive Sentiment

Positive Sentiment

Here we are asking for text that represents positive sentiment. The function has returned two fragments. I agree with the first one, but the second looks suspicious. In fact, I believe that the sentence “She is a hyperactive doberman” is also positive statement.

The following is an interesting case (and common too) where we extract email addresses and web URL:

Emails

Email Address

Works perfectly.

Next, let us ask the function to identify country names and prominent persons.

Countries and Persons

Countries and Persons

“India” appears twice because there are two occurrences of the word in the text.

Here is an interesting example from the medical domain. The function is able to identify body parts and disease names present in the text:

Medical Domain Example

Medical Domain Example

Nice, isn’t it?

Our last example is sure to appeal to all. Here, we take the “Wikipedia” text pertaining to “Yoga” and extract words and phrases that are related to the concepts “Mythology” or “Religion”. We then present this data as a word cloud:

Mythology and Religion

Mythology and Religion

There is a lot more that one can do with this function. If you are curious to know what content types are supported by “TextCases” function, take a look at this page.

I hope you enjoyed this article. There are two other related functions in the Wolfram Language: “TextPosition” and “TextContents”. I wrote an article on the latter some time ago. Check it out.

Have a nice weekend!

 

Tags: , , ,

Subscribe

If you enjoyed this article, subscribe now to receive more just like it.

Subscribe via RSS Feed

Leave a Reply

Your email address will not be published. Required fields are marked *

Top