Extracting meaningful information from unstructured, human readable text is a hot topic of research today and has important applications in many domains. I have written a few blogs related to this topic, for example, see this and this.
In today’s article, I would like to show how Mathematica can be a great help when working with natural language text.
Mathematica’s Wolfram Language has had, since release 10.2, a function called “TextCases” that can find interesting syntactic and semantic patterns in natural language text. The good news is that its functionality is continuously being enhanced in each release. The current version 12.1.1 offers some incredible features that I haven’t seen in any other framework or API.
Let us start with a straightforward and common example. Given a paragraph, extract all the sentences contained in it. This functionality is available in many libraries.
No fancy stuff here. Next, let us ask for “adjectives” and “proper nouns” in the text.
The result shows that the function has correctly identified the requested POS words. This feature is also widely available.
What about word groups, such as phrases? In the following, the function identifies the different “noun groups”:
The following is an example of “quantifier phrase”:
Note that the extracted item contains both a number and the corresponding measurement unit. We can also extract “wh-adjective phrases”:
We know that “fortunate” is an adjective. The “WH” words are who, whose, whom, which, what, where, when, why, how. The phrase “how fortunate” is therefore a “wh-adjective phrase”.
The above examples show how we can extract “syntactic” information from the given text. Let us look at some cases involving “semantic” patterns.
The following example illustrates how we can identify sentences that mention dog “breed”.
Here “doberman” is a dog breed. The function has correctly identified that. How about checking for positive or negative sentiments? We can do that too. Look at the following example.
Here we are asking for text that represents positive sentiment. The function has returned two fragments. I agree with the first one, but the second looks suspicious. In fact, I believe that the sentence “She is a hyperactive doberman” is also positive statement.
The following is an interesting case (and common too) where we extract email addresses and web URL:
Works perfectly.
Next, let us ask the function to identify country names and prominent persons.
“India” appears twice because there are two occurrences of the word in the text.
Here is an interesting example from the medical domain. The function is able to identify body parts and disease names present in the text:
Nice, isn’t it?
Our last example is sure to appeal to all. Here, we take the “Wikipedia” text pertaining to “Yoga” and extract words and phrases that are related to the concepts “Mythology” or “Religion”. We then present this data as a word cloud:
There is a lot more that one can do with this function. If you are curious to know what content types are supported by “TextCases” function, take a look at this page.
I hope you enjoyed this article. There are two other related functions in the Wolfram Language: “TextPosition” and “TextContents”. I wrote an article on the latter some time ago. Check it out.
Have a nice weekend!
Recent Comments