Parsing Text with MeaningCloud’s Text Analytics API

Written by on December 9, 2018 in LISP, Natural Language Processing, Programming with 0 Comments

There is wide-spread interest in Natural Language Processing (NLP) today, and there are several API services available to cater to this demand. See this article for a fairly detailed list of services. All of them support multiple languages, including English.

Today, I am going to share my experience in working with MeaningCloud’s Text Analytics API, specifically the Parsing API. MeaningCloud is among the pioneers in the field of NLP, with offices in Madrid and New York. They offer a wide variety of products and services in the area of text processing, with a very reasonably-priced subscription model. The free tier includes up to 20,000 requests per month.

The Text Analytics API consists of the following categories:

  • Topics Extraction
  • Text Classification
  • Deep Categorization
  • Sentiment Analysis
  • Language Identification
  • Text Clustering
  • Summarization
  • Document Structure Analysis
  • Corporate Reputation
  • Lemmatization, POS and Parsing

A good overview of the Text Analytics service is available here. There is also a nice introductory youtube video.

For my experiment, I decided to focus only on the last category – Lemmatization, POS and Parsing. What I found interesting about this API is that it returns a morphosyntactic tree (similar to the constituency tree) as MeaningCloud calls it. Given that Dependency parsing is widely used these days (for example, Google  and TextRazor), I was a bit surprised by the traditional approach followed by the company, although it is still useful.

Once you register for the free tier at their site, you can use the supplied key to make REST calls to their API, or if you merely want to understand the API capabilities, their test console comes in handy. They also have client SDKs for PHP, Java, Python, and Visual Basic.

I decided to use their REST service from my Lisp program, and it was pretty straightforward.

Although this API endpoint is essentially meant for POS and Lemmatization, you can optionally include Sentiment and Topic extraction options. There is no additional charge for this request. Nice, isn’t it?

The different Request parameters are explained here.

In my API call, I have included Sentiment and all categories of Topics in addition to POS and Lemmatization.

Here is the parse function:

Function to Parse

Function to Parse

Let us invoke the parser with the sentence: Lilly and her friends admired their professor.

CL-USER 14 > (parse-text “Lilly and her friends admired their professor.”)

T

If the API call succeeds, this function returns T. For convenience, the returned result is saved in a global variable for subsequent use. Here is the partial output:

CL-USER 15 > (pprint *parser-result*)

((:STATUS (:CODE . “0”) (:MSG . “OK”) (:CREDITS . “1”) (:REMAINING–CREDITS . “19750”))
 (:TOKEN–LIST
 ((:TYPE . “sentence”)
  (:ID . “10”)
  (:INIP . “0”)
  (:ENDP . “45”)
  (:STYLE (:IS-BOLD . “no”) (:IS-ITALICS . “no”) (:IS-UNDERLINED . “no”) (:IS-TITLE . “no”))
  (:SEPARATION . “A”)
  (:QUOTE–LEVEL . “0”)
  (:AFFECTED–BY–NEGATION . “no”)
  (:SENTIMENT (:SELF–SENTIMENT (:TEXT . “*”) (:INIP . “0”) (:ENDP . “45”) (:CONFIDENCE . “100”) (:SCORE–TAG . “P+”)))

A quick look at the result shows that the call has consumed 1 credit (20000 free credits per month). Notice the last part of the result fragment, where it says the sentiment polarity of the whole sentence is P+, meaning Strong Positive. There is a also separate Global Sentiment section:

(:GLOBAL–SENTIMENT
 (:MODEL . “general_en”)
 (:SCORE–TAG . “P+”)
 (:AGREEMENT . “AGREEMENT”)
 (:SUBJECTIVITY . “SUBJECTIVE”)
 (:CONFIDENCE . “100”)
 (:IRONY . “NONIRONIC”))

A detailed description of the various fields in the returned response object is available here.

Instead of using the result object directly, I prefer to use a subset that contains just the key fields of interest to me.

Here is the corresponding function:

Get Words Function

Get Words Function

Here is the partial output from that function call.

CL-USER 16 > (pprint (get-words-info *parser-result*))

(((:FORM . “Lilly”)
 (:ID . “1”)
 (:ANALYSIS–LIST
 ((:TAG . “NPFS-N-“) (:LEMMA . “Lilly”) (:ORIGINAL–FORM . “Lilly”) (:SENSE–ID–LIST ((:SENSE–ID . “5e3f6f66ec”)))))
  (:SENSE–LIST
  ((:ID . “5e3f6f66ec”)
   (:FORM . “Lilly”)
   (:INFO
. “sementity/class=instance@fiction=nonfiction@id=ODENTITY_FIRST_NAME@type=

    Top>Person>FirstNamesemld_list=sumo:FirstName”))))
   ((:FORM . “and”) (:ID . “2”) (:ANALYSIS–LIST ((:TAG . “CCYN9”) (:LEMMA . “and”) (:ORIGINAL–FORM . “and”))))
   ((:FORM . “her”)
    (:ID . “3”)
    (:ANALYSIS–LIST
    ((:TAG . “SD-PFS3N7”) (:LEMMA . “her”) (:ORIGINAL–FORM . “her”) (:SENSE–ID–LIST ((:SENSE–ID . “PRONHUMAN”)))))
    (:SENSE–LIST ((:ID . “PRONHUMAN”) (:FORM . “her”) (:INFO . “semhum=human”))))
  ((:FORM . “friends”)
   (:ID . “4”)
   (:ANALYSIS–LIST
   ((:TAG . “NC-P-N6”) (:LEMMA . “friend”) (:ORIGINAL–FORM . “friends”) (:SENSE–ID–LIST ((:SENSE–ID . “2f1f98e4bb”)))))
   (:SENSE–LIST
    ((:ID . “2f1f98e4bb”)
     (:FORM . “friend”)
     (:INFO
.     “sementity/class=class@fiction=nonfiction@id=ODENTITY_PERSON@type=Top>Person

       semld_list=http://en.wikipedia.org/wiki/Friend|sumo:Human”))))

As you can observe, for each word, there is a POS tag and Lemma. Where relevant, there is an entry for Sense ID as well. The POS tag is quite fine-grained. It does not stop with the top-level category such as Verb, Noun, Adverb, etc., but includes as much information as can be given for that category.  Where possible, it even contains the relative frequency of occurrence of the word!

Lilly NPFS-N- Noun, Proper Noun, Feminine, Singular, Normal Word
and CCYN9 Conjunction, Coordinated, Copulative, Normal Word, Maximum Frequency
her SD-PFS3N7 Possessive, Determiner, Plural, Feminine, Singular, 3rd Person, Normal Word, High Frequency
friends NC-P-N6 Noun, Common Noun, Plural, Normal Word, Medium-high Frequency
admired VI-P3ASA-N-N4 Verb, Indicative, Plural, 3rd Person, Past, Simple, Active, Non-auxiliary, Normal Word, Medium-low frequency
their SD-SMP3N7 Possessive, Determiner, Singular, Masculine, Plural, 3rd Person, Normal Word, High Frequency
professor NC-S-N4 Noun, Common Noun, Singular, Normal Word, Medium-low Frequency

POS tag details can be found here.

Instead of getting the result in the JSON format as described above, it is also possible to get it in the form of a Tree rendered as a GIF image. This is useful to visualize the structure of the input sentence. Below is the corresponding tree for the sample sentence: Lilly and her friends admired their professor.

Morphosyntactic Tree

Morphosyntactic Tree

I got this GIF image through their Test Console.

Let us try parsing one more sentence: That book was published in 1998 and costs $150 now.

The parse tree for this sentence is given below:

Morphosyntactic Tree

Morphosyntactic Tree – Second Example

Let us parse it using our function:

CL-USER 18 > (parse-text “That book was published in 1998 and costs $150 now.”)
T

Take a look at the description of the token 1998:

((:FORM . “1998”)
 (:NORMALIZED–FORM . “date@20||||1998||||||”)
 (:ID . “6”)
 (:INIP . “27”)
 (:ENDP . “30”)
 (:STYLE (:IS-BOLD . “no”) (:IS-ITALICS . “no”) (:IS-UNDERLINED . “no”) (:IS-TITLE . “no”))
 (:SEPARATION . “1”)
 (:QUOTE–LEVEL . “0”)
 (:AFFECTED–BY–NEGATION . “no”)
 (:ANALYSIS–LIST ((:TAG . “NDUU-n-“) (:LEMMA . “1998”) (:ORIGINAL–FORM . “1998”)))
 (:TOPIC–LIST
 (:TIME–EXPRESSION–LIST
 ((:FORM . “1998”)
  (:NORMALIZED–FORM . “20||||1998||||||”)
  (:ACTUAL–TIME . “1998-12-09”)
  (:PRECISION . “year”)
  (:INIP . “27”)
  (:ENDP . “30”)))))

What is interesting is that the token 1998 has been tagged as a Date as opposed to just a number. I guess the Actual Time element given as 1998-12-09 is the date on which I ran the API service for this article. So, the parser engine does much more than assigning a superficial label to each token. This can come in quite handy if we are keen to perform a deep semantic analysis of input text.

Likewise, look at the (partial) description of the token $:

((:FORM . “$”)
 (:ID . “9”)
 (:INIP . “42”)
 (:ENDP . “42”)
 (:STYLE (:IS-BOLD . “no”) (:IS-ITALICS . “no”) (:IS-UNDERLINED . “no”) (:IS-TITLE . “no”))
 (:SEPARATION . “1”)
 (:QUOTE–LEVEL . “0”)
 (:AFFECTED–BY–NEGATION . “no”)
 (:ANALYSIS–LIST
 ((:TAG . “NCUP-s-“) (:LEMMA . “$”) (:ORIGINAL–FORM . “$”) (:SENSE–ID–LIST ((:SENSE–ID . “7b6858c50a”)))))
  (:SENSE–LIST
  ((:ID . “7b6858c50a”)
  (:FORM . “dollar”)
  (:OFFICIAL–FORM . “United States dollar”)

The input token $ has been mapped to “United States dollar”!

I must admit that I am quite impressed with the API. I would encourage you to consider using it in your next text analysis project if it requires examining syntactic phenomena. I hope to check out their other API offerings in the near future, and will share my experience with you at that time.

The Lisp source code is available here. It has been tested in LispWorks.

Have a nice weekend!

Subscribe

If you enjoyed this article, subscribe now to receive more just like it.

Subscribe via RSS Feed

Leave a Reply

Your email address will not be published.

Top