There is wide-spread interest in Natural Language Processing (NLP) today, and there are several API services available to cater to this demand. See this article for a fairly detailed list of services. All of them support multiple languages, including English.
Today, I am going to share my experience in working with MeaningCloud’s Text Analytics API, specifically the Parsing API. MeaningCloud is among the pioneers in the field of NLP, with offices in Madrid and New York. They offer a wide variety of products and services in the area of text processing, with a very reasonably-priced subscription model. The free tier includes up to 20,000 requests per month.
The Text Analytics API consists of the following categories:
- Topics Extraction
- Text Classification
- Deep Categorization
- Sentiment Analysis
- Language Identification
- Text Clustering
- Summarization
- Document Structure Analysis
- Corporate Reputation
- Lemmatization, POS and Parsing
A good overview of the Text Analytics service is available here. There is also a nice introductory youtube video.
For my experiment, I decided to focus only on the last category – Lemmatization, POS and Parsing. What I found interesting about this API is that it returns a morphosyntactic tree (similar to the constituency tree) as MeaningCloud calls it. Given that Dependency parsing is widely used these days (for example, Google and TextRazor), I was a bit surprised by the traditional approach followed by the company, although it is still useful.
Once you register for the free tier at their site, you can use the supplied key to make REST calls to their API, or if you merely want to understand the API capabilities, their test console comes in handy. They also have client SDKs for PHP, Java, Python, and Visual Basic.
I decided to use their REST service from my Lisp program, and it was pretty straightforward.
Although this API endpoint is essentially meant for POS and Lemmatization, you can optionally include Sentiment and Topic extraction options. There is no additional charge for this request. Nice, isn’t it?
The different Request parameters are explained here.
In my API call, I have included Sentiment and all categories of Topics in addition to POS and Lemmatization.
Here is the parse function:
Let us invoke the parser with the sentence: Lilly and her friends admired their professor.
CL-USER 14 > (parse-text “Lilly and her friends admired their professor.”)
T
If the API call succeeds, this function returns T. For convenience, the returned result is saved in a global variable for subsequent use. Here is the partial output:
CL-USER 15 > (pprint *parser-result*)
((:STATUS (:CODE . “0”) (:MSG . “OK”) (:CREDITS . “1”) (:REMAINING–CREDITS . “19750”))
(:TOKEN–LIST
((:TYPE . “sentence”)
(:ID . “10”)
(:INIP . “0”)
(:ENDP . “45”)
(:STYLE (:IS-BOLD . “no”) (:IS-ITALICS . “no”) (:IS-UNDERLINED . “no”) (:IS-TITLE . “no”))
(:SEPARATION . “A”)
(:QUOTE–LEVEL . “0”)
(:AFFECTED–BY–NEGATION . “no”)
(:SENTIMENT (:SELF–SENTIMENT (:TEXT . “*”) (:INIP . “0”) (:ENDP . “45”) (:CONFIDENCE . “100”) (:SCORE–TAG . “P+”)))
A quick look at the result shows that the call has consumed 1 credit (20000 free credits per month). Notice the last part of the result fragment, where it says the sentiment polarity of the whole sentence is P+, meaning Strong Positive. There is a also separate Global Sentiment section:
(:GLOBAL–SENTIMENT
(:MODEL . “general_en”)
(:SCORE–TAG . “P+”)
(:AGREEMENT . “AGREEMENT”)
(:SUBJECTIVITY . “SUBJECTIVE”)
(:CONFIDENCE . “100”)
(:IRONY . “NONIRONIC”))
A detailed description of the various fields in the returned response object is available here.
Instead of using the result object directly, I prefer to use a subset that contains just the key fields of interest to me.
Here is the corresponding function:
Here is the partial output from that function call.
CL-USER 16 > (pprint (get-words-info *parser-result*))
(((:FORM . “Lilly”)
(:ID . “1”)
(:ANALYSIS–LIST
((:TAG . “NPFS-N-“) (:LEMMA . “Lilly”) (:ORIGINAL–FORM . “Lilly”) (:SENSE–ID–LIST ((:SENSE–ID . “5e3f6f66ec”)))))
(:SENSE–LIST
((:ID . “5e3f6f66ec”)
(:FORM . “Lilly”)
(:INFO
. “sementity/class=instance@fiction=nonfiction@id=ODENTITY_FIRST_NAME@type=Top>Person>FirstNamesemld_list=sumo:FirstName”))))
((:FORM . “and”) (:ID . “2”) (:ANALYSIS–LIST ((:TAG . “CCYN9”) (:LEMMA . “and”) (:ORIGINAL–FORM . “and”))))
((:FORM . “her”)
(:ID . “3”)
(:ANALYSIS–LIST
((:TAG . “SD-PFS3N7”) (:LEMMA . “her”) (:ORIGINAL–FORM . “her”) (:SENSE–ID–LIST ((:SENSE–ID . “PRONHUMAN”)))))
(:SENSE–LIST ((:ID . “PRONHUMAN”) (:FORM . “her”) (:INFO . “semhum=human”))))
((:FORM . “friends”)
(:ID . “4”)
(:ANALYSIS–LIST
((:TAG . “NC-P-N6”) (:LEMMA . “friend”) (:ORIGINAL–FORM . “friends”) (:SENSE–ID–LIST ((:SENSE–ID . “2f1f98e4bb”)))))
(:SENSE–LIST
((:ID . “2f1f98e4bb”)
(:FORM . “friend”)
(:INFO
. “sementity/class=class@fiction=nonfiction@id=ODENTITY_PERSON@type=Top>Personsemld_list=http://en.wikipedia.org/wiki/Friend|sumo:Human”))))
As you can observe, for each word, there is a POS tag and Lemma. Where relevant, there is an entry for Sense ID as well. The POS tag is quite fine-grained. It does not stop with the top-level category such as Verb, Noun, Adverb, etc., but includes as much information as can be given for that category. Where possible, it even contains the relative frequency of occurrence of the word!
Lilly | NPFS-N- | Noun, Proper Noun, Feminine, Singular, Normal Word |
and | CCYN9 | Conjunction, Coordinated, Copulative, Normal Word, Maximum Frequency |
her | SD-PFS3N7 | Possessive, Determiner, Plural, Feminine, Singular, 3rd Person, Normal Word, High Frequency |
friends | NC-P-N6 | Noun, Common Noun, Plural, Normal Word, Medium-high Frequency |
admired | VI-P3ASA-N-N4 | Verb, Indicative, Plural, 3rd Person, Past, Simple, Active, Non-auxiliary, Normal Word, Medium-low frequency |
their | SD-SMP3N7 | Possessive, Determiner, Singular, Masculine, Plural, 3rd Person, Normal Word, High Frequency |
professor | NC-S-N4 | Noun, Common Noun, Singular, Normal Word, Medium-low Frequency |
POS tag details can be found here.
Instead of getting the result in the JSON format as described above, it is also possible to get it in the form of a Tree rendered as a GIF image. This is useful to visualize the structure of the input sentence. Below is the corresponding tree for the sample sentence: Lilly and her friends admired their professor.
I got this GIF image through their Test Console.
Let us try parsing one more sentence: That book was published in 1998 and costs $150 now.
The parse tree for this sentence is given below:
Let us parse it using our function:
CL-USER 18 > (parse-text “That book was published in 1998 and costs $150 now.”)
T
Take a look at the description of the token 1998:
((:FORM . “1998”)
(:NORMALIZED–FORM . “date@20||||1998||||||”)
(:ID . “6”)
(:INIP . “27”)
(:ENDP . “30”)
(:STYLE (:IS-BOLD . “no”) (:IS-ITALICS . “no”) (:IS-UNDERLINED . “no”) (:IS-TITLE . “no”))
(:SEPARATION . “1”)
(:QUOTE–LEVEL . “0”)
(:AFFECTED–BY–NEGATION . “no”)
(:ANALYSIS–LIST ((:TAG . “NDUU-n-“) (:LEMMA . “1998”) (:ORIGINAL–FORM . “1998”)))
(:TOPIC–LIST
(:TIME–EXPRESSION–LIST
((:FORM . “1998”)
(:NORMALIZED–FORM . “20||||1998||||||”)
(:ACTUAL–TIME . “1998-12-09”)
(:PRECISION . “year”)
(:INIP . “27”)
(:ENDP . “30”)))))
What is interesting is that the token 1998 has been tagged as a Date as opposed to just a number. I guess the Actual Time element given as 1998-12-09 is the date on which I ran the API service for this article. So, the parser engine does much more than assigning a superficial label to each token. This can come in quite handy if we are keen to perform a deep semantic analysis of input text.
Likewise, look at the (partial) description of the token $:
((:FORM . “$”)
(:ID . “9”)
(:INIP . “42”)
(:ENDP . “42”)
(:STYLE (:IS-BOLD . “no”) (:IS-ITALICS . “no”) (:IS-UNDERLINED . “no”) (:IS-TITLE . “no”))
(:SEPARATION . “1”)
(:QUOTE–LEVEL . “0”)
(:AFFECTED–BY–NEGATION . “no”)
(:ANALYSIS–LIST
((:TAG . “NCUP-s-“) (:LEMMA . “$”) (:ORIGINAL–FORM . “$”) (:SENSE–ID–LIST ((:SENSE–ID . “7b6858c50a”)))))
(:SENSE–LIST
((:ID . “7b6858c50a”)
(:FORM . “dollar”)
(:OFFICIAL–FORM . “United States dollar”)
The input token $ has been mapped to “United States dollar”!
I must admit that I am quite impressed with the API. I would encourage you to consider using it in your next text analysis project if it requires examining syntactic phenomena. I hope to check out their other API offerings in the near future, and will share my experience with you at that time.
The Lisp source code is available here. It has been tested in LispWorks.
Have a nice weekend!
Recent Comments