In my previous post, I showed how to parse sentences using OpenNLP. Another useful feature supported by OpenNLP is “chunking”. That is the subject of today’s article.
Chunking stands between part-of-speech tagging and full parse in terms of the information it captures. POS tagging assigns part of speech to individual tokens in a sentence. So, in the sentence “Peter likes sweets”, the POS tags are:
Peter => NNP
likes => VBZ
sweets => NNS
The tagging is based on Penn Treebank scheme.
The constituency parser operates at the other extreme. It tries to assign a structure to the complete sentence, by assigning a structure (recursively) to constituent parts. We saw this in the last article.
Full parse is significantly more expensive than just POS tagging for obvious reasons. Sometimes we might be interested only in the smaller structures contained in the larger parse tree, for example, Verb Phrase, Adjective Phrase, Noun Phrase, and so on. The classic example is NER (Named Entity Recognition) where we are interested in specific Noun Phrases. This usually (not always) involves more than one token in the given text, and is called “chunking”.
OK. Let us see how to use the chunker in OpenNLP. I have written a simple class called “OpenNLPChunkerExample” to illustrate the essential features (you can download the source from here).
The code fragment below gets the chunked tags and prints them along with the corresponding word.
The output from the program is:
The tagging produced by the chunker follows the “IOB” tagging scheme. Here,
B = Beginning of chunk
I = In a chunk
O = Outside any chunk
From the above scheme, we can easily see that the words “The pretty cat” form a single NP chunk, the word “chased” forms a VP chunk all by itself, and the words “the ugly rat” constitute an NP chunk again. The final “.” is not part of any chunk.
To facilitate readability, we can write a convenience function to group the related chunks. Here is the code:
The function returns a Span[]. The updated “main” that uses this function and prints the chunks is:
The corresponding output is:
We can even get the probability associated with each chunked tag. Here is the final version that prints this information:
Here is the corresponding output:
Before concluding, let us print the chunks for another sentence: “It is very beautiful.”
You can see that we now have an Adjective Phrase (ADJP): “very beautiful”.
Python’s NLTK, another popular NLP toolkit, also supports chunking. What I like about NLTK is that it allows us to define a “chunking grammar” to customize our chunking logic. This can prove useful in some cases. Take a look at NLTK when you get time.
You can download my Java program from here.
Have a nice weekend and a great week ahead!
Recent Comments