This is the third part in the series on information extraction from unstructured text. In the first part, we saw how MeaningCloud allows us to specify complex rules to identify custom categories through their Deep Categorization API. The second part covered spaCy’s pattern matcher.
Today, I would like to discuss how we can use TextRazor’s “Prolog Engine” to perform customized text analysis.
TextRazor provides a powerful API for dealing with natural language text. It supports Classification, Topic Tagging, Entity Recognition, Dependency Parsing, and more. What makes TextRazor somewhat unique is the ability to seamlessly attach custom rules written in Prolog in order to discover special patterns in the text. These rules can make use of the rich set of built-in predicates that expose the core logic of the NLP engine.
The prolog engine in TextRazor is based on the popular SWI Prolog implementation (TextRazor’s documentation incorrectly mentions this as YAP prolog).
Since Prolog is a full-fledged programming language (albeit quite different from the “conventional” languages), the ability to write custom logic in Prolog implies that we can encode arbitrarily complex logic in order to discover interesting patterns in text.
I have written a REST client in Lisp that shows how to use Prolog rules to extract key pieces of information about a patient coming to the doctor for consultation. For practical reasons, the program and the case description are deliberately kept simple in order to focus on the idea.
Let us consider the following case text:
Mr.John Haggard, aged 30 years, has cold and cough. He sneezes when exposed to
cold weather. There is headache on waking up. He has runny nose as well.
Given this text, we want to extract the Name, Gender and Age of the patient. In addition, (as in the previous two articles) we want to know if the patient is suffering from “common-cold” or “diarrhea”. Here is the set of Prolog rules that will do the job:
common_cold :- sequence(‘runny’, ‘nose’).
common_cold :- lemma(S), member(S, [‘headache’, ‘cold’, ‘cough’, ‘sneeze’]).
diarrhea :- lemma(‘loose’), or( lemma(‘motion’), lemma(‘stools’)).
diarrhea :- lemma(‘diarrhoea’); lemma(‘diarrhea’).
name(N) :- sequence(lemma(X), token(TID1, N1), token(TID2, N2)), title(X),
part_of_speech(TID1, ‘NNP’), part_of_speech(TID2, ‘NNP’), N = [N1, N2],!.
name(N) :- sequence(lemma(X), token(TID, N)), title(X), part_of_speech(TID, ‘NNP’).
title(X) :- member(X, [‘mr.’, ‘mrs.’, ‘miss.’, ‘ms.’]).
gender(G) :- lemma(X), member(X, [‘he’, ‘his’, ‘him’]), G = ‘male’.
gender(G) :- lemma(X), member(X, [‘she’, ‘her’]), G = ‘female’.
is_number(Str) :- atom_string(Atom, Str), atom_number(Atom, _).
age(X) :- sequence(token(X), ‘years’, ‘old’), is_number(X).
age(X) :- sequence(‘aged’, token(X), ‘years’), is_number(X).
age(X) :- sequence(token(X), ‘years’, ‘of’, ‘age’), is_number(X).
The predicate common-cold will be True if the text contains the sequence of words “runny” and “nose” (without any other intervening word), or if the text contains the root form (“lemma”) of any of the words “headache”, “cold”, “cough”, and “sneeze”.
Likewise, we can infer that the patient has diarrhea if the word “diarrhea” (or “diarrhoea”) occurs directly in the text, or we have either “loose motion” or “loose stools”.
To determine the Name of the patient, we first look for the Title (i.e., how we address the person) – “Mr.”, “Mrs.”, “Ms”, or “Miss.”, followed by one or two Proper Nouns. Although this itself can give a clue to the gender of the person, I am defining a separate predicate for this. The gender is male if any of the words “he”, “him”, or “his” occurs in the text. Similarly, the gender is female if either “she” or “her” is present in the text.
The age of the patient is also determined from the context. It is a number conforming to any of the following patterns:
- <Age> “years” “old” ( => “40 years old”)
- “aged” <Age> “years ( => “aged 40 years”)
- <Age> “years” “of” “age” ( => “40 years of age”)
Obviously, I have simplified the rules to make it easier to understand the logic and the approach.
TextRazor API requires that when we pass the Prolog rules to the NLP engine, we have to additionally mention the specific “extractors” (in our case, the predicates) that it should try to satisfy. For this example, the extractors is a list of “common-cold”, “diarrhea”, “name”, “gender”, and “age” (you will understand better when you see the Lisp code). Notice that I am not passing “title”, although it is a predicate in my rule set. The reason is that I am not specifically looking for “title” as an expected output; it is just an auxiliary predicate used in another main predicate called “name”.
If you have followed the Prolog code carefully, you might notice a limitation in my approach. For each ailment I am looking for, I will have to define a predicate such as “fever”, “asthma”, “dementia”, etc., and additionally, include that predicate name in the list of extractors. Not impossible, but not quite elegant!
What is the alternative? Instead of defining a predicate matching the name of the ailment (or in addition to that), we can follow this strategy:
ailment(NameOfAilment) :- <Bind argument to the actual ailment>
For example,
ailment(Name) :- common_cold, Name = ‘common-cold’.
ailment(Name) :- diarrhea, Name = ‘diarrhea’.
And so on… In this case, if common-cold is True, the Name argument of ailment predicate is set to common-cold, and likewise for diarrhea.
In fact, we can do better:
ailment(common_cold) :- common_cold.
ailment(diarrhea) :- diarrhea.
Now we don’t have to include the individual ailment predicates in our extractors list; we only have to include “ailment”. Here is the updated rules:
common_cold :- sequence(‘runny’, ‘nose’).
common_cold :- lemma(S), member(S, [‘headache’, ‘cold’, ‘cough’, ‘sneeze’]).
diarrhea :- lemma(‘loose’), or( lemma(‘motion’), lemma(‘stools’)).
diarrhea :- lemma(‘diarrhoea’); lemma(‘diarrhea’).
ailment(common_cold) :- common_cold.
ailment(diarrhea) :- diarrhea.
name(N) :- sequence(lemma(X), token(TID1, N1), token(TID2, N2)), title(X),
part_of_speech(TID1, ‘NNP’), part_of_speech(TID2, ‘NNP’), N = [N1, N2],!.
name(N) :- sequence(lemma(X), token(TID, N)), title(X), part_of_speech(TID, ‘NNP’).
title(X) :- member(X, [‘mr.’, ‘mrs.’, ‘miss.’, ‘ms.’]).
gender(G) :- lemma(X), member(X, [‘he’, ‘his’, ‘him’]), G = ‘male’.
gender(G) :- lemma(X), member(X, [‘she’, ‘her’]), G = ‘female’.
is_number(Str) :- atom_string(Atom, Str), atom_number(Atom, _).
age(X) :- sequence(token(X), ‘years’, ‘old’), is_number(X).
age(X) :- sequence(‘aged’, token(X), ‘years’), is_number(X).
age(X) :- sequence(token(X), ‘years’, ‘of’, ‘age’), is_number(X).
Note the highlighted predicates. This is a matter of design choice, and I prefer the latter.
The program is tested on three inputs:
(setf *text1*
“Mr.John Haggard, aged 30 years, has cold and cough. He sneezes when exposed to
cold weather. There is headache on waking up. He has runny nose as well.
“)(setf *text2*
“Ms.Mary, who is 40 years old, came to the clinic with complaint of loose motion.
She said the problem started two days ago after she ate some fruits.
“)(setf *text3*
“Mrs.Anne Lovelord, aged 60 years, complains of cold with incessant sneezing.
She passes loose motion twice a day.
“)
Here is the actual output from the Lisp program:
I hope you can see the power of defining custom logic in such an expressive language as Prolog. It allows you to do almost anything you want. Of course, if you are not careful, you can end up adversely affecting the performance of the system. (I hope that TextRazor has built in adequate safeguards against malware masquerading as custom rules!)
At this juncture, I want to point out a practical issue. Although I am comfortable programming in Prolog, I found the experience of developing and testing the rules a bit difficult and tedious. The only way to test the logic is to experiment by writing a client program like I did. I wish there was some kind of “Developer Console” or “Test Console”, where I can quickly enter my Prolog rules, give a sample piece of text and check the result. In the case of MeaningCloud, the “Test Console” was quite helpful and we didn’t have to write any client code to test the logic. Although TextRazor’s site has a “Demo” mode, it does not support defining Prolog rules and custom extractors. I hope TextRazor addresses this limitation in the near future. Despite this minor limitation, the overall approach of supporting Prolog logic engine is highly commendable!
You can download my Lisp code here. The program has been tested in LispWorks.
Have a great weekend!
Recent Comments