MeaningCloud announced their premium offering “Deep Categorization” service quite recently. You can read about it in this nice blog by Blanca Galego.
Compared to the canonical classification models that rely on machine learning using large data sets, MeaningCloud’s deep categorization models use hand-crafted rules that take advantage of “morphosyntactic, semantic and contextual information” to classify a piece of text.
The company has several pre-defined deep categorization models, such as Intention Analysis, Voice-of-Customer Banking, Voice-of-Employee Exit Interview, and so on. Take a look at this to learn more.
This feature, as well as how to build custom models, is explained in detail in the documentation section.
After reading the extensive documentation, I was keen to get my hands dirty by trying it on some example. I am working on a homeopathy project that involves text analysis, so I decided to apply it in that context.
Here is the problem: Given a homeopathy case record (corresponding to a patient), can we automatically identify the nature of ailment? Prescribing the correct remedy for an ailment is a very complex process in homeopathy, so if some kind of automation can facilitate this process, I would consider it to be a significant achievement.
For this toy experiment, of course, I have taken a simple example with just two categories:
acute-disease>common-cold
acute-disease>diarrhoea
Given a piece of text, we want the system to label it as one or both of the above categories.
First, consider the labels I have chosen for the categories. In homeopathy, common cold and diarrhoea are examples of acute diseases. Diabetes is an example of chronic disease. When developing deep categorization models, it is useful to take advantage of hierarchies like this, wherever appropriate.
Now, take the case of common cold. A case record that documents this acute ailment of a patient will most likely have expressions like “cold”, “runny nose”, “sneezing”, “headache”, “fever”, etc. We can capture this knowledge in terms of “rules” involving pattern expressions for categorizing given text into “acute-disease>common-cold” category. Here are my two rules:
The first rule “symptom1” uses a “macro” called “cold-symptoms”. Macros are a nice way to capture patterns that are likely to be used across multiple rules. Instead of repeating the same pattern expression again and again, defining it as a macro makes it concise, more readable and maintainable.
Here is the macro definition:
The two rules “symptom1” and “symptom2” together define the category “acute-disease>common-cold”.
Let us now look at the other category “acute-disease>diarrhoea”. See this figure:
This pattern expression is slightly more complex than the ones used for common cold. To whet your curiosity, the last part of the pattern [putrid|watery|undigested stools]~4 matches the following cases:
“putrid stools”
“watery stools”
“stools containing undigested food”
That is, the two sets of words may appear in any order within a span of 4 words. For a detailed description of the pattern syntax, see the documentation.
The following image shows our two categories:
Once the model has been defined, we need to “Build” the model by clicking the “Build” button under “Actions” group on the left side (see above). If there are no errors, the model is compiled and we get a confirmation message:
Once the model is ready, we can test it with our sample data.The nice thing is that we can do this through the Test Console; there is no need to write any program to make API calls for this purpose. Clicking on the “Test” button under “Actions” brings up the Test Console with our model already selected!
When you click “Raw” button at the bottom left, we get the result:
You can see that the system has identified the key terms in the text and also computed overall relevance score of 40 (by the way, we can configure weights for various rules).
Let us now look at an example of diarrhoea.
And here is the analysis result:
One thing worth pointing out is that the rules operate at “sentence” level (uses “Split Sentences” option). What this means is that the patterns are checked sentence by sentence. I feel this is an important option you have to consider while designing a model.
Another useful capability is the support for exporting and importing our models. This is needed if we are keeping track of different versions of our model.
Needless to say, I am quite impressed with this new “Deep Categorization” feature since it opens up new techniques for identifying “hidden” categories in the given text.
One gentle warning may be in order here. Writing “good” (expressive and efficient) pattern expressions for complex scenarios is really hard, and is no different from writing a “good program” in any language. If we aren’t careful enough, the rules can become “fragile” and fail to match valid scenarios. Of course, in these situations, it would be nice to have a tool that can take some sample sentences and generate the rules and pattern expression(s) for us! (Hope the tech team at MeaningCloud is listening!)
I trust you found this article interesting. I intend to spend more time exploring the deep categorization pattern language so that I can apply it to tougher problems.
You might also be interested in looking at MeaningCloud’s Text Parsing engine, described in my earlier blog.
Have a nice day and great week ahead!
Recent Comments