I had earlier briefly talked about the Ilexicon project that I have been working on for some time now. The goal is to build an intelligent dictionary that will come in handy while implementing NLP applications such as recognizers and generators. In today’s post, I want to demonstrate some cool features available in iLexicon. At present iLexicon supports only English.
Let us start with a simple query: List words having 2 syllables, 8 letters and containing the substring zz.
cg-user(1): (get-matching-words :word-pat “[a-z]+zz[a-z]+$” :num-syllables 2 :num-letters 8)
(“blizzard” “grizzled” “frizzing” “frizzler” “grizzler” “quizzing” “whizzing”)
get-matching-words is the main function to search for words in the lexicon. It has numerous options and supports rich queries, as you will discover soon. Here, it takes a word pattern specification (a regular expression), the number of syllables and the number of letters the word should be made of.
Let us find out how many words are in the dictionary with at least 8 letters:
cg-user(2): (length (get-matching-words :num-letters ‘(8 nil)))
262647
Slightly more interesting query this time: Find all words that start with g and end with either e or m, which also rhyme with the word home.
cg-user(3): (get-matching-words :word-pat “^g[a-z]+[em]$” :rhyming-with “home”)
(“gloam” “gnome”)
The lexicon knows about parts of speech, so we can ask a query like this: List all Verbs that have 15 letters in them.
cg-user(4): (get-matching-words :pos “[V]” :num-letters 15)
(“circumstantiate” “consubstantiate” “conventionalise” “conventionalize” “cross-fertilize”
“cross-pollinate” “dedifferentiate” “haemagglutinate” “professionalise” “professionalize” …)
Because this is a large list, I have shown only a part of it.
Another nice feature in the lexicon is the information about accent structure of words. This is quite useful when you are writing poetry, and you want words to conform to certain foot and meter. In the query below, I am asking for all Adjectives containing the substring mon and which have the syllable structure unstressed-unstressed-stressed-unstressed-unstressed:
cg-user(5): (get-matching-words :accent-structure “uusuu” :pos “[J]” :word-pat “mon”)
(“acrimonious” “ammoniacal” “antimonious” “ceremonial” “ceremonious” “commonsensical” “demoniacal” “disharmonious” “inharmonious” “matrimonial” …)
We can ask for words that have a specific stem. The lexicon supports both Porter and Snowball stems.
cg-user(6): (get-matching-words :snowball-stem “terrac”)
(“terrace” “terraced”)
Let us now focus on interesting word patterns similar to the ones listed in the title of this blog. Let us start with Palindromes . The lexicon marks palindrome words as such, so you can query directly:
cg-user(7): (get-matching-words :word-pat “t$” :palindrome 1 :num-letters 5)
(“tebet” “tenet” “tevet” “tibit”)
Here I am asking for 5-letter words ending in t that are also palindromes.
An Isogram is a word that has no repeating letters. Let us look for 5-letter isograms starting with the letter c:
cg-user(8): (get-matching-words :word-pat “^c” :num-letters 5 :isogram 1)
(“cabin” “cabot” “cadge” “cadgy” “cagey” “caine” “caird” “cairn” “cager” “caius” …)
Onomatopoeia means using words that sound similar to the entities they denote. Two examples are buzz and hiss.
Here are some Onomatopoeic words that end in g:
cg-user(9): (get-matching-words :word-pat “g$” :onomatopoeia 1)
(“bang” “bing” “bong” “clang” “ding-dong” “ding” “gong” “ping” “pong” “ring” …)
Semordnilap refers to a word which, when read in reverse, means another word. Notice that Semordnilap is Palindromes spelt in reverse!
cg-user(10): (get-matching-words :word-pat “^p” :semordnilap 1 :num-letters ‘(4 6))
(“part” “peek” “plew” “plug” “pool” “poort” “pope” “port” “pose” “prat” “proc” “prod” …)
In addition to all these interesting word patterns, iLexicon contains another useful feature. Given a word, it can give you other grammatical forms of the word. For instance, if you want the comparative degree of the adjective good, you can do this:
cg-user(11): (get-word-forms “good” ‘(D2))
(“better”)
Here D stands for degree and 2 means comparative. So what is the positive form of best? Let us find out:
cg-user(12): (get-word-forms “best” ‘(D1))
(“good” “well”)
We get both good and well. Useful, isn’t it?
There are many more exotic features built into iLexicon, and of course, it is still work in progress. I will share with you more details about this project in due course.
The core engine of iLexicon is written in Allegro CL on Windows.
Recent Comments