Dependency parsing is widely used these days, and many NLP tools give a dependency graph as the parsed representation of the input text. See for example, SpacY and TextRazor. The following is the dependency tree corresponding to the sentence Mary is drinking cold water:
The above tree was generated using SpacY. You can see that the arrow points from the dependent word to its head word, and each arrow has a label denoting the relationship between the head word and its dependent (some use the opposite convention, where the arrow points from head word to dependent). The above graph (actually, a tree) can be represented as a collection of triples as follows (not a formal RDF):
(drinking aux is)
(drinking nubj Mary)
(water dobj drinking)
(water amod cold)
This is not the only way, but you get the idea. Expressed this way, we can immediately see the similarities between the dependency graph and RDF used in Semantic Web. If we are able to convert the dependency graph to RDF, we can then take advantage of a large number of tools that are available to operate on the RDF graph. Tools such as Jena and AllegroGraph are known for handling very large data sets, and come bundled with support for SPARQL query language and efficient Reasoners.
Before attempting the conversion, we have to be clear about what information needs to be carried over from the dependency representation. Here is my list:
1) We should have some unique IDs for the different sentences. So if someone wants to know which sentences contain, for example, the word “likes”, this information should be available. To take an even simpler example, we should be able to answer the question “How many sentences are there in the given text?”.
2) Each occurrence of a word must have a unique ID. To amplify, if the same word occurs multiple times in the same sentence, its IDs must be different.
3) Obviously, we need to capture the <head-word, dependency, dependent-word> relationship.
4) I feel it is useful to keep track of the part-of-speech (POS) of each occurrence of a word. Remember that a word such as “sleep” could act as a Noun in one place and Verb in another.
5) The lemma (root form) of each word could also prove useful. This way, we can search for sentences that have the lemma “be”, without worrying about whether it is “is” or “was”.
When converting to RDF, we have to remember that there are many serialization formats, not just one. Common formats are Turtle, N-triples, N-Quads, N3, RDF/XML, and RDF/JSON. I chose Turtle as my output format.
Another design decision is to use an appropriate Namespace for qualifying the different URIs. Again, to simplify my work and to get started quickly, I have used my own namespace for the URIs. Although not critical at this point, I feel this aspect has to be addressed eventually.
Let us consider the simple sentence: John loves Mary.
Here is the Turtle representation generated by my converter:
# Dependency Graph Representation in Turtle Format.@prefix m: <http://mmsindia/depgraph/example/> .m:sent-1 m:word m:word-250 .m:word-250 m:pos “NNP” .m:word-250 m:lemma “john” .m:word-251 m:nsubj m:word-250 .m:sent-1 m:word m:word-251 .m:word-251 m:pos “VBZ” .m:word-251 m:lemma “love” .m:word-251 m:ROOT m:word-251 .m:sent-1 m:word m:word-252 .m:word-252 m:pos “NNP” .m:word-252 m:lemma “mary” .m:word-251 m:dobj m:word-252 .m:sent-1 m:word m:word-253 .m:word-253 m:pos “.” .m:word-253 m:lemma “.” .m:word-251 m:punct m:word-253 .m:word-250 m:label “John” .m:word-251 m:label “loves” .m:word-252 m:label “Mary” .m:word-253 m:label “.” .
There is a triple of the form {m:sent-<id> m:word m:word-<id>} for each word in the text. The word-<id> has a unique value for every occurrence of a word. Hence for each word-<id>, we also have the corresponding lemma of the word and its POS. You will also notice that the last section enumerates all the word instances as triples of the form {m:word-<id> m:label <literal>}.
I think the conversion logic is easy to understand. As I mentioned earlier, once we obtain the RDF representation, we can do many interesting things with it.
In the next post, I will show how we can import this data into a graph database (AllegroGraph) and query the graph.
I implemented the conversion logic in Python. The program uses SpacY to parse input text and convert the corresponding dependency graph to Turtle format. The input file and output files are passed as command line parameters to the program. You can download the python program here. A sample input file containing multiple sentences and the corresponding TTL file are also available for download.
Have a nice weekend!
Recent Comments