Wednesday, May 21, 2008

Response to comment...

Roger asked "How do you intend to verify whether you have understood a text?"

Good question!

The short answer is to look at the logical representation of the text in memory. It would be even better if I had an interface to be able to ASK questions of this logical representation. For instance, if the following sentence is read: "The man slowly climbed the stairs." Then we should be able to ask if the man climbed and the answer should be "yes". If we ask what the man climbed then the answer should be "the stairs". Additionally, I would like to see the system learn that men can climb stairs and that this can be done at varying speeds (or at least slowly).

Of course I have to learn to walk before I can fly...so there are more unanswered questions than answered ones right now. How will I query those facts? How will I represent the fact internally? How will I interpret the text into this internal representation? How will I parse the sentences before interpreting them? How will I recognize parts of speech? How will I learn those rules for recognition of parts of speech?

This is where I think I am now. I found with the Porter Stemmer how to recognize some words as verbs. Next I have to figure out recognize other parts of speech and what to do with the words I DON'T recognize. Maybe if it can get the list of words it doesn't understand down to a manageable level, it could ask a user for some information. Hopefully, by asking a user a few careful questions it will be able to learn rules that will allow it to categorize large quantities of unknown words. One thing I DON'T want to do is use some sort of preexisting knowledge of what parts of speech words are. I also don't want to train my application on test data and then have it only have that level of understanding. I would prefer to be able to make an application that, programmed with a core set of rules, would be able to build up its own dictionary and continuously refine its understanding with every bit of text it encounters.

Where I am so far...

I became interested in Natural Language Programming when I was reading Artificial Intelligence - a Modern Approach (2e) by Peter Novig. I read the first 13 chapters in depth but only did a single read through on the remaining chapters. I intend to go back finish the rest of it later. After that I started reading Foundations of Statistical Natural Language Processing by Manning and Schutze. Though I read the whole thing, I became very lost in the final chapters. One of the things that made me lose interest is that it was more about how to statistically recognize languages and my interest is more in getting a computer to understand and create an internal representation of what was read. Some of the concepts, I felt would be easier to understand after a read more introductory material as well.
This lead me to start reading Speech and Natural Language Processing by Jurafsky and Martin. I think the book is great and I just finished chapter five today. Chapter 3 introduced me to stemming. I found a text version of a novel on line (a cheap sci fi) and I've written an application to parse out all the words. I implemented the Porter Stemming algorithm and was pleased to find that, of the 8000 distinct words in my file, I could find determine that about 2000 of them where stems of other words. I have the system make a "guess" that a word is a verb if it finds the stem, -ing, -ed and -s versions of a word.
Chapter 5 is where the book became really interesting as far as I am concerned. I've implemented the Levenshtein minimum edit distance algorithm and am working on the forward algorithm. I plan to implement the Viterbi algorithm next week sometime.

Let the blogging begin...

Hi, I started this blog because I'm very interested in Natural Language Processing. My primary interest is in representing the semantics of text to a computer.
I am just beginning my study in this area and would appreciate any input anyone has in the form of comments!