Recently I’ve been working on an extraction tool for analyzing a large amount of text to determine relevant and contextual noun phrases and proper nouns to return the sentiment of text surrounding these words. I’ve been using Python and a few natural language processing libraries and packages to help in this, specifically NLTK, NumPy, PyYAML, and nameparser. The natural language toolkit (NLTK) has an O’Reilly book available online that has been awesome in showing how to use the toolkit.
At first I started out with extracting contextual nouns by using regular expressions to match all words that start with a capital letter [A-Z] followed by lowercase letters [a-z] then a space [\s] followed by a word that starts with a capital letter [A-Z] followed by lowercase letters [a-z]. I used books (Oliver Twist by Charles Dickens in this example) from Project Gutenberg to test it out and after removing the duplicates 268 unique phrases matched. This returned both streets names, place names, character names, and a few other common nouns. It was quick but not really that good for extracting all of the contextual nouns from the text.
After working through the problem a bit more, I came upon an inspiring example by Alex Rowe that really went above and beyond what I had done. With this I was able to use the power of the NLTK to remove words commonly known as stopwords (insignificant parts of speech that don’t contribute to the meaning of a phrase), group the returned words in context of where they appear in the text, and make myself a coffee while waiting for the script to finish. This returned 37,879 unique noun phrases!
The unfortunate thing about this example though is that it couldn’t determine proper nouns from the text so I was still in need of my original script. There is an excellent library within the NLTK corpus called names which contain two text files of female and male names. I used this library to compare my unique noun phrases and create a separate list of potential character names from the text. This was very good, but due to parental creativity, it was still returning words like “bliss”, “town”, and “pleasure” that apparently are baby names somewhere in the world, but I knew were unlikely to be character names from the book. So after examining the name list, I used a counter to measure the number of times the name appeared in the text and removed the names that appeared infrequently (indicating a non-character), then filtered out any duplicates.
I then took my original tiny proper noun list and compared the character names extracted from the larger list using a regular expression search that took a proper noun (“Oliver Twist”, “Covent Garden”, “London Bridge”) and searched the noun to see if any of my extracted names (“oliver”, “rose”, “noah”) matched any of the characters in the noun using \b as a word boundary, (?=\w) for the word- either the first word or the last word, [%s] to insert the variable name in the list I was cycling through, another word boundary \b and (?!\w) which is a sub-expression that performs a negative look ahead search and matches the search string at any point where a string not matching the word begins. See this resource for more helpful info on regular expressions. This filtered list returned much better results, though I still had a problem with it containing unique phrases that were not duplicates but upon inspection were regarding one character. For example, “Master Charley”, ”But Charley”, “Charley Bates” and “Charles Bates”.
This project is ongoing and I will post more when I’ve figured out a more advanced solution that does not involve me looking at the returned list to see if there are fluff words that shouldn’t be there. Eventually I’ll be able to tokenize the names to give a larger analysis of the context surrounding them. I’ll leave you with looking at what nameparser, a google code project that parses human names into individual components, can do, which is returning a way better list of proper nouns than I did originally. From there you can also get the titles of a person as well as first name and last name, which is immensely helpful for further honing of the character list.