Emily Daniels

Further Development of the Character Extraction Tool

noahClaypole Further Development of the Character Extraction Tool

Work has continued on analyzing the mood and sentiment of a text by splitting the text into sentences using a regular expression that accounted for .!?’” sentence endings but wouldn’t split on common abbreviations like Mr., Mrs., and other titles. After I had a list of sentences I pulled in my proper nouns list, which turned out to be the most accurate list of potential character name matches, and compared the two lists with a regular expression that appended the sentence to a dictionary under the key of the character name if it was found in the sentence.

From there I used pattern to parse and return the mood of the sentence, categorized into four types of grammatical moods; indicative, imperative, conditional, or subjunctive. From the pattern API documentation, an example of an indicative (fact or belief) sentence would be “It rains.” An example of an imperative (command or warning) sentence would be “Don’t rain!” An example of a conditional (conjecture) sentence would be “It might rain.” An example of a subjunctive (wish or opinion) sentence would be “I hope it rains.” I created a separate dictionary to hold the mood data which is also grouped by character name as the key token.

For sentiment analysis I used the pattern.vector module. The module contains machine learning tools that use the previous classification of a body of text to predict the future classification of a body of text. From the documentation, classification is a supervised machine learning method that uses labeled documents (i.e., Document objects with a type) as training examples to statistically predict the label (class, type) of new documents, based on their similarity to the training examples using a distance metric (e.g., cosine similarity). A Document is a bag-of-words representation of a text, i.e., unordered words + word count.

It uses a Naive Bayes classifier to extract the information from a text (in this case a text that labels sentences as a good or bad movie review) and inform the prediction of unknown text (in this case, whether a sentence has a good or bad sentiment tone). A Naive Bayes classifier assumes that the presence or absence of a particular feature is unrelated to the presence or absence of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 3″ in diameter. The classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of the presence or absence of the other features.

To use the classifier I placed the trainer file reviews.csv in the same folder as my script and imported the corresponding modules into my script, then used the train function on a NB object with the list of classified text pulled in from the csv file. I then iterated over the keys and values in my list of characters and sentences and called the classify function (from the NB object) on the value, appending this to a list with the character name as the key. This gives a range of 0-5 for each sentence; 0 being bad or negative tone, 5 being good or positive tone.

While I was debugging the scripts I found a different way to preserve the dictionary format of my keys and values so that I could write and read quickly while preserving the nested structure. In Python it is called pickling; the module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream and “unpickling” is the inverse operation, whereby a byte stream is converted back into an object hierarchy.  I had previously been using text files to write to and then read from the dictionaries I was creating, but when reading in the data I would have to recreate the structure by splitting and stripping on different punctuation points in the text.

Here is a character analyses of Noah Claypole from Oliver Twist, and the literary critique of the character from SparkNotes for comparison:

Character Synopsis of Noah Claypole:  A charity boy and Mr. Sowerberry’s apprentice. Noah is an overgrown, cowardly bully who mistreats Oliver and eventually joins Fagin’s gang.

Raw Data:

Whether Noah:[['Whether Noah Claypole, whose rapacity was none of the least comprehensive, would have acceded even to these glowing terms, had he been a perfectly free agent, is very doubtful; but as he recollected that, in the event of his refusal, it was in the power of his new acquaintance to give him up to justice immediately (and more unlikely things had come to pass), he gradually relented, and said he thought that would suit him.'], [0], ['conditional']]

Poor Noah:[['Poor Noah!'], [1], ['indicative']]

Noah Claypole:[["'I'm Mister Noah Claypole,' said the charity-boy, 'and you're under me.", 'said Noah Claypole.', 'That Oliver Twist was moved to resignation by the example of these good people, I cannot, although I am his biographer, undertake to affirm with any degree of confidence; but I can most distinctly say, that for many months he continued meekly to submit to the domination and ill-treatment of Noah Claypole: who used him far worse than before, now that his jealousy was roused by seeing the new boy promoted to the black stick and hatband, while he, the old one, remained stationary in the muffin-cap and leathers.', 'One day, Oliver and Noah had descended into the kitchen at the usual dinner-hour, to banquet upon a small joint of mutton--a pound and a half of the worst end of the neck--when Charlotte being called out of the way, there ensued a brief interval of time, which Noah Claypole, being hungry and vicious, considered he could not possibly devote to a worthier purpose than aggravating and tantalising young Oliver Twist.', 'CHAPTER VII\nOLIVER CONTINUES REFRACTORY\nNoah Claypole ran along the streets at his swiftest pace, and paused not once for breath, until he reached the workhouse-gate.', "And the cocked hat and cane having been, by this time, adjusted to their owner's satisfaction, Mr. Bumble and Noah Claypole betook themselves with all speed to the undertaker's shop.", 'Now, Mr. and Mrs. Sowerberry having gone out to tea and supper: and Noah Claypole not being at any time disposed to take upon himself a greater amount of physical exertion than is necessary to a convenient performance of the two functions of eating and drinking, the shop was not closed, although it was past the usual hour of shutting-up.', 'At the upper end of the table, Mr. Noah Claypole lolled negligently in an easy-chair, with his legs thrown over one of the arms: an open clasp-knife in one hand, and a mass of buttered bread in the other.', "'Never mind whether they're two mile off, or twenty,' said Noah Claypole; for he it was; 'but get up and come on, or I'll kick yer, and so I give yer notice.'", 'Through these streets, Noah Claypole walked, dragging Charlotte after him; now stepping into the kennel to embrace at a glance the whole external character of some small public-house; now jogging on again, as some fancied appearance induced him to believe it too public for his purpose.', 'asked Noah Claypole.', "Noah Claypole's mind might have been at ease after this assurance, but his body certainly was not; for he shuffled and writhed about, into various uncouth positions:  eyeing his new friend meanwhile with mingled fear and suspicion.", 'Whether Noah Claypole, whose rapacity was none of the least comprehensive, would have acceded even to these glowing terms, had he been a perfectly free agent, is very doubtful; but as he recollected that, in the event of his refusal, it was in the power of his new acquaintance to give him up to justice immediately (and more unlikely things had come to pass), he gradually relented, and said he thought that would suit him.', "Noah Claypole, bespeaking his good lady's attention, proceeded to enlighten her relative to the arrangement he had made, with all that haughtiness and air of superiority, becoming, not only a member of the sterner sex, but a gentleman who appreciated the dignity of a special appointment on the kinchin lay, in London and its vicinity.", 'Noah Claypole, or Morris Bolter as the reader pleases, punctually followed the directions he had received, which--Master Bates being pretty well acquainted with the locality--were so exact that he was enabled to gain the magisterial presence without asking any question, or meeting with any interruption by the way.', 'CHAPTER XLV\nNOAH CLAYPOLE IS EMPLOYED BY FAGIN ON A SECRET MISSION\nThe old man was up, betimes, next morning, and waited impatiently for the appearance of his new associate, who after a delay that seemed interminable, at length presented himself, and commenced a voracious assault on the breakfast.', "Peeping out, more than once, when he reached the top, to make sure that he was unobserved, Noah Claypole darted away at his utmost speed, and made for the Jew's house as fast as his legs would carry him.", 'Stretched upon a mattress on the floor, lay Noah Claypole, fast asleep.', 'Mr. Noah Claypole:  receiving a free pardon from the Crown in consequence of being admitted approver against Fagin:  and considering his profession not altogether as safe a one as he could wish:  was, for some little time, at a loss for the means of a livelihood, not burdened with too much work.'], [0, 0, 0, 4, 0, 2, 0, 2, 0, 4, 0, 4, 0, 1, 4, 1, 2, 2, 4], ['indicative', 'indicative', 'indicative', 'conditional', 'indicative', 'indicative', 'indicative', 'indicative', 'conditional', 'indicative', 'indicative', 'conditional', 'conditional', 'indicative', 'indicative', 'indicative', 'conditional', 'indicative', 'conditional']]

As Noah:[["As Noah's red nose grew redder with anger, and as he crossed the road while speaking, as if fully prepared to put his threat into execution, the woman rose without any further remark, and trudged onward by his side."], [3], ['indicative']]

If Noah:[["If Noah had been attired in his charity-boy's dress, there might have been some reason for the Jew opening his eyes so wide; but as he had discarded the coat and badge, and wore a short smock-frock over his leathers, there seemed no particular reason for his appearance exciting so much attention in a public-house."], [0], ['conditional']]

Mister Noah:[["'I'm Mister Noah Claypole,' said the charity-boy, 'and you're under me.", "Oliver, shut that door at Mister Noah's back, and take them bits that I've put out on the cover of the bread-pan."], [0, 0], ['indicative', 'indicative']]


Tone in sentences containing character (a negative-positive scale of 0-5):

0, 0, 0, 4, 0, 2, 0, 2, 0, 4, 0, 4, 0, 1, 4, 1, 2, 2, 4, 3, 0, 0, 0 Average: 1.4 (negative)

Mood combined with tone (indicative: fact or belief, imperative: command or warning, conditional:  conjecture, subjunctive: wish or opinion):

0 (negative) indicative, 0 (negative) indicative, 0 (negative) indicative, 4 (positive) conditional, 0 (negative) indicative, 2 (negative) indicative, 0 (negative) indicative, 2 (negative) indicative, 0 (negative) conditional, 4 (positive) indicative, 0 (negative) indicative, 4 (positive) conditional, 0 (negative) conditional, 1 (negative) indicative, 4 (positive) indicative, 1 (negative) indicative, 2 (negative) conditional, 2 (negative) indicative, 4 (positive) conditional, 3 (neutral) indicative, 0 (negative) conditional, 0 (negative) indicative, 0 (negative) indicative

Occurrences in text: indicative: 16, imperative: 0, conditional: 7, subjunctive: 0

Postulation: The author uses overall negative factual and belief based sentences combined with overall negative conjecture based sentences to construct this antagonistic character in the story.

From here I’ll continue to hone the tool, expand the data on each character, and narrow the characters extracted from the text so that I can ensure the entities that are returned are more accurate.


Character Extraction with Python and NLTK

Olivertwist front 785x663 Character Extraction with Python and NLTK

Recently I’ve been working on an extraction tool for analyzing a large amount of text to determine relevant and contextual noun phrases and proper nouns to return the sentiment of text surrounding these words. I’ve been using Python and a few natural language processing libraries and packages to help in this, specifically NLTK, NumPy, PyYAML, and nameparser. The natural language toolkit (NLTK) has an O’Reilly book available online that has been awesome in showing how to use the toolkit.

At first I started out with extracting contextual nouns by using regular expressions to match all words that start with a capital letter [A-Z] followed by lowercase letters [a-z] then a space [\s] followed by a word that starts with a capital letter [A-Z] followed by lowercase letters [a-z]. I used books (Oliver Twist by Charles Dickens in this example) from Project Gutenberg to test it out and after removing the duplicates 268 unique phrases matched. This returned both streets names, place names, character names, and a few other common nouns. It was quick but not really that good for extracting all of the contextual nouns from the text.

After working through the problem a bit more, I came upon an inspiring example by Alex Rowe that really went above and beyond what I had done. With this I was able to use the power of the NLTK to remove words commonly known as stopwords (insignificant parts of speech that don’t contribute to the meaning of a phrase), group the returned words in context of where they appear in the text, and make myself a coffee while waiting for the script to finish. This returned 37,879 unique noun phrases!

The unfortunate thing about this example though is that it couldn’t determine proper nouns from the text so I was still in need of my original script. There is an excellent library within the NLTK corpus called names which contain two text files of female and male names. I used this library to compare my unique noun phrases and create a separate list of potential character names from the text. This was very good, but due to parental creativity, it was still returning words like “bliss”, “town”, and “pleasure” that apparently are baby names somewhere in the world, but I knew were unlikely to be character names from the book. So after examining the name list, I used a counter to measure the number of times the name appeared in the text and removed the names that appeared infrequently (indicating a non-character), then filtered out any duplicates.

I then took my original tiny proper noun list and compared the character names extracted from the larger list using a regular expression search that took a proper noun (“Oliver Twist”, “Covent Garden”, “London Bridge”) and searched the noun to see if any of my extracted names (“oliver”, “rose”, “noah”) matched any of the characters in the noun using \b as a word boundary, (?=\w) for the word- either the first word or the last word, [%s] to insert the variable name in the list I was cycling through, another word boundary \b and (?!\w) which is a sub-expression that performs a negative look ahead search and matches the search string at any point where a string not matching the word begins. See this resource for more helpful info on regular expressions. This filtered list returned much better results, though I still had a problem with it containing unique phrases that were not duplicates but upon inspection were regarding one character. For example, “Master Charley”,  ”But Charley”, “Charley Bates” and “Charles Bates”.

This project is ongoing and I will post more when I’ve figured out a more advanced solution that does not involve me looking at the returned list to see if there are fluff words that shouldn’t be there. Eventually I’ll be able to tokenize the names to give a larger analysis of the context surrounding them. I’ll leave you with looking at what nameparser, a google code project that parses human names into individual components, can do, which is returning a way better list of proper nouns than I did originally. From there you can also get the titles of a person as well as first name and last name, which is immensely helpful for further honing of the character list.




A drawing I did of a photo of a salamander under water a few years ago:

salamander Salamander

Indiana Jones Distilled

Jones jones = new Jones(gun, whip, hat);

BadGuy badGuy = new BadGuy(gun, funds, twistedAgenda);


if (jones.getsMap){


if (badGuy.getsTreasure){




else if (jones.getsTreasure){




if (treasureDestroyed){







Regex C# Phone Validation

I can’t add to the commenting on this question, but I wanted to share a regular expression I wrote for phone number validation in C#. Another solution posted in the question accepts 10 digits, () around area code, and doesn’t allow preceding 1 as country code:


But using this regex in C# returns an [x-y] range in reverse order error similar to this posting. Here is a regular expression solution that allows for “()”, “.”, and “-” between number fields:

Regex validPhone = new Regex(@”^(?:(?:[\(.-]\s*)?(?:(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]‌​)\s*)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[\).-]\s*)?)([2-9]1[02-9]‌​|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[-.]\s*)?([0-9]{4})$”);

This allows phone numbers in these formats:

  •  (333)456-7890
  •  (333) 456-7890
  •  333-456-7890
  •  3334567890
  •  333 – 456 – 7890
  •  333.456.7890

And not in these formats:

  • (123)456-7890
  • (123) 456-7890
  • 123-456-7890
  • (411)456-7890
  • (911)456-7890
  • (000)456-7890
  • (111)456-7890
  • 9999999999
  • 999-999-9999
  • (999)999-9999

Hope this helps!

« Older posts

Copyright © 2014 Emily Daniels

Theme by Anders NorenUp ↑