Using Data to Tell Better Stories

Lately I’ve been interested in sentiment analysis, aka opinion mining. It’s a method that combines natural language processing with text analysis to basically identify (and quantify!) feelings created by words. Julia Silge has a fantastic example on her blog, where she mined her favorite book, Pride and Prejudice.

A couple of years ago, I attended StoryCon at UC Berkeley where the topic was how communicators working in public health could use stories to improve messaging. (Of course I used that information as a basis for a pretty nerdy post over at my blog, thegeekticket.com.) Julia’s blog post got me re-thinking about how data could inform narrative.

Can you mine stories for insight into how to tell a better story?

The Data

Without a specific application in mind for my analysis, I decided to look at archetypes – what we think of as classic tales.

My first thought was to look at stories in the public domain, particularly really short stories and a whole lot of them. I found an old site called All Family Resources with 209 plain text versions of Grimm’s Fairy Tales. Best of all, the naming convention for the urls made it easy to scrape the pages using R’s RCurl and XML packages.

I created a numerical vector of 1 to 209 and then randomly pulled 100 numbers from it.

# vector
num <- c(1:209)

# random sample
set.seed(1234)
rand <- sample(num,100,replace=F)

I then wrote a function that would cycle through the numbers and create a data frame with information from the scraped pages.

get_data <- function(x) {
  
  # set up empty dataframe
  temp <- data.frame(title="",narrative.time=0,sentiment=0)
  
  for(item in x) {
    # scraping
    if(nchar(item)<2) {
      url <- paste("http://www.familymanagement.com/literacy/grimms/grimms0",item,".html",sep="")
    } else {
      url <- paste("http://www.familymanagement.com/literacy/grimms/grimms",item,".html",sep="")
    }
    data <- getURL(url)
    doc <- htmlParse(data,asText = T)
    text <- xpathSApply(doc,"//p",xmlValue)
    
    # remove everything before the story and after
    # before
    text <- text[-c(1:3)]
    # after
    text <- text[-c((length(text)-6):length(text))]
    
    # sentiments
    s_v <- get_sentences(text)
    sentiment <- get_sentiment(s_v)
    if(length(s_v)<=20) { next } else {
      perc <- get_percentage_values(sentiment,bins=10)
      
      temp <- rbind(data.frame(title=rep(sub(" \\n.*$","",s_v[1]),length(perc)),
                                   narrative.time=c(1:length(perc)),
                                   sentiment=perc
      ),temp)
    }
  }
  temp
}

That Sentimental Feeling

Of course I needed something to extract the sentiment from each narrative.

Using R’s syuzhet package, not only could I break each story down sentence by sentence, but I could use the package’s get_sentiment function to quantify positive and negative sentiments found in the text. I went with the default method for the function, which according to the documentation is a custom sentiment dictionary developed in the Nebraska Literary Lab that “should be better tuned to fiction” as it was based on sentences from a small corpus of contemporary novels.

The final step was breaking the sentiment values into equal bins, since I would be comparing 100 stories of unequal length. I used the package’s get_percentage_values function with the option of 10 bins.

The Results

In Figure 1, you see all the 100 stories on a single plot, with Narrative Time (broken into 10 bins) on the x-axis and “emotional valence”” on the y-axis – “emotional valence” being a positive or negative value related to feeling created by the text. Essentially, a positive number means a positive feeling, negative number means a negative feeling.

It’s messy and beautiful in a crazy way, but it doesn’t tell us anything.

Fig. 1

data-stories-fig-1

Using R’s dplyr package I was able to group the stories by Narrative Time and find the average sentiment for each bin. Figure 2 is basically Figure 1 with the average plotted.

Fig. 2

data-stories-fig-2

From here, it looks like the average emotional value for each bin creates almost a flat line, but it’s important to remember scale. With Figure 3, which removes all the stories, a familiar shape emerges – a narrative arc. It might not look entirely right, but if you stand on your head, you’ll see it.

Fig. 3

data-stories-fig-3

The classic narrative structure is broken into three acts: Setup, Conflict and Resolution. In Act I, we get a sense of setting and our characters. Our antagonist is also introduced, leading to Act II: conflict. In the plot above you see the sentiment dip closer to 0, meaning the positive feeling decreases. But with Act III, we reach the story’s resolution and a return to a higher positive value.

It’s not perfect. You could shift the lines marking the acts around. In a story as short as a fairy tale, the first act can be one or two sentences, such as “Once upon a time, in a far-off kingdom, there was a peasant who fell in love with a princess.”

What Can We Learn From This?

People have different reactions to different types of stories. In fairy tales, we expect a happy ending. In horror stories, not so much. The story you model your messaging on depends on the kind of response you want from the readers. In a message where you are trying to depict overcoming a struggle (usually fundraising), you would want to analyze stories about struggles. Is there an emotional arc?

Another thing to note is that stories can be all over the place. Look back at Fig. 1 and you can see the large variation in emotional valence. That’s because I picked a bunch of random stories (and in case you didn’t know, some Grimm’s Fairy tales are messed up). Look at stories you consider to be successful.

You don’t even need to know how to write an R script to do this.