Was there Gender Bias in Ghostbuster Reviews?

Slate had an interesting article on whether the gender of a critic was related to his/her review of the new Ghostbusters film, saying the current data “seem to suggest that, aside from a few outliers, female critics have been more inclined to be generous toward the new Ghostbusters than male critics.” What drove me crazy, however, is A) the article consisted mostly of quotes from a handful of critics to prove its point and B) there was no methodology or data included, just a summary stat. I decided to do my own digging.

Methodology

On the last day of the film’s opening weekend, I scraped all 12 pages of critics reviews from Rotten Tomatoes using Python’s wonderful BeautifulSoup package. I collected the critic’s name, his/her “fresh” or “rotten” rating, and the critic’s publication.

In [1]:
from bs4 import BeautifulSoup
import requests
In [2]:
all_reviews = []
all_reviewers = []
In [3]:
for i in range(14):      # Number of pages plus one 
    url = "https://www.rottentomatoes.com/m/ghostbusters_2016/reviews/?page={}&sort=".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.content,"html.parser")
    page_reviews = soup.find_all("div", class_="review_icon")
    page_reviewers = soup.find_all("div", class_="critic_name")
    all_reviews = all_reviews + page_reviews
    all_reviewers = all_reviewers + page_reviewers

Converting ratings to ‘fresh’ or ‘rotten’

In [4]:
ratings = [str(x) for x in all_reviews]
ratings = [x[35:41] for x in ratings]
ratings = [x.replace('"','') for x in ratings]

Cleaning up critics’ names and publications

In [5]:
reviewer_names = [y.a.contents for y in all_reviewers]
In [6]:
reviewers = [item for sublist in reviewer_names for item in sublist]
In [7]:
pub_names = [z.em.contents for z in all_reviewers]
publications = [item for sublist in pub_names for item in sublist]

Creating a tidy dataframe

In [8]:
import pandas as pd
In [9]:
data = { 'reviewer': reviewers,
        'publication': publications,
        'rating':ratings
       }

Then using Python’s Genderize package, I ran through the critics’ first names and identified each critics’ gender.

Here’s the thing with Genderize: It utilizes the widely used Genderize.io web service, a database that contains 216,286 distinct names across 79 countries and 89 languages. It assigns a gender to a name and a probability on whether the given name is likely that gender. For example, the probability of Matthew being “male” is 1. Cameron, however, is only 0.94. I decided to only use names based on a 95% probability. That means, there is always a 5% chance that one name could be assigned to the wrong gender. But with 230 names, this was more efficient than trying to identify each critic one by one. (Note: I also played around with using 98% and 100% probability, and reached the same results with a much smaller sample.)[1]

Determining gender

In [10]:
first_names = [a.split() for a in df.reviewer]
first_names = [a[0] for a in first_names]
In [11]:
from genderize import Genderize
import time
In [12]:
genders = []
for x in first_names:
    gender = Genderize().get([x])
    print(gender)
    time.sleep(5)
    genders = genders + gender
In [13]:
df['gender'] = all_genders
In [14]:
df['probability'] = prob

Remove probability under 95%

In [15]:
df = df.loc[df.probability >= 0.95]

Results

The table and figure below break down the reviews by rating and gender.

Female Male All
Fresh 38 107 145
Rotten 9 47 56
All 47 154 201

ghostbusters-fig-1

As the table and figure show, very few female critics in my sample gave the film a rotten rating. The film gets a fresh rating of 80% with female critics and 69% with male critics. That certainly sounds damning, right? That’s where Slate ended its story. But, with a two-way table like the one above, you can actually test for statistical significance: the Chi-Square Test.

I’ve previously gone in-depth on how the Chi-Square Test works here (under Hypothesis Testing), so I’m not going to get into the details. Basically we want to prove that there is a gender bias based on the results above (our \(H_1\)) using 95% confidence. But in order to do that, we have to disprove the null hypothesis (or \(H_0\)): There is no difference.

\(H_0\): There is no difference between the proportions.

\(H_1\): There is a difference.

We do this based off using the following formula:

\[\sum \frac{(observed – expected)^2}{expected}\]

Or I can cheat and let Python’s scipy package do it for me (with Yates correction).

In [16]:
import scipy.stats as stat
ratings_gender_chi = stat.chi2_contingency(ratings_gender)
print("Chi value: " + str(ratings_gender_chi[0]) + "\tp-value: " + str(ratings_gender_chi[1]))
Chi value: 1.78524223294	p-value: 0.181506925773

Basically, that means we can’t reject the null hypothesis with a p-value greater than 0.05. There is no significant difference between the expected and observed frequencies. But before the haters claim a moral victory of some sort …

Conclusion

Putting this together revealed a couple of things to me — well, not so much revealed as confirmed. First, the film critic profession is seriously lacking women. Even Slate pointed this out. That’s just disappointing. Second, Rotten Tomatoes’ method of aggregation is questionable and they may not be the best source for reviews. (FiveThirtyEight has covered this.)

“Fresh” and “Rotten” are binary and leave no room for a middle ground. You literally love something or hate it. But who can say that about the majority of films they’ve seen? It’s entirely possible to love one part of a movie and dislike another.[2]

This exercise proves something and nothing at the same time. The gender divide on reviews (as aggregated by Rotten Tomatoes) for Ghostbusters is not statistically significant. But there are problems with the data.

As always, everything above is available in a Jupyter notebook on GitHub.


[1] If you’ve read through the entire article, thanks! If not, the following won’t make much sense. Basically, the Chi-Square Test results for the data when the probability of gender/name match set at 98% and 100% are p-values of 0.14 and 0.41, respectively. So if you’re being incredibly cautious when using genderize, the results are actually worse.

[2] To this day, I will argue with anyone that there is half a good movie in Indiana Jones and the Kingdom of the Crystal Skull or Prometheus. I proudly own both.