This EDA takes a look at data from IMDb to analyze trends in movie and TV genres and subgenres over time. Specifically, I’m using the “title.basics” dataset from IMDb, which was obtained in October of 2020. I have also imported a dataset of the highest grossing films globally, compiled by Aaron O’Neill on Statista. (https://www.statista.com/statistics/1072778/highest-grossing-movie-annually-historical/)
Some core questions I’d like to answer in this EDA:
What have the most popular (“popular”, in the context of this EDA, meaning plentiful) film genres and subgenres been in the last 60 years? How have the most popular genres changed over the decades?
Do the highest grossing films globally give us an idea of genre trends? In other words, did films like “Titanic” lead to a resurgence in romance films?
Similarly, what have the most popular TV genres and subgenres been in the last 60 years? How have these changed over the decades?
To create a nice time window for the data, I’ve limited my analysis to only films and television from 1959 to 2019.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
#importing the raw IMDb data (filtered)
imdb_data <- read_rds("data/processed/imdb_data.rds")
#importing the raw highest-grossing film data
highest_grossing <- read_rds("data/processed/highest_grossing.rds")
#importing the filtered IMDb movie data
imdb_movies <- read_rds("data/processed/imdb_movies.rds")
#adjusting the highest-grossing film data
highest_grossing <- highest_grossing %>%
separate(movie, into = c("year", "movie"), sep = " - ")
#now that the year and movie are separate, removing films that aren't within our time range, along with 1968's Funny Girl (this ranking is only based on U.S. data, which isn't useful for our global study)
highest_grossing <- highest_grossing %>%
filter(year != "****2020") %>%
filter(year != "1968") %>%
filter(year >= 1959) %>%
filter(year <= 2019)
#turning the year value into a double (this'll be useful for graphing the top films later)
highest_grossing <- highest_grossing %>%
mutate(year = as.double(year))
#adjusting the IMDb data
#filtering to our time period
imdb_final_data <- imdb_data %>%
filter(startYear >= 1959) %>%
filter(startYear <= 2019)
#splitting genre and subgenre
imdb_final_data <- imdb_final_data %>%
separate(genres, into = c("genre", "subgenre"), sep = ",")
#creating counts of film by genre and subgenre, by year
imdb_movies <- imdb_movies %>%
group_by(startYear) %>%
mutate(genre_year_count = sum(!is.na(genre))) %>%
mutate(subgenre_year_count = sum(!is.na(subgenre)))
#creating a top_film joined dataset - combining the highest-grossing data with the film details, like genre and subgenre
top_film <- inner_join(highest_grossing, imdb_movies,
c("movie" = "primaryTitle", "year" = "startYear")) %>%
select(-movie, year)
#creating rankings of genre and subgenre based on top_film and the imdb_movies data
genres_over_time <-count(imdb_movies, genre, sort = TRUE) %>%
group_by(startYear)
#fct_lump_lowfreq(genre)
subgenres_over_time <- count(imdb_movies, subgenre, sort = TRUE) %>%
filter(!is.na(subgenre)) %>%
group_by(startYear)
top_films_over_time <- full_join(top_film, genres_over_time,
c("year" = "startYear", "genre" = "genre")) %>%
filter(!is.na(originalTitle))
top_subs_over_time <- full_join(top_film, subgenres_over_time,
c("year" = "startYear", "subgenre" = "subgenre")) %>%
filter(!is.na(originalTitle))
#Making a proportional decade-based genre heatmap
heatmap <- imdb_movies %>% mutate(startYear = as.character(startYear)) %>%
mutate(startYear = fct_collapse(startYear,
"2010s" = c("2019", "2018", "2017", "2016", "2015", "2014", "2013", "2012", "2011", "2010"),
"2000s" = c("2009", "2008", "2007", "2006", "2005", "2004", "2003", "2002", "2001", "2000"),
"1990s" = c("1999", "1998", "1997", "1996", "1995", "1994", "1993", "1992", "1991", "1990"),
"1980s" = c("1989", "1988", "1987", "1986", "1985", "1984", "1983", "1982", "1981", "1980"),
"1970s" = c("1979", "1978", "1977", "1976", "1975", "1974", "1973", "1972", "1971", "1970"),
"1960s" = c("1969", "1968", "1967", "1966", "1965", "1964", "1963", "1962", "1961", "1960")
)) %>%
select(genre, startYear) %>%
filter(startYear != "1959")
heatmap <-count(heatmap, genre, sort = TRUE) %>%
group_by(startYear) %>%
mutate(prop = n / sum(n))
#bar graph of genres from 1959-2019
imdb_movies %>%
filter(rank(genre_count) > 1000) %>%
mutate(genre = fct_reorder(genre, genre_count)) %>%
ggplot() +
geom_bar(aes(x = genre)) +
labs(x = "Genres", y = "Number of films", title = "Most plentiful genres of film ") +
theme(axis.text.x = element_text(angle = 90))
#bar graph of subgenres from 1959-2019
imdb_movies %>%
filter(rank(subgenre_count) > 1000) %>%
filter(!is.na(subgenre)) %>%
mutate(subgenre = fct_reorder(subgenre, subgenre_count)) %>%
ggplot() +
geom_bar(aes(x = subgenre)) +
labs(x = "Subgenres", y = "Number of films", title = "Most plentiful SUBgenres of film") +
theme(axis.text.x = element_text(angle = 90))
#Let's switch from aggregate graphs to graphing the popularity per year - again by genre and subgenre.
ggplot() +
geom_line(data = genres_over_time, mapping = aes(x = startYear, y = n, color = genre)) +
geom_point(data = top_films_over_time, mapping = aes(x = year, y= n, color = genre)) +
labs(x = "Year", y = "Number of films globally", title = "Trends in film genres globally from 1959-2019")
#the missing value in this graph is Rain Man, which IMDb doesn't label as having a sub-genre.
ggplot(subgenres_over_time) +
geom_line(aes(x = startYear, y = n, color = subgenre)) +
geom_point(data = top_subs_over_time, mapping = aes(x = year, y= n, color = subgenre)) +
labs(x = "Year", y = "Number of films globally", title = "Trends in film subgenres globally from 1959-2019")
#Genre heatmap, split by decade.
ggplot() +
geom_tile(data = heatmap, mapping = aes(startYear, genre, fill = prop)) +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Decade", y = "Film genre", fill = "Proportion", title = "What genres are the most popular in every decade?")
A couple of things I notice in the genre data: - Documantaries overtaking other types of films starting in the mid-2000s. There could be a ton of reasons for why - the rise of the internet, more commercially available film equipment, the success of films like Bowling for Columbine…
though there’s a lack of drama films being the highest-grossing once the 2000s hit, films of that genre are now the second most plentiful.
by contrast, though action films have been consistently the highest-grossing, since at least the 80s, these films are not very plentiful. This may be because blockbuster films (which are most often action films, and are usually the highest grossing) are usually high in production costs - making it difficult to produce a high quantity of them.
a dip in comedies, dramas and documentaries around 2012-2014, which I have no explanation for.
a bump in biographies around that same time - which I also can’t really explain.
A couple of things I note in the subgenre data: - The extreme rise and fall of documentaries between the late 2000s and mid 2010s. I have no idea why that is.
What I notice in the heatmap: - The heatmap reaffirms my observations in the genre data. Dramas have been consistently popular over time (being a higher proportion) but are slowly becoming a smaller proportion overall. Documentaries are becoming a larger proportion of films - in the 2000s and 2010s, you can see the shade of blue becoming lighter. Action and adventure movies, by contrast, are becoming a smaller proportion of films overall - the shades of blue are becoming darker. Horror films and thrillers are becoming slightly more plentiful.
Now, let’s look at the data for TV.
#importing the filtered and tidied TV dataset
imdb_tv <- read_rds("data/processed/imdb.tv.rds")
#Using count to rank genres
tv_genres_over_time <- imdb_tv %>% group_by(startYear) %>% count(genre, sort = TRUE)
tv_subgenres_over_time <- imdb_tv %>% group_by(startYear) %>% count(subgenre, sort = TRUE) %>%
filter(!is.na(subgenre))
#creating a genre heatmap for TV
tv_heatmap <- imdb_tv %>% mutate(startYear = as.character(startYear)) %>%
mutate(startYear = fct_collapse(startYear,
"2010s" = c("2019", "2018", "2017", "2016", "2015", "2014", "2013", "2012", "2011", "2010"),
"2000s" = c("2009", "2008", "2007", "2006", "2005", "2004", "2003", "2002", "2001", "2000"),
"1990s" = c("1999", "1998", "1997", "1996", "1995", "1994", "1993", "1992", "1991", "1990"),
"1980s" = c("1989", "1988", "1987", "1986", "1985", "1984", "1983", "1982", "1981", "1980"),
"1970s" = c("1979", "1978", "1977", "1976", "1975", "1974", "1973", "1972", "1971", "1970"),
"1960s" = c("1969", "1968", "1967", "1966", "1965", "1964", "1963", "1962", "1961", "1960")
)) %>%
filter(startYear != "1959")
tv_heatmap <- tv_heatmap %>% group_by(startYear) %>%
count(genre, sort = TRUE) %>%
mutate(prop = n / sum(n))
#Graphing top TV genres
imdb_tv %>%
filter(rank(genre_count) > 1000) %>%
mutate(genre = fct_reorder(genre, genre_count)) %>%
ggplot() +
geom_bar(aes(x = genre)) +
labs(x = "Genres", y = "Number of TV episodes/movies", title = "Most plentiful genres of TV produced every year?") +
theme(axis.text.x = element_text(angle = 90))
imdb_tv %>%
filter(rank(subgenre_count) > 1000) %>%
filter(!is.na(subgenre)) %>%
mutate(genre = fct_reorder(subgenre, subgenre_count)) %>%
ggplot() +
geom_bar(aes(x = subgenre)) +
labs(x = "Subgenres", y = "Number of TV episodes/movies", title = "Most plentiful SUBgenres of TV produced every year?") +
theme(axis.text.x = element_text(angle = 90))
#Graphing genres over time
ggplot(tv_genres_over_time) +
geom_line(aes(x = startYear, y = n, color = genre)) +
labs(x = "Year", y = "Number of episodes/movies globally", title = "Trends in TV genres globally from 1959-2019") +
theme(axis.text.x = element_text(angle = 90))
ggplot(tv_subgenres_over_time) +
geom_line(aes(x = startYear, y = n, color = subgenre)) +
labs(x = "Year", y = "Number of episodes/movies globally", title = "Trends in TV subgenres globally from 1959-2019") +
theme(axis.text.x = element_text(angle = 90))
#Genre heatmap, split by decade.
ggplot() +
geom_tile(data = tv_heatmap, mapping = aes(startYear, genre, fill = prop)) +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Decade", y = "TV genre", fill = "Proportion", title = "Most popular genres in every decade?")
Things I notice in the genre data:
The giant gap between main-genre drama and comedy TV and literally everything else - wow!
Comedy TV shows have consistently been more popular than films, especially with their peak now. Also, I think it’s interesting that they began to overtake dramas in the 2010s - why this is I do not know.
The sharp decline and only recent uptick in romance TV - interesting.
The death of westerns after the 1960s - this can be seen in the heatmap.
The rise in reality TV in the 2000s (which makes sense, as shows like Survivor and the Bachelor got their start then), along with the rise of news shows in the 90s (again, makes sense as networks like CNN and Fox News started the 24 hour news cycle), and the rise of Talk shows in the 90s thru the present (Oprah and the like hitting their cultural peak during the decade)
The peak of dramas in the 1970s and 80s - makes sense, as shows like Dallas were at their cultural peak.
The most popular film genres are dramas, documentaries, and comedies. Action films have slowed in quantity while documentaries have grown, leading to the current top 3.
Not really. The highest grossing films in the past few decades have been action movies - presumably, blockbusters. However, action is a distant fourth in terms of quantity of films produced. Comedies and documentaries have no dots on the genre plot - but they’re in the top three genres! However, to answer the more specific question: more romance films HAVE continued to be made after “Titanic” (the purple dot in the subgenres plot).
Comedies and dramas have from the jump been so much more plentiful than other TV - and that’s still true. Starting in the mid-2010s, however, talk shows have jumped up and joined as a distant third. That’s just genres, tho - subgenres have had a lot more activity and change. Romance, drama and talk-shows have at various points been the top subgenre, with the top spot changing frequently over the decades. No other subgenres come close!
IMDb. (2020). title.basics.tsv.gz. https://datasets.imdbws.com/title.basics.tsv.gz
O’Neill, A. (2019, November 20). Highest grossing movie worldwide, annually 1915-2020. https://www.statista.com/statistics/1072778/highest-grossing-movie-annually-historical/