Lines of Dialogue in The Office

Lines of Dialogue in The Office

March 19, 2020
Medium: R and ggplot2
Large: JPEG

Do episodes that Mindy Kaling and B.J. Novak feature more lines of dialogue from their characters, Kelly Kapoor and Ryan Howard, than episodes by other writers?

We analyze a dataset of lines of dialogue in The Office from the schrute R package to find out.

library(tidyverse)
library(schrute)
library(ggtext)

data(theoffice)

theoffice
# A tibble: 55,130 x 9
   index season episode episode_name director  writer     character text             text_w_direction          
   <int> <chr>  <chr>   <chr>        <chr>     <chr>      <chr>     <chr>            <chr>                     
 1     1 01     01      Pilot        Ken Kwap… Ricky Ger… Michael   All right Jim. … All right Jim. Your quart…
 2     2 01     01      Pilot        Ken Kwap… Ricky Ger… Jim       Oh, I told you.… Oh, I told you. I couldn'…
 3     3 01     01      Pilot        Ken Kwap… Ricky Ger… Michael   So you've come … So you've come to the mas…
 4     4 01     01      Pilot        Ken Kwap… Ricky Ger… Jim       Actually, you c… Actually, you called me i…
 5     5 01     01      Pilot        Ken Kwap… Ricky Ger… Michael   All right. Well… All right. Well, let me s…
 6     6 01     01      Pilot        Ken Kwap… Ricky Ger… Michael   Yes, I'd like t… [on the phone] Yes, I'd l…
 7     7 01     01      Pilot        Ken Kwap… Ricky Ger… Michael   I've, uh, I've … I've, uh, I've been at Du…
 8     8 01     01      Pilot        Ken Kwap… Ricky Ger… Pam       Well. I don't k… Well. I don't know.       
 9     9 01     01      Pilot        Ken Kwap… Ricky Ger… Michael   If you think sh… If you think she's cute n…
10    10 01     01      Pilot        Ken Kwap… Ricky Ger… Pam       What?            What?                     
# … with 55,120 more rows

Who have written the most episodes of The Office?

episodes_per_writer <- theoffice %>%
  select(season, episode, writer) %>% 
  distinct() %>%
  count(writer, sort = TRUE)

episodes_per_writer
# A tibble: 47 x 2
   writer                            n
   <chr>                         <int>
 1 Mindy Kaling                     20
 2 B.J. Novak                       15
 3 Paul Lieberstein                 13
 4 Brent Forrester                   9
 5 Greg Daniels                      9
 6 Justin Spitzer                    9
 7 Jennifer Celotta                  8
 8 Charlie Grandy                    7
 9 Gene Stupnitsky;Lee Eisenberg     7
10 Michael Schur                     7
# … with 37 more rows

Mindy Kaling has written 20 episodes, the most of any other writer, followed by B.J. Novak at 15 episodes. Both Kaling and Novak play characters on the show, Kelly Kapoor and Ryan Howard respectively.

What are all the episodes?

episodes <- theoffice %>%
  select(season, episode, episode_name, writer) %>%
  mutate(
    writer = case_when(
      str_detect(writer, "Mindy Kaling") ~ "Mindy Kaling",
      str_detect(writer, "B.J. Novak") ~ "B.J. Novak",
      TRUE ~ "Other writers"
    )
  ) %>%
  distinct()

episodes
# A tibble: 186 x 4
   season episode episode_name      writer       
   <chr>  <chr>   <chr>             <chr>        
 1 01     01      Pilot             Other writers
 2 01     02      Diversity Day     B.J. Novak   
 3 01     03      Health Care       Other writers
 4 01     04      The Alliance      Other writers
 5 01     05      Basketball        Other writers
 6 01     06      Hot Girl          Mindy Kaling 
 7 02     01      The Dundies       Mindy Kaling 
 8 02     02      Sexual Harassment B.J. Novak   
 9 02     03      Office Olympics   Other writers
10 02     04      The Fire          B.J. Novak   
# … with 176 more rows

And let’s count lines of dialogue per Ryan and Kelly in each episode.

dialogue <- theoffice %>%
  select(season, episode, character) %>%
  mutate(character = case_when(
    str_detect(character, "Kelly") ~ "Kelly",
    str_detect(character, "Ryan") ~ "Ryan",
    TRUE ~ "Other characters"
  )) %>%
  group_by(season, episode) %>%
  count(character) %>%
  mutate(prop = n / sum(n)) %>%
  ungroup() %>%
  filter(character != "Other characters")

dialogue
# A tibble: 282 x 5
   season episode character     n    prop
   <chr>  <chr>   <chr>     <int>   <dbl>
 1 01     01      Ryan          8 0.0349 
 2 01     02      Kelly         2 0.00985
 3 01     02      Ryan          4 0.0197 
 4 01     03      Ryan          1 0.00410
 5 01     04      Ryan          4 0.0165 
 6 01     05      Ryan          8 0.0348 
 7 01     06      Ryan         12 0.0347 
 8 02     01      Kelly         7 0.0273 
 9 02     01      Ryan          2 0.00781
10 02     02      Ryan          1 0.00353
# … with 272 more rows

Now let’s create a full dataframe we’ll use for plotting.

kelly_ryan <- writers %>%
  select(season, episode) %>%
  distinct() %>%
  mutate(Kelly = NA, Ryan = NA) %>%
  pivot_longer(c("Kelly", "Ryan"), names_to = "character", values_to = "dummy") %>%
  select(-dummy) %>%
  left_join(episodes, by = c("season", "episode")) %>%
  left_join(dialogue, by = c("season", "episode", "character")) %>%
  replace_na(list("n" = 0, "prop" = 0)) %>%
  mutate(
    season = paste("Season", as.numeric(season)),
    episode = as.numeric(episode),
    episode_name = case_when(
      character == "Ryan" ~ "",
      episode_name == "Lecture Circuit (Part 1)" ~ "Lecture Circuit (Parts 1&2)",
      episode_name == "Lecture Circuit (Part 2)" ~ "",
      episode_name == "Sexual Harassment" ~ "Sexual\nHarassment",
      TRUE ~ episode_name
    )
  )

kelly_ryan
# A tibble: 372 x 7
   season   episode character episode_name    writer            n    prop
   <chr>      <dbl> <chr>     <chr>           <chr>         <dbl>   <dbl>
 1 Season 1       1 Kelly     "Pilot"         Other writers     0 0      
 2 Season 1       1 Ryan      ""              Other writers     8 0.0349 
 3 Season 1       2 Kelly     "Diversity Day" B.J. Novak        2 0.00985
 4 Season 1       2 Ryan      ""              B.J. Novak        4 0.0197 
 5 Season 1       3 Kelly     "Health Care"   Other writers     0 0      
 6 Season 1       3 Ryan      ""              Other writers     1 0.00410
 7 Season 1       4 Kelly     "The Alliance"  Other writers     0 0      
 8 Season 1       4 Ryan      ""              Other writers     4 0.0165 
 9 Season 1       5 Kelly     "Basketball"    Other writers     0 0      
10 Season 1       5 Ryan      ""              Other writers     8 0.0348 
# … with 362 more rows

Finally, some notes for the plot, and the plot itself.

notes <- tribble(
  ~season, ~episode, ~text, ~character, ~prop,
  "Season 1", 6, "— 20 lines of dialogue for every 100 lines in the episode", "Ryan", 0.2,
  "Season 1", 6, "— 10 lines of dialogue for every 100 lines in the episode", "Ryan", 0.1,
  "Season 1", 6, "— No lines of dialogue for every 100 lines in the episode", "Ryan", 0.0,
)

mk_color <- "#9D02D7"
bn_color <- "#FA8775"

ggplot() +
  geom_rect(data = kelly_ryan, aes(xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = 0.205, color = writer), alpha = 0, size = 0.3) + 
  geom_col(data = kelly_ryan, aes(x = character, y = prop, fill = character), width = 0.7) +
  geom_text(data = filter(df, writer != "Other writers"), aes(x = 1.5, y = 0.235, label = episode_name), family = "Fira Sans Extra Condensed Light", size = 2.5, lineheight = 0.7) +
  geom_text(data = notes, aes(x = character, y = prop, label = text), family = "Fira Sans Extra Condensed Light", size = 3, lineheight = 0.9, hjust = -0.05) +
  facet_grid(season ~ episode, switch = "both") +
  scale_color_manual(values = c(bn_color, mk_color, "#F0F0F0")) +
  scale_fill_manual(values = c(mk_color, bn_color)) +
  guides(color = FALSE, fill = FALSE) +
  labs(
    title = glue::glue("<i>The Office</i> writers <span style='color:{mk_color}'>Mindy Kaling</span> and <span style='color:{bn_color}'>B.J. Novak</span> also play two characters on the show, <span style='color:{mk_color}'>Kelly Kapoor</span> and <span style='color:{bn_color}'>Ryan Howard</span>.<br>Do episodes they have written feature more lines of dialogue from their characters than episodes by other writers? "),
    subtitle = "No, but there are a few notable exceptions. B.J. Novak wrote <i>The Fire</i> (Season 2, Episode 4) where Ryan accidentally starts a fire in the office toaster oven. He also wrote <i>Initiation</i><br>(Season 3, Episode 5) where Ryan is taken to Dwight's beet farm for a sales department hazing. In the following episode <i>Diwali</i> (Season 3, Episode 6), written by Mindy Kaling,<br>Kelly invites the office to a Diwali celebration, which Michael mistakes for a Halloween party. Kaling also wrote <i>Night Out</i> (Season 4, Episode 15) where Michael and Dwight travel<br>to New York City to party with Ryan and it is revealed that Ryan is on drugs, and <i>Lecture Circuit</i> (Season 5, Episodes 16 & 17) where the office forgets to celebrate Kelly's birthday.",
    caption = "Data from the schrute R package (github.com/bradlindblad/schrute)\nCode to recreate this graphic at nsgrantham.com/the-office",
    x = NULL, y = NULL
  ) +
  theme_minimal(base_family = "Fira Sans Extra Condensed Light", base_size = 14) +
  theme(
    plot.title = element_markdown(family = "Fira Sans Extra Condensed", margin = margin(0, 0, 0.8, 0, unit = "line"), size = 22, lineheight = 1.2),
    plot.subtitle = element_markdown(size = 16, margin = margin(0, 0, 1.2, 0, unit = "line"), lineheight = 1.2),
    plot.background = element_rect(fill = "#F8F8FF", color = "#F8F8FF"),
    plot.title.position = "plot",
    plot.margin = margin(1, 1, 1, 1, unit = "line"),
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.major.x = element_blank(),
    strip.text.y.left = element_text(family = "Fira Sans Extra Condensed", angle = 0, margin = margin(0, 1, 0, 0, unit = "line")),
    strip.placement = "bottom",
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.text.x = element_blank()
  ) +
  coord_cartesian(clip = "off")

ggsave("the-office.png", width = 16, height = 12)

For the most part, Mindy Kaling and B.J. Novak do not write more lines of dialogue for their characters than other writers.