Swimming + Data Science

Fast, Deep, Fast and Deep

Swimming + Data Science has been quiet for a while as I worked on other projects. Today though we’re back at it in an attempt to quantify the speed and depth of swim meets. It’s more than that though, because this is vintage me, back on my hobby horses. We will be using SwimmeR. We will be focusing, perhaps myopically, on swim meets from my local area. We will also be using a discussion I had with some friends as a jumping off point. It’s a perfect Swimming + Data Science trifecta!

So here’s the story. I was officiating at the Section IV Class C sectional finals meet, held recently. During an awards break I got into a conversation with a couple of people about swim meets past, as I am wont to do. These people, and I’m paraphrasing, remarked that:

  • The meet was generally both slower and less deep than previous years, particularly 2006
  • That a specific team from 2006 that these people were associated with would have run rampant all over the current iteration of the meet

Now I enjoy a good “things were better in my day” bit as much as anyone - perhaps even more - but I also enjoy testing assertions. So that’s what we’re going to do. These tests will proceed in two parts, one for each assertion.

To start let’s load some packages. There’s a new version of SwimmeR, v0.13.0, released November 5th 2021. It’s great and has lots of new features, none of which are particularly relevant to the contents of this article. Still though, might as well update.

Setup

install.packages("SwimmeR")

We’ll use a few tidyverse packages plus flextable.

library(SwimmeR)
library(dplyr)
library(stringr)
library(tidyr)
library(flextable)

flextable_style <- function(x) {
  x %>%
    flextable() %>%
    bold(part = "header") %>% # bolds header
    bg(bg = "#D3D3D3", part = "header")  # puts gray background behind the header row
}

Collecting the meet results

Collecting meet results with SwimmeR is easy. Simply find a link to the results and pipe them into read_results followed by swim_parse. Here we’ll do that with both the 2006 and 2021 Section VI Class C meet results.

C_2006 <-
  "http://www.section4swim.com/Results/GirlsHS/2006/Sec4/C/Single.htm" %>%
  read_results() %>%
  swim_parse() %>%
  mutate(Year = 2006) %>%
  mutate(Event = str_remove(Event, "Girls "))

C_2006 <- C_2006 %>%
  mutate(Event = factor(Event, levels = unique(C_2006$Event))) # to maintain event order later on

C_2021 <-
  "http://www.section4swim.com/Results/GirlsHS/2021/Sec4/C/Single.htm" %>%
  read_results() %>%
  swim_parse() %>%
  mutate(Year = 2021) %>%
  mutate(Event = str_remove(Event, "Girls "))

C_2021 <- C_2021 %>%
  mutate(Event = factor(Event, levels = unique(C_2021$Event))) # to maintain event order later on

Assertion #1 - Faster and Deeper

Fast?

First off let me apologize to our diving friends. “Fast” is a term commonly used by swimming people to describe an entire meet, including diving. Diving obviously isn’t timed though, so the better term might be “strength” - a strong meet, a weak meet etc. Still, I’m going to mostly stick with references to speed just because that’s the more common term.

“Fast” is almost always a relative measure. Is a 4:56.00 in the 500 free fast? Depends on the context. That time would have won the 500 free in a recent men’s dual meet at my alma mater - 4:56.00 would be a fast time in the context of that meet. On the other hand in this women’s dual meet at Stanford a 4:56.00 would have been good for 6th - i.e. last place. Katie Ledecky won that 500 going 4:33.94. What we’d really like is a quantitative measure of the speed for each event, which taken together can describe the strength of a meet.

I propose that what’s really meant by a statement like “the meet is fast this year” is that in some kind of weighted fashion the times are faster than years past for the same finish places. Sounds obvious, I know. What’s the weighing though? We need it in order to be quantitative. Luckily it’s already been determined. It’s point value by place. First place in these meets is worth 20 points, 16th place is worth 1 point and it seems reasonable to me that the first place time is 20x more important in describing the speed of a meet than the 16th place time. The 16th place time does still matter a bit though.

Let’s try and compare the “fastness” between the 2006 and 2021 meets. Our goal is to use the weighed.mean function on times and place-points to get a time (or score for diving), describing the overall strength of each event. We’ll stick C_2006 and C_2021 together and then summarise using weighed.mean and then pivot_wider to format a nice table.

C_Combined <- C_2021 %>%
  bind_rows(C_2006) %>%
  filter(Exhibition < 1,
         DQ < 1) %>%
  group_by(Event) %>%
  select(Place, Name, Team, Event, Finals_Time, Points, Year) %>%
  filter(Place <= 16) %>%
  filter(is.na(Points) == FALSE)

C_Combined %>%
  group_by(Event, Year) %>%
  summarise(W_Avg = mmss_format(weighted.mean(sec_format(Finals_Time), Points, na.rm = TRUE))) %>%
  mutate(W_Avg = case_when(
    str_detect(Event, "Diving") ~ as.character(sec_format(W_Avg)),
    # mmss format doesn't make sense for diving results
    TRUE ~ W_Avg
  )) %>%
  pivot_wider(names_from = "Year", values_from = "W_Avg") %>%
  flextable_style()

The 2006 meet was faster than the 2021 meet in every event except diving, and margin was more than 1 second per 100, sometimes as much as 4 s/100 - which is a lot. Diving is interesting too. Let’s look closer.

C_Combined %>% 
  ungroup() %>% 
  filter(str_detect(Event, "Diving")) %>% 
  select(Place, "Score" = Finals_Time, Year) %>% 
  pivot_wider(names_from = "Year", values_from = "Score") %>% 
  select(Place, `2006`, `2021`) %>% 
  flextable_style()

In 2006 there were 9 divers and a winning score of 284.80. In 2021 there were 3 divers and a winning score of 267.10. Yes the weighed average score was higher for 2021 but I ask you - do you think 2021 was a better year for the sport of diving in Section IV Class C than 2006? I sure don’t. As I discuss frequently in this space, models and quantitative analysis are great, but it does help to actually know things too.


Deep?

A quantitative description of meet depth is simpler - it’s the difference in time (or diving score) between 1st place and some other place of interest - usually the last scoring place. If the difference is small the meet is deep. Here we’ll compute the difference, in seconds, between the min and max scoring marks.

C_Combined %>% 
  group_by(Event, Year) %>% 
  mutate(Finals_Time = sec_format(Finals_Time)) %>% 
  summarise(Range = max(Finals_Time, na.rm = TRUE) - min(Finals_Time, na.rm = TRUE)) %>% 
    mutate(Range = case_when(str_detect(Event, "Diving", negate = TRUE) ~ mmss_format(Range),
                           TRUE ~ as.character(Range))) %>% 
  pivot_wider(names_from = "Year", values_from = "Range") %>% 
  flextable_style()

The range is much smaller in 2006 for all swimming events, meaning the meet was much deeper.

Faster and Deeper

Those people i was chatting with were right on both counts. The 2006 meet was indeed faster and deeper, by a wide margin, than the 2021 meet.

Assertion #2 - Lion Redux

The team in question here is the 2006 Dryden Lady Lions, headed by eventual DIII All-American Sheila Rhoades.

To start we’ll need to remove all of the Dryden results from C_2021 - it wouldn’t be fair for Dryden to have two teams. Then we’ll collect only the Dryden results from 2006 and join the two sets of results together. The 2006 Dryden team will then be “swimming” in the 2021 meet.

C_2021_No_Dryden <- C_2021 %>%
  filter(Team != "Dryden") # remove dryden

C_2006_Only_Dryden <- C_2006 %>%
  filter(Team == "Dryden") # keep only dryden

C_Comp <- C_2021_No_Dryden %>%
  bind_rows(C_2006_Only_Dryden) %>%
  mutate( # clean up names
    Team = str_replace(Team, "Susquehanna Vall$", "Susquehanna Valley"),
    Team = str_replace(Team, "Notre Dame-E$", "Notre Dame-Elmira"),
    Team = str_replace(Team, "Watkins Glen-OM", "Watkins Glen-Odessa Montour")
  )

It’s more complicated than that though. We need to reconstruct prelims to sort athletes into A Finals and B Finals before the finals themselves can be scored. Fortunately we have access to everyone’s Prelims_Time, although because of how SwimmeR works some prelims times are in the Finals_Time column. In SwimmeR the Finals_Time represents the final (i.e. last) swim for each swimmer, which may or may not actually be in the finals. It’s confusing I know and I apologize, but it’s important to the underlying functioning of swim_parse so that’s the way it will stay. Anyway, swimmers who have Finals_Times but no Prelims_Time in a prelims-finals meet actually have their prelims performance listed in Finals_Time. Let’s fix our results.

We’ll also need to handle diving separately, because in diving the highest value (score) wins, whereas in swimming the lowest value (time) wins.

C_Comp <- C_Comp %>%
  mutate(Prelims_NA = case_when(is.na(Prelims_Time) ~ 1,
                                TRUE ~ 0)) %>%
  mutate(Prelims_Time = case_when(is.na(Prelims_Time) ~ Finals_Time,
                                  TRUE ~ Prelims_Time)) %>%
  mutate(Finals_Time = case_when(Prelims_NA > 0 ~ "NA",
                                 TRUE ~ Finals_Time)) %>%
  na_if("NA") %>%
  select(-Prelims_NA)

C_Comp_Diving <- C_Comp %>%
  filter(str_detect(Event, "Diving") == TRUE) %>%
  mutate(Prelims_Time = as.numeric(Prelims_Time)) %>%
  mutate(
    Place = rank(desc(Prelims_Time), ties.method = "min"),
    # highest score gets rank 1
    Prelims_Time = as.character(Prelims_Time)
  )

C_Comp_Swimming <- C_Comp %>%
  filter(str_detect(Event, "Diving") == FALSE) %>%
  filter(DQ < 1) %>%
  mutate(Prelims_Time_sec = SwimmeR::sec_format(Prelims_Time)) %>% # time as seconds
  group_by(Event) %>%
  mutate(Place = rank(Prelims_Time_sec, ties.method = "min")) %>% # places, low number wins
  select(-Prelims_Time_sec) %>% 
  arrange(Place)

C_Comp_Prelims <- C_Comp_Diving %>%
  bind_rows(C_Comp_Swimming) %>%
  mutate(
    Finals_Time = case_when(
      is.na(Finals_Time) &
        is.na(Prelims_Time) == FALSE &
        DQ < 1 & Exhibition < 1 ~ Prelims_Time,
      TRUE ~ Finals_Time
    )
  )

Great, now we have our meet, with athletes sorted into A and B finals by place, with 8th place as the cutoff for the A finals. We can use results_score to rescore the meet on the basis of Finals_Time.

C_Finals_Score <- C_Comp_Prelims %>%
  results_score(
    events = unique(C_2021$Event),
    meet_type = "prelims_finals",
    lanes = 8,
    point_values = c(20, 17, 16, 15, 14, 13, 12, 11, 9, 7, 6, 5, 4, 3, 2, 1),
    scoring_heats = 2
  )

C_Finals_Score %>%
  group_by(Team) %>%
  summarise(Score = sum(Points, na.rm = TRUE)) %>%
  mutate(Place = round(rank(desc(Score), ties.method = "min"), 1)) %>%
  arrange(desc(Score)) %>%
  select(Place, everything()) %>%
  flextable_style()

My friends are vindicated again. Dryden 2006 would have indeed won the 2021 meet - although only by 33 points. The real question is how did they know?


In Conclusion

We’ve come up with a quantitative framework for determining the speed and depth of a swim meet, which I’m sure will come in handy somewhere. We’ve also made use of some SwimmeR functions and lots of tidyverse functions and hopefully learned a little bit on the way. Join us next time back here at Swimming + Data Science where we’ll be doing something else fun.