Olympics by Culture and Spending - The Olympic Advantage

Why do we care about the Olympics

For the purposes of this document I am going to encourage you to read the overall report that the team and I created on why we cared about the Olympics and what we sought out to figure out about the influences over the Olympics.

The main four components of the research were focused around:

  1. The country’s economy
  2. Investment and culture of sportiness (the individual report provided below with the research).
  3. Geographic characteristics
  4. Impact of being the host country

Country’s investment into sports culture

To provide the introduction to this, it’s important to know what I was most curious about when preparing and some of the topical questions that I sought to answer. So the areas I was focusing on:

  1. Sportiness of the population/culture, measured through:
    1. Elite Sports ranking
    2. Hobbies involving sports
      1. Watching Sport
      2. Health & Fitness
      3. Playing sport
    3. Sports Participation rate
  2. Financial Metrics of sportiness
    1. Fitness Applications
    2. Fitness Spend
    3. Gym memberships (per 100K)
    4. Overall country spending on sports related equipment

This prelude to the topic was what influences from culture and spending most impact the ability to win an Olympic medal for a country. Leading to what is the correlation and causation of the factors and number of Olympic medals won and what is the likeliness of it being chance.

To start the analysis, if you are curious to know what libraries were used in this investigation expand the code below to see the list, that were either utilized or tested whilst developing the code.

Relevant libraries
# Setting a CRAN mirror to avoid errors
options(repos = c(CRAN = "https://cloud.r-project.org/"))

# Install the required packages
# List of packages to check and install if missing
packages <- c(
  "priceR", "rvest", "dplyr", "googlesheets4", 
  "rnaturalearth", "rnaturalearthdata", "sf", 
  "xml2", "tidyverse"
)

# Loop through each package
for (pkg in packages) {
  if (!require(pkg, character.only = TRUE)) {
    install.packages(pkg)
    library(pkg, character.only = TRUE) # Load the package after installation
  }
}

library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)
library(rvest)
library(priceR)
library(rnaturalearth)
library(rnaturalearthdata)
library(sf)
library(rvest)
library(xml2)
library(tidyverse)
library(googlesheets4)
library(httr)
library(quantmod)
library(knitr)

Geospatial Analysis

So the first source I went to for my data source, was Myprotein, and I know what you are thinking an odd location for a research database, however the Research by Adele Halsall provided a fascinating insight into overall sportiness especially in trying to measure sports culture which is never easy. She was kind enough to store her data on a google sheet, and as such that company with the googlesheets4 library meant I could extract into my table.

Pulling in MyProtein Sporties Country data
sheet_id <- "1JMS4D9Nx-qxokAkStZ-hJ9vCobphCVqWjZHSGsodlbE"
#this is the sheet utilized in googlesheets after the /spreadsheets/d/sheetid

gs4_deauth()
# As it is publically accessible do not need a google account 

# Read the data with read_sheet pertaining to the googlesheets4 library
data <- read_sheet(sheet_id)

head(n = 10,data)
# A tibble: 10 × 12
   Country         `Olympic Medals` Winter Olympic Medal…¹ `Elite Sport Ranking`
   <chr>                      <dbl>                  <dbl>                 <dbl>
 1 Germany                     1346                    408                     1
 2 United States …             2522                    305                     7
 3 Sweden                       494                    158                     4
 4 Norway                       152                    368                     3
 5 Finland                      303                    167                    27
 6 Canada                       302                    199                     2
 7 Switzerland                  192                    153                     5
 8 Austria                       87                    232                    10
 9 Netherlands                  285                    130                    25
10 Australia                    507                     15                    13
# ℹ abbreviated name: ¹​`Winter Olympic Medals`
# ℹ 8 more variables: `Sports Participation Rate` <dbl>,
#   `Gym Memberships per 100k` <dbl>, `Hobbies - Health & Fitness` <dbl>,
#   `Hobbies - Playing Sport` <dbl>, `Hobbies - Watching Sport` <dbl>,
#   `Fitness Apps` <dbl>, `Fitness Spend` <dbl>,
#   `Total Score (out of 100)` <dbl>

Now whilst the above was useful for getting the data for analysis we needed to also enhance the information with Geospatial information for the purposes of mapping the information.

Pulling in the world map
world <- ne_countries(scale = "medium", returnclass = "sf") |>
#This ne_countries is from the rnaturalearth library in R and contains country based information
  select(name) |>
  mutate(name = case_when(
    name == "Czechia" ~ "Czech Republic",
    name == "U.S. Virgin Is." ~ "Virgin Islands",
    name == "Côte d'Ivoire" ~ "Ivory Coast",
    name == "Dominican Rep." ~ "Dominican Republic",
    TRUE ~ name # Keep other values unchanged
    ))
#Tidied up the naming conventions of countries to align with the information provided in the MyProtein Sports review

ggplot(world) +
  geom_sf(aes(geometry = geometry)) +
  labs(title = "Geometries Plot", x = "Longitude", y = "Latitude")
#this is the same visualization as below and uses a native R library

I have provided a visualization of the bare bones map provided in this analysis, the logic is hidden above in the code but want to share the insights generated.

Then after tidying everything up we combine the two pieces of data to present the information in a plotted format, instead of as a table, as a visual learner and observer it will help to make points clearer.

Data integration and mapping
merged_World_data <- inner_join(world,data, by = c("name" = "Country")) |>
      mutate(`Elite Sport Ranking` = 84-`Elite Sport Ranking`)
# the purpose of the mutate in this case is given that the lower the number the better the performance i.e. Country ranked 1st performs better we took the inverse of worst performing country for the Elite Sports ranking as this aligns with other metrics where higher is better.

ggplot(data = merged_World_data) +
  geom_sf(aes(fill = `Olympic Medals`)) + # Use the 'medals' column for fill color
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Medals") + # Color gradient for medals
  theme_minimal() +
  labs(
    title = "Olympic Medals by Country",
    subtitle = "Visualizing the Number of Medals Won",
    caption = "MyProtein"
  )
#This is a duplication of the visualization naturally occuring however wanted to keep both so the code could be seen all together.

Next Compare each of the factors to analyse culture by mapping to each other of the 9 factors being (Olympic Medals, Winter Olympic Medals, Elite Sport Ranking, Sports Participation Rate, Gym Memberships per 100k, Hobbies - Health & Fitness, Hobbies - Watching Sport, Fitness Apps, Fitness Spend). The code below shows how it was cleaned and tidied to analyse a country comparison.

Tidying up Factors and presenting
long_data <- merged_World_data |>
  pivot_longer(
    cols = c(
      `Olympic Medals`, `Winter Olympic Medals`, `Elite Sport Ranking`,
      `Sports Participation Rate`, `Gym Memberships per 100k`,
      `Hobbies - Health & Fitness`, `Hobbies - Watching Sport`,
      `Fitness Apps`, `Fitness Spend`
    ),
    names_to = "Metric",
    values_to = "Value"
  ) |>
  group_by(Metric) %>%
  mutate(Normalized_Value = (Value - min(Value, na.rm = TRUE)) / 
           (max(Value, na.rm = TRUE) - min(Value, na.rm = TRUE))) %>%
  ungroup()

# Plot using normalized values
ggplot(data = long_data) +
  geom_sf(aes(fill = Normalized_Value), color = "black") + # Fill with normalized value
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Relative Value") +
  facet_wrap(~ Metric, ncol = 3) + # Use shared scales for geometry
  theme_minimal() +
  labs(
    title = "Comparison of Fitness and Sports Metrics by Country",
    subtitle = "Normalized Values by Metric"
  ) +
  theme(legend.position = "bottom")

As you can see above the data shows a lot of correlations that can be visually observed before being analysed so some primary comments would be between the Gym Members per 100K and the Sports Participation rate.

As an aside and an Australian I will note that in all of these images we perform quite successfully except for the performance in Winter Olympic Medals, which is understandable especially living so close to the beach and water it is easier for summer sports.

Some notes for the top performer of the Number of Olympic medals of the U.S, it appears that the Elite Sports Ranking, Gym memberships per 100k and other factors quite evidently overlap, but I am interesting in hearing everyone thoughts on what other factors you would have included to account for the performance in the olympics.

The next section is more rigorous analytics around these factors and how they overlap by starting to quantify the regression with a causation and correlation focus of the supporting factors around the numbers of Olympic medals won. This is where the visualization of this data moves from graph to Table and the data enthusiasts start to really pay attention.

Regression

As mentioned below the first part is to ensure that the data quality is high and a successful regression analysis can be performed. This is largely analyzing the table above with some additional data cleansing and pivoting to allow it to flow into the data analytics format provided by the R libraries.

Regression code
Merged_unmapped_World_Data <- merged_World_data
#Creating a separate data table to remove the geometry or location for each country to reduce the data size

Merged_unmapped_World_Data$geometry <- NULL
#Setting the geometry to 0
y <- Merged_unmapped_World_Data$`Olympic Medals` 
# Dependent variable

# Independent variables
X <- Merged_unmapped_World_Data |>
  select(
    `Elite Sport Ranking`,
    `Sports Participation Rate`,
    `Gym Memberships per 100k`,
    `Hobbies - Health & Fitness`,
    `Hobbies - Playing Sport`,
    `Hobbies - Watching Sport`,
    `Fitness Apps`,
    `Fitness Spend`
  )

# COmbining all the data
X <- cbind(X)

For the statistical analysis the Principal Component Analysis (PCA) was utilized across the data.

PCA Analysis
# This is possible because ot the stats library and so easily sharable
pca <- prcomp(X, scale. = TRUE)

# Use the first few principal components for regression
pca_data <- as.data.frame(pca$x[,1:8]) # Use the first 8 components
merged_data_pca <- cbind(y, pca_data)

# Fit the regression model
model_pca <- lm(y ~ ., data = merged_data_pca)

# Summarize the model code is suppressed below but used in a hidden run to show the table 
summary(model_pca)

Call:
lm(formula = y ~ ., data = merged_data_pca)

Residuals:
    Min      1Q  Median      3Q     Max 
-579.26 -103.39   -0.92   46.84 1677.87 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   115.06      20.66   5.568 1.50e-07 ***
PC1            74.32      13.12   5.663 9.68e-08 ***
PC2           -78.80      15.16  -5.200 7.91e-07 ***
PC3           -12.33      20.94  -0.589 0.557082    
PC4           -79.48      24.12  -3.295 0.001281 ** 
PC5           -37.92      24.51  -1.547 0.124435    
PC6           102.52      29.28   3.501 0.000643 ***
PC7            85.95      33.60   2.558 0.011717 *  
PC8            57.80      37.26   1.552 0.123300    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 239.2 on 125 degrees of freedom
Multiple R-squared:  0.429, Adjusted R-squared:  0.3925 
F-statistic: 11.74 on 8 and 125 DF,  p-value: 2.319e-12
kable(pca$rotation[, 1:8])
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Elite Sport Ranking 0.2552993 -0.4660090 0.1369448 -0.0525239 -0.5938219 0.5032198 0.1898671 -0.2328972
Sports Participation Rate 0.4068975 -0.4114166 0.0458514 0.1075156 0.0144703 -0.2951564 -0.7509729 -0.0148991
Gym Memberships per 100k 0.2073760 -0.5255337 0.1684044 -0.3650119 0.3755100 -0.2805787 0.4702557 0.2798986
Hobbies - Health & Fitness -0.3451803 -0.2808000 0.4469759 0.4093140 0.4742064 0.2136865 -0.0141170 -0.4048995
Hobbies - Playing Sport -0.3299214 0.0866252 0.6851963 -0.1185995 -0.4554738 -0.4245194 -0.0453082 0.1017949
Hobbies - Watching Sport -0.4549719 -0.1670587 -0.0248454 -0.5726510 0.1090661 0.4300852 -0.4107180 0.2664349
Fitness Apps -0.4034741 -0.2929198 -0.4573991 -0.2066119 -0.1442053 -0.4106815 0.0539816 -0.5541982
Fitness Spend -0.3584182 -0.3705714 -0.2677146 0.5479293 -0.2028292 -0.0376422 0.0706709 0.5610389
model_refined <- lm(y ~ PC1 + PC2 + PC4 + PC6 + PC7, data = merged_data_pca)
summary(model_refined)

Call:
lm(formula = y ~ PC1 + PC2 + PC4 + PC6 + PC7, data = merged_data_pca)

Residuals:
    Min      1Q  Median      3Q     Max 
-650.21 -101.61   -4.96   48.21 1691.77 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   115.06      20.84   5.522 1.79e-07 ***
PC1            74.32      13.23   5.616 1.16e-07 ***
PC2           -78.80      15.28  -5.156 9.31e-07 ***
PC4           -79.48      24.32  -3.268 0.001392 ** 
PC6           102.52      29.53   3.472 0.000705 ***
PC7            85.95      33.88   2.537 0.012383 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 241.2 on 128 degrees of freedom
Multiple R-squared:  0.4055,    Adjusted R-squared:  0.3823 
F-statistic: 17.46 on 5 and 128 DF,  p-value: 3.725e-13
plot(model_refined$fitted.values, residuals(model_refined), xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red")
qqnorm(residuals(model_refined))
qqline(residuals(model_refined), col = "red")

In defining the Principal components we can see the regression provided below, and so to talk to some of the key numbers that are present, focusing below on the variables where the “loading” listed indicated the correlation or contribution of each variable to each principal component. With the following being true

  • A high positive loading (close to +1) means the variable is strongly positively correlated with the component.

  • A high negative loading (close to -1) means the variable is strongly negatively correlated with the component.

  • A loading close to 0 indicates a weak relationship.

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Elite Sport Ranking 0.26 -0.47 0.14 -0.05 -0.59 0.50 0.19 -0.23
Sports Participation Rate 0.41 -0.41 0.05 0.11 0.01 -0.30 -0.75 -0.01
Gym Memberships per 100k 0.21 -0.53 0.17 -0.37 0.38 -0.28 0.47 0.28
Hobbies - Health & Fitness -0.35 -0.28 0.45 0.41 0.47 0.21 -0.01 -0.40
Hobbies - Playing Sport -0.33 0.09 0.69 -0.12 -0.46 -0.42 -0.05 0.10
Hobbies - Watching Sport -0.45 -0.17 -0.02 -0.57 0.11 0.43 -0.41 0.27
Fitness Apps -0.40 -0.29 -0.46 -0.21 -0.14 -0.41 0.05 -0.55
Fitness Spend -0.36 -0.37 -0.27 0.55 -0.20 -0.04 0.07 0.56

This is where the analysis of probability of explanation comes into play, particularly noticed by the Pr(>|t|) value and the R-squared adding some context to the explanation. With the importance of the Principal components defined by the signif. codes to show how explainable it was.

Again this is primarily focused on the coding for the analysis however part of analyzing any good data source is the statistics behind it and as such I won’t talk in too much depth, however provided is the statistics behind the claims made (enjoy).


Call:
lm(formula = y ~ ., data = merged_data_pca)

Residuals:
    Min      1Q  Median      3Q     Max 
-579.26 -103.39   -0.92   46.84 1677.87 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   115.06      20.66   5.568 1.50e-07 ***
PC1            74.32      13.12   5.663 9.68e-08 ***
PC2           -78.80      15.16  -5.200 7.91e-07 ***
PC3           -12.33      20.94  -0.589 0.557082    
PC4           -79.48      24.12  -3.295 0.001281 ** 
PC5           -37.92      24.51  -1.547 0.124435    
PC6           102.52      29.28   3.501 0.000643 ***
PC7            85.95      33.60   2.558 0.011717 *  
PC8            57.80      37.26   1.552 0.123300    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 239.2 on 125 degrees of freedom
Multiple R-squared:  0.429, Adjusted R-squared:  0.3925 
F-statistic: 11.74 on 8 and 125 DF,  p-value: 2.319e-12

Next stages would be looking at the refined table by removing non critical factors these would be the values where Pr(>|t|) is higher than its counterparts or closer to 1 which indicates that the relationship is higher likelihood of being based on chance than actual relationship. The removed data would be PC3,5 and 8 and result in the below Regression analysis resulting in an Adjusted R-Squared value of 0.38 which is not the greatest explanation however given that this is a macro level view (i.e. country) and focusing on sports culture a social science it is a reasonable rate.


Call:
lm(formula = y ~ PC1 + PC2 + PC4 + PC6 + PC7, data = merged_data_pca)

Residuals:
    Min      1Q  Median      3Q     Max 
-650.21 -101.61   -4.96   48.21 1691.77 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   115.06      20.84   5.522 1.79e-07 ***
PC1            74.32      13.23   5.616 1.16e-07 ***
PC2           -78.80      15.28  -5.156 9.31e-07 ***
PC4           -79.48      24.32  -3.268 0.001392 ** 
PC6           102.52      29.53   3.472 0.000705 ***
PC7            85.95      33.88   2.537 0.012383 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 241.2 on 128 degrees of freedom
Multiple R-squared:  0.4055,    Adjusted R-squared:  0.3823 
F-statistic: 17.46 on 5 and 128 DF,  p-value: 3.725e-13

So some of the key takeaways across the data focusing on three of the variables:

  • Elite Sport Ranking - Countries with strong elite sport rankings tend to perform better

  • Sports Participation rate - Higher Sports participation rate correlate with better olympic results

  • Hobbies - Health & Fitness - Countries where most people focus on personal fitness as a hobby may see indirect benefits in Olympic results.

Population Sport Spending

The next component of investigation was spending in Sports at the Retail Value RSP in local currency for each country. For this data Euromonitor, does a great job of consolidating this information so that it is easier to do an analysis and uses a variety of sources such as the SEC & IRS for the U.S numbers as official sources and many more. The first part of this was to normalize the numbers as they were in local currencies and aggregated in different capacity (some in millions, others in billions).

Euromonitor Sports Spend file read
Sports_spend <- read.csv("Passport_Stats_10-12-2024_0544_GMT.csv", skip = 5, header = TRUE) |>
  mutate(Currency =  substr(Unit, 1, 3)) |>
  mutate(X2023 = str_replace_all(X2023, ",", "")) |>
  mutate(X2023 = as.numeric(X2023)) |>
  filter(!is.na(Category) & Category != "") 
#Given that Euromonitor does not have an API, and the shared license makes it difficult to parse the information, I downloaded and uploaded the information and focused on 2023

head(n = 5, Sports_spend)
         Geography           Category        Data.Type        Unit
1            China Outdoor and Sports Retail Value RSP CNY million
2 Hong Kong, China Outdoor and Sports Retail Value RSP HKD million
3            India Outdoor and Sports Retail Value RSP INR million
4        Indonesia Outdoor and Sports Retail Value RSP IDR billion
5            Japan Outdoor and Sports Retail Value RSP JPY billion
  Current.Constant  X2023 Currency
1   Current Prices 2458.7      CNY
2   Current Prices   70.3      HKD
3   Current Prices 1203.7      INR
4   Current Prices  211.7      IDR
5   Current Prices   13.1      JPY
Extract FX rates
unique_currencies <- Sports_spend |>
  distinct(Currency) |>   # Get unique currencies
  pull(Currency)    

# Combine the unique currencies into a comma-separated string
currency_string <- paste(unique_currencies, collapse = ",")

#Setting up an api key for the extract of Exchange Rates
Exchange_rate_api <- readLines("exchange rates api.txt")

#Downloading the exchange rates from the provided API
Exchange_rate_url <- paste0("https://api.exchangeratesapi.io/v1/2023-12-31?access_key=",Exchange_rate_api,"&symbols=",currency_string,"&format=1")

#Download the information
Exchange_rate <- GET(Exchange_rate_url)

#Parse the information
Usable_Exchange <- content(Exchange_rate, "parsed")

rates_table <- as_tibble(Usable_Exchange$rates)

long_data <- rates_table |>
  pivot_longer(
    cols = everything(),    # Select all columns
    names_to = "Currency",  # New column for the headers
    values_to = "Value"     # New column for the values
  )
#Setting it up as USD related focus of investment

USD_Value <- long_data |>
  filter(Currency == "USD") |>
  pull(Value)

Rates <- long_data |>
  mutate(Value_USD = as.numeric(1/(Value/USD_Value)))
#Converting everything to USD 

Sports_spend <- Sports_spend |>
  left_join(Rates,Sports_spend, by = ("Currency"))
#Joining in the amount spent for Sports with the local currency adjustment to USD.

Sports_spend <- Sports_spend |>
  mutate(X2023_USD = X2023*Value_USD,
         Last7 = str_sub(Unit, -7),  # Extract the last 7 characters i.e. million or billion
         Multiplier = ifelse(Last7 == "million", 1000000, 1000000000), # Set multiplier based on condition
         X2023_USD = X2023_USD*Multiplier)

kable(Sports_spend)
Geography Category Data.Type Unit Current.Constant X2023 Currency Value Value_USD X2023_USD Last7 Multiplier
China Outdoor and Sports Retail Value RSP CNY million Current Prices 2458.7 CNY 7.825079 0.1412709 347342777 million 1e+06
Hong Kong, China Outdoor and Sports Retail Value RSP HKD million Current Prices 70.3 HKD 8.632571 0.1280564 9002365 million 1e+06
India Outdoor and Sports Retail Value RSP INR million Current Prices 1203.7 INR 92.027926 0.0120122 14459061 million 1e+06
Indonesia Outdoor and Sports Retail Value RSP IDR billion Current Prices 211.7 IDR 17013.075723 0.0000650 13755598 billion 1e+09
Japan Outdoor and Sports Retail Value RSP JPY billion Current Prices 13.1 JPY 155.900774 0.0070908 92889042 billion 1e+09
Malaysia Outdoor and Sports Retail Value RSP MYR million Current Prices 77.7 MYR 5.079529 0.2176296 16909822 million 1e+06
Philippines Outdoor and Sports Retail Value RSP PHP million Current Prices 474.9 PHP 61.238406 0.0180517 8572742 million 1e+06
Singapore Outdoor and Sports Retail Value RSP SGD million Current Prices 12.2 SGD 1.458677 0.7578484 9245750 million 1e+06
South Korea Outdoor and Sports Retail Value RSP KRW billion Current Prices 15.1 KRW 1431.045823 0.0007725 11664466 billion 1e+09
Taiwan Outdoor and Sports Retail Value RSP TWD million Current Prices 330.1 TWD 33.919310 0.0325908 10758209 million 1e+06
Thailand Outdoor and Sports Retail Value RSP THB million Current Prices 347.6 THB 38.066335 0.0290403 10094392 million 1e+06
Australia Outdoor and Sports Retail Value RSP AUD million Current Prices 133.7 AUD 1.622777 0.6812125 91078113 million 1e+06
Poland Outdoor and Sports Retail Value RSP PLN million Current Prices 202.3 PLN 4.344490 0.2544501 51475259 million 1e+06
Romania Outdoor and Sports Retail Value RSP RON million Current Prices 51.8 RON 4.978393 0.2220508 11502230 million 1e+06
Russia Outdoor and Sports Retail Value RSP RUB million Current Prices 14083.7 RUB 98.661662 0.0112045 157801018 million 1e+06
Ukraine Outdoor and Sports Retail Value RSP UAH million Current Prices 211.2 UAH 42.111813 0.0262505 5544105 million 1e+06
Argentina Outdoor and Sports Retail Value RSP ARS million Current Prices 7867.7 ARS 896.324454 0.0012333 9703402 million 1e+06
Brazil Outdoor and Sports Retail Value RSP BRL million Current Prices 422.6 BRL 5.364221 0.2060795 87089198 million 1e+06
Mexico Outdoor and Sports Retail Value RSP MXN million Current Prices 1182.3 MXN 18.764278 0.0589128 69652594 million 1e+06
South Africa Outdoor and Sports Retail Value RSP ZAR million Current Prices 536.9 ZAR 20.224535 0.0546592 29346500 million 1e+06
United Arab Emirates Outdoor and Sports Retail Value RSP AED million Current Prices 75.5 AED 4.060068 0.2722752 20556781 million 1e+06
Canada Outdoor and Sports Retail Value RSP CAD million Current Prices 210.8 CAD 1.464723 0.7547202 159095013 million 1e+06
USA Outdoor and Sports Retail Value RSP USD million Current Prices 1985.5 USD 1.105456 1.0000000 1985500000 million 1e+06
France Outdoor and Sports Retail Value RSP EUR million Current Prices 356.6 EUR 1.000000 1.1054560 394205610 million 1e+06
Germany Outdoor and Sports Retail Value RSP EUR million Current Prices 176.6 EUR 1.000000 1.1054560 195223530 million 1e+06
Italy Outdoor and Sports Retail Value RSP EUR million Current Prices 90.8 EUR 1.000000 1.1054560 100375405 million 1e+06
Netherlands Outdoor and Sports Retail Value RSP EUR million Current Prices 57.5 EUR 1.000000 1.1054560 63563720 million 1e+06
Spain Outdoor and Sports Retail Value RSP EUR million Current Prices 84.9 EUR 1.000000 1.1054560 93853214 million 1e+06
Sweden Outdoor and Sports Retail Value RSP SEK million Current Prices 397.2 SEK 11.152225 0.0991243 39372154 million 1e+06
Switzerland Outdoor and Sports Retail Value RSP CHF million Current Prices 58.1 CHF 0.929683 1.1890677 69084832 million 1e+06
Turkey Outdoor and Sports Retail Value RSP TRY million Current Prices 386.3 TRY 32.463169 0.0340526 13154528 million 1e+06
United Kingdom Outdoor and Sports Retail Value RSP GBP million Current Prices 162.4 GBP 0.868386 1.2730007 206735316 million 1e+06

Aggregated USD amount spent on sports equipment in 2023 for 34 countries, to provide some context as to how much each country spends on sports. This will be cross referenced against the number of medals won.

Geography USD Amount
USA 1985500000
France 394205610
China 347342777
United Kingdom 206735316
Germany 195223530
Canada 159095013
Russia 157801018
Italy 100375405
Spain 93853214
Japan 92889042
Australia 91078113
Brazil 87089198
Mexico 69652594
Switzerland 69084832
Netherlands 63563720
Poland 51475259
Sweden 39372154
South Africa 29346500
United Arab Emirates 20556781
Malaysia 16909822
India 14459061
Indonesia 13755598
Turkey 13154528
South Korea 11664466
Romania 11502230
Taiwan 10758209
Thailand 10094392
Argentina 9703402
Singapore 9245750
Hong Kong, China 9002365
Philippines 8572742
Ukraine 5544105

After all of this sports spend we tie it back to the original question which is how does sports spending relate to Olympic medals won, we join the legacy data of Olympic medals to the new sport spend data that was listed above.

It is interesting to compare the R-squared values in the Olympic medal regression analysis where this model of current spend (i.e. 2023 USD in millions) is a better estimator of winning an Olympic medal compared to the previous analysis focused on hobbies and International sports ranking. The confidence P-value is indicative of this, feel free to further dive into the data below.

Regression of Medals to Amount spent
Sports_spend_medals <- Sports_spend |>
  mutate(Geography = case_when(
    Geography == "USA" ~ "United States of America",
    Geography == "Hong Kong, China" ~ "Hong Kong",
    TRUE ~ Geography)) |>
  mutate(X2023_USD = X2023_USD/1000000) |>
  left_join(data |>
              select(`Olympic Medals`,Country), by = c("Geography" = "Country"))

model <- lm(`Olympic Medals` ~ X2023_USD, data = Sports_spend_medals)

# Display the summary of the regression model (code suppressed here and hidden overall)
summary(model)

Call:
lm(formula = `Olympic Medals` ~ X2023_USD, data = Sports_spend_medals)

Residuals:
    Min      1Q  Median      3Q     Max 
-227.21 -189.68 -115.48   58.82 1153.54 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 203.2603    60.9761   3.333  0.00229 ** 
X2023_USD     1.2624     0.1642   7.689 1.41e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 320.4 on 30 degrees of freedom
Multiple R-squared:  0.6634,    Adjusted R-squared:  0.6521 
F-statistic: 59.11 on 1 and 30 DF,  p-value: 1.412e-08

Call:
lm(formula = `Olympic Medals` ~ X2023_USD, data = Sports_spend_medals)

Residuals:
    Min      1Q  Median      3Q     Max 
-227.21 -189.68 -115.48   58.82 1153.54 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 203.2603    60.9761   3.333  0.00229 ** 
X2023_USD     1.2624     0.1642   7.689 1.41e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 320.4 on 30 degrees of freedom
Multiple R-squared:  0.6634,    Adjusted R-squared:  0.6521 
F-statistic: 59.11 on 1 and 30 DF,  p-value: 1.412e-08

To further cement all of this, a graph has been provided below with 95% confidence intervals to highlight the data spent, obviously this is difficult due to the closer grouping of the data to the under $250 million USD spent on sport related items. Further analysis could definitely be performed here to complete a logarithmic analysis or some alternative type.

Plotting Spend vs Olympic medals
# Extract the formula
formula_text <- paste0("y = ", round(coef(model)[1], 2), 
                       " + ", round(coef(model)[2], 2), "x")


ggplot(Sports_spend_medals, aes(x = X2023_USD, y = `Olympic Medals`)) +
  geom_point(color = "blue", size = 3) +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  annotate("text", x = 1000, y = 2000, label = formula_text, color = "black", size = 5, hjust = 0) +
  labs(
  title = "Sports Spending vs Olympic Medals",
  x = "Sports Spending in 2023 (Million USD)",
  y = "Olympic Medals"
  ) +
  theme_minimal()

All of this shows the more that a country spends the more Olympic medals that are won with it only costing roughly a million USD for population per medal won.

Thank you for reading the analysis, I would highly recommend again reviewing my colleagues research into the other 4 components or the over-view non technical analysis. Otherwise feel free to connect via Linkedin to discuss this topic further, or generally to talk about coding, financial services or low-code no code tools as that is my specialty.

References:

  1. Halsall, A. (2021, July 8). Which are the world’s sportiest countries? - MYPROTEINTM. MYPROTEIN. https://us.myprotein.com/thezone/motivation/which-are-the-worlds-sportiest-countries/

  2. Euromonitor International. (2022). Outdoor and sports market size dataset. [Data file]. Retrieved from https://www.euromonitor.com

  3. Exchange Rates API. (n.d.). Documentation. Retrieved from https://exchangeratesapi.io/documentation