#2 Fitting Distributions (Baggage Handling System)

14:540:384: Simulation Models in IE (Spring 2025)

Author

Daniel Moore

Published

January 29, 2025

Questions

Learning Objectives

  • Create a Quarto Project
  • Install and load packages
  • Read data to tibble
  • Do basic plotting of the data
  • Fit a distribution to the data
  • Plot results

1 Problem

fictional

Newark Liberty International Airport is replacing Terminal B as part of their redevelopment plan. Daifuku has been selected to design the baggage handling system around which the terminal will be built. They have used their proprietary software, Sym3 to demonstrate that they meet the Port Authority’s specification documents for 60% Schematic Design.

Does the data indicate that bags will be delivered within 30 min +/- 5 min?

EWR Redevelopment Concept

1.1 Start R Studio

1.2 Load Necessary Packages

Code
install.packages("tidyverse")
install.packages("knitr")
install.packages("fitdistrplus")

library(tidyverse)
library(knitr)
library(fitdistrplus)

1.3 Download Data

Code
# Options: 
# data <- read_csv("path/to/where/you/downloaded/it")
# navigate to the file in your file explorer pane. Click on the file and "import dataset"
data <- read_csv("https://tinyurl.com/sim-bag-data")
head(data)
# A tibble: 6 × 5
  `Flight Number` `Tail Number` `Origin Airport` Arrival            
  <chr>           <chr>         <chr>            <dttm>             
1 0208            N804UA        ATL              2024-01-23 22:47:00
2 0208            N804UA        ATL              2024-01-23 22:47:00
3 0208            N804UA        ATL              2024-01-23 22:47:00
4 0208            N804UA        ATL              2024-01-23 22:47:00
5 0208            N804UA        ATL              2024-01-23 22:47:00
6 0208            N804UA        ATL              2024-01-23 22:47:00
# ℹ 1 more variable: bag_completion_time <dttm>

2 Inspecting Data

Code
data |> ggplot(aes(x = Arrival)) +
  geom_histogram(binwidth=60*60*3, color = "black") +
  labs(
    title = "Histogram of Plane Arrival Times",
    x = "Arrival Time",
    y = "Count of Planes? Bags? What?"
  ) +
  theme_minimal()

Initial Observations

  • Arrivals don’t seem to pick up until noon and go constant until 3 AM…hmmm
  • Could EWR have 20k flights / day? (14 flights/minute)…hmmm

Timezones

  • Everything is done in POSIX (Unix) time
  • “It measures time by the number of non-leap seconds that have elapsed since 00:00:00 UTC on 1 January 1970, the Unix epoch.”
  • UTC is Coordinated Universal Time, not to be confused with Greenwich Mean Time (GMT)
  • UTC is based on International Atomic Time (TAI) which is a weighted average of the time kept by over 450 atomic clocks in over 80 national laboratories worldwide.
Code
# set timezones so the datetime will DISPLAY correctly
data <- data |> mutate(
          Arrival = as.POSIXct(Arrival, tz = "EST")) |>
        mutate(
          bag_completion_time = as.POSIXct(bag_completion_time,
          tz = "EST"))

kable(head(data))
Flight Number Tail Number Origin Airport Arrival bag_completion_time
0208 N804UA ATL 2024-01-23 17:47:00 2024-01-23 18:12:48
0208 N804UA ATL 2024-01-23 17:47:00 2024-01-23 18:19:04
0208 N804UA ATL 2024-01-23 17:47:00 2024-01-23 18:10:56
0208 N804UA ATL 2024-01-23 17:47:00 2024-01-23 18:15:42
0208 N804UA ATL 2024-01-23 17:47:00 2024-01-23 18:09:44
0208 N804UA ATL 2024-01-23 17:47:00 2024-01-23 18:10:13

Fixed Histogram

Code
flight_arrivals <- data |>
  distinct(`Flight Number`, Arrival)

ggplot(flight_arrivals, aes(x = Arrival)) +
  geom_histogram(binwidth=60*60*3, fill = "lightblue", color = "black") + # Binwidth = 1 hour (3600 seconds)
  labs(
    title = "Histogram of Plane Arrival Times",
    x = "Arrival Time",
    y = "Count of Planes"
  ) +
  theme_dark()

3 Solution

Bag Delivery Time Statistics

  • What is a statistic?
Code
x_bar <- mean(data$bag_delivery_time)
s <- sd(data$bag_delivery_time)

Does this meet specifications

Code
summary(data)
 Flight Number      Tail Number        Origin Airport    
 Length:19495       Length:19495       Length:19495      
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
    Arrival                       bag_completion_time             
 Min.   :2024-01-23 00:00:00.00   Min.   :2024-01-23 00:16:37.00  
 1st Qu.:2024-01-23 06:53:00.00   1st Qu.:2024-01-23 07:21:11.00  
 Median :2024-01-23 12:14:00.00   Median :2024-01-23 12:52:01.00  
 Mean   :2024-01-23 11:39:42.83   Mean   :2024-01-23 12:09:40.24  
 3rd Qu.:2024-01-23 18:22:00.00   3rd Qu.:2024-01-23 18:51:19.50  
 Max.   :2024-01-23 23:47:00.00   Max.   :2024-01-24 00:31:43.00  

3.1 Visualize Data

Code
data <- data |> mutate(bag_delivery_time = as.numeric(bag_completion_time - Arrival))

data |> ggplot(aes(x = bag_delivery_time)) + 
  geom_histogram()

3.2 Fit Distributions to Data

  • use fitdistrplus to fit various distributions
Code
f_norm <- fitdist(data$bag_delivery_time, "norm")
f_ln <- fitdist(data$bag_delivery_time, "lnorm")
f_weibull <- fitdist(data$bag_delivery_time, "weibull")
f_gamma <- fitdist(data$bag_delivery_time, "gamma")

plot.legend <- c("Normal", "LogNormal", "Weibull", "Gamma")
  • use plotting functionality to plot a comparison of the data.
Code
denscomp(
  list(f_norm, f_ln, f_weibull, f_gamma),
  xlab = "Bag Delivery Time (min)",
  legendtext = plot.legend)

3.3 Compare to Manual Apporach

  • Alternatively, can generate probabilities for each bag arrival time by applying the fitted distributions to the data
Code
data <- data |> mutate(
  Weibull = dweibull(bag_delivery_time,
                      shape = f_weibull$estimate["shape"],
                      scale = f_weibull$estimate["scale"]),
  Normal = dnorm(bag_delivery_time,
                      mean = f_norm$estimate["mean"],
                      sd = f_norm$estimate["sd"]),
  Gamma = dgamma(bag_delivery_time,
                      shape = f_gamma$estimate["shape"],
                      rate = f_gamma$estimate["rate"]),
  LogNormal = dlnorm(bag_delivery_time,
                      meanlog = f_ln$estimate["meanlog"],
                      sdlog = f_ln$estimate["sdlog"]),
)
  • Then can plot these direclty one at a time
Code
data |> ggplot(aes(x = bag_delivery_time)) +
  geom_line(aes(y = Gamma, color="Gamma", line="grey")) +
  geom_line(aes(y = Normal, color="Normal")) +
  geom_line(aes(y = LogNormal, color="LogNormal")) +
  labs(
    color = "Distribution",
    x = "Bag Delivery Time (min)",
    y = "Probability Density"
  )

  • Alternatively, we could transform our tidy data from “wide” to “long” form
Code
long_data <- data |> pivot_longer(
  cols = c(Gamma, LogNormal, Normal, Weibull),
  names_to = "Distribution",
  values_to = "Probability")
  • Using kable to output a formatted table.
Code
kable(head(long_data))
Flight Number Tail Number Origin Airport Arrival bag_completion_time bag_delivery_time Distribution Probability
0208 N804UA ATL 2024-01-23 17:47:00 2024-01-23 18:12:48 25.80000 Gamma 0.0632560
0208 N804UA ATL 2024-01-23 17:47:00 2024-01-23 18:12:48 25.80000 LogNormal 0.0666378
0208 N804UA ATL 2024-01-23 17:47:00 2024-01-23 18:12:48 25.80000 Normal 0.0566180
0208 N804UA ATL 2024-01-23 17:47:00 2024-01-23 18:12:48 25.80000 Weibull 0.0480593
0208 N804UA ATL 2024-01-23 17:47:00 2024-01-23 18:19:04 32.06667 Gamma 0.0687913
0208 N804UA ATL 2024-01-23 17:47:00 2024-01-23 18:19:04 32.06667 LogNormal 0.0661965
  • and we can actually plot this much easier
Code
ggplot() +
  # Histogram or density plot of `bag_delivery_time`
  geom_histogram(data = data, aes(x = bag_delivery_time, y = ..density..), 
                 bins = 30, color="black", fill = "skyblue", alpha = 0.5) +
  # Overlay probability density lines from `long_data`
  geom_line(data = long_data, aes(x = bag_delivery_time, y = Probability, color = Distribution), size=1)

  • For relativley simple scenarios, it may be easier to manually plot the data yourself
  • As data and plots become more sophisticated this can be difficult
  • I recommend getting familiar with “long” and “wide” data and how to go between the two.

3.4 Conclusion

  • Which distribution is the best fit?

  • How do you know?

  • Does the schematic design meet specifications?

Code
#| echo: true

fitted_distributins <- list(f_gamma, f_ln, f_norm, f_weibull)

# Extract summary metrics from the fitted distributions
summary_table <- sapply(fitted_distributins, function(fit) {
  c(
    "Number of Parameters" = as.integer(length(fit$estimate)),  # Number of parameters
    "Log-Likelihood" = fit$loglik,                # Log-likelihood
    "AIC" = fit$aic,                              # Akaike Information Criterion
    "BIC" = fit$bic                               # Bayesian Information Criterion
  )
})

# Convert to a data frame for readability
# summary_table <- as.data.frame(summary_table)

# Assign column names based on your distributions
colnames(summary_table) <- c("Gamma", "Log-Normal", "Normal", "Weibull")  # Adjust based on distribution names

# View the table
kable(summary_table)
Gamma Log-Normal Normal Weibull
Number of Parameters 2.00 2.00 2.00 2.0
Log-Likelihood -58690.38 -58732.54 -58877.11 -59819.8
AIC 117384.75 117469.09 117758.23 119643.6
BIC 117400.51 117484.84 117773.98 119659.3