R course project report

Author

Amanda Sumner, DS24

Published

May 10, 2025

1 Introduction

The following report consists of exam tasks for the R programming course in several parts: use of external data from SCB, collecting car sales data from Blocket.se, regression model trained on this car sales data, and answers to theoretical questions. This report was written in R Studio and exported as html. The .qmd file with code is also available.

2 External data from SCB

Loading libraries for the entire project:

Code

library(tidyverse)
library(readxl)
library(skimr)
library(scales)
library(knitr)
library(patchwork)
library(car)
library(rsample)
library(Metrics)
library(broom)
library(pxweb)
library(jsonlite)
library(janitor)
library(ggrepel)

2.1 Income by region

Income data: median income by region (for ages 16+) and population, year 2023.

Code

scb_api_inkomst <- "https://api.scb.se/OV0104/v1/doris/sv/ssd/START/HE/HE0110/HE0110A/SamForvInk1"

pxweb_query_inkomst <- list(
  "Region" = c("01", "03", "04", "05", "06", "07", "08", "09", "10",
               "12", "13", "14", "17", "18", "19", "20", "21", "22",
               "23", "24", "25"),
  "Alder" = c("tot16+"),             
  "ContentsCode" = c("HE0110J7",     
                     "HE0110J9"),    
  "Tid" = c("2023")                  
)

scb_response_inkomst <- pxweb_get(url = scb_api_inkomst, query = pxweb_query_inkomst)

scb_data_inkomst <- as.data.frame(scb_response_inkomst)

scb_inkomst <- scb_data_inkomst |>
  select(
  region = region,
  medelinkomst = `Medelinkomst, tkr`,
  antal_pers = `Antal personer`
  ) |>
  mutate(
    medelinkomst = as.numeric(medelinkomst),
    antal_pers = as.numeric(antal_pers),
    region = as.factor(region)
  )

scb_inkomst

2.2 Car ownership

Car ownership data: total number of passenger cars per region, year 2023.

Code

scb_api_bilar <- "https://api.scb.se/OV0104/v1/doris/sv/ssd/START/TK/TK1001/TK1001A/PersBilarA"

pxweb_query_bilar <- list(
  "Region" = c("01", "03", "04", "05", "06", "07", "08", "09", "10",
               "12", "13", "14", "17", "18", "19", "20", "21", "22",
               "23", "24", "25"),
  "ContentsCode" = c("TK1001AB"),
  "Tid" = c("2023")
)

scb_response_bilar <- pxweb_get(url = scb_api_bilar, query = pxweb_query_bilar)

scb_data_bilar <- as.data.frame(scb_response_bilar)

scb_bilar <- scb_data_bilar |>
  select(
  region = region,
  antal_bil = `Antal`
  ) |>
  mutate(
    antal_bil = as.numeric(antal_bil),
    region = as.factor(region)
  )

scb_bilar

Code

scb_data <- left_join(scb_inkomst, scb_bilar, by = "region")
scb_data

2.3 Data visualisation

Code

scb_data_viz <- scb_data |>
  mutate(bilar_per_1000 = (antal_bil / antal_pers) * 1000)

scb_data_viz

Code

income_vs_cars <- ggplot(scb_data_viz,                                          aes(x = medelinkomst, y = bilar_per_1000)) +
  geom_point(aes(size = antal_pers), alpha = 0.8, color = "deepskyblue4") +
  geom_smooth(method = "lm", se = FALSE, color = "darkorange") +
  geom_text_repel(aes(label = region), size = 3, max.overlaps = 15) +
  labs(
    title = "Car ownership vs average income by region",
    subtitle = "Point size represents regional population",
    x = "Average income (thousands of SEK)",
    y = "Cars per 1000 population"
  )+
  theme(legend.position = "none")

income_vs_cars

Code

region_income <- ggplot(scb_data_viz, aes(x = medelinkomst, y = fct_reorder(region, medelinkomst))) +
  geom_col(fill = "pink", color = "black") +
  geom_text(aes(label = round(medelinkomst, 0)),
            hjust = -0.2,
            size = 3) +
  scale_x_continuous(expand = expansion(mult = c(0.01, 0.2))) +
  labs(
    title = "Average income",
    subtitle = "by region in 2023",
    x = "Average income (kSEK)",
    y = element_blank()
  )


region_cars_per_1000 <- ggplot(scb_data_viz, aes(x = bilar_per_1000, y = fct_reorder(region, bilar_per_1000))) +
  geom_col(fill = "lightblue", color = "black") +
  geom_text(aes(label = round(bilar_per_1000, 0)),
            hjust = -0.2,
            size = 3) +
  scale_x_continuous(expand = expansion(mult = c(0.01, 0.2))) +
  labs(
    title = "Number of cars",
    subtitle = "per 1000 population",
    x = "Number of cars",
    y = element_blank()
  )
region_income + region_cars_per_1000

Sweden exhibits notable regional disparities in both average income and car ownership. These variations suggest that used car market and specifically pricing might vary across the country. The data show that there is only a moderate relationship between car ownership and average income by region. Although, curiously, the regions with the highest and the lowest income have, respectively, the lowest and the highest car ownership rates.

I would like to later use this data with my collected car sales data from Blocket to see if there is a relationship between car prices and average income or car ownership.

3 Data collection

I collected data from Blocket.se by using the Web Scraper Chrome extension (webscraper.io). It works by following links from a starting url and copying text from specified elements on the page. It opens each webpage in a popup Chrome window and after it is done, the data can be exported as csv. It is not blocked by the server as a robot because the user agent is Chrome and it is actually opening each webpage and copying the data, with a request interval and page load delay.

I scraped data from 200 most recent pages of car listings which were filtered by year between 2010 and 2020, resulting in an Excel spreadsheet with approximately 7000 rows. I then cleaned the data in Power Query in Excel to remove rows with missing values and standardise spelling of some values. Colours, for example, were reduced to include only base colours: light blue and dark blue both converted to blue, etc. Numeric and text columns were converted to their correct data type, respectively. The resulting Excel spreadsheet had 6381 rows and 13 columns.

4 Blocket car data

Code

car_data <- read_excel("blocket_cars.xlsx")

head(car_data)

Code

skim(car_data)

Data summary
Name	car_data
Number of rows	6381
Number of columns	13
_______________________
Column type frequency:
character	8
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
make	1	2	13	48
model	1	1	18	410
fuel	1	2	6	4
transmission	1	7	7	2
body	1	3	11	8
drive	1	3	3	2
colour	1	3	10	12
location	1	3	20	250

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	2016.39	2.86	2010	2014	2017	2019	2020	▃▃▆▇▇
mileage	1	13535.98	6274.56	170	9110	12690	17093	65461	▇▆▁▁▁
engine_cc	1	1922.14	648.56	647	1560	1968	1997	6498	▆▇▁▁▁
power_hk	1	173.00	76.15	44	123	161	191	740	▇▅▁▁▁
price	1	199305.29	140082.72	1	119700	174900	249500	4199000	▇▁▁▁▁

4.1 Numeric variables

4.1.1 EDA of price

Code

summary(car_data$price)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1  119700  174900  199305  249500 4199000

Code

car_data |>
  ggplot(aes(x = price)) +
  geom_histogram(bins = 100, fill = "lightblue", color = "black") +
  scale_x_continuous(labels = scales::label_number(scale = 1/1000, suffix = "k")) +
  ggtitle("Distribution of car prices") +
  xlab("Price (SEK)")

Code

car_data |>
  filter(price < 40000) |>
  ggplot(aes(x = price)) +
  geom_histogram(bins = 100, fill = "pink", color = "black") +
  ggtitle("Distribution of low car prices (< 40k SEK)") +
  xlab("Price (SEK)")

Code

car_data |>
  filter(price > 500000) |>
  ggplot(aes(x = price)) +
  geom_histogram(bins = 100, fill = "lavender", color = "black") +
  scale_x_continuous(labels = scales::label_number(scale = 1/1000, suffix = "k")) +
  ggtitle("Distribution of High Car Prices (> 500k SEK)") +
  xlab("Price (SEK)")

Filtering out cars priced below 30k SEK and above 500k SEK:

Code

low_price_threshold <- 20000
high_price_threshold <- 1000000

car_data |>
  filter(price >= high_price_threshold) |>
  select(make, model, year, mileage, price) |>
  arrange(desc(price)) |>
  tibble()

Code

car_data |>
  filter(price <= low_price_threshold) |>
  select(make, model, year, mileage, price) |>
  arrange(price) |>
  tibble()

Code

car_data <- car_data |>
  filter(between(price, low_price_threshold, high_price_threshold))
summary(car_data$price)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  20000  119800  174900  196060  249000  999900

Code

car_data |>
  ggplot(aes(x = price)) +
  geom_histogram(bins = 100, fill = "lightblue", color = "black") +
  scale_x_continuous(labels = scales::label_number(scale = 1/1000, suffix = "k")) +
  ggtitle("Distribution of car prices after filtering extremes") +
  xlab("Price (SEK)")

4.1.2 EDA of mileage

Code

summary(car_data$mileage)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    526    9141   12695   13547   17088   65461

Code

car_data |>
  ggplot(aes(x = mileage)) +
  geom_histogram(bins = 50, fill="lightblue", color = "black") +
  ggtitle("Distribution of mileage")

Code

car_data |>
  filter(mileage > 40000) |>
  ggplot(aes(x = mileage)) +
  geom_histogram(bins = 50, fill="pink", color = "black") +
  ggtitle("Distribution of mileage over 40000")

Code

car_data <- car_data |>
  filter(mileage <= 40000)
summary(car_data$mileage)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    526    9126   12682   13480   17044   39999

4.1.3 EDA of engine

Code

summary(car_data$engine_cc)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    647    1560    1968    1916    1997    6417

Code

car_data |>
  ggplot(aes(x = engine_cc)) +
  geom_histogram(bins = 50, fill="lightblue", color = "black") +
  ggtitle("Distribution of engine size")

Code

car_data |>
  filter(engine_cc > 6000) |>
  select(make, model, year, engine_cc, price) |>
  arrange(desc(engine_cc)) |>
  tibble()

4.1.4 EDA of power

Code

summary(car_data$power_hk)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   44.0   123.0   157.0   172.1   191.0   728.0

Code

car_data |>
  ggplot(aes(x = power_hk)) +
  geom_histogram(bins = 50, fill="lightblue", color = "black") +
  ggtitle("Distribution of power")

Code

car_data |>
  filter(power_hk > 600) |>
  select(make, model, year, power_hk, price) |>
  arrange(desc(power_hk)) |>
  tibble()

The horsepower distribution is very skewed. Does it need to be normalised? Visualising the relationship between horsepower and price.

Code

ggplot(car_data, aes(x = power_hk, y = price)) +
  geom_point(alpha = 0.5, color="deepskyblue4") +
  geom_smooth(color="darkorange") +
  ggtitle("Price vs Horsepower")

4.1.5 Correlations of numeric predictors

Code

numeric_predictors <- car_data |>
  select(year, mileage, engine_cc, power_hk, price)
cor(numeric_predictors)

                 year     mileage   engine_cc    power_hk      price
year       1.00000000 -0.50834842 -0.01698333  0.16112628  0.5638117
mileage   -0.50834842  1.00000000  0.11577634 -0.02131676 -0.3849814
engine_cc -0.01698333  0.11577634  1.00000000  0.77047426  0.5686938
power_hk   0.16112628 -0.02131676  0.77047426  1.00000000  0.7307023
price      0.56381173 -0.38498144  0.56869380  0.73070228  1.0000000

4.2 Categorical variables

4.2.1 Make and model

Code

n_distinct(car_data$make)

[1] 46

Code

n_distinct(car_data$model)

[1] 405

Code

make_counts <- car_data |>
  count(make, sort = TRUE)
print(make_counts)

# A tibble: 46 × 2
   make              n
   <chr>         <int>
 1 Volkswagen     1067
 2 Volvo          1034
 3 BMW             542
 4 Audi            482
 5 Mercedes-Benz   478
 6 Ford            335
 7 Kia             293
 8 Toyota          278
 9 Renault         205
10 Skoda           202
# ℹ 36 more rows

Code

make_counts |>
  slice_max(n, n = 30) |> # Show only top 30
  ggplot(aes(x = n, y = fct_reorder(make, n))) +
  geom_col(fill = "lightblue", color = "black") +
  geom_text(
    aes(label = n),
    hjust = -0.2,
    size = 3
  ) +
  scale_x_continuous(expand = expansion(mult = c(0.01, 0.1))) +
  labs(
    title = "Number of cars per make",
    x = "Number of cars",
    y = "Make"
  )

There are 46 distinct car makes and 405 distinct models. That is too many models to use them all in prediction. However, certain car models are much more expensive than others from the same make. It is impractical to plot all car models, so I focus on the most common makes.

Code

top_make_names <- car_data |>
  count(make, sort = TRUE) |>
  slice_max(n, n = 6) |>
  pull(make)

top_makes_filtered <- car_data |>
  filter(make %in% top_make_names) |>
  mutate(make = factor(make, levels = top_make_names))

plot_make <- function(make_name_string, input_data) {
  make_data <- input_data |>
    filter(make == make_name_string) |>
    mutate(model_lumped = fct_lump_n(model, n = 10))
  
  p <- ggplot(make_data, aes(y = model_lumped, x = price)) +
    geom_boxplot(fill = "lightblue", color = "black", outlier.color = "darkorange") +
    scale_x_continuous(
        labels = scales::label_number(scale = 1/1000, suffix = "k", accuracy = 1)
        ) +
    labs(
      title = make_name_string,
      x = element_blank(),
      y = element_blank()
    ) +
    theme_minimal(base_size = 8)
  
  return(p)
}

plot_list <- map(top_make_names, ~plot_make(make_name_string = .x, input_data = top_makes_filtered))
wrap_plots(plot_list, ncol = 3)

4.2.2 Body type

Looking at the price distribution, it appears that certain models, especially noticeable with BMW, are much higher priced. Is it related to those being sports cars?

Code

car_data_body <- car_data |>
  mutate(body = as.factor(body))

print(car_data_body |> count(body, sort = TRUE))

# A tibble: 8 × 2
  body            n
  <fct>       <int>
1 kombi        2029
2 SUV          1439
3 halvkombi    1317
4 yrkesfordon   798
5 sedan         357
6 familjebuss   215
7 coupé         106
8 cab            85

Code

ggplot(car_data_body, aes(x = fct_reorder(body, price, .fun = median, .desc = FALSE), y = price)) +
  geom_boxplot(fill = "lavender", outlier.color = "darkorange") +
  scale_y_continuous(labels = scales::label_number(scale = 1/1000, suffix = "k", accuracy = 1)) +
  coord_flip() +
  labs(
    title = "Price distribution by body style",
    y = "Price (thousands of SEK)",
    x = "Body style"
  )

4.2.3 Power and engine

Prices are not strongly related to a specific body type. Are they related to either engine size or power?

Code

car_data_price <- car_data |>
  mutate(Is_High_Price = price > 500000)

plot_power <- ggplot(car_data_price, aes(x = power_hk, y = price, color = Is_High_Price)) +
  geom_point(alpha = 0.7) +
  scale_color_manual(values = c("FALSE" = "deepskyblue4", "TRUE" = "darkorange"), name = paste("Price > 500k")) +
  scale_y_continuous(labels = scales::label_number(scale = 1/1000, suffix = "k", accuracy = 1)) +
  theme(legend.position = "bottom") +
  labs(
    title = "Price vs horsepower",
    x = "Horsepower (hk)",
    y = "Price (thousands of SEK)",
    )
 

plot_engine <- ggplot(car_data_price, aes(x = engine_cc, y = price, color = Is_High_Price)) +
  geom_point(alpha = 0.7) +
  scale_color_manual(values = c("FALSE" = "deepskyblue4", "TRUE" = "darkorange"), name = paste("Price > 500k")) +
  scale_y_continuous(labels = scales::label_number(scale = 1/1000, suffix = "k", accuracy = 1)) +
  theme(legend.position = "bottom") +
  labs(
    title = "Price vs engine size (cc)",
    x = "Engine size (cc)",
    y = "Price (thousands of SEK)",
    
  )
 
plot_power + plot_engine

From the above visualisations, it is clear that the higher engine size and power, the more common are price outliers. This helps us understand that price outliers are not related to just specific car models but it is the car models with higher engine size and power, therefore the car model variable is not essential to include, make is sufficient.

Code

car_data <- car_data |>
  select(-model)

Code

make_counts <- car_data |>
  count(make, sort = TRUE) |>
  mutate(Rank = row_number())

ggplot(make_counts, aes(x = Rank, y = n)) +
  geom_line(color = "deepskyblue4", linewidth = 1) +
  geom_point(color = "deepskyblue4", size = 2) +
  scale_y_continuous(labels = scales::label_number(accuracy = 1)) +
  geom_vline(xintercept = 20, linetype = "dashed", color = "darkorange") +
  geom_vline(xintercept = 25, linetype = "dashed", color = "red") +
  geom_vline(xintercept = 30, linetype = "dashed", color = "purple") +
  annotate("text", x = 20, y = Inf, label = "Top 20", vjust = 2, hjust = -0.1, size = 3, color = "darkorange") +
  annotate("text", x = 25, y = Inf, label = "Top 25", vjust = 4, hjust = -0.1, size = 3, color = "red") +
  annotate("text", x = 30, y = Inf, label = "Top 30", vjust = 6, hjust = -0.1, size = 3, color = "purple") +

  labs(
    title = "Car count per make (ordered by frequency)",
    x = "Make rank (most frequent = 1)",
    y = "Number of cars"
  )

Limiting the car makes to the top 25 most common and lumping all other makes under “Other”.

Code

car_data <- car_data |>
  mutate(
    make = fct_lump_n(
      make,
      n = 25,
      other_level = "Other"
    )
  )

tibble(car_data |> count(make, sort = TRUE))

How do the other categorical variables relate to price?

4.2.4 Colour

Code

colour_counts <- car_data |>
  count(colour, sort = TRUE)


plot_colour <- ggplot(colour_counts, aes(x = n, y = fct_reorder(colour, n))) +
  geom_col(fill = "pink", color = "black") +
  geom_text(
    aes(label = n),
    hjust = -0.2,
    size = 3
  ) +
  scale_x_continuous(expand = expansion(mult = c(0.01, 0.1))) +
  labs(
    title = "Number of cars per colour",
    x = "Number of cars",
    y = element_blank()
  )

car_data <- car_data |>
    mutate(colour = as.factor(colour))

plot_colour_price <- ggplot(car_data, aes(x = fct_reorder(colour, price, .fun = median), y = price)) +
  geom_boxplot(fill = "lavender", outlier.shape = 21, outlier.fill = "darkorange") + 
  scale_y_continuous(labels = scales::label_number(scale = 1/1000, suffix = "k", accuracy = 1)) + 

  coord_flip() +
  labs(
    title = "Price distribution by car colour",
    y = "Price (thousands of SEK)",
    x = element_blank()
  )
plot_colour + plot_colour_price

At this point, it looks like colour does not affect the price much.

Lumping the less common colours into “Other”.

Code

car_data <- car_data |>
  mutate(
    colour = fct_lump_n(
      colour,
      n = 7,
      other_level = "Other"
    )
  )

4.2.5 Drivetrain and transmission type

Code

car_data <- car_data |>
    mutate(
      drive = as.factor(drive),
      transmission = as.factor(transmission)
      )

plot_drive <- ggplot(car_data, aes(x = drive, y = price)) +
  geom_boxplot(fill = "lightblue", outlier.shape = 21, outlier.fill = "darkorange") +
  scale_y_continuous(labels = scales::label_number(scale = 1/1000, suffix = "k", accuracy = 1)) +
  labs(
    title = "Price by drivetrain",
    x = "Drivetrain type",
    y = "Price (thousands of SEK)"
  )


plot_transmission <- ggplot(car_data, aes(x = transmission, y = price)) +
  geom_boxplot(fill = "lavender", outlier.shape = 21, outlier.fill = "darkorange") +
  scale_y_continuous(labels = scales::label_number(scale = 1/1000, suffix = "k", accuracy = 1)) +
  labs(
    title = "Price by transmission type",
    x = "Transmission type",
    y = element_blank()
  ) 

plot_drive + plot_transmission

Automatic cars are, on average, more expensive than cars with manual transmission. Likewise, 4-wheel-drive cars are more expensive than 2-wheel-drive.

4.2.6 Fuel type

Code

tibble(car_data |> count(fuel, sort = TRUE))

Code

car_data <- car_data |>
    mutate(
      fuel = as.factor(fuel)
      )

plot_fuel <- ggplot(car_data, aes(x = fuel, y = price)) +
  geom_boxplot(fill = "pink", outlier.shape = 21, outlier.fill = "darkorange") +
  scale_y_continuous(labels = scales::label_number(scale = 1/1000, suffix = "k", accuracy = 1)) +
  labs(
    title = "Price by fuel type",
    x = "Fuel type",
    y = "Price (thousands of SEK)"
  )
print(plot_fuel)

There is only one electric car in the dataset. Removing the row so that it doesn’t affect the result.

Code

car_data <- car_data |>
  filter(fuel != "el")

car_data |> count(fuel)

4.2.7 Location

What to do about location? There are too many, with one or very few cars in some, but I want to see if price varies regionally.

Code

car_data |> count(location)

Mapping kommun to län to reduce the number of values. Importing an external csv file with all location data.

Code

kommun_län_map <- read_csv("Postort-Kommun-Lan.csv")
kommun_län_map

Code

kommun_lan_map <- kommun_län_map |>
  rename(
    kommun = `KnNamn.kort`,
    lan = `LnNamn.kort`
  ) |>
  select(kommun, lan) |>
  distinct(kommun, .keep_all = TRUE)

car_data <- car_data |>
  rename(kommun = location)

Code

car_data <- car_data |>
  left_join(kommun_lan_map, by = "kommun")

print(head(car_data |> select(kommun, lan)))

# A tibble: 6 × 2
  kommun     lan            
  <chr>      <chr>          
1 Vallentuna Stockholm      
2 Halmstad   Halland        
3 Uppsala    Uppsala        
4 Göteborg   Västra Götaland
5 Halmstad   Halland        
6 Sigtuna    Stockholm

Code

sum(is.na(car_data$lan))

[1] 45

Code

sum(!is.na(car_data$lan))

[1] 6300

Removing the failed joins, 45 rows, and selecting all necessary columns once again.

Code

car_data <- car_data |>
  filter(!is.na(lan))

car_data <- car_data |>
  select(make, year, fuel, transmission, body, mileage, engine_cc, power_hk, drive, colour, lan, price)

car_data

Does the price have regional variations?

Code

car_data <- car_data |>
    mutate(
      lan = as.factor(lan)
      )

avg_price_lan <- car_data |>
  group_by(lan) |>
  summarise(average_price = mean(price, na.rm = TRUE))


plot_avg <- ggplot(
  avg_price_lan,
  aes(x = average_price, y = fct_reorder(lan, average_price))
  ) +
  geom_col(fill = "lightblue", color = "black") +
  geom_text(aes(label = scales::label_number(scale = 1/1000, suffix = "k", accuracy = 1)(average_price)), hjust = -0.1, size = 3) +
  scale_x_continuous(
    labels = scales::label_number(scale = 1/1000, suffix = "k", accuracy = 1), expand = expansion(mult = c(0.01, 0.15))
  ) +
  labs(
    title = "Average car price by län",
    x = "Average price (thousands of SEK)",
    y = "Län (county)"
  )
  
plot_price <- ggplot(
  car_data,
  aes(x = fct_reorder(lan, price, .fun = median, .desc = FALSE), y = price)) +
  geom_boxplot(fill = "lavender", outlier.shape = 21, outlier.fill = "darkorange") +
  scale_y_continuous(labels = scales::label_number(scale = 1/1000, suffix = "k", accuracy = 1)) +
  coord_flip() +
  labs(
    title = "Price distribution by län",
    y = "Price (thousands of SEK)",
    x = element_blank()
  )
  
plot_avg + plot_price

As previously mentioned in the External data section, I wanted to compare the statistics from SCB on car ownership and income with the collected car prices by region. Is there a relationship between average income and car prices or the number of cars owned and car prices? It has to be noted that the SCB data is 2023 totals and the car listings are from April 2025. However, there should not be a noticeable difference as Sweden’s economy has not changed significantly in one year.

Code

scb_data_viz_cleaned <- scb_data_viz |>
  mutate(
    lan = str_replace(region, regex(" län$"), ""),
    lan = trimws(lan),
    lan = str_replace(lan, regex("s$"), "")
  ) |>
  select(-(c("region", "antal_pers", "antal_bil")))

combined_data <- left_join(
  avg_price_lan,
  scb_data_viz_cleaned,
  by = "lan"
)

combined_data

Code

scatter_price_income <- ggplot(combined_data, aes(x = medelinkomst, y = average_price)) +
  geom_point(aes(), color = "deepskyblue4", alpha = 0.7, size = 3) +
  geom_smooth(method = "lm", se = FALSE, linetype = "dotted", color = "darkorange", linewidth = 1) +
  geom_text_repel(aes(label = lan), size = 3, max.overlaps = Inf, 
                  box.padding = 0.4, point.padding = 0.3) +
  scale_x_continuous(name = "Median income (SCB, thousands of SEK)",
                     labels = scales::label_number(suffix = "k", accuracy = 1)) +
  scale_y_continuous(name = "Average Blocket car price (SEK)",
                     labels = scales::label_number(scale = 1/1000, suffix = "k", accuracy = 1)) +
  labs(
    title = "Blocket car price vs. median income by region"
  )

scatter_price_income

Code

scatter_price_cars <- ggplot(combined_data, aes(x = bilar_per_1000, y = average_price)) +
  geom_point(aes(), color = "purple3", alpha = 0.7, size = 3) +
  geom_smooth(method = "lm", se = FALSE, linetype = "dotted", color = "darkorange", linewidth = 1) +
  geom_text_repel(aes(label = lan), size = 3, max.overlaps = Inf,
                  box.padding = 0.4, point.padding = 0.3) +
  scale_x_continuous(name = "Cars per 1000 population (SCB)",
                     labels = scales::label_number(accuracy = 1)) +
  scale_y_continuous(name = "Average Blocket car price (SEK)",
                     labels = scales::label_number(scale = 1/1000, suffix = "k", accuracy = 1)) +
  labs(
    title = "Blocket car price vs car ownership rate by region"
  )

scatter_price_cars

According to the two charts above, there is no clearly defined relationship between car prices on Blocket and car ownership rates and average income by region.

4.2.8 Prepared data

Making sure all categorical predictors are converted into factors.

Code

car_data <- car_data |>
  mutate(
    make = as.factor(make),
    fuel = as.factor(fuel),
    body = as.factor(body),
    transmission = as.factor(transmission),
    drive = as.factor(drive),
    colour = as.factor(colour),
    lan = as.factor(lan)
    )

4.3 Regression model

Split data into training and test sets, 80% training and 20% test, stratifying it on price, to ensure both sets have similar price distribution.

Code

set.seed(26)

data_split <- initial_split(car_data, prop = 0.80, strata = price)
train_data <- training(data_split)
test_data <- testing(data_split)

cat("Training data rows:", nrow(train_data), "\n")

Training data rows: 5038

Code

cat("Test data rows:", nrow(test_data), "\n")

Test data rows: 1262

4.3.1 Initial model

Fit the multiple linear regression model

Code

lm1 <- lm(price ~ ., data = train_data)

tidy(lm1, conf.int = TRUE)

Code

glance(lm1)

Code

vif(lm1)

                 GVIF Df GVIF^(1/(2*Df))
make         6.237500 25        1.037290
year         1.915459  1        1.384001
fuel         2.548018  2        1.263428
transmission 1.609935  1        1.268832
body         4.887600  7        1.120008
mileage      1.790191  1        1.337980
engine_cc    4.093485  1        2.023236
power_hk     4.742742  1        2.177784
drive        1.991989  1        1.411379
colour       1.506636  7        1.029710
lan          1.525369 20        1.010612

Multicollinearity is generally low, below 2, except for engine_cc and power_hk where the values are close to the threshold of 2.0-2.2 This indicates moderate multicollinearity between the two variables, which makes sense, as engine size and horsepower are strongly related. The value is not high enough to be a concern though and both variables are statistically significant (low p-values).

Code

par(mfrow = c(2, 2))
plot(lm1)

The diagnostic plots for the initial linear regression model indicate a need for model refinement. The residuals vs fitted plot shows a curved pattern, the linear model doesn’t capture non-linear relationships in the data. The scale - location plot also shows an upward trend and funnel shape which indicates heteroscedasticity - a non-constant variance of error. The Q-Q plot shows residuals deviating significantly from the line at tails, which means the residuals are not normally distributed. Residuals vs leverage plot highlights several outliers.

4.3.2 Model 2

Code

model_formula_refined <- log(price) ~ poly(mileage, 2) + poly(year, 2) + make + fuel + transmission + body + engine_cc + power_hk + drive + colour + lan

lm2 <- lm(model_formula_refined, data = train_data)

tidy(lm2, conf.int = TRUE)

Code

glance(lm2)

Code

vif(lm2)

                     GVIF Df GVIF^(1/(2*Df))
poly(mileage, 2) 1.952878  2        1.182140
poly(year, 2)    2.124302  2        1.207269
make             6.363259 25        1.037704
fuel             2.621991  2        1.272500
transmission     1.609940  1        1.268834
body             5.004558  7        1.121901
engine_cc        4.106839  1        2.026534
power_hk         4.749674  1        2.179375
drive            2.003381  1        1.415408
colour           1.520810  7        1.030399
lan              1.541155 20        1.010872

Code

par(mfrow = c(2, 2))
plot(lm2)

4.3.3 Model 3

Code

model_formula_refined2 <- log(price) ~ make + year + fuel + transmission + mileage + body + engine_cc + drive

lm3 <- lm(model_formula_refined2, data = train_data)

tidy(lm3, conf.int = TRUE)

Code

glance(lm3)

Code

vif(lm3)

                 GVIF Df GVIF^(1/(2*Df))
make         3.953237 25        1.027872
year         1.813461  1        1.346648
fuel         1.983273  2        1.186713
transmission 1.578909  1        1.256547
mileage      1.763729  1        1.328055
body         3.932630  7        1.102751
engine_cc    1.873838  1        1.368882
drive        1.778742  1        1.333695

Code

par(mfrow = c(2, 2))
plot(lm3)

4.3.4 Model evaluation

Code

test_pred_1 <- predict(lm1, newdata = test_data)
rmse_1 <- rmse(actual = test_data$price, predicted = test_pred_1)
rsq_1 <- cor(test_data$price, test_pred_1)^2


test_actual_sek <- test_data$price
test_actual_log <- log(test_data$price)

test_pred_2 <- predict(lm2, newdata = test_data)
test_pred_sek_2 <- exp(test_pred_2)
rmse_2 <- rmse(actual = test_actual_log, predicted = test_pred_2)
rsq_2 <- cor(test_actual_log, test_pred_2)^2
rmse_sek_2 <- rmse(actual = test_actual_sek, test_pred_sek_2)

test_pred_3 <- predict(lm3, newdata = test_data)
test_pred_sek_3 <- exp(test_pred_3)
rmse_3 <- rmse(actual = test_actual_log, predicted = test_pred_3)
rsq_3 <- cor(test_actual_log, test_pred_3)^2
rmse_sek_3 <- rmse(actual = test_actual_sek, test_pred_sek_3)


cat("Initial model test set RMSE:", round(rmse_1, 4), "\n")

Initial model test set RMSE: 45364.24

Code

cat("Refined model 2 test set RMSE on log scale:", round(rmse_2, 4), "\n")

Refined model 2 test set RMSE on log scale: 0.1859

Code

cat("Refined model 2 test set RMSE on SEK scale:", round(rmse_sek_2, 2), "\n")

Refined model 2 test set RMSE on SEK scale: 46826.87

Code

cat("Refined model 3 test set RMSE on log scale:", round(rmse_3, 4), "\n")

Refined model 3 test set RMSE on log scale: 0.2037

Code

cat("Refined model 3 test set RMSE on SEK scale:", round(rmse_sek_3, 2), "\n\n")

Refined model 3 test set RMSE on SEK scale: 50014.23

Code

cat("Initial model test set R-squared:", round(rsq_1, 4), "\n")

Initial model test set R-squared: 0.849

Code

cat("Refined model 2 test set R-squared:", round(rsq_2, 4), "\n")

Refined model 2 test set R-squared: 0.9

Code

cat("Refined model 3 test set R-squared:", round(rsq_3, 4), "\n")

Refined model 3 test set R-squared: 0.8798

I tested two refined models. Model 2 used log(price) ~ poly(mileage, 2) + poly(year, 2) in addition to all other variables. Model 3 used log(price) as well as make, year, fuel, transmission, mileage, body, engine_cc and drive, that is, removing colour, horsepower, and län variables. Model 3 performed worse than model 2, so the removed variables were actually significant and should be kept.

For model 2, multiple R-squared and adjusted R-squared values of 0.8962 and 0.8947 show a very good fit, the model explains ca. 89% variance in log(price) in the training data. Large value of F-statistic with a very small p-value < 2.2e-16 indicates that the model is overall highly statistically significant and shows an improvement over the initial model.

The model could be simplified by removing the colour variables, most of the colours are not significant, except for white and red. Most of the location (län) variables are also not significant, except for Gotland where the average price is the lowest. Most of the makes are highly significant, except for a few, such as BMW and Honda.

I will attempt a final refined model without the län and colour predictors, as well as adding poly to the numeric predictors.

4.3.5 Final model

Code

model_formula_refined_4 <- log(price) ~ poly(mileage, 2) + year + make + fuel + transmission + body + poly(engine_cc, 2) + poly(power_hk, 2) + drive

lm4 <- lm(model_formula_refined_4, data = train_data)

tidy(lm4, conf.int = TRUE)

Code

glance(lm4)

Code

vif(lm4)

                       GVIF Df GVIF^(1/(2*Df))
poly(mileage, 2)   1.891894  2        1.172801
year               1.922764  1        1.386638
make               6.111173 25        1.036866
fuel               2.906194  2        1.305663
transmission       1.690090  1        1.300035
body               4.524851  7        1.113856
poly(engine_cc, 2) 9.514165  2        1.756276
poly(power_hk, 2)  8.895746  2        1.727013
drive              2.047110  1        1.430772

Code

par(mfrow = c(2, 2))
plot(lm4)

4.3.6 Final model evaluation

Code

test_pred_4 <- predict(lm4, newdata = test_data)
test_pred_sek_4 <- exp(test_pred_4)
rmse_4 <- rmse(actual = test_actual_log, predicted = test_pred_4)
rsq_4 <- cor(test_actual_log, test_pred_4)^2
rmse_sek_4 <- rmse(actual = test_actual_sek, test_pred_sek_4)

cat("Refined model 4 test set RMSE on log scale:", round(rmse_4, 4), "\n")

Refined model 4 test set RMSE on log scale: 0.1822

Code

cat("Refined model 4 test set RMSE on SEK scale:", round(rmse_sek_4, 2), "\n")

Refined model 4 test set RMSE on SEK scale: 40247.15

Code

cat("Refined model 4 test set R-squared:", round(rsq_4, 4), "\n")

Refined model 4 test set R-squared: 0.9037

Model 4 shows an improvement over model 2 in every aspect. Residual standard error is 0.1849 on 4994 degrees of freedom. Multiple R-squared 0.8994 and adjusted R-squared 0.8985 values indicate an even better fit than model 2. It is an acceptable final model for the purpose of this project.

Code

model_filename <- "final_car_model.rds"
saveRDS(lm4, file = model_filename)
cat("Model saved to:", model_filename, "\n")

Model saved to: final_car_model.rds

4.3.7 Testing the model

Testing the model to predict price from new car data. I took random car listings from Blocket on dates that were not initially collected and entered the values manually.

Code

new_car_data <- tibble(
  make = c("Dodge", "Mercedes-Benz", "Peugeot", "Skoda", "Kia", "Volkswagen", "BMW", "Ford", "Volvo", "Porsche"),
  year = c(2020, 2017, 2012, 2015, 2018, 2015, 2019, 2019, 2018, 2020),
  mileage = c(2054, 4060, 19979, 13872, 13200, 15980, 9888, 5507, 8914, 9693),
  fuel = c("bensin", "bensin", "diesel", "diesel", "hybrid", "diesel", "diesel", "bensin", "diesel", "bensin"),
  transmission = c("automat", "automat", "manuell", "manuell", "automat", "automat", "automat", "manuell", "automat", "automat"),
  power_hk = c(396, 510, 112, 90, 206, 263, 191, 101, 236, 245),
  engine_cc = c(5654, 3982, 1560, 1598, 1999, 2967, 1995, 998, 1969, 1984),
  body = c("yrkesfordon", "kombi", "kombi", "halvkombi", "kombi", "SUV", "kombi", "kombi", "SUV", "SUV"),
  drive = c("4WD", "2WD", "2WD", "2WD", "2WD", "4WD", "4WD", "2WD", "4WD", "4WD"),
  actual_price = c(849800, 630000, 36000, 88800, 164900, 219900, 269500, 144800, 479800, 489800)
)

make_train_levels <- levels(train_data$make)
new_car_data <- new_car_data |>
  mutate(
    make = fct_other(make, keep = make_train_levels, other_level = "Other"),
    make = factor(make, levels = make_train_levels)
  )

predicted_log_prices <- predict(lm4, newdata = new_car_data)

predicted_prices_sek <- round(exp(predicted_log_prices), 0)

new_car_table <- new_car_data |>
  select(make, year, actual_price)

tibble(new_car_table, predicted_prices_sek)

Code

comparison_data <- new_car_data |>
  mutate(predicted_price = predicted_prices_sek)

plot_comparison <- ggplot(comparison_data, aes(x = predicted_price, y = actual_price)) +
  geom_point(alpha = 0.7, size = 3, color = "darkorange") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "deepskyblue4", linewidth = 1) +
  geom_text(
    aes(label = make),
    vjust = -0.8,
    hjust = 0.5,
    size = 3,
    check_overlap = TRUE
  ) +
  scale_x_continuous(labels = scales::label_number(scale = 1/1000, suffix = "k", accuracy = 1)) +
  scale_y_continuous(labels = scales::label_number(scale = 1/1000, suffix = "k", accuracy = 1)) +
  coord_equal(
    xlim = range(c(comparison_data$predicted_price, comparison_data$actual_price), na.rm = TRUE),
    ylim = range(c(comparison_data$predicted_price, comparison_data$actual_price), na.rm = TRUE)
  ) +
  labs(
    title = "Actual vs. predicted prices for new car listings",
    x = "Predicted price",
    y = "Actual price"
    )
plot_comparison

5 Theory questions

Kolla på följande video: https://www.youtube.com/watch?v=X9_ISJ0YpGw&t=290s , beskriv kortfattat vad en Quantile-Quantile (QQ) plot är.
Din kollega Karin frågar dig följande: ”Jag har hört att i Maskininlärning så är fokus på prediktioner medan man i statistisk regressionsanalys kan göra såväl prediktioner som statistisk inferens. Vad menas med det, kan du ge några exempel?” Vad svarar du Karin?
Vad är skillnaden på ”konfidensintervall” och ”prediktionsintervall” för predikterade värden?
Den multipla linjära regressionsmodellen kan skrivas som: 𝑌= 𝛽0 + 𝛽1𝑥1 + 𝛽1𝑥2+ …+ 𝛽𝑝𝑥𝑝 +𝜀 . Hur tolkas beta parametrarna?
Din kollega Nils frågar dig följande: ”Stämmer det att man i statistisk regressionsmodellering inte behöver använda träning, validering och test set om man nyttjar mått såsom BIC? Vad är logiken bakom detta?” Vad svarar du Hassan?
Förklara algoritmen nedan för ”Best subset selection”
Ett citat från statistikern George Box är: “All models are wrong, some are useful.” Förklara vad som menas med det citatet.

A quantile-quantile plot compares the distributions of two datasets. It plots the quantiles of the first dataset against the quantiles of the second. If the two distributions are similar, the points fall approximately along a straight line. A Q-Q plot is commonly used to check if a dataset follows a normal distribution.
While both machine learning and statistical regression can be used for prediction, the difference is in their primary focus and goals. The goal of machine learning is to build models that are accurate at predicting unknown values on new data. The focus is on minimising the prediction errors and it is less important how exactly the prediction is made. Examples of prediction in machine learning: spam email classification or predicting house prices. In statistical regression the focus is on understanding the relationships between variables and statistical significance and confidence of these relationships. Example: analysis of the effect of a new medicine, controlling for demographics, the goal is to estimate the average effect, test if the effect is statistically significant and the uncertainty of the estimate for general population.
Confidence interval estimates the range where the average value of a variable is likely to fall for a given set of predictor values, that is, the uncertainty of the average outcome. Prediction interval estimates the range where a single individual observation is likely to fall for a given set of predictor values, that is, the uncertainty of a single outcome.
𝛽0 is the intercept, 𝛽1, 𝛽2… 𝛽p are the model coefficients for predictors which are unknown and are estimated with training data, and 𝜀 is the error.
BIC selects a model based on the same data used for fitting the model. It helps select a simpler model and reduces overfitting for the specific dataset. The machine learning approach of training, validation, and test datasets evaluates model performance on previously unseen data because the primary goal is to achieve high accuracy on new data.
This is an algorithm to find the best linear regression model by considering all possible combinations of the predictor variables. It starts with a null model with no predictors. Then it iterates through subset sizes with predictor number from 1 up to p and fits all possible regression models for each. It selects the best model per subset based on the smallest residual sum of squares or the largest R^2. Then it selects the overall best model based on validation criteria such as prediction error on validation set or cross-validation.
“All models are wrong” means that a model is not a perfect representation of reality, it is an abstraction by default, focussing on limited aspects of the reality. “Some models are useful”, even though they’re not perfect, models can help us make accurate predictions that we couldn’t do without them. What is important in a model is that it is a useful tool for a specific task, not that it is a perfect copy of the reality.

6 Self-evaluation

Honestly, in the beginning I disliked R intensely. RStudio was also hard to get used to, after having used VSCode for a long time. The worst thing about R is having to type special characters.

While working on this project, however, I found that it was good at visualising data quite nicely and the export to html and pdf function is very useful. So, it has its uses, but I am glad to go back to using Python for machine learning.