A Deep(er) Dive into Toronto’s Airbnb Market

Pricing determinants, predictive modeling, and hypothesis testing

predictive analytics
short-term rentals
housing economics
regression modeling
feature engineering
hypothesis testing
Python
Author

A. Srikanth

Published

January 2, 2026

Project Spotlight

This analysis was completed as a course project for the Master of Management Analytics (MMA) program at Queen’s University. It is presented for educational purposes and reflects project-scoped assumptions and methods; results may not generalize beyond the dataset and period analyzed. This content is not financial, legal, or investment advice.

Collaborators: Adam Tang (MMA, ’26) • Balamira Gurumoorthy (MMA, ’26) • Bella Wang (MMA, ’26) • Esha Fatim (MMA, ’26) • Mohan Li (MMA, ’26) • Yukang Fu (MMA, ’26)

Context

Short-term rental platforms like Airbnb have reshaped travel and real estate use. In Toronto, demand stays high due to tourism, business travel, and hotel capacity and pricing. Short-term rentals can outperform long-term leasing through higher nightly revenue and flexible occupancy.

From July 2024 to June 2025, Toronto averaged ~11,000 active listings with a 181 CAD average daily rate and 68% occupancy, translating to ~5,430 CAD in average monthly revenue per listing. Despite slower listing growth, Airbnb remains a viable alternative for many owners.

Pricing is the main challenge. New hosts often rely on guesswork or copy nearby listings without understanding what drives price. Location matters, but host attributes and amenities also shape outcomes.

Objectives

Quantify which listing and host features actually move nightly price, and use that model-driven signal to guide pricing decisions that improve occupancy and revenue. In parallel, compare neighbourhood-level performance to highlight where returns tend to be strongest and which features consistently support higher pricing across Toronto’s short-term rental market.

Data Sources

The Toronto Airbnb Listing Dataset subset used spans July 2024 to June 2025 and contains more than 189,000 listing records, which reflect repeated monthly observations of the same listings over time, alongside 27,029 unique listings, which represent distinct properties counted once by listing ID. This dual structure supports cross-sectional analysis by comparing listings at a single point in time (or after deduplication) and time-series analysis by evaluating month-to-month changes in price, availability, and related outcomes.

Columns fall into three groups:

  1. Market and activity indicators: price, number_of_reviews, review_scores_rating (price used as the dependent variable; others as predictors).
  2. Property features: neighbourhood, room_type, property_type, bedrooms, bathrooms, accommodates, amenities, availability_365.
  3. Host features: host_id, host_is_superhost, host_listings_count, host_response_time.

Toronto spans 140 neighbourhoods, but Airbnb supply is heavily concentrated downtown. By listing volume, the top five are ‘Waterfront Communities-The Island’ (4,625), ‘Niagara’ (986), ‘Church-Yonge Corridor’ (810), ‘Annex’ (791), and ‘Kensington-Chinatown’ (712). Together, these areas make up more than 30% of all listings, which points to demand clustered in a few hotspots and competition that’s especially intense in those same pockets. A chart below shows how average monthly prices move over time in the top three neighborhoods by listing count.

Analysis (Part I) - Building a Predictive Model

The Modeling Process

A linear regression model was developed using a standard applied econometrics workflow. Since price is heavily right-skewed, the target was set to log_price. An initial Statsmodels regression produced R² = 0.559. Predictors with p-values >= 0.10 were removed (intercept retained); fit changed little (F-statistic rose from 774.1 to 1001, R² unchanged). A Variance Inflation Factor (VIF) check showed the input variables weren’t strongly overlapping or telling the same story, so the model wasn’t being distorted by redundant predictors.

Model fit improved modestly through transformations and interactions. Added terms such as bedrooms_squared, beds_squared, cleanliness_squared, persons_reviews, persons_condo, and bedrooms_hotel increased R² to 0.585.

Model Result Description

Final model: a pricing model was built to predict nightly rates after taking the log of price, using 47 inputs (a mix of original features plus a few engineered ones). Every feature kept in the final version showed a clear signal in the data (p-value <= 0.010). The model’s adjusted R² was 0.584, meaning it captured about 58% of the differences in prices across listings after the log transform.

The key effects aligned with expectations going into this analysis: listings that accommodate more guests were priced higher, with an estimated effect of +0.29 on log_price. Listings with more bedrooms were also priced higher (+0.15), as were listings with more bathrooms (+0.12). Several amenities were associated with higher prices, including having a TV (+0.0947), a dishwasher (+0.1624), and access to a gym (+0.1273). In contrast, heating (−0.0371) and air conditioning (−0.0351) were associated with slightly lower prices, likely because these features are common and therefore do not distinguish higher-end listings. On the host side, superhost status (+0.014) and being instant-bookable (+0.035) were associated with modestly higher prices.

Model Implications

A simple linear specification underfit observed pricing patterns; introducing non-linear transformations and interaction terms yielded a modest improvement in fit, consistent with the non-linear structure of real-world markets.

Interaction terms involving geography (e.g., region × bedrooms) did not materially improve model performance. This pattern suggests that location primarily shifts the overall price level, rather than altering how other attributes such as size, capacity, or amenities translate into price.

The resulting model offers a practical baseline for price-setting, but it remains a partial representation of the market. It does not incorporate real-time demand shocks, seasonal dynamics, or competitor price adjustments that can meaningfully affect short-run pricing.

Analysis (Part II) - Hypothesis Testing

Five statistical tests were used to check common assumptions about pricing and performance in Toronto’s Airbnb market. First, a two-sample t-test compared average nightly prices for superhosts versus non-superhosts. Superhosts showed slightly higher average prices, but the difference was not statistically significant, suggesting the badge is more likely to influence booking behavior than enable premium pricing. Second, a one-way ANOVA tested whether review counts differ across neighbourhoods and found highly significant differences, indicating that location strongly affects how much exposure a listing receives and how quickly reviews accumulate. Third, a one-way ANOVA tested whether prices differ across neighbourhoods and produced very strong evidence that neighbourhood is a major driver of pricing, with central or attraction-rich areas tending to sit at higher price levels than more peripheral areas.

Host tenure and listing scale were also evaluated. A two-sample t-test comparing newer hosts to longer-tenured hosts found a clear difference: hosts who joined within the last five years averaged 188.39 CAD per night, while longer-tenured hosts averaged 178.87 CAD, and this gap was highly significant (t = 4.95, p < 0.000001). Pearson correlation tests were then used to assess whether higher ratings translate into higher prices and whether hosts with more listings price differently. Rating versus price was statistically significant but negligible in strength, implying that near-perfect ratings do not meaningfully increase nightly rates on their own. Host listings count versus price showed a very weak negative relationship, consistent with portfolio hosts pricing slightly lower to support occupancy across multiple properties.

Taken together, the hypothesis tests point to neighbourhood as the dominant factor shaping both visibility (reviews) and pricing. Host-level signals such as superhost status and ratings likely matter for trust and conversion, but they show limited direct pricing power in the data. Host tenure appears more informative than badges or ratings, with newer hosts charging higher nightly rates on average, potentially reflecting newer inventory, updated amenities, or different pricing expectations.

Methodology

Four sequential procedures were applied: (1) data cleaning and model preparation, (2) exploratory data analysis (EDA), (3) predictive modeling, and (4) hypothesis testing.

Results & Next Steps

Unsurprisingly, pricing in Toronto’s Airbnb universe mostly follows the laws of “listing physics.” Bigger places cost more. More people, more bedrooms, more bathrooms, and the price tends to climb. After that, a small set of amenities actually moves the needle in a noticeable way. Neighbourhood matters, sure, but it’s not the whole story. A well-chosen upgrade can outperform the strategy of simply being in a trendy postal code.

The hypothesis tests add a useful reality check. Superhost status and near-perfect ratings are great for trust, but they don’t reliably buy a meaningful price premium. Meanwhile, location still shows up loud and clear, and host tenure also lines up with measurable differences in pricing.

For investors, dense hotspots like ‘Waterfront Communities-The Island’ and ‘Niagara’ are like busy city intersections: lots of flow, lots of competition, and the margins can get squeezed. The more interesting opportunity can show up in quieter areas where supply is thinner but prices hold up, especially when a listing is engineered with features that consistently support higher rates. For new hosts, instant booking and a smart amenity mix remain some of the cleanest, most practical levers for improving revenue.

Code
from __future__ import annotations

from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols

try:
    ROOT = Path(__file__).resolve().parent
except NameError:
    ROOT = Path.cwd()

DATA_DIR = ROOT / "data"

FIG_W = 9
FIG_H = 5
FIG_DPI = 110

color_blue = "#033c73"
color_indigo = "#6610f2"
color_purple = "#6f42c1"
color_red = "#751F2C"

plt.rcParams.update(
    {
        "figure.figsize": (FIG_W, FIG_H),
        "figure.dpi": FIG_DPI,
        "font.family": "Ramabhadra",
        "font.weight": "bold",
        "text.color": "black",
        "axes.labelcolor": "black",
        "axes.titlecolor": "black",
        "xtick.color": "black",
        "ytick.color": "black",
        "axes.titlesize": 14,
        "axes.titleweight": "bold",
        "axes.labelsize": 12,
        "axes.labelweight": "bold",
        "xtick.labelsize": 10,
        "ytick.labelsize": 10,
        "legend.fontsize": 10,
        "axes.grid": True,
        "grid.alpha": 0.25,
    }
)

def require_cols(frame: pd.DataFrame, cols: list[str]) -> None:
    missing = [c for c in cols if c not in frame.columns]
    if missing:
        raise KeyError(f"Missing required column(s): {', '.join(missing)}")

def ensure_log_price(frame: pd.DataFrame) -> None:
    if "log_price" not in frame.columns:
        require_cols(frame, ["price"])
        frame["price"] = pd.to_numeric(frame["price"], errors="coerce")
        frame["log_price"] = np.log(frame["price"].where(frame["price"] > 0))

def fit_and_print(formula: str, frame: pd.DataFrame, label: str) -> object:
    model = ols(formula, data=frame).fit()
    print(f"\n=== {label} ===")
    print(model.summary())
    return model

def ensure_yearmonth(frame: pd.DataFrame) -> None:
    if "YearMonth" in frame.columns:
        frame["YearMonth"] = frame["YearMonth"].astype(str)
        return
    date_col = next(
        (c for c in ["last_scraped", "calendar_last_scraped", "scrape_date", "snapshot_date", "date"] if c in frame.columns),
        None,
    )
    if date_col is None:
        raise KeyError("Missing required column: YearMonth (and no usable scrape date column found to derive it).")
    dt = pd.to_datetime(frame[date_col], errors="coerce")
    frame["YearMonth"] = dt.dt.to_period("M").astype(str)

def plot_price_histogram(frame: pd.DataFrame, max_price: float = 2000) -> None:
    frame = frame.copy()
    frame["price"] = pd.to_numeric(frame["price"], errors="coerce")
    price_filt = frame.loc[frame["price"].le(max_price), "price"].dropna()

    fig, ax = plt.subplots()
    ax.hist(price_filt.values, bins=50, edgecolor="black", color=color_blue, alpha=0.9);
    ax.set_title(f"DISTRIBUTION OF PRICE (FILTERED, <= ${int(max_price)})");
    ax.set_xlabel("PRICE");
    ax.set_ylabel("FREQUENCY");
    fig.tight_layout();
    plt.show()

def plot_log_price_histogram(frame: pd.DataFrame, max_price: float = 2000) -> None:
    frame = frame.copy()
    frame["price"] = pd.to_numeric(frame["price"], errors="coerce")
    frame["log_price"] = np.log(frame["price"].where(frame["price"] > 0))
    filt = frame.loc[frame["price"].le(max_price), "log_price"].dropna()

    fig, ax = plt.subplots()
    ax.hist(filt.values, bins=50, edgecolor="black", color=color_blue, alpha=0.9);
    ax.set_title("DISTRIBUTION OF LOG-TRANSFORMED PRICE");
    ax.set_xlabel("LOG(PRICE)");
    ax.set_ylabel("FREQUENCY");
    fig.tight_layout();
    plt.show()

def plot_top3_neighbourhood_trends(frame: pd.DataFrame) -> None:
    require_cols(frame, ["neighbourhood_cleansed", "price", "YearMonth"])
    top3 = frame["neighbourhood_cleansed"].value_counts().head(3).index.tolist()
    df_top3 = frame[frame["neighbourhood_cleansed"].isin(top3)].copy()

    df_top3["price"] = pd.to_numeric(df_top3["price"], errors="coerce")
    grouped = (
        df_top3.groupby(["YearMonth", "neighbourhood_cleansed"], dropna=False)["price"]
        .mean()
        .unstack()
        .sort_index()
    )

    fig, ax = plt.subplots()
    palette = [color_blue, color_indigo, color_purple]
    for i, col in enumerate(grouped.columns[:3]):
        ax.plot(
            grouped.index,
            grouped[col].values,
            marker="o",
            linewidth=2.0,
            color=palette[i % len(palette)],
            alpha=0.9,
            label=str(col),
        );

    ax.set_title("AVERAGE MONTHLY PRICE FOR TOP 3 NEIGHBOURHOODS");
    ax.set_xlabel("YEAR-MONTH");
    ax.set_ylabel("AVERAGE PRICE (CAD)");
    ax.tick_params(axis="x", rotation=45);
    ax.legend(title="NEIGHBOURHOOD", bbox_to_anchor=(1.05, 1), loc="upper left");
    fig.tight_layout();
    plt.show()

def plot_hosts_joined_per_year(frame: pd.DataFrame) -> None:
    require_cols(frame, ["id", "host_since"])
    d = frame.copy()
    d["host_since"] = pd.to_datetime(d["host_since"], errors="coerce")
    d = d.dropna(subset=["host_since"])
    d = d.drop_duplicates(subset="id")
    d["year_joined"] = d["host_since"].dt.year

    counts = d.groupby("year_joined")["id"].count().sort_index()
    print(counts)

    fig, ax = plt.subplots()
    ax.plot(counts.index.values, counts.values, marker="o", linewidth=2.0, color=color_indigo, alpha=0.9);
    ax.set_title("UNIQUE HOSTS JOINING PER YEAR");
    ax.set_xlabel("YEAR JOINED");
    ax.set_ylabel("NUMBER OF HOSTS");
    fig.tight_layout();
    plt.show()

def plot_price_vs_review_score(frame: pd.DataFrame, max_price: float = 2000) -> None:
    require_cols(frame, ["price", "review_scores_rating"])
    d = frame.copy()
    d["price"] = pd.to_numeric(d["price"], errors="coerce")
    d["review_scores_rating"] = pd.to_numeric(d["review_scores_rating"], errors="coerce")
    d = d.loc[d["price"].le(max_price)].dropna(subset=["price", "review_scores_rating"])

    fig, ax = plt.subplots()
    ax.scatter(
        d["review_scores_rating"].values,
        d["price"].values,
        s=18,
        alpha=0.55,
        color=color_blue,
        edgecolors="none",
    );
    ax.set_title("NIGHTLY RATE VS. REVIEW SCORE RATING");
    ax.set_xlabel("REVIEW SCORE");
    ax.set_ylabel("PRICE (CAD)");
    ax.set_ylim(0, 500);
    fig.tight_layout();
    plt.show()

def plot_availability_histogram(frame: pd.DataFrame) -> None:
    require_cols(frame, ["availability_365"])
    d = frame.copy()
    d["availability_365"] = pd.to_numeric(d["availability_365"], errors="coerce")
    vals = d["availability_365"].dropna()

    fig, ax = plt.subplots()
    ax.hist(vals.values, bins=30, edgecolor="black", color=color_blue, alpha=0.9);
    ax.set_title("LISTING AVAILABILITY ACROSS THE YEAR (0–365 DAYS)");
    ax.set_xlabel("DAYS AVAILABLE PER YEAR");
    ax.set_ylabel("NUMBER OF LISTINGS");
    fig.tight_layout();
    plt.show()

df = pd.read_csv(DATA_DIR / "Airbnb_02.csv")
ensure_log_price(df)

full_formula = (
    "log_price ~ accommodates + bedrooms + beds + number_of_reviews + review_scores_rating + "
    "review_scores_accuracy + review_scores_cleanliness + review_scores_checkin + review_scores_communication + "
    "review_scores_location + review_scores_value + calculated_host_listings_count + bathroom_count + host_is_superhost_binary + "
    "host_has_profile_pic_binary + host_identity_verified_binary + instant_bookable_binary + has_kitchen + has_wifi + has_tv + "
    "has_dishwasher + has_oven + has_washer + has_dryer + has_gym + has_patio_or_balcony + has_backyard + has_bed_linens + "
    "has_air_conditioning + has_heating + is_hotel_guesthouse + is_others + is_rental_unit_apartment_condo_loft + is_room + "
    "is_villa + room_type_private + in_East_End_Toronto + in_East_York + in_Etobicoke + in_Midtown_Toronto + in_North_York + "
    "in_Scarborough + in_West_End_Toronto + in_York"
)
full_model = fit_and_print(full_formula, df, "Full model")

=== Full model ===
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              log_price   R-squared:                       0.559
Model:                            OLS   Adj. R-squared:                  0.558
Method:                 Least Squares   F-statistic:                     773.5
Date:                Fri, 02 Jan 2026   Prob (F-statistic):               0.00
Time:                        19:14:31   Log-Likelihood:                -19405.
No. Observations:               26884   AIC:                         3.890e+04
Df Residuals:                   26839   BIC:                         3.927e+04
Df Model:                          44                                         
Covariance Type:            nonrobust                                         
=======================================================================================================
                                          coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------
Intercept                               2.8590      0.076     37.821      0.000       2.711       3.007
accommodates                            0.1565      0.003     54.391      0.000       0.151       0.162
bedrooms                                0.0890      0.006     14.162      0.000       0.077       0.101
beds                                   -0.0133      0.005     -2.753      0.006      -0.023      -0.004
number_of_reviews                       0.0005      6e-05      7.839      0.000       0.000       0.001
review_scores_rating                    0.0543      0.022      2.456      0.014       0.011       0.098
review_scores_accuracy                  0.0235      0.018      1.286      0.199      -0.012       0.059
review_scores_cleanliness               0.1396      0.014      9.830      0.000       0.112       0.167
review_scores_checkin                  -0.0747      0.016     -4.751      0.000      -0.105      -0.044
review_scores_communication             0.0104      0.018      0.587      0.558      -0.024       0.045
review_scores_location                  0.0883      0.015      5.727      0.000       0.058       0.119
review_scores_value                    -0.0814      0.016     -4.970      0.000      -0.113      -0.049
calculated_host_listings_count         -0.0055      0.000    -22.636      0.000      -0.006      -0.005
bathroom_count                          0.1045      0.007     15.732      0.000       0.091       0.118
host_is_superhost_binary                0.0109      0.007      1.592      0.111      -0.003       0.024
host_has_profile_pic_binary             0.0033      0.013      0.251      0.802      -0.022       0.029
host_identity_verified_binary          -0.0327      0.013     -2.573      0.010      -0.058      -0.008
instant_bookable_binary                 0.0268      0.007      3.609      0.000       0.012       0.041
has_kitchen                            -0.0126      0.010     -1.327      0.185      -0.031       0.006
has_wifi                                0.0127      0.007      1.788      0.074      -0.001       0.027
has_tv                                  0.1130      0.007     17.009      0.000       0.100       0.126
has_dishwasher                          0.1968      0.008     25.488      0.000       0.182       0.212
has_oven                               -0.0604      0.007     -8.199      0.000      -0.075      -0.046
has_washer                              0.0006      0.008      0.075      0.940      -0.015       0.016
has_dryer                              -0.0258      0.009     -2.804      0.005      -0.044      -0.008
has_gym                                 0.1459      0.013     11.045      0.000       0.120       0.172
has_patio_or_balcony                    0.0114      0.010      1.109      0.267      -0.009       0.031
has_backyard                           -0.0098      0.011     -0.892      0.372      -0.031       0.012
has_bed_linens                         -0.0063      0.008     -0.831      0.406      -0.021       0.009
has_air_conditioning                   -0.0330      0.007     -4.832      0.000      -0.046      -0.020
has_heating                            -0.0452      0.007     -6.041      0.000      -0.060      -0.031
is_hotel_guesthouse                     0.1536      0.014     11.007      0.000       0.126       0.181
is_others                               0.1330      0.086      1.549      0.121      -0.035       0.301
is_rental_unit_apartment_condo_loft     0.2063      0.008     26.034      0.000       0.191       0.222
is_room                                 0.1710      0.121      1.408      0.159      -0.067       0.409
is_villa                               -0.1297      0.061     -2.144      0.032      -0.248      -0.011
room_type_private                       0.5047      0.041     12.241      0.000       0.424       0.586
in_East_End_Toronto                    -0.1190      0.015     -8.157      0.000      -0.148      -0.090
in_East_York                           -0.2236      0.020    -10.933      0.000      -0.264      -0.184
in_Etobicoke                           -0.3293      0.013    -25.145      0.000      -0.355      -0.304
in_Midtown_Toronto                     -0.1210      0.012    -10.093      0.000      -0.144      -0.097
in_North_York                          -0.3074      0.011    -29.002      0.000      -0.328      -0.287
in_Scarborough                         -0.4250      0.012    -34.664      0.000      -0.449      -0.401
in_West_End_Toronto                    -0.1223      0.011    -11.504      0.000      -0.143      -0.101
in_York                                -0.2682      0.016    -16.589      0.000      -0.300      -0.236
==============================================================================
Omnibus:                     6540.729   Durbin-Watson:                   1.858
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            47387.094
Skew:                           0.979   Prob(JB):                         0.00
Kurtosis:                       9.202   Cond. No.                     2.45e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.45e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Code
sig_terms = [name for name, p in full_model.pvalues.items() if (name != "Intercept") and (p <= 0.10)]
reduced_formula = "log_price ~ " + " + ".join(sig_terms) if sig_terms else full_formula
_ = fit_and_print(reduced_formula, df, "Reduced model (p ≤ 0.10)")

=== Reduced model (p ≤ 0.10) ===
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              log_price   R-squared:                       0.559
Model:                            OLS   Adj. R-squared:                  0.558
Method:                 Least Squares   F-statistic:                     1031.
Date:                Fri, 02 Jan 2026   Prob (F-statistic):               0.00
Time:                        19:14:31   Log-Likelihood:                -19412.
No. Observations:               26884   AIC:                         3.889e+04
Df Residuals:                   26850   BIC:                         3.917e+04
Df Model:                          33                                         
Covariance Type:            nonrobust                                         
=======================================================================================================
                                          coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------
Intercept                               2.8600      0.073     39.150      0.000       2.717       3.003
accommodates                            0.1567      0.003     54.568      0.000       0.151       0.162
bedrooms                                0.0882      0.006     14.086      0.000       0.076       0.100
beds                                   -0.0135      0.005     -2.804      0.005      -0.023      -0.004
number_of_reviews                       0.0005   5.83e-05      8.543      0.000       0.000       0.001
review_scores_rating                    0.0692      0.020      3.518      0.000       0.031       0.108
review_scores_cleanliness               0.1433      0.014     10.318      0.000       0.116       0.171
review_scores_checkin                  -0.0697      0.015     -4.774      0.000      -0.098      -0.041
review_scores_location                  0.0915      0.015      5.991      0.000       0.062       0.121
review_scores_value                    -0.0753      0.016     -4.749      0.000      -0.106      -0.044
calculated_host_listings_count         -0.0055      0.000    -22.881      0.000      -0.006      -0.005
bathroom_count                          0.1043      0.007     15.725      0.000       0.091       0.117
host_identity_verified_binary          -0.0308      0.013     -2.440      0.015      -0.056      -0.006
instant_bookable_binary                 0.0270      0.007      3.645      0.000       0.012       0.041
has_wifi                                0.0128      0.007      1.815      0.069      -0.001       0.027
has_tv                                  0.1133      0.006     17.455      0.000       0.101       0.126
has_dishwasher                          0.1948      0.007     26.818      0.000       0.181       0.209
has_oven                               -0.0622      0.007     -8.748      0.000      -0.076      -0.048
has_dryer                              -0.0253      0.008     -3.080      0.002      -0.041      -0.009
has_gym                                 0.1482      0.013     11.301      0.000       0.123       0.174
has_air_conditioning                   -0.0333      0.007     -5.039      0.000      -0.046      -0.020
has_heating                            -0.0447      0.007     -6.076      0.000      -0.059      -0.030
is_hotel_guesthouse                     0.1524      0.014     10.949      0.000       0.125       0.180
is_rental_unit_apartment_condo_loft     0.2055      0.008     26.251      0.000       0.190       0.221
is_villa                               -0.1347      0.060     -2.229      0.026      -0.253      -0.016
room_type_private                       0.5029      0.041     12.225      0.000       0.422       0.584
in_East_End_Toronto                    -0.1195      0.015     -8.206      0.000      -0.148      -0.091
in_East_York                           -0.2233      0.020    -10.924      0.000      -0.263      -0.183
in_Etobicoke                           -0.3302      0.013    -25.260      0.000      -0.356      -0.305
in_Midtown_Toronto                     -0.1215      0.012    -10.140      0.000      -0.145      -0.098
in_North_York                          -0.3085      0.011    -29.190      0.000      -0.329      -0.288
in_Scarborough                         -0.4260      0.012    -34.855      0.000      -0.450      -0.402
in_West_End_Toronto                    -0.1232      0.011    -11.616      0.000      -0.144      -0.102
in_York                                -0.2696      0.016    -16.725      0.000      -0.301      -0.238
==============================================================================
Omnibus:                     6550.375   Durbin-Watson:                   1.857
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            47428.768
Skew:                           0.981   Prob(JB):                         0.00
Kurtosis:                       9.204   Cond. No.                     1.56e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.56e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Code
engineering_inputs = [
    "accommodates","bedrooms","beds","review_scores_accuracy","review_scores_cleanliness",
    "calculated_host_listings_count","number_of_reviews",
    "is_rental_unit_apartment_condo_loft","is_hotel_guesthouse","has_oven","is_others","has_gym",
]
require_cols(df, engineering_inputs)

df = df.assign(
    persons_squared=df["accommodates"] ** 2,
    bedrooms_squared=df["bedrooms"] ** 2,
    beds_squared=df["beds"] ** 2,
    accuracy_squared=df["review_scores_accuracy"] ** 2,
    cleanliness_squared=df["review_scores_cleanliness"] ** 2,
    host_listings_squared=df["calculated_host_listings_count"] ** 2,
    bedrooms_persons=df["bedrooms"] * df["accommodates"],
    persons_reviews=df["number_of_reviews"] * df["accommodates"],
    persons_condo=df["is_rental_unit_apartment_condo_loft"] * df["accommodates"],
    persons_hotel=df["is_hotel_guesthouse"] * df["accommodates"],
    persons_oven=df["has_oven"] * df["accommodates"],
    persons_others=df["is_others"] * df["accommodates"],
    persons_gym=df["has_gym"] * df["accommodates"],
    bedrooms_condo=df["is_rental_unit_apartment_condo_loft"] * df["bedrooms"],
    bedrooms_hotel=df["is_hotel_guesthouse"] * df["bedrooms"],
)

new_formula = (
    "log_price ~ accommodates + bedrooms + number_of_reviews + review_scores_rating + review_scores_accuracy + "
    "review_scores_checkin + review_scores_location + review_scores_value + calculated_host_listings_count + bathroom_count + "
    "host_is_superhost_binary + host_identity_verified_binary + instant_bookable_binary + has_wifi + has_tv + has_dishwasher + "
    "has_oven + has_dryer + has_gym + has_bed_linens + has_air_conditioning + has_heating + is_hotel_guesthouse + is_others + "
    "is_rental_unit_apartment_condo_loft + is_villa + room_type_private + in_East_End_Toronto + in_East_York + in_Etobicoke + "
    "in_Midtown_Toronto + in_North_York + in_Scarborough + in_West_End_Toronto + in_York + persons_squared + "
    "bedrooms_persons + persons_reviews + bedrooms_squared + beds_squared + accuracy_squared + cleanliness_squared + "
    "host_listings_squared + persons_condo + persons_hotel + persons_oven + persons_others + bedrooms_condo + bedrooms_hotel"
)
_ = fit_and_print(new_formula, df, "Transformed model")

=== Transformed model ===
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              log_price   R-squared:                       0.584
Model:                            OLS   Adj. R-squared:                  0.584
Method:                 Least Squares   F-statistic:                     770.0
Date:                Fri, 02 Jan 2026   Prob (F-statistic):               0.00
Time:                        19:14:31   Log-Likelihood:                -18612.
No. Observations:               26884   AIC:                         3.732e+04
Df Residuals:                   26834   BIC:                         3.773e+04
Df Model:                          49                                         
Covariance Type:            nonrobust                                         
=======================================================================================================
                                          coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------
Intercept                               3.1749      0.102     31.117      0.000       2.975       3.375
accommodates                            0.2938      0.006     47.968      0.000       0.282       0.306
bedrooms                                0.1470      0.015      9.814      0.000       0.118       0.176
number_of_reviews                       0.0006      0.000      5.777      0.000       0.000       0.001
review_scores_rating                    0.0575      0.020      2.831      0.005       0.018       0.097
review_scores_accuracy                 -0.1593      0.050     -3.182      0.001      -0.257      -0.061
review_scores_checkin                  -0.0733      0.014     -5.092      0.000      -0.102      -0.045
review_scores_location                  0.0961      0.015      6.402      0.000       0.067       0.126
review_scores_value                    -0.0834      0.016     -5.267      0.000      -0.114      -0.052
calculated_host_listings_count         -0.0094      0.001    -16.141      0.000      -0.011      -0.008
bathroom_count                          0.1218      0.007     18.562      0.000       0.109       0.135
host_is_superhost_binary                0.0144      0.007      2.148      0.032       0.001       0.028
host_identity_verified_binary          -0.0305      0.012     -2.476      0.013      -0.055      -0.006
instant_bookable_binary                 0.0345      0.007      4.754      0.000       0.020       0.049
has_wifi                                0.0096      0.007      1.401      0.161      -0.004       0.023
has_tv                                  0.0952      0.006     14.985      0.000       0.083       0.108
has_dishwasher                          0.1623      0.007     21.720      0.000       0.148       0.177
has_oven                               -0.0301      0.012     -2.497      0.013      -0.054      -0.006
has_dryer                              -0.0124      0.008     -1.548      0.122      -0.028       0.003
has_gym                                 0.1322      0.013     10.350      0.000       0.107       0.157
has_bed_linens                         -0.0142      0.007     -1.931      0.053      -0.029       0.000
has_air_conditioning                   -0.0346      0.006     -5.375      0.000      -0.047      -0.022
has_heating                            -0.0322      0.007     -4.497      0.000      -0.046      -0.018
is_hotel_guesthouse                     0.3893      0.032     12.331      0.000       0.327       0.451
is_others                               0.6169      0.151      4.083      0.000       0.321       0.913
is_rental_unit_apartment_condo_loft     0.4160      0.014     28.794      0.000       0.388       0.444
is_villa                               -0.1031      0.059     -1.756      0.079      -0.218       0.012
room_type_private                       0.4530      0.040     11.299      0.000       0.374       0.532
in_East_End_Toronto                    -0.1500      0.014    -10.562      0.000      -0.178      -0.122
in_East_York                           -0.2525      0.020    -12.686      0.000      -0.292      -0.214
in_Etobicoke                           -0.3339      0.013    -26.237      0.000      -0.359      -0.309
in_Midtown_Toronto                     -0.1307      0.012    -11.190      0.000      -0.154      -0.108
in_North_York                          -0.2989      0.010    -29.011      0.000      -0.319      -0.279
in_Scarborough                         -0.4194      0.012    -35.220      0.000      -0.443      -0.396
in_West_End_Toronto                    -0.1305      0.010    -12.597      0.000      -0.151      -0.110
in_York                                -0.2716      0.016    -17.315      0.000      -0.302      -0.241
persons_squared                        -0.0132      0.001    -23.052      0.000      -0.014      -0.012
bedrooms_persons                        0.0091      0.002      4.399      0.000       0.005       0.013
persons_reviews                     -7.886e-05   3.27e-05     -2.414      0.016      -0.000   -1.48e-05
bedrooms_squared                       -0.0184      0.003     -6.521      0.000      -0.024      -0.013
beds_squared                           -0.0033      0.001     -4.456      0.000      -0.005      -0.002
accuracy_squared                        0.0249      0.006      4.060      0.000       0.013       0.037
cleanliness_squared                     0.0165      0.002      9.813      0.000       0.013       0.020
host_listings_squared                5.017e-05   6.26e-06      8.008      0.000    3.79e-05    6.24e-05
persons_condo                          -0.0310      0.005     -6.197      0.000      -0.041      -0.021
persons_hotel                          -0.0529      0.011     -4.646      0.000      -0.075      -0.031
persons_oven                           -0.0090      0.003     -3.031      0.002      -0.015      -0.003
persons_others                         -0.1259      0.030     -4.157      0.000      -0.185      -0.067
bedrooms_condo                         -0.0945      0.012     -7.911      0.000      -0.118      -0.071
bedrooms_hotel                         -0.0761      0.027     -2.865      0.004      -0.128      -0.024
==============================================================================
Omnibus:                     7483.462   Durbin-Watson:                   1.858
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            59897.984
Skew:                           1.116   Prob(JB):                         0.00
Kurtosis:                       9.963   Cond. No.                     6.38e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.38e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Code
new_reduced_formula = (
    "log_price ~ accommodates + bedrooms + number_of_reviews + review_scores_rating + review_scores_accuracy + "
    "review_scores_checkin + review_scores_location + review_scores_value + calculated_host_listings_count + bathroom_count + "
    "host_is_superhost_binary + host_identity_verified_binary + instant_bookable_binary + has_tv + has_dishwasher + "
    "has_oven + has_gym + has_bed_linens + has_air_conditioning + has_heating + is_hotel_guesthouse + is_others + "
    "is_rental_unit_apartment_condo_loft + is_villa + room_type_private + in_East_End_Toronto + in_East_York + in_Etobicoke + "
    "in_Midtown_Toronto + in_North_York + in_Scarborough + in_West_End_Toronto + in_York + persons_squared + bedrooms_persons + "
    "persons_reviews + bedrooms_squared + beds_squared + accuracy_squared + cleanliness_squared + host_listings_squared + "
    "persons_condo + persons_hotel + persons_oven + persons_others + bedrooms_condo + bedrooms_hotel"
)
_ = fit_and_print(new_reduced_formula, df, "Transformed reduced model")

=== Transformed reduced model ===
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              log_price   R-squared:                       0.584
Model:                            OLS   Adj. R-squared:                  0.584
Method:                 Least Squares   F-statistic:                     802.6
Date:                Fri, 02 Jan 2026   Prob (F-statistic):               0.00
Time:                        19:14:31   Log-Likelihood:                -18614.
No. Observations:               26884   AIC:                         3.732e+04
Df Residuals:                   26836   BIC:                         3.772e+04
Df Model:                          47                                         
Covariance Type:            nonrobust                                         
=======================================================================================================
                                          coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------
Intercept                               3.1819      0.102     31.218      0.000       2.982       3.382
accommodates                            0.2942      0.006     48.046      0.000       0.282       0.306
bedrooms                                0.1475      0.015      9.850      0.000       0.118       0.177
number_of_reviews                       0.0006      0.000      5.774      0.000       0.000       0.001
review_scores_rating                    0.0579      0.020      2.850      0.004       0.018       0.098
review_scores_accuracy                 -0.1585      0.050     -3.167      0.002      -0.257      -0.060
review_scores_checkin                  -0.0736      0.014     -5.109      0.000      -0.102      -0.045
review_scores_location                  0.0962      0.015      6.404      0.000       0.067       0.126
review_scores_value                    -0.0836      0.016     -5.281      0.000      -0.115      -0.053
calculated_host_listings_count         -0.0094      0.001    -16.203      0.000      -0.011      -0.008
bathroom_count                          0.1217      0.007     18.555      0.000       0.109       0.135
host_is_superhost_binary                0.0144      0.007      2.134      0.033       0.001       0.028
host_identity_verified_binary          -0.0303      0.012     -2.462      0.014      -0.054      -0.006
instant_bookable_binary                 0.0351      0.007      4.834      0.000       0.021       0.049
has_tv                                  0.0948      0.006     14.936      0.000       0.082       0.107
has_dishwasher                          0.1624      0.007     21.753      0.000       0.148       0.177
has_oven                               -0.0315      0.012     -2.618      0.009      -0.055      -0.008
has_gym                                 0.1276      0.012     10.217      0.000       0.103       0.152
has_bed_linens                         -0.0130      0.007     -1.770      0.077      -0.027       0.001
has_air_conditioning                   -0.0355      0.006     -5.586      0.000      -0.048      -0.023
has_heating                            -0.0370      0.007     -5.647      0.000      -0.050      -0.024
is_hotel_guesthouse                     0.3896      0.032     12.343      0.000       0.328       0.452
is_others                               0.6215      0.151      4.114      0.000       0.325       0.918
is_rental_unit_apartment_condo_loft     0.4175      0.014     28.935      0.000       0.389       0.446
is_villa                               -0.1055      0.059     -1.797      0.072      -0.221       0.010
room_type_private                       0.4512      0.040     11.256      0.000       0.373       0.530
in_East_End_Toronto                    -0.1498      0.014    -10.553      0.000      -0.178      -0.122
in_East_York                           -0.2532      0.020    -12.720      0.000      -0.292      -0.214
in_Etobicoke                           -0.3344      0.013    -26.278      0.000      -0.359      -0.309
in_Midtown_Toronto                     -0.1311      0.012    -11.224      0.000      -0.154      -0.108
in_North_York                          -0.2995      0.010    -29.089      0.000      -0.320      -0.279
in_Scarborough                         -0.4202      0.012    -35.311      0.000      -0.444      -0.397
in_West_End_Toronto                    -0.1302      0.010    -12.569      0.000      -0.150      -0.110
in_York                                -0.2715      0.016    -17.306      0.000      -0.302      -0.241
persons_squared                        -0.0132      0.001    -23.082      0.000      -0.014      -0.012
bedrooms_persons                        0.0091      0.002      4.394      0.000       0.005       0.013
persons_reviews                     -7.968e-05   3.27e-05     -2.440      0.015      -0.000   -1.57e-05
bedrooms_squared                       -0.0185      0.003     -6.554      0.000      -0.024      -0.013
beds_squared                           -0.0033      0.001     -4.448      0.000      -0.005      -0.002
accuracy_squared                        0.0247      0.006      4.032      0.000       0.013       0.037
cleanliness_squared                     0.0166      0.002      9.823      0.000       0.013       0.020
host_listings_squared                5.018e-05   6.24e-06      8.036      0.000    3.79e-05    6.24e-05
persons_condo                          -0.0312      0.005     -6.239      0.000      -0.041      -0.021
persons_hotel                          -0.0529      0.011     -4.640      0.000      -0.075      -0.031
persons_oven                           -0.0090      0.003     -3.037      0.002      -0.015      -0.003
persons_others                         -0.1267      0.030     -4.181      0.000      -0.186      -0.067
bedrooms_condo                         -0.0949      0.012     -7.945      0.000      -0.118      -0.071
bedrooms_hotel                         -0.0762      0.027     -2.869      0.004      -0.128      -0.024
==============================================================================
Omnibus:                     7466.830   Durbin-Watson:                   1.859
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            59688.570
Skew:                           1.114   Prob(JB):                         0.00
Kurtosis:                       9.952   Cond. No.                     6.38e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.38e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Code
df_v01 = pd.read_csv(DATA_DIR / "Airbnb_01.csv")
ensure_yearmonth(df_v01)

plot_price_histogram(df_v01, max_price=2000)

Code
plot_log_price_histogram(df_v01, max_price=2000)

Code
plot_top3_neighbourhood_trends(df_v01)

Code
df1 = pd.read_csv(DATA_DIR / "Airbnb_03.csv")

plot_hosts_joined_per_year(df1)
year_joined
2008       1
2009      52
2010     152
2011     621
2012     920
2013    1399
2014    2129
2015    3309
2016    3431
2017    2620
2018    2445
2019    2300
2020    1005
2021    1274
2022    2209
2023    3329
2024    3399
2025     580
Name: id, dtype: int64

Code
plot_price_vs_review_score(df1, max_price=2000)

Code
plot_availability_histogram(df1)

Code
library(readr)
library(dplyr)
library(leaflet)
library(htmltools)
library(htmlwidgets)
library(scales)

color_blue   <- "#033c73"
color_indigo <- "#6610f2"
color_purple <- "#6f42c1"
color_red    <- "#751F2C"
color_grey   <- "#9e9e9e"

df <- read_csv(file.path("data", "Airbnb_03.csv"), show_col_types = FALSE)

stopifnot(all(c("id", "latitude", "longitude") %in% names(df)))

df <- df %>%
  mutate(
    id = trimws(as.character(id)),
    id = sub("\\.0+$", "", id),
    latitude  = as.numeric(latitude),
    longitude = as.numeric(longitude),
    price = if ("price" %in% names(.)) parse_number(as.character(price)) else NA_real_
  ) %>%
  filter(
    !is.na(id), id != "",
    is.finite(latitude), is.finite(longitude)
  ) %>%
  mutate(price = ifelse(is.finite(price), price, NA_real_))

map_df <- df %>%
  group_by(id) %>%
  summarise(
    latitude  = first(latitude),
    longitude = first(longitude),
    avg_nightly_price = if (all(is.na(price))) NA_real_ else mean(price, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(avg_nightly_price = ifelse(is.finite(avg_nightly_price), avg_nightly_price, NA_real_))

stopifnot(nrow(map_df) == dplyr::n_distinct(map_df$id))

domain_vals <- map_df$avg_nightly_price
domain_vals <- domain_vals[is.finite(domain_vals)]

if (length(domain_vals) >= 10) {
  q <- quantile(domain_vals, probs = c(0.02, 0.98), na.rm = TRUE, names = FALSE, type = 8)
  lo <- q[1]
  hi <- q[2]
} else if (length(domain_vals) > 0) {
  lo <- min(domain_vals, na.rm = TRUE)
  hi <- max(domain_vals, na.rm = TRUE)
} else {
  lo <- 0
  hi <- 1
}

map_df <- map_df %>%
  mutate(
    price_for_color = pmin(pmax(avg_nightly_price, lo), hi),
    popup_txt = paste0(
      "<div class='ram'>",
      "<div style='font-size:14px;font-weight:700;margin-bottom:4px;'>AVERAGE NIGHTLY PRICE</div>",
      "<div style='font-size:13px;'>",
      ifelse(is.na(avg_nightly_price), "PRICE MISSING", dollar(avg_nightly_price)),
      "</div>",
      "<div style='font-size:12px;opacity:0.85;margin-top:4px;'>LISTING ID: ", htmlEscape(id), "</div>",
      "</div>"
    )
  )

pal <- colorNumeric(
  palette = colorRampPalette(c(color_blue, color_indigo, color_purple, color_red))(256),
  domain = c(lo, hi),
  na.color = color_grey
)

m <- leaflet(map_df) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addControl(
    html = paste0(
      "<div class='ram' style='font-size:16px;font-weight:800;'>TORONTO AIRBNB PROPERTIES</div>",
      "<div class='ram' style='font-size:12px;opacity:0.85;margin-top:2px;'>One dot per unique listing ID. Colors use a capped scale (P2–P98).</div>"
    ),
    position = "topleft"
  ) %>%
  addCircleMarkers(
    lng = ~longitude,
    lat = ~latitude,
    radius = 3,
    stroke = TRUE,
    weight = 0.6,
    color = "#111111",
    fillColor = ~pal(price_for_color),
    fillOpacity = 0.75,
    popup = ~popup_txt
  ) %>%
  addLegend(
    position = "bottomright",
    pal = pal,
    values = c(lo, hi),
    title = "AVG NIGHTLY PRICE (CAPPED)",
    labFormat = labelFormat(prefix = "$", big.mark = ","),
    opacity = 1
  )

m %>%
  prependContent(
    tags$link(rel = "stylesheet", href = "https://fonts.googleapis.com/css2?family=Ramabhadra&display=swap"),
    tags$style(HTML("
      .leaflet-container, .leaflet-control, .leaflet-popup-content, .leaflet-tooltip { font-family: 'Ramabhadra', sans-serif !important; }
      .ram { font-family: 'Ramabhadra', sans-serif !important; }
    "))
  )