Pricing determinants, predictive modeling, and hypothesis testing
predictive analytics
short-term rentals
housing economics
regression modeling
feature engineering
hypothesis testing
Python
Author
A. Srikanth
Published
January 2, 2026
Project Spotlight
Disclaimer
This analysis was completed as a course project for the Master of Management Analytics (MMA) program at Queen’s University. It is presented for educational purposes and reflects project-scoped assumptions and methods; results may not generalize beyond the dataset and period analyzed. This content is not financial, legal, or investment advice.
Collaborators: Adam Tang (MMA, ’26) • Balamira Gurumoorthy (MMA, ’26) • Bella Wang (MMA, ’26) • Esha Fatim (MMA, ’26) • Mohan Li (MMA, ’26) • Yukang Fu (MMA, ’26)
Context
Short-term rental platforms like Airbnb have reshaped travel and real estate use. In Toronto, demand stays high due to tourism, business travel, and hotel capacity and pricing. Short-term rentals can outperform long-term leasing through higher nightly revenue and flexible occupancy.
From July 2024 to June 2025, Toronto averaged ~11,000 active listings with a 181 CAD average daily rate and 68% occupancy, translating to ~5,430 CAD in average monthly revenue per listing. Despite slower listing growth, Airbnb remains a viable alternative for many owners.
Pricing is the main challenge. New hosts often rely on guesswork or copy nearby listings without understanding what drives price. Location matters, but host attributes and amenities also shape outcomes.
Objectives
Quantify which listing and host features actually move nightly price, and use that model-driven signal to guide pricing decisions that improve occupancy and revenue. In parallel, compare neighbourhood-level performance to highlight where returns tend to be strongest and which features consistently support higher pricing across Toronto’s short-term rental market.
Data Sources
The Toronto Airbnb Listing Dataset subset used spans July 2024 to June 2025 and contains more than 189,000 listing records, which reflect repeated monthly observations of the same listings over time, alongside 27,029 unique listings, which represent distinct properties counted once by listing ID. This dual structure supports cross-sectional analysis by comparing listings at a single point in time (or after deduplication) and time-series analysis by evaluating month-to-month changes in price, availability, and related outcomes.
Columns fall into three groups:
Market and activity indicators: price, number_of_reviews, review_scores_rating (price used as the dependent variable; others as predictors).
Toronto spans 140 neighbourhoods, but Airbnb supply is heavily concentrated downtown. By listing volume, the top five are ‘Waterfront Communities-The Island’ (4,625), ‘Niagara’ (986), ‘Church-Yonge Corridor’ (810), ‘Annex’ (791), and ‘Kensington-Chinatown’ (712). Together, these areas make up more than 30% of all listings, which points to demand clustered in a few hotspots and competition that’s especially intense in those same pockets. A chart below shows how average monthly prices move over time in the top three neighborhoods by listing count.
Analysis (Part I) - Building a Predictive Model
The Modeling Process
A linear regression model was developed using a standard applied econometrics workflow. Since price is heavily right-skewed, the target was set to log_price. An initial Statsmodels regression produced R² = 0.559. Predictors with p-values >= 0.10 were removed (intercept retained); fit changed little (F-statistic rose from 774.1 to 1001, R² unchanged). A Variance Inflation Factor (VIF) check showed the input variables weren’t strongly overlapping or telling the same story, so the model wasn’t being distorted by redundant predictors.
Model fit improved modestly through transformations and interactions. Added terms such as bedrooms_squared, beds_squared, cleanliness_squared, persons_reviews, persons_condo, and bedrooms_hotel increased R² to 0.585.
Model Result Description
Final model: a pricing model was built to predict nightly rates after taking the log of price, using 47 inputs (a mix of original features plus a few engineered ones). Every feature kept in the final version showed a clear signal in the data (p-value <= 0.010). The model’s adjusted R² was 0.584, meaning it captured about 58% of the differences in prices across listings after the log transform.
The key effects aligned with expectations going into this analysis: listings that accommodate more guests were priced higher, with an estimated effect of +0.29 on log_price. Listings with more bedrooms were also priced higher (+0.15), as were listings with more bathrooms (+0.12). Several amenities were associated with higher prices, including having a TV (+0.0947), a dishwasher (+0.1624), and access to a gym (+0.1273). In contrast, heating (−0.0371) and air conditioning (−0.0351) were associated with slightly lower prices, likely because these features are common and therefore do not distinguish higher-end listings. On the host side, superhost status (+0.014) and being instant-bookable (+0.035) were associated with modestly higher prices.
Model Implications
A simple linear specification underfit observed pricing patterns; introducing non-linear transformations and interaction terms yielded a modest improvement in fit, consistent with the non-linear structure of real-world markets.
Interaction terms involving geography (e.g., region × bedrooms) did not materially improve model performance. This pattern suggests that location primarily shifts the overall price level, rather than altering how other attributes such as size, capacity, or amenities translate into price.
The resulting model offers a practical baseline for price-setting, but it remains a partial representation of the market. It does not incorporate real-time demand shocks, seasonal dynamics, or competitor price adjustments that can meaningfully affect short-run pricing.
Analysis (Part II) - Hypothesis Testing
Five statistical tests were used to check common assumptions about pricing and performance in Toronto’s Airbnb market. First, a two-sample t-test compared average nightly prices for superhosts versus non-superhosts. Superhosts showed slightly higher average prices, but the difference was not statistically significant, suggesting the badge is more likely to influence booking behavior than enable premium pricing. Second, a one-way ANOVA tested whether review counts differ across neighbourhoods and found highly significant differences, indicating that location strongly affects how much exposure a listing receives and how quickly reviews accumulate. Third, a one-way ANOVA tested whether prices differ across neighbourhoods and produced very strong evidence that neighbourhood is a major driver of pricing, with central or attraction-rich areas tending to sit at higher price levels than more peripheral areas.
Host tenure and listing scale were also evaluated. A two-sample t-test comparing newer hosts to longer-tenured hosts found a clear difference: hosts who joined within the last five years averaged 188.39 CAD per night, while longer-tenured hosts averaged 178.87 CAD, and this gap was highly significant (t = 4.95, p < 0.000001). Pearson correlation tests were then used to assess whether higher ratings translate into higher prices and whether hosts with more listings price differently. Rating versus price was statistically significant but negligible in strength, implying that near-perfect ratings do not meaningfully increase nightly rates on their own. Host listings count versus price showed a very weak negative relationship, consistent with portfolio hosts pricing slightly lower to support occupancy across multiple properties.
Taken together, the hypothesis tests point to neighbourhood as the dominant factor shaping both visibility (reviews) and pricing. Host-level signals such as superhost status and ratings likely matter for trust and conversion, but they show limited direct pricing power in the data. Host tenure appears more informative than badges or ratings, with newer hosts charging higher nightly rates on average, potentially reflecting newer inventory, updated amenities, or different pricing expectations.
Methodology
Four sequential procedures were applied: (1) data cleaning and model preparation, (2) exploratory data analysis (EDA), (3) predictive modeling, and (4) hypothesis testing.
Results & Next Steps
Unsurprisingly, pricing in Toronto’s Airbnb universe mostly follows the laws of “listing physics.” Bigger places cost more. More people, more bedrooms, more bathrooms, and the price tends to climb. After that, a small set of amenities actually moves the needle in a noticeable way. Neighbourhood matters, sure, but it’s not the whole story. A well-chosen upgrade can outperform the strategy of simply being in a trendy postal code.
The hypothesis tests add a useful reality check. Superhost status and near-perfect ratings are great for trust, but they don’t reliably buy a meaningful price premium. Meanwhile, location still shows up loud and clear, and host tenure also lines up with measurable differences in pricing.
For investors, dense hotspots like ‘Waterfront Communities-The Island’ and ‘Niagara’ are like busy city intersections: lots of flow, lots of competition, and the margins can get squeezed. The more interesting opportunity can show up in quieter areas where supply is thinner but prices hold up, especially when a listing is engineered with features that consistently support higher rates. For new hosts, instant booking and a smart amenity mix remain some of the cleanest, most practical levers for improving revenue.
sig_terms = [name for name, p in full_model.pvalues.items() if (name !="Intercept") and (p <=0.10)]reduced_formula ="log_price ~ "+" + ".join(sig_terms) if sig_terms else full_formula_ = fit_and_print(reduced_formula, df, "Reduced model (p ≤ 0.10)")