A 200-room city hotel with 75% occupancy and a $180 average daily rate generates $9.85M in annual revenue. But if 28% of bookings cancel—the industry average—you're leaving $2.1M on the table every year. Most of those rooms go unsold because you find out about the cancellation too late to resell. Here's the question revenue managers should be asking: which bookings will cancel 72 hours before arrival, when there's still time to recover the revenue?

That's what hotel booking cancellation prediction does. It combines logistic regression with time series analysis to identify high-risk bookings before they cancel, giving you a 2-3 day window to either retain the guest or accept strategic overbooking. The difference between a 28% cancellation rate and a 22% rate with proactive management is $820K annually for that same 200-room property.

Before we dive into the analysis, let's establish the experimental frame: this isn't observational hand-waving. Cancellation prediction is a supervised learning problem with a binary outcome (canceled vs. kept), time-varying features (lead time, days until arrival), and categorical predictors (deposit type, market segment, customer type). The model's job is to estimate P(cancellation | booking characteristics) at decision points where you can still act—typically 48-72 hours before arrival.

The Revenue Management Problem: When Late Cancellations Kill Occupancy

Most hotels track overall cancellation rate as a KPI, but that number masks the real problem: timing. A cancellation 30 days out is manageable—you have time to resell. A cancellation 12 hours before arrival is catastrophic—the room goes empty and you eat the loss. The distribution of cancellation timing determines how much revenue you can recover.

Industry data shows a bimodal pattern: early cancellations (30+ days out) driven by plan changes, and late cancellations (0-3 days out) driven by travel disruptions or better offers. The late cancellation bucket is where prediction models add value. If you can identify a high-risk booking 72 hours before arrival, you have three strategic options:

  • Retention outreach — Proactive email or call offering flexibility, upgrades, or rebooking incentives to keep the reservation
  • Controlled overbooking — Accept additional bookings to offset predicted cancellations without walking guests
  • Dynamic rate adjustment — Lower rates for remaining inventory to drive last-minute bookings if cancellation probability is high

All three require 48-72 hour advance notice. That's the operational constraint that shapes the prediction problem. Your model doesn't need to predict cancellations 6 months in advance—it needs to be accurate at the decision horizon where you can still act.

Experimental Design Check: What Makes This Causal?

Cancellation prediction is fundamentally a prediction problem, not a causal inference problem. You're estimating P(cancel | features), not trying to prove that deposit type causes cancellations. That said, the predictors in the model have causal interpretation because hotels ran natural experiments: when properties switched from refundable to non-refundable deposit policies, cancellation rates dropped 60-80%. That's a policy intervention with a measured effect.

The model combines observational prediction with experimentally-validated features. Deposit type, lead time, and market segment have known causal relationships from A/B tests and policy changes. Customer type and historical behavior are correlational but predictive. The blend gives you a model that's both accurate and actionable.

Monthly Cancellation Rate Trend

The time series shows clear seasonal patterns: city hotels peak at 42% cancellation rate in July-August (summer leisure travel with flexible plans), while resort hotels bottom out at 18% in December-January (holiday travel with firm commitments). The variance matters because it tells you when to tighten overbooking policies and when to relax them.

Notice the divergence between hotel types. City properties run 8-12 percentage points higher across all months because their customer mix skews toward business and OTA bookings—both high-cancellation segments. Resort properties capture more group and contract business with deposits, which stabilizes the base rate. If you manage multiple properties, you can't use a single cancellation threshold across the portfolio—segment-specific models are required.

The month-over-month volatility also reveals demand shocks: the April spike to 38% for city hotels likely reflects conference cancellations or corporate travel freezes. These anomalies don't invalidate the model—they're exactly the events where real-time prediction adds value. When external shocks hit, your baseline cancellation assumptions break, and you need updated probabilities to adjust inventory management.

Cancellation Rate by Deposit Type

Here's the single strongest predictor in the model: deposit type. No Deposit bookings cancel at 47%, while Non Refund bookings cancel at just 9%. That's a 5.2x difference—massive effect size by any standard. Refundable deposits sit in the middle at 23%, showing that skin-in-the-game matters even when it's recoverable.

This finding has direct policy implications. If you're running 40%+ cancellation rates, the first question is: what percentage of your bookings require deposits? Many hotels offer no-deposit booking to compete with OTAs, but they're trading acquisition for retention. The revenue-optimal strategy is rate fencing: charge a 5-10% premium for fully flexible (no deposit) bookings and make non-refundable the default. Customers who value flexibility will pay; price-sensitive customers will lock in.

From a modeling perspective, deposit type is so predictive that it risks overwhelming other features. When you run logistic regression with deposit type included, its coefficient dominates and other predictors shrink. That's not a bug—it reflects reality. But it also means you should run two models: one with deposit type (for operational prediction) and one without (to understand secondary risk factors for bookings within the same deposit category).

Cancellation Rate by Market Segment

Online travel agencies drive 41% cancellation rates—nearly double the 22% rate for direct bookings. Groups cancel at just 12%, and corporate contracts sit at 19%. The spread tells you where to focus retention efforts and where to accept higher risk for volume.

The OTA problem is structural: customers who book through third parties have lower commitment (they're comparison shopping), weaker brand loyalty, and often don't understand cancellation policies buried in confirmation emails. You're also competing with the OTA's own incentive to let customers cancel and rebook at lower rates (generating new commissions). The 41% rate is the cost of distribution—you pay it in cancellations instead of commission fees.

Direct bookings at 22% represent your baseline. These customers chose your brand, navigated your website, and entered payment details directly. They're invested. Corporate and group bookings outperform because they're contract-based with institutional accountability—someone's corporate card is on file, or there's a signed group agreement with attrition clauses. When you're building a predictive model, market segment is a clean categorical variable with large, stable effects across time.

Operationally, this means segmenting your overbooking strategy. Accept 10-15% overbooking on OTA inventory (you'll need it), 3-5% on direct bookings, and zero on groups. The risk-reward math is different for each channel.

Lead Time Distribution by Cancellation Status

The box plots reveal a stark difference: median lead time for canceled bookings is 109 days versus 68 days for kept bookings. The interquartile range shows canceled bookings spread from 45 days (Q1) to 198 days (Q3), while kept bookings cluster tighter at 32-127 days. Translation: people who book far in advance are hedging, not committing.

The 100-day threshold is where risk accelerates. Bookings made 100+ days out cancel at 2.5-3x the rate of bookings made within 30 days. That's a planning horizon effect—customers booking 4-6 months ahead don't have firm travel plans yet. They're locking in rates or availability but keeping options open. Customers booking 2-4 weeks out have decided to travel; they're finalizing logistics.

For modeling, lead time is a continuous predictor, but the relationship isn't linear—it's exponential. A booking made 200 days out is much riskier than one made 100 days out, which is moderately riskier than one made 30 days out. You'll want to log-transform lead time or use polynomial terms to capture the nonlinearity. Alternatively, bin it into risk categories: 0-30 days (low), 31-90 days (medium), 91-180 days (high), 180+ days (very high).

Try It Yourself: Upload Your Booking Data

See which bookings are most likely to cancel in the next 72 hours. Upload a CSV with booking date, arrival date, deposit type, market segment, and customer type. Get a ranked list of high-risk reservations plus a calibrated logistic regression model you can deploy in your PMS.

Run Cancellation Prediction →

Cancellation Rate by Customer Type

Transient customers (one-off leisure and business travelers) cancel at 33%, while Contract customers (corporate rates, negotiated terms) cancel at 18%. Group bookings are most stable at 11%. The pattern reinforces what we saw in market segment: institutional relationships reduce cancellation risk.

The Transient-Party category (leisure groups not under formal contract) sits at 29%—slightly better than solo transient but worse than contracted groups. These are families or friend groups booking multiple rooms. They have social commitment but no legal obligation. When one party cancels, the whole group often collapses.

Customer type and market segment are correlated but not redundant. A Direct booking (market segment) can be either Transient or Contract (customer type). An OTA booking is almost always Transient. When you build the logistic regression model, both variables should be included because they capture different dimensions of commitment: how the customer found you (market segment) versus what relationship they have with you (customer type).

Logistic Regression: Odds Ratios for Cancellation

Here's where all the univariate analysis collapses into a single predictive model. The odds ratios show the multiplicative effect of each predictor on cancellation probability, holding all other features constant. An odds ratio above 1.0 increases risk; below 1.0 decreases it.

No Deposit has an odds ratio of 6.8—meaning a booking with no deposit required is 6.8x more likely to cancel than the baseline (non-refundable deposit), controlling for market segment, lead time, and customer type. That's the largest effect in the model by far. Lead Time (per 30 days) has an odds ratio of 1.4—every additional month of advance booking increases cancellation odds by 40%.

Market segment shows Online TA with an odds ratio of 2.1 (2.1x baseline risk), Direct at 1.3, and Corporate at 0.8 (20% lower risk than baseline). Customer type Transient has odds ratio 1.6 versus Contract at 0.9. All of these effects are net of each other—you're seeing the independent contribution of each predictor after accounting for correlations.

The confidence intervals (not shown in this chart but included in the full report) tell you which effects are statistically significant. If the interval crosses 1.0, the predictor isn't reliably different from baseline. In this model, all major predictors clear significance thresholds with p < 0.01, which gives you confidence the effects are real and not sampling noise.

How to Use Odds Ratios Operationally

Odds ratios translate directly into risk scores. Take a booking with No Deposit (6.8x), Online TA (2.1x), Transient customer (1.6x), and 120-day lead time (1.4^4 = 3.8x for 4 months). Multiply: 6.8 × 2.1 × 1.6 × 3.8 = 217x baseline risk. If baseline cancellation probability is 5%, this booking has predicted probability of ~70% (using logistic transformation). That's a flag for retention outreach or controlled overbooking.

Conversely, a booking with Non-Refund deposit (1.0x baseline), Direct channel (1.3x), Contract customer (0.9x), and 15-day lead time (1.4^0.5 = 1.2x) yields: 1.0 × 1.3 × 0.9 × 1.2 = 1.4x baseline = ~7% predicted cancellation. That's a low-risk booking you can count on.

How to Interpret Your Results: The 72-Hour Decision Window

You now have a calibrated logistic regression model that outputs P(cancellation) for every booking in your system. The question is: what do you do with those probabilities? Here's the decision framework we recommend based on revenue management best practices.

Step 1: Set risk thresholds based on operational capacity. If you have strong last-minute demand (city hotel near airport or convention center), you can tolerate higher cancellation risk because you'll resell rooms quickly. Set your "high risk" threshold at 60-70%. If you have weak last-minute demand (resort property off-season), set it at 40-50%. The threshold determines which bookings trigger action.

Step 2: Segment actions by risk tier. For bookings with 70%+ predicted cancellation probability 72 hours out, trigger retention outreach—email or call offering flexibility, upgrades, or rebooking incentives. For 50-70% risk, accept controlled overbooking (take additional reservations to offset expected cancellations). For 30-50% risk, monitor but don't act. Below 30%, treat as firm bookings.

Step 3: Measure model performance in production. Track actual cancellation rate by predicted risk decile. Your top decile (highest predicted risk) should cancel at 60-80%. Your bottom decile should cancel at <10%. If the spread compresses, your model is losing discrimination—retrain with recent data. Also track false positives: how often do you overbook based on predicted cancellations that don't materialize? That's your walked-guest risk.

Step 4: A/B test retention strategies. This is where experimental rigor matters. Don't just blast high-risk bookings with generic retention emails. Randomly assign them to control (no contact), Treatment A (flexible rebooking offer), Treatment B (room upgrade offer), and Treatment C (personalized outreach from front desk manager). Measure cancellation rate and net revenue by arm. You'll likely find that personalized outreach cuts cancellation risk by 20-30% but only scales to your top 50-100 highest-risk bookings per week. Automated emails scale but deliver 5-10% lift.

The model gives you targeting precision. The experiment gives you causal estimates of what works. Combine them and you have a closed-loop revenue management system: predict risk, intervene strategically, measure impact, refine.

What the Model Won't Tell You: External Shocks and Black Swans

Logistic regression is trained on historical patterns, which means it breaks when the world changes in ways not seen in training data. A global pandemic, major weather event, or sudden economic shock will invalidate your model's predictions because the base rates shift overnight. If your training data shows 28% average cancellation and a crisis pushes it to 55%, every predicted probability will be underestimated.

The solution is monitoring and rapid retraining. Track rolling 7-day cancellation rate by segment. If it moves more than 2 standard deviations from historical mean, your model is out of distribution—pause automated actions and retrain with last 30 days of data. You'll sacrifice long-term stability for short-term accuracy, but during shocks that's the right trade.

Also recognize that prediction models optimize for the average case, not tail risk. A booking with 15% predicted cancellation probability still cancels 15% of the time. If you manage 500 bookings per week, 75 of your "low risk" bookings will cancel. That's not model failure—that's irreducible uncertainty. The model helps you triage, not achieve perfection.

Run This Analysis on Your Data in 60 Seconds

Upload a CSV with booking history (arrival date, booking date, deposit type, market segment, customer type, cancellation status). Get the full report with time series trends, odds ratios, and a downloadable prediction model calibrated to your property. No code, no setup—just results.

Get Your Cancellation Prediction Report →

When This Model Beats Gut Instinct: The $820K Question

Revenue managers develop intuition about which bookings are risky—OTA bookings made far in advance with no deposit "feel" like cancellations waiting to happen. And they're right. But intuition doesn't scale to 500+ bookings per week, doesn't quantify risk precisely, and doesn't tell you which retention strategy works.

The model's value is in three areas: scale (score every booking automatically), precision (65% risk vs. 85% risk drives different actions), and testability (you can A/B test interventions and measure ROI). When a 200-room hotel reduces cancellation rate from 28% to 22% using prediction-driven retention and overbooking, the incremental revenue is $820K annually. That's the difference between flying blind and flying with instruments.

Here's the final experimental design point: the only way to know if this model works for your property is to run it as a controlled test. Take 50% of high-risk bookings (predicted cancellation 60%+) and apply retention outreach. Leave the other 50% as control (no contact). Measure cancellation rate difference. If treatment reduces cancellations by 15-25 percentage points, you have a statistically significant, operationally meaningful effect. Scale it. If the effect is <5 points or not significant, your intervention doesn't work—try something else.

That's the Kit Fisher approach: build a model, design an experiment, measure causally, act on evidence. Correlation got you the predictors. Experimentation tells you what to do with them.

Frequently Asked Questions

What cancellation rate is normal for hotels?

Industry benchmarks show 20-40% cancellation rates depending on property type and market segment. City hotels average 28-35%, while resort hotels see 18-25%. Non-refundable deposits reduce cancellations by 60-80%. If you're running above 35% without strong OTA channel mix, you have a deposit policy problem or a customer quality problem.

How far in advance can you predict a cancellation?

Logistic regression models can identify high-risk bookings 48-72 hours before arrival with 70-85% accuracy. Longer lead times reduce precision because guest circumstances change. The sweet spot is 2-3 days out—enough time to resell the room without excessive false alarms. Predictions made 7+ days before arrival have too much uncertainty to act on confidently.

Which booking channels have the highest cancellation rates?

Online travel agencies (OTAs) consistently show the highest cancellation rates at 35-45%, followed by direct bookings at 25-30%. Corporate contracts cancel 15-20% of the time, while group bookings are most stable at 8-12%. The difference is driven by deposit policies and customer commitment level. OTAs allow no-deposit bookings by default; groups sign contracts with attrition clauses.

What's the most important predictor of cancellation?

Deposit type is the strongest single predictor. Bookings with no deposit required cancel at 4-5x the rate of non-refundable bookings. Lead time is second—bookings made 180+ days out cancel at 2-3x the rate of bookings made within 7 days. Market segment, customer type, and historical behavior round out the top five. If you can only change one thing, make non-refundable deposits the default rate and charge a premium for flexibility.

How do you use cancellation predictions operationally?

Revenue managers use predicted cancellation probability to optimize overbooking levels, adjust rate fences, and trigger retention outreach. If a booking has 80%+ cancellation probability 3 days out, consider pre-emptive offers (room upgrade, flexible rebooking) or accept controlled overbooking to fill the expected gap. The key is acting at the decision horizon where you can still resell the room—48-72 hours before arrival.

Related Articles