Linear Regression — Predict Outcomes and Measure What Drives Them

You have a number you care about — revenue, conversions, test scores, customer satisfaction. Linear regression tells you which factors actually move that number, by how much, and whether the effect is real or just noise in the data.

What Is Linear Regression?

Linear regression is the workhorse of statistical modeling. You pick a target variable — the thing you want to explain or predict — and one or more predictor variables that you think influence it. The model fits a line (or hyperplane, with multiple predictors) through your data and tells you exactly how much each predictor contributes to the outcome. A one-unit increase in ad spend is associated with a $3.20 increase in revenue? That is a coefficient, and the model gives you one for every predictor.

What makes linear regression practical is that it does not just give you a prediction — it tells you whether to trust it. Every coefficient comes with a p-value that answers a simple question: could this effect be zero in reality? If the p-value is below 0.05, the predictor has a statistically significant relationship with your target. If it is above 0.05, the data does not provide strong enough evidence that the predictor matters.

Here is a concrete example. You are running ads on Facebook and Google, and you want to know how much each dollar of spend drives in revenue. Linear regression takes your daily data — revenue as the target, Facebook spend and Google spend as predictors — and returns two coefficients. Maybe Facebook returns $2.80 per dollar and Google returns $4.10 per dollar. Now you know where to shift budget. That is the kind of decision this analysis supports.

When to Use Linear Regression

Marketing teams use linear regression constantly. Which channels actually drive sales? If you have weekly data on email sends, paid search spend, social media impressions, and revenue, regression separates the signal from the noise. It answers questions like "does increasing our Google Ads budget by $1,000 actually move revenue?" with a number and a confidence level, not a gut feeling.

Pricing decisions are another natural fit. If you have transaction data with price and quantity sold, regression quantifies the relationship. A coefficient of -12 on price means each $1 price increase is associated with 12 fewer units sold. Combine that with your margin data and you can calculate the optimal price point. HR teams use the same approach — what predicts employee performance? Is it years of experience, training hours completed, or team size? Regression tells you which factors carry weight and which ones are just correlated by coincidence.

Operations teams lean on regression for cost modeling. What drives shipping costs? Package weight, distance, carrier, and time of year might all play a role. Regression quantifies each factor so you can forecast costs accurately and identify which levers you can pull. The common thread across all these use cases: you have a numeric outcome and want to understand what influences it, with evidence you can defend to stakeholders.

What Data Do You Need?

You need a CSV with at least two columns: one numeric target variable (the thing you want to predict or explain) and one or more numeric predictor columns. Revenue and ad spend. Test scores and study hours. Weight and calorie intake. The target must be continuous — if it is a yes/no outcome, you need logistic regression instead.

More rows give you more reliable results. With 30 rows, the model might find a pattern but the confidence intervals will be wide. With 300 rows, the estimates tighten up considerably. A common rule of thumb is at least 10-20 observations per predictor variable. If you have 5 predictors, aim for at least 50-100 rows.

Do not worry about correlated predictors. If your Facebook spend and Google spend tend to go up and down together, the model can still work — but the report flags this with VIF (Variance Inflation Factor) scores. A VIF above 5-10 means two predictors are so correlated that the model struggles to separate their individual effects. The report tells you which predictors are affected so you can decide whether to drop one or keep both and accept wider confidence intervals.

How to Read the Report

Start with R-squared at the top of the report. This number ranges from 0 to 1 and tells you how much of the variation in your target the model explains. An R-squared of 0.75 means the predictors account for 75% of the variation — the remaining 25% is due to factors not in your data. For business data, anything above 0.5 is often useful. For physical systems (like engineering data), you might expect 0.9+. There is no universal "good" threshold; it depends on your domain.

Next, look at the coefficient table. Each row is a predictor. The "Estimate" column tells you the effect size — how much the target changes for a one-unit increase in that predictor, holding everything else constant. The "p-value" column tells you if the effect is statistically significant. Below 0.05 means strong evidence the predictor matters. Below 0.01 means very strong evidence. Above 0.10 means the data does not support a real effect. Focus your attention on predictors with low p-values and meaningful effect sizes.

The residual plots check whether the model's assumptions hold. The residuals-vs-fitted plot should look like a random cloud — if you see a curve or a funnel shape, the model is missing something. The Q-Q plot checks whether residuals are approximately normal. The actual-vs-predicted chart gives you a visual sense of fit: points clustered tightly around the diagonal line mean the model predicts well. Points scattered far from the line mean the model struggles with those observations.

When to Use Something Else

If you have many predictors and suspect several of them are correlated or irrelevant, consider ridge regression or lasso regression. Ridge handles correlated predictors by shrinking coefficients toward zero without eliminating them. Lasso goes further and sets some coefficients exactly to zero, effectively selecting the most important predictors for you. Both are especially useful when you have more than 10-15 predictors.

If your target variable is a yes/no outcome — did the customer convert? did the patient recover? — you need logistic regression instead. Linear regression assumes a continuous numeric target and will give misleading results on binary outcomes. If the relationship between predictors and target is clearly non-linear — for example, revenue grows exponentially with ad spend up to a saturation point — then XGBoost or random forest will capture those patterns better.

Sometimes you do not need prediction at all — you just want to see which variables are related to each other. In that case, a simple correlation analysis gives you a matrix of pairwise relationships without building a model. Correlation tells you "these move together"; regression tells you "this drives that." Choose based on whether you need directional, causal-style interpretation or just want to explore your data.

The R Code Behind the Analysis

Every report includes the exact R code used to produce the results — reproducible, auditable, and citable. This is not AI-generated code that changes every run. The same data produces the same analysis every time.

The analysis uses lm() from base R — the same ordinary least squares implementation found in every statistics textbook and university course. Multicollinearity is checked with car::vif(), which calculates the Variance Inflation Factor for each predictor. Diagnostic plots use base R's plot() on the model object, producing residuals-vs-fitted, Q-Q, scale-location, and leverage plots. The summary() function produces the coefficient table with standard errors, t-statistics, and p-values. No custom or experimental code — just standard, well-tested statistical functions that have been validated by decades of use.