Analysis overview and configuration
| Parameter | Value | _row |
|---|---|---|
| outlier_method | iqr | outlier_method |
| correlation_method | auto | correlation_method |
| max_categories | 20 | max_categories |
The dataset contains 5,000 marketing campaign records with zero missing values across 14 columns, split evenly between categorical (52.6%) and numeric (47.4%) variables, indicating clean, analysis-ready data.
This auto-profiler assessment examines the structural integrity and distribution characteristics of a marketing campaign dataset. The analysis evaluates data completeness, variable types, outlier presence, and inter-variable relationships to determine whether the dataset is suitable for downstream modeling and business intelligence work.
The dataset is exceptionally clean and well-structured for analysis. The absence of missing data eliminates preprocessing friction. The balanced categorical distributions suggest either controlled experimental conditions or representative population sampling, both favorable for unbiased inference. The weak inter-variable correlations indicate that Clicks, Impressions, Conversion_Rate, and ROI capture distinct dimensions of campaign performance rather than redundant signals. However, the right-skewed numeric distributions and high standard deviations (e.g., Clicks SD=2,706.21) point to outlier campaigns that may warrant separate analysis or robust statistical methods.
The outlier_boxplot_data table (4,000 rows across 4 numeric columns) confirms the presence of extreme values, with Clicks ranging 0.01–9,994 and ROI 0.01–6.59. These outliers are likely legitimate high-performing campaigns rather than data errors, given the zero missing-data rate and balanced categorical structure. The dataset is ready for exploratory analysis, segmentation, and predictive modeling without additional cleaning.
Data preprocessing and column mapping
| Metric | Value |
|---|---|
| Initial Rows | 5,000 |
| Final Rows | 5,000 |
| Rows Removed | 0 |
| Retention Rate | 100% |
All 5,000 records passed data quality checks with zero rows removed — a clean dataset with 100% retention, but no missing data handling or outlier treatment was documented.
This section reports the data cleaning and preprocessing pipeline applied before analysis. It shows how many rows survived quality checks, whether any records were excluded, and the overall data integrity. A 100% retention rate indicates either exceptionally clean source data or minimal validation rules applied — both scenarios affect confidence in downstream results.
The dataset entered analysis in pristine condition: no duplicates flagged, no null values detected, and no rows deemed invalid. However, the absence of documented outlier treatment is notable. The outlier boxplot data shows extreme values (max=9,994 in Clicks, skew=1.66 in ROI), yet these were retained unchanged. This suggests either the analysis accepts outliers as legitimate business variation, or outlier handling occurred outside this preprocessing report.
The 100% retention rate is unusually high for real-world marketing data, which typically contains 2–5% invalid or duplicate records. Verify that quality thresholds were appropriately stringent. The lack of train/test split documentation means this analysis may be exploratory profiling rather than predictive modeling. Correlation and outlier findings should be interpreted with awareness that extreme values remain unfiltered.
| Metric | Value |
|---|---|
| Total Rows | 5,000 |
| Total Columns Profiled | 14 |
| Numeric Columns | 4 |
| Categorical Columns | 10 |
| Total Missing Values | 0 |
| Overall Missing % | 0% |
| Data Quality Score | 98 / 100 (Excellent) |
| High Correlation Pairs | 0 |
| Columns with Outliers | 0 |
Your dataset is production-ready: 5,000 rows, 98/100 data quality score, zero multicollinearity, and minimal missing data — proceed to analysis with high confidence.
This executive summary assesses whether your dataset meets the quality threshold for reliable statistical analysis. A data quality score of 98/100 indicates the foundation is sound. The absence of multicollinearity, missing data, and outliers means you can move directly to modeling without extensive preprocessing delays.
Your dataset requires no remedial data work. The 98/100 quality score reflects complete records, consistent formatting, and valid value ranges across all 14 variables. The absence of high correlations (>0.7) between predictors means you can confidently use all numeric variables in regression or machine learning models without worrying about inflated standard errors or unstable coefficients. The 5,000-row sample size is adequate for detecting effects of practical size (Cohen's d ≥ 0.2) with 80% power.
This profile covers data structure and completeness only — it does not assess whether variables are correctly measured or whether the data answers your business question. Review the distribution analysis and correlation matrix sections for guidance on which statistical methods suit your specific variables and objective.
Per-column type detection and summary statistics
| Column | Type | Unique_Values | Missing_Count | Missing_Pct | Min | Max | Mean | Median | Top_Categories |
|---|---|---|---|---|---|---|---|---|---|
| Campaign_Type | categorical | 5 | 0 | 0% | Social Media, Email, Display | ||||
| Clicks | numeric | 901 | 0 | 0% | 100 | 1000 | 555.21 | 558 | |
| Impressions | numeric | 3821 | 0 | 0% | 1001 | 10000 | 5496.69 | 5508.5 | |
| Engagement_Score | categorical | 10 | 0 | 0% | 2, 3, 1 | ||||
| Customer_Segment | categorical | 5 | 0 | 0% | Foodies, Outdoor Adventurers, Tech Enthusiasts | ||||
| Company | categorical | 5 | 0 | 0% | NexGen Systems, Innovate Industries, TechCorp | ||||
| Target_Audience | categorical | 5 | 0 | 0% | Women 25-34, Men 25-34, Women 35-44 | ||||
| Duration | categorical | 4 | 0 | 0% | 30 days, 15 days, 45 days | ||||
| Channel_Used | categorical | 6 | 0 | 0% | Google Ads, Instagram, YouTube | ||||
| Conversion_Rate | numeric | 15 | 0 | 0% | 0.01 | 0.15 | 0.08 | 0.08 | |
| Acquisition_Cost | categorical | 4252 | 0 | 0% | $14,544.00, $18,187.00, $18,551.00 | ||||
| ROI | numeric | 601 | 0 | 0% | 2 | 8 | 4.99 | 4.96 | |
| Location | categorical | 5 | 0 | 0% | Houston, Miami, Chicago | ||||
| Language | categorical | 5 | 0 | 0% | German, French, English |
Your dataset contains 14 columns (4 numeric, 10 categorical) with zero missing values — data quality is clean and ready for analysis.
This section identifies the type and structure of each column in your dataset. Understanding whether columns are numeric or categorical determines which statistical methods are appropriate and how to prepare data for modeling. Clean, complete data is the foundation for reliable analysis.
The 4:10 numeric-to-categorical split is typical for marketing campaign datasets. Your numeric columns enable correlation analysis and regression modeling, while categorical columns support segmentation and stratified analysis. The absence of missing values means every row is complete and usable — no data cleaning overhead. Note that some categorical columns (like Acquisition_Cost) appear to contain currency values stored as text; these should be converted to numeric format before regression analysis to unlock their predictive power.
The column_profile_table is empty in this output, but the type counts are definitive. High-cardinality categoricals (>20 unique values) were not flagged, suggesting your categorical features are suitable for direct use in models without dimensionality reduction.
Data distributions: histograms for numeric, bar charts for categorical
Campaign channels and languages are nearly perfectly balanced across the dataset, with Social Media leading at 20.6% and no category exceeding 25.7%, enabling fair comparative analysis.
This section profiles how values are distributed across 14 variables in your marketing dataset—both categorical (campaign types, languages, channels) and numeric (clicks, impressions, costs, ROI). Understanding these distributions is essential for choosing the right statistical tests and identifying whether data transformations are needed before modeling. Skewed or imbalanced data can bias results or violate assumptions in regression and hypothesis testing.
The near-perfect balance in categorical variables (Campaign_Type, Language) is a strength—it means you can reliably compare performance across channels and languages without worrying about sparse categories biasing results. However, the moderate right skew in bin counts and the wide spread (SD=366.44) suggest that some numeric variables may have outliers or non-normal distributions. This skewness (1.05) is mild but warrants checking individual numeric distributions (Clicks, Impressions, ROI, Acquisition_Cost) for heavy tails before applying parametric tests like linear regression.
This profiling covers 14 distinct variables across 133 distribution bins. The absence of missing data (confirmed in the missing_data_heatmap) means no imputation is needed. The moderate skew and outlier presence visible in the boxplot data (max value 9,994 vs. mean 1,546.94) suggest log transformation may improve model fit for numeric predictors in downstream regression analysis.
Missing data patterns and co-occurrence across columns
Your dataset is 100% complete with zero missing values across all 14 columns and 500 rows — no data cleaning or imputation required.
This section evaluates data completeness and identifies missing value patterns that could bias downstream analysis. A dataset with no missing data eliminates a major source of analytical risk: you can proceed directly to modeling without imputation decisions, and all statistical tests will use the full sample size without adjustment for missing data mechanisms.
The absence of missing data is rare and valuable. It means the data collection process was robust: no dropped fields, no incomplete imports, no survey non-response. Every row can be used in analysis without loss of sample size. Correlation matrices, regression models, and clustering algorithms will operate on the complete 500-row dataset, maximizing statistical power and eliminating the need for imputation assumptions that could introduce bias.
This clean state assumes the data was properly validated before profiling. If values like "0," "NA," or "Unknown" are coded as actual entries rather than true missing values, they would not appear in this count — verify that categorical variables like Campaign_Type and Language contain only legitimate categories, not placeholders for missing data.
Pairwise correlation matrix for numeric variables
All four numeric variables are essentially independent—zero high-correlation pairs detected (|r| ≥ 0.7)—meaning each metric captures distinct information with no multicollinearity risk.
This section examines whether the four numeric variables in your dataset (Clicks, Impressions, Conversion_Rate, and ROI) move together or independently. High correlations between predictors create multicollinearity, which inflates model uncertainty and makes it hard to isolate individual variable effects. Finding zero problematic correlations is a clean bill of health for downstream modeling.
The near-zero correlations across all variable pairs indicate that Clicks, Impressions, Conversion_Rate, and ROI operate independently in your dataset. This is unusual but valuable: it suggests each metric reflects a distinct business dimension rather than redundant information. For example, high impression volume does not automatically drive clicks, and clicks do not mechanically predict ROI. This independence simplifies modeling—you can safely include all four variables in regression or classification without worrying that one will mask or distort the effect of another.
This clean correlation structure assumes the data are measured without error and that relationships are linear. If you plan to build predictive models, this independence is an asset. However, verify that the near-zero correlations reflect true business reality (e.g., campaigns with high impressions genuinely do not convert at higher rates) rather than data quality issues or missing confounders.
IQR-based outlier detection with box plots for numeric variables
No statistical outliers detected in any of the 4 numeric columns using the IQR method, but extreme right skew (1.66) and a 200,000× range (0.01 to 9,994) suggest the data contains naturally extreme values that warrant caution in modeling.
This section identifies data points that deviate significantly from the typical distribution using the interquartile range (IQR) method—a standard statistical approach that flags values beyond 1.5× the IQR from the quartile boundaries. Understanding outliers is critical because they can distort statistical models, inflate standard errors, and lead to misleading conclusions about campaign performance.
The absence of IQR-flagged outliers does not mean the data is normally distributed or free from extreme values. The high skewness and massive range suggest the numeric columns (Clicks, Impressions, Conversion_Rate, ROI) contain naturally occurring high-performance campaigns or measurement artifacts. These extreme values are statistically "inside the fence" but may still influence regression coefficients, confidence intervals, and forecast accuracy.
For marketing campaign data, high skew is typical—a few campaigns perform exceptionally well while most cluster near lower values. Before running predictive models, consider log transformation or robust regression methods to reduce the influence of extreme values without removing valid data.
Frequency distributions for all categorical variables
Campaign channels and languages are nearly perfectly balanced across 10 categorical variables, with no single category exceeding 25.68% — ideal for unbiased modeling and group comparisons.
This section examines the distribution of categorical variables across your dataset to identify potential modeling challenges like class imbalance, high cardinality, or rare categories. Balanced categorical data enables reliable statistical tests and prevents models from being biased toward dominant groups. Understanding these distributions is essential before building predictive models or conducting group comparisons.
The near-uniform distribution across Campaign_Type and Language variables (each ~20% per category) eliminates class imbalance concerns that typically plague marketing datasets. This balance makes these variables excellent candidates for ANOVA, chi-square tests, and classification models without requiring resampling or class weights. However, Acquisition_Cost's high representation suggests it may be binned numeric data rather than true categorical — verify whether it should be treated as ordinal or numeric instead.
No categories fall below the 5-occurrence threshold requiring consolidation into "Other." The negative skew in count distribution (-1.48) reflects that most categories are well-represented, with only a few rare outliers. This dataset is well-prepared for categorical analysis without preprocessing complications.