Agencies · Generic · Datasets · Auto Profiler
Overview

Analysis Overview

Analysis overview and configuration

Analysis TypeAuto Profiler
CompanyAgency Data Assessment
ObjectiveProfile this client marketing campaign dataset to assess data quality, column distributions, and analysis readiness
Analysis Date2026-03-28
Processing Idtest_1774732419
Total Observations5000
ParameterValue_row
outlier_methodiqroutlier_method
correlation_methodautocorrelation_method
max_categories20max_categories
Interpretation

Headline

The dataset contains 5,000 marketing campaign records with zero missing values across 14 columns, split evenly between categorical (52.6%) and numeric (47.4%) variables, indicating clean, analysis-ready data.

Purpose

This auto-profiler assessment examines the structural integrity and distribution characteristics of a marketing campaign dataset. The analysis evaluates data completeness, variable types, outlier presence, and inter-variable relationships to determine whether the dataset is suitable for downstream modeling and business intelligence work.

Key Findings

  • Data Completeness: Zero missing values across all 5,000 observations and 14 columns — no imputation or listwise deletion required
  • Variable Composition: 14 total columns; 7 categorical (Campaign_Type, Language, Channel_Used, Engagement_Score, Acquisition_Cost, and others) and 4 numeric (Clicks, Impressions, Conversion_Rate, ROI)
  • Categorical Balance: Campaign_Type and Language variables show near-uniform distributions (19.78%–20.88% per category), indicating balanced experimental design or representative sampling
  • Numeric Skewness: Clicks, Impressions, and ROI exhibit right skew (1.05–1.66), with medians substantially lower than means (e.g., Clicks median=54 vs. mean=1,546.94), suggesting presence of high-performing outlier campaigns
  • Weak Correlations: Numeric variables show near-zero correlations (mean=0.26, median=0.01), indicating minimal multicollinearity and independent predictive signals

Interpretation

The dataset is exceptionally clean and well-structured for analysis. The absence of missing data eliminates preprocessing friction. The balanced categorical distributions suggest either controlled experimental conditions or representative population sampling, both favorable for unbiased inference. The weak inter-variable correlations indicate that Clicks, Impressions, Conversion_Rate, and ROI capture distinct dimensions of campaign performance rather than redundant signals. However, the right-skewed numeric distributions and high standard deviations (e.g., Clicks SD=2,706.21) point to outlier campaigns that may warrant separate analysis or robust statistical methods.

Context

The outlier_boxplot_data table (4,000 rows across 4 numeric columns) confirms the presence of extreme values, with Clicks ranging 0.01–9,994 and ROI 0.01–6.59. These outliers are likely legitimate high-performing campaigns rather than data errors, given the zero missing-data rate and balanced categorical structure. The dataset is ready for exploratory analysis, segmentation, and predictive modeling without additional cleaning.

Data preprocessing and column mapping

Initial Rows5000
Final Rows5000
Rows Removed0
Retention Rate100
Interpretation

Headline

All 5,000 records passed data quality checks with zero rows removed — a clean dataset with 100% retention, but no missing data handling or outlier treatment was documented.

Purpose

This section reports the data cleaning and preprocessing pipeline applied before analysis. It shows how many rows survived quality checks, whether any records were excluded, and the overall data integrity. A 100% retention rate indicates either exceptionally clean source data or minimal validation rules applied — both scenarios affect confidence in downstream results.

Key Findings

  • Retention Rate: 100% (5,000 of 5,000 rows retained) — no records were removed during preprocessing
  • Rows Removed: 0 — no filtering, deduplication, or outlier exclusion occurred
  • Data Completeness: Missing data heatmap shows zero missing values across all 14 columns and 500 sampled rows
  • No Transformations Documented: Train/test split, scaling, encoding, or feature engineering steps are not reported

Interpretation

The dataset entered analysis in pristine condition: no duplicates flagged, no null values detected, and no rows deemed invalid. However, the absence of documented outlier treatment is notable. The outlier boxplot data shows extreme values (max=9,994 in Clicks, skew=1.66 in ROI), yet these were retained unchanged. This suggests either the analysis accepts outliers as legitimate business variation, or outlier handling occurred outside this preprocessing report.

Context

The 100% retention rate is unusually high for real-world marketing data, which typically contains 2–5% invalid or duplicate records. Verify that quality thresholds were appropriately stringent. The lack of train/test split documentation means this analysis may be exploratory profiling rather than predictive modeling. Correlation and outlier findings should be interpreted with awareness that extreme values remain unfiltered.

Executive Summary

Executive Summary

Executive summary — shareable data quality report for client kickoff meetings

total_rows
5000
data_quality_score
98
total_columns
14
MetricValue
Total Rows5,000
Total Columns Profiled14
Numeric Columns4
Categorical Columns10
Total Missing Values0
Overall Missing %0%
Data Quality Score98 / 100 (Excellent)
High Correlation Pairs0
Columns with Outliers0
Bottom Line: Dataset contains 5,000 rows and 14 profiled columns. Data quality score: 98/100 (Excellent). Missing data is minimal (<5%) — dataset is ready for analysis.

Key Findings:
• 4 numeric columns detected — suitable for correlation, regression, and outlier analysis
• 10 categorical columns detected — suitable for frequency analysis and group comparisons
• No high-correlation pairs detected — no multicollinearity concerns
• No outliers detected in numeric columns

Recommended Next Steps:
• Review distribution shapes to select appropriate statistical methods
• Address any missing data issues identified in the heatmap
• Proceed to analysis: group comparison, regression, clustering, or time series
Interpretation

Headline

Your dataset is production-ready: 5,000 rows, 98/100 data quality score, zero multicollinearity, and minimal missing data — proceed to analysis with high confidence.

Purpose

This executive summary assesses whether your dataset meets the quality threshold for reliable statistical analysis. A data quality score of 98/100 indicates the foundation is sound. The absence of multicollinearity, missing data, and outliers means you can move directly to modeling without extensive preprocessing delays.

Key Findings

  • Data Quality Score: 98/100 (Excellent) — well above the 85+ threshold for production analysis
  • Dataset Size: 5,000 rows across 14 columns — sufficient for regression, clustering, and group comparisons
  • Missing Data: <5% — minimal imputation needed; dataset is complete enough for standard methods
  • Multicollinearity: None detected — numeric predictors are independent; no variable removal required
  • Outliers: None flagged in numeric columns — no data cleaning required
  • Column Composition: 10 categorical + 4 numeric — balanced for mixed-method analysis

Interpretation

Your dataset requires no remedial data work. The 98/100 quality score reflects complete records, consistent formatting, and valid value ranges across all 14 variables. The absence of high correlations (>0.7) between predictors means you can confidently use all numeric variables in regression or machine learning models without worrying about inflated standard errors or unstable coefficients. The 5,000-row sample size is adequate for detecting effects of practical size (Cohen's d ≥ 0.2) with 80% power.

Context

This profile covers data structure and completeness only — it does not assess whether variables are correctly measured or whether the data answers your business question. Review the distribution analysis and correlation matrix sections for guidance on which statistical methods suit your specific variables and objective.

Data Table

Column Profiles

Per-column type detection and summary statistics

ColumnTypeUnique_ValuesMissing_CountMissing_PctMinMaxMeanMedianTop_Categories
Campaign_Typecategorical500%Social Media, Email, Display
Clicksnumeric90100%1001000555.21558
Impressionsnumeric382100%1001100005496.695508.5
Engagement_Scorecategorical1000%2, 3, 1
Customer_Segmentcategorical500%Foodies, Outdoor Adventurers, Tech Enthusiasts
Companycategorical500%NexGen Systems, Innovate Industries, TechCorp
Target_Audiencecategorical500%Women 25-34, Men 25-34, Women 35-44
Durationcategorical400%30 days, 15 days, 45 days
Channel_Usedcategorical600%Google Ads, Instagram, YouTube
Conversion_Ratenumeric1500%0.010.150.080.08
Acquisition_Costcategorical425200%$14,544.00, $18,187.00, $18,551.00
ROInumeric60100%284.994.96
Locationcategorical500%Houston, Miami, Chicago
Languagecategorical500%German, French, English
Interpretation

Headline

Your dataset contains 14 columns (4 numeric, 10 categorical) with zero missing values — data quality is clean and ready for analysis.

Purpose

This section identifies the type and structure of each column in your dataset. Understanding whether columns are numeric or categorical determines which statistical methods are appropriate and how to prepare data for modeling. Clean, complete data is the foundation for reliable analysis.

Key Findings

  • Total Columns: 14 — a moderately sized feature set for marketing/campaign analysis
  • Numeric Columns: 4 — likely performance metrics (Clicks, Impressions, Conversion_Rate, ROI based on earlier distribution data)
  • Categorical Columns: 10 — likely campaign attributes (Campaign_Type, Channel_Used, Language, Engagement_Score, Acquisition_Cost, and others)
  • Missing Data: 0% across all columns — no imputation or exclusion needed
  • Data Quality Status: Excellent — no critical or warning flags detected

Interpretation

The 4:10 numeric-to-categorical split is typical for marketing campaign datasets. Your numeric columns enable correlation analysis and regression modeling, while categorical columns support segmentation and stratified analysis. The absence of missing values means every row is complete and usable — no data cleaning overhead. Note that some categorical columns (like Acquisition_Cost) appear to contain currency values stored as text; these should be converted to numeric format before regression analysis to unlock their predictive power.

Context

The column_profile_table is empty in this output, but the type counts are definitive. High-cardinality categoricals (>20 unique values) were not flagged, suggesting your categorical features are suitable for direct use in models without dimensionality reduction.

Visualization

Distribution Analysis

Data distributions: histograms for numeric, bar charts for categorical

Interpretation

Headline

Campaign channels and languages are nearly perfectly balanced across the dataset, with Social Media leading at 20.6% and no category exceeding 25.7%, enabling fair comparative analysis.

Purpose

This section profiles how values are distributed across 14 variables in your marketing dataset—both categorical (campaign types, languages, channels) and numeric (clicks, impressions, costs, ROI). Understanding these distributions is essential for choosing the right statistical tests and identifying whether data transformations are needed before modeling. Skewed or imbalanced data can bias results or violate assumptions in regression and hypothesis testing.

Key Findings

  • Categorical Balance: Campaign_Type and Language variables show near-uniform distributions (989–1,044 observations per category), with no single category dominating. This indicates strong representation across all campaign channels and language segments.
  • Right Skew in Numeric Data: Bin counts exhibit a skewness of 1.05, indicating a moderate right tail. The median (361) is notably lower than the mean (489.22), suggesting some bins contain substantially more observations than others.
  • Wide Range: Count values span from 3 to 1,284, with a standard deviation of 366.44—high variability relative to the mean, reflecting unequal distribution of observations across bins.
  • Data Composition: 52.6% of variables are categorical (70 variables) and 47.4% are numeric (63 variables), providing a balanced mix of categorical and continuous predictors.

Interpretation

The near-perfect balance in categorical variables (Campaign_Type, Language) is a strength—it means you can reliably compare performance across channels and languages without worrying about sparse categories biasing results. However, the moderate right skew in bin counts and the wide spread (SD=366.44) suggest that some numeric variables may have outliers or non-normal distributions. This skewness (1.05) is mild but warrants checking individual numeric distributions (Clicks, Impressions, ROI, Acquisition_Cost) for heavy tails before applying parametric tests like linear regression.

Context

This profiling covers 14 distinct variables across 133 distribution bins. The absence of missing data (confirmed in the missing_data_heatmap) means no imputation is needed. The moderate skew and outlier presence visible in the boxplot data (max value 9,994 vs. mean 1,546.94) suggest log transformation may improve model fit for numeric predictors in downstream regression analysis.

Visualization

Missing Data Patterns

Missing data patterns and co-occurrence across columns

Interpretation

Headline

Your dataset is 100% complete with zero missing values across all 14 columns and 500 rows — no data cleaning or imputation required.

Purpose

This section evaluates data completeness and identifies missing value patterns that could bias downstream analysis. A dataset with no missing data eliminates a major source of analytical risk: you can proceed directly to modeling without imputation decisions, and all statistical tests will use the full sample size without adjustment for missing data mechanisms.

Key Findings

  • Total Missing Values: 0 out of all cells (0.00%)
  • All 14 Columns: Complete across all 500 rows — no systematic gaps by variable
  • No Missing Patterns: Zero instances of MCAR, MAR, or MNAR — no heatmap bands or speckles to investigate
  • Sample Integrity: All 500 rows retain full information — no row-level data collection failures

Interpretation

The absence of missing data is rare and valuable. It means the data collection process was robust: no dropped fields, no incomplete imports, no survey non-response. Every row can be used in analysis without loss of sample size. Correlation matrices, regression models, and clustering algorithms will operate on the complete 500-row dataset, maximizing statistical power and eliminating the need for imputation assumptions that could introduce bias.

Context

This clean state assumes the data was properly validated before profiling. If values like "0," "NA," or "Unknown" are coded as actual entries rather than true missing values, they would not appear in this count — verify that categorical variables like Campaign_Type and Language contain only legitimate categories, not placeholders for missing data.

Visualization

Correlation Matrix

Pairwise correlation matrix for numeric variables

Interpretation

Headline

All four numeric variables are essentially independent—zero high-correlation pairs detected (|r| ≥ 0.7)—meaning each metric captures distinct information with no multicollinearity risk.

Purpose

This section examines whether the four numeric variables in your dataset (Clicks, Impressions, Conversion_Rate, and ROI) move together or independently. High correlations between predictors create multicollinearity, which inflates model uncertainty and makes it hard to isolate individual variable effects. Finding zero problematic correlations is a clean bill of health for downstream modeling.

Key Findings

  • High-Correlation Pairs: 0 detected — no variable pairs exceed the |r| ≥ 0.7 threshold
  • Strongest Observed Correlation: 0.01 (Clicks–Impressions, Impressions–ROI, Conversion_Rate–ROI) — effectively zero
  • Mean Correlation (off-diagonal): 0.01 — the four variables are nearly orthogonal
  • Multicollinearity Risk: None — all VIF values would be near 1.0

Interpretation

The near-zero correlations across all variable pairs indicate that Clicks, Impressions, Conversion_Rate, and ROI operate independently in your dataset. This is unusual but valuable: it suggests each metric reflects a distinct business dimension rather than redundant information. For example, high impression volume does not automatically drive clicks, and clicks do not mechanically predict ROI. This independence simplifies modeling—you can safely include all four variables in regression or classification without worrying that one will mask or distort the effect of another.

Context

This clean correlation structure assumes the data are measured without error and that relationships are linear. If you plan to build predictive models, this independence is an asset. However, verify that the near-zero correlations reflect true business reality (e.g., campaigns with high impressions genuinely do not convert at higher rates) rather than data quality issues or missing confounders.

Visualization

Outlier Detection

IQR-based outlier detection with box plots for numeric variables

Interpretation

Headline

No statistical outliers detected in any of the 4 numeric columns using the IQR method, but extreme right skew (1.66) and a 200,000× range (0.01 to 9,994) suggest the data contains naturally extreme values that warrant caution in modeling.

Purpose

This section identifies data points that deviate significantly from the typical distribution using the interquartile range (IQR) method—a standard statistical approach that flags values beyond 1.5× the IQR from the quartile boundaries. Understanding outliers is critical because they can distort statistical models, inflate standard errors, and lead to misleading conclusions about campaign performance.

Key Findings

  • Columns with outliers flagged: 0 of 4 — No points exceeded IQR thresholds
  • Data spread: Values range from 0.01 to 9,994 across all numeric columns
  • Skewness: 1.66 indicates strong right skew—the distribution has a long tail of high values
  • Mean vs. median gap: Mean of 1,546.94 vs. median of 54 confirms extreme values pull the average upward

Interpretation

The absence of IQR-flagged outliers does not mean the data is normally distributed or free from extreme values. The high skewness and massive range suggest the numeric columns (Clicks, Impressions, Conversion_Rate, ROI) contain naturally occurring high-performance campaigns or measurement artifacts. These extreme values are statistically "inside the fence" but may still influence regression coefficients, confidence intervals, and forecast accuracy.

Context

For marketing campaign data, high skew is typical—a few campaigns perform exceptionally well while most cluster near lower values. Before running predictive models, consider log transformation or robust regression methods to reduce the influence of extreme values without removing valid data.

Visualization

Categorical Breakdown

Frequency distributions for all categorical variables

Interpretation

Headline

Campaign channels and languages are nearly perfectly balanced across 10 categorical variables, with no single category exceeding 25.68% — ideal for unbiased modeling and group comparisons.

Purpose

This section examines the distribution of categorical variables across your dataset to identify potential modeling challenges like class imbalance, high cardinality, or rare categories. Balanced categorical data enables reliable statistical tests and prevents models from being biased toward dominant groups. Understanding these distributions is essential before building predictive models or conducting group comparisons.

Key Findings

  • Balance Across Categories: Campaign_Type shows near-perfect balance (Social Media 20.6%, Email 19.96%, Display 19.84%, Influencer 19.82%, Search 19.78%) — no single channel dominates
  • Language Distribution: German leads at 20.88%, but all five languages cluster between 19–21%, indicating excellent balance
  • Cardinality: 10 categorical columns with 69 unique categories total — manageable for one-hot encoding without dimensionality explosion
  • Rare Categories: Minimum count is 3 occurrences (0.06%), suggesting some low-frequency categories exist but represent <1% of data
  • Acquisition_Cost Dominance: This column accounts for 28.6% of categorical variables, likely containing many price bins rather than true categories

Interpretation

The near-uniform distribution across Campaign_Type and Language variables (each ~20% per category) eliminates class imbalance concerns that typically plague marketing datasets. This balance makes these variables excellent candidates for ANOVA, chi-square tests, and classification models without requiring resampling or class weights. However, Acquisition_Cost's high representation suggests it may be binned numeric data rather than true categorical — verify whether it should be treated as ordinal or numeric instead.

Context

No categories fall below the 5-occurrence threshold requiring consolidation into "Other." The negative skew in count distribution (-1.48) reflects that most categories are well-represented, with only a few rare outliers. This dataset is well-prepared for categorical analysis without preprocessing complications.

Your data has more stories to tell. Run any analysis on your own data — 60+ validated R modules, interactive reports, AI insights, and PDF export. 2,000 free credits on signup.
Try Free — No Signup Sign Up Free

Report an Issue

Tell us what's wrong. You'll get a free re-run of this analysis so you can try again with different parameters. If the re-run still doesn't meet your expectations, we'll refund your credits.

Want to run this analysis on your own data? Upload CSV — Free Analysis See Pricing