Context and Data Preparation

Analysis Overview and Data Quality

OV

Analysis Overview

CSV Auto-Profiler

Analysis overview and configuration

Auto Profiler
Analytics User
Profile this CSV dataset automatically
Module Configuration
outlier_method iqr
outlier_threshold 1.5
correlation_method pearson
correlation_threshold 0.7
alpha 0.05
max_categories 20
missing_warn_threshold 0.05
missing_critical_threshold 0.3
IN

Key Insights

Analysis Overview

Purpose

This analysis provides a comprehensive data quality and statistical profile of a 500-row, 31-column dataset. The objective is to automatically characterize the dataset’s structure, completeness, distributions, and relationships to establish a baseline understanding before deeper analytical work. This foundational assessment ensures data reliability and identifies potential issues early.

Key Findings

  • Data Quality Score: 97.7% - Exceptionally clean dataset with zero missing values across all 500 observations and 31 variables
  • Column Composition: 12 numeric and 19 categorical columns with no high-cardinality categorical variables, indicating manageable dimensionality
  • Correlation Structure: 4 high-correlation pairs identified, primarily among tenure-related variables (YearsWithCurrManager correlates 0.77 with YearsAtCompany and 0.74 with YearsInCurrentRole)
  • Outlier Prevalence: 6 columns contain outliers; 97.7% of values are normal, 2.4% are mild outliers, and 0.3% are extreme outliers
  • Distribution Skewness: Numeric values show moderate positive skew (mean 1.23), suggesting right-tailed distributions in several variables

Interpretation

The dataset exhibits exceptional data quality with complete coverage and minimal data integrity issues. The low

IN

Key Insights

Analysis Overview

Purpose

This analysis provides a comprehensive data quality and statistical profile of a 500-row, 31-column dataset. The objective is to automatically characterize the dataset’s structure, completeness, distributions, and relationships to establish a baseline understanding before deeper analytical work. This foundational assessment ensures data reliability and identifies potential issues early.

Key Findings

  • Data Quality Score: 97.7% - Exceptionally clean dataset with zero missing values across all 500 observations and 31 variables
  • Column Composition: 12 numeric and 19 categorical columns with no high-cardinality categorical variables, indicating manageable dimensionality
  • Correlation Structure: 4 high-correlation pairs identified, primarily among tenure-related variables (YearsWithCurrManager correlates 0.77 with YearsAtCompany and 0.74 with YearsInCurrentRole)
  • Outlier Prevalence: 6 columns contain outliers; 97.7% of values are normal, 2.4% are mild outliers, and 0.3% are extreme outliers
  • Distribution Skewness: Numeric values show moderate positive skew (mean 1.23), suggesting right-tailed distributions in several variables

Interpretation

The dataset exhibits exceptional data quality with complete coverage and minimal data integrity issues. The low

PP

Data Preprocessing

Zero rows removed - profiling raw data

500
Final Rows

Data preprocessing and column mapping

Data Pipeline
500
Initial Records
500
Clean Records
Column Mapping
independent_1
Age
independent_2
Attrition
independent_3
BusinessTravel
independent_4
DailyRate
independent_5
Department
independent_6
DistanceFromHome
independent_7
Education
independent_8
EducationField
independent_9
EnvironmentSatisfaction
independent_10
Gender
independent_11
HourlyRate
independent_12
JobInvolvement
independent_13
JobLevel
independent_14
JobRole
independent_15
JobSatisfaction
independent_16
MaritalStatus
independent_17
MonthlyIncome
independent_18
MonthlyRate
independent_19
NumCompaniesWorked
independent_20
OverTime
independent_21
PercentSalaryHike
independent_22
PerformanceRating
independent_23
RelationshipSatisfaction
independent_24
StockOptionLevel
independent_25
TotalWorkingYears
independent_26
TrainingTimesLastYear
independent_27
WorkLifeBalance
independent_28
YearsAtCompany
independent_29
YearsInCurrentRole
independent_30
YearsSinceLastPromotion
independent_31
YearsWithCurrManager
id
EmployeeNumber
500 Records
MCP Analytics
IN

Key Insights

Data Preprocessing

Purpose

This section documents the data preprocessing pipeline applied to the 500-row dataset before analysis. It demonstrates data integrity through the preprocessing stage and establishes the foundation for the statistical analysis and outlier detection performed across 31 columns with a 97.7% data quality score.

Key Findings

  • Retention Rate: 100% (500/500 rows retained) - No observations were removed during cleaning, indicating either pristine input data or minimal preprocessing requirements
  • Rows Removed: 0 - Zero data loss suggests no duplicate records, invalid entries, or rows failing quality thresholds were detected
  • Data Quality Score: 97.7% - Exceptionally high quality aligns with zero missing values across all 31 columns, supporting reliable downstream analysis
  • Train/Test Split: Not specified - No explicit model training/validation split documented, suggesting this is exploratory analysis rather than predictive modeling

Interpretation

The complete retention of all 500 observations indicates the dataset arrived in excellent condition with no data quality issues requiring removal. Combined with the 0% overall missing rate and 97.7% quality score, this suggests the data preprocessing phase was minimal—primarily validation rather than transformation. This clean state enables confident statistical analysis, correlation assessment, and outlier detection without concerns about data loss biasing results.

Context

The absence of documented train/test splits indicates this analysis focuses

IN

Key Insights

Data Preprocessing

Purpose

This section documents the data preprocessing pipeline applied to the 500-row dataset before analysis. It demonstrates data integrity through the preprocessing stage and establishes the foundation for the statistical analysis and outlier detection performed across 31 columns with a 97.7% data quality score.

Key Findings

  • Retention Rate: 100% (500/500 rows retained) - No observations were removed during cleaning, indicating either pristine input data or minimal preprocessing requirements
  • Rows Removed: 0 - Zero data loss suggests no duplicate records, invalid entries, or rows failing quality thresholds were detected
  • Data Quality Score: 97.7% - Exceptionally high quality aligns with zero missing values across all 31 columns, supporting reliable downstream analysis
  • Train/Test Split: Not specified - No explicit model training/validation split documented, suggesting this is exploratory analysis rather than predictive modeling

Interpretation

The complete retention of all 500 observations indicates the dataset arrived in excellent condition with no data quality issues requiring removal. Combined with the 0% overall missing rate and 97.7% quality score, this suggests the data preprocessing phase was minimal—primarily validation rather than transformation. This clean state enables confident statistical analysis, correlation assessment, and outlier detection without concerns about data loss biasing results.

Context

The absence of documented train/test splits indicates this analysis focuses

Executive Summary

Key Findings and Recommended Next Steps

TLDR

Executive Summary

Key Findings & Recommendations

500
Data Quality Score

Key Performance Indicators

Total rows
500
Total columns
31
Numeric columns
12
Categorical columns
19
Overall missing pct
0
Data quality score
97.7

Dataset Overview

Key findings

Metric Value
Total Rows 500
Total Columns 31
Numeric Columns 12
Categorical Columns 19
Missing Data % 0.00%
Data Quality Score 97.7/100
High Correlations 4
Columns with Outliers 6

Executive Summary

Bottom Line: This dataset contains 500 rows across 31 columns (12 numeric, 19 categorical). Data quality score: 97.7%. Overall missing data: 0.00%. Found 4 high correlation pairs (|r| > 0.70) - potential multicollinearity.

Dataset Characteristics:
• 500 rows × 31 columns (12 numeric, 19 categorical)
• Missing data: 0.00% overall
• Data quality score: 97.7/100

Key Findings:
• 4 high correlation pairs detected
• 6 columns with outliers
• 0 high-cardinality categoricals
Recommended Next Steps:
• ANOVA or Chi-Square tests (categorical grouping variables)
• Address multicollinearity before regression (drop redundant features)
• Outlier investigation (verify data entry, consider robust methods)

IN

Key Insights

Executive Summary

EXECUTIVE SUMMARY

Purpose

This section synthesizes the complete data quality and structural assessment of a 500-row, 31-column dataset. Understanding these foundational metrics is critical for determining whether the data is suitable for downstream analysis and what preprocessing steps are required before modeling or statistical testing.

Key Findings

  • Data Quality Score: 97.7/100 – Exceptionally clean dataset with minimal data integrity issues
  • Missing Data: 0.00% overall – Complete dataset with no gaps requiring imputation
  • High Correlation Pairs: 4 detected (|r| > 0.70) – Indicates potential multicollinearity between features, particularly among tenure-related variables (YearsWithCurrManager, YearsAtCompany, YearsInCurrentRole)
  • Outlier Prevalence: 6 columns contain outliers; 97.7% of values are normal, 2.4% are mild/extreme – Manageable outlier burden
  • Categorical Structure: 19 categorical variables with 0 high-cardinality columns – Well-balanced categorical feature space

Interpretation

This dataset demonstrates exceptional data quality with zero missing values and a 97.7 quality score, indicating minimal data entry errors or structural problems. The presence of 4 high-correlation pairs suggests redundancy in tenure-related

IN

Key Insights

Executive Summary

EXECUTIVE SUMMARY

Purpose

This section synthesizes the complete data quality and structural assessment of a 500-row, 31-column dataset. Understanding these foundational metrics is critical for determining whether the data is suitable for downstream analysis and what preprocessing steps are required before modeling or statistical testing.

Key Findings

  • Data Quality Score: 97.7/100 – Exceptionally clean dataset with minimal data integrity issues
  • Missing Data: 0.00% overall – Complete dataset with no gaps requiring imputation
  • High Correlation Pairs: 4 detected (|r| > 0.70) – Indicates potential multicollinearity between features, particularly among tenure-related variables (YearsWithCurrManager, YearsAtCompany, YearsInCurrentRole)
  • Outlier Prevalence: 6 columns contain outliers; 97.7% of values are normal, 2.4% are mild/extreme – Manageable outlier burden
  • Categorical Structure: 19 categorical variables with 0 high-cardinality columns – Well-balanced categorical feature space

Interpretation

This dataset demonstrates exceptional data quality with zero missing values and a 97.7 quality score, indicating minimal data entry errors or structural problems. The presence of 4 high-correlation pairs suggests redundancy in tenure-related

Column Profiles

Type detection and per-column statistics

CP

Column Profiles

Type detection and statistics per column

31
Total Columns

Per-column statistics and type detection results

column_name detected_type unique_values missing_count missing_pct min_value max_value mean_value median_value std_dev top_categories
Age numeric 43.000 0.000 0.000 18.000 60.000 36.896 36.000 9.360 NA
Attrition categorical 2.000 0.000 0.000 NA NA NA NA NA No, Yes
BusinessTravel categorical 3.000 0.000 0.000 NA NA NA NA NA Travel_Rarely, Travel_Frequently, Non-Travel
DailyRate numeric 415.000 0.000 0.000 103.000 1499.000 838.874 843.000 408.792 NA
Department categorical 3.000 0.000 0.000 NA NA NA NA NA Research & Development, Sales, Human Resources
DistanceFromHome numeric 29.000 0.000 0.000 1.000 29.000 9.120 6.000 8.255 NA
Education categorical 5.000 0.000 0.000 NA NA NA NA NA 3, 4, 2
EducationField categorical 6.000 0.000 0.000 NA NA NA NA NA Life Sciences, Medical, Marketing
EnvironmentSatisfaction categorical 4.000 0.000 0.000 NA NA NA NA NA 3, 4, 2
Gender categorical 2.000 0.000 0.000 NA NA NA NA NA Male, Female
HourlyRate numeric 71.000 0.000 0.000 30.000 100.000 65.742 66.000 20.604 NA
JobInvolvement categorical 4.000 0.000 0.000 NA NA NA NA NA 3, 2, 4
JobLevel categorical 5.000 0.000 0.000 NA NA NA NA NA 1, 2, 3
JobRole categorical 9.000 0.000 0.000 NA NA NA NA NA Sales Executive, Research Scientist, Laboratory Technician
JobSatisfaction categorical 4.000 0.000 0.000 NA NA NA NA NA 4, 3, 2
MaritalStatus categorical 3.000 0.000 0.000 NA NA NA NA NA Married, Single, Divorced
MonthlyIncome numeric 482.000 0.000 0.000 1102.000 19999.000 6598.644 4952.000 4814.582 NA
MonthlyRate numeric 497.000 0.000 0.000 2094.000 26959.000 14157.400 14174.000 7002.615 NA
NumCompaniesWorked categorical 10.000 0.000 0.000 NA NA NA NA NA 1, 0, 4
OverTime categorical 2.000 0.000 0.000 NA NA NA NA NA No, Yes
PercentSalaryHike numeric 15.000 0.000 0.000 11.000 25.000 15.222 14.000 3.729 NA
PerformanceRating categorical 2.000 0.000 0.000 NA NA NA NA NA 3, 4
RelationshipSatisfaction categorical 4.000 0.000 0.000 NA NA NA NA NA 3, 4, 2
StockOptionLevel categorical 4.000 0.000 0.000 NA NA NA NA NA 0, 1, 2
TotalWorkingYears numeric 38.000 0.000 0.000 0.000 40.000 11.464 10.000 7.777 NA
TrainingTimesLastYear categorical 7.000 0.000 0.000 NA NA NA NA NA 2, 3, 5
WorkLifeBalance categorical 4.000 0.000 0.000 NA NA NA NA NA 3, 2, 4
YearsAtCompany numeric 33.000 0.000 0.000 0.000 40.000 7.038 5.000 6.458 NA
YearsInCurrentRole numeric 19.000 0.000 0.000 0.000 18.000 4.238 3.000 3.720 NA
YearsSinceLastPromotion numeric 15.000 0.000 0.000 0.000 15.000 2.194 1.000 3.275 NA
YearsWithCurrManager numeric 16.000 0.000 0.000 0.000 17.000 4.184 3.000 3.576 NA
31
total columns
12
numeric columns
IN

Key Insights

Column Profiles

Purpose

This section provides a structural overview of the dataset’s composition, identifying which columns are numeric versus categorical. Understanding column types is foundational for selecting appropriate statistical methods, interpreting distributions, and detecting data quality issues across the 500-row dataset.

Key Findings

  • Total Columns: 31 columns analyzed with clear type separation
  • Numeric Columns: 12 columns suitable for correlation, regression, and outlier detection
  • Categorical Columns: 19 columns appropriate for frequency analysis and cross-tabulation
  • Missing Data: 0% overall missing rate indicates complete data integrity across all columns
  • Data Quality Score: 97.7% reflects high-quality, analysis-ready data with minimal anomalies

Interpretation

The dataset demonstrates balanced composition between numeric and categorical variables, enabling comprehensive multivariate analysis. The absence of missing values eliminates imputation concerns and ensures all 500 observations are usable across all 31 dimensions. The high data quality score validates that the subsequent correlation analysis (4 high-correlation pairs identified) and outlier detection (6 columns with outliers) are based on reliable, complete information.

Context

The column profile table appears empty in the provided summary, suggesting detailed per-column statistics are available elsewhere in the analysis framework. The type detection logic applied here (numeric >10 unique values; categorical ≤10

IN

Key Insights

Column Profiles

Purpose

This section provides a structural overview of the dataset’s composition, identifying which columns are numeric versus categorical. Understanding column types is foundational for selecting appropriate statistical methods, interpreting distributions, and detecting data quality issues across the 500-row dataset.

Key Findings

  • Total Columns: 31 columns analyzed with clear type separation
  • Numeric Columns: 12 columns suitable for correlation, regression, and outlier detection
  • Categorical Columns: 19 columns appropriate for frequency analysis and cross-tabulation
  • Missing Data: 0% overall missing rate indicates complete data integrity across all columns
  • Data Quality Score: 97.7% reflects high-quality, analysis-ready data with minimal anomalies

Interpretation

The dataset demonstrates balanced composition between numeric and categorical variables, enabling comprehensive multivariate analysis. The absence of missing values eliminates imputation concerns and ensures all 500 observations are usable across all 31 dimensions. The high data quality score validates that the subsequent correlation analysis (4 high-correlation pairs identified) and outlier detection (6 columns with outliers) are based on reliable, complete information.

Context

The column profile table appears empty in the provided summary, suggesting detailed per-column statistics are available elsewhere in the analysis framework. The type detection logic applied here (numeric >10 unique values; categorical ≤10

Distribution Analysis

Histograms for numeric, bar charts for categorical variables

DA

Distribution Analysis

Histograms and bar charts for all columns

Data distributions: histograms for numeric, bar charts for categorical

IN

Key Insights

Distribution Analysis

Purpose

This section visualizes how values are distributed across the 31 variables in the dataset (12 numeric, 19 categorical). Distribution analysis reveals data shape, concentration patterns, and potential data quality issues—essential for understanding whether variables are suitable for modeling and whether transformations are needed.

Key Findings

  • Right Skew (1.23): Numeric distributions show moderate positive skew, indicating right-tailed patterns where extreme high values pull the mean above the median (mean=2325 vs median=24.25).
  • Bin Frequency Imbalance: Count values range 0–422 with high variance (sd=68.13), showing uneven population across bins; some intervals contain zero observations.
  • Categorical Dominance: 75.3% of categorical data concentrates in empty/missing categories, suggesting sparse representation in certain categorical variables.
  • Multimodal Pattern: Age distribution shows peaks at intervals (10, 14, 23, 27 counts), indicating potential subgroups or cohorts within the workforce.

Interpretation

The moderate right skew across numeric variables suggests natural business metrics (income, rates, tenure) follow typical organizational patterns where most values cluster at lower ranges with occasional high outliers. The uneven bin distribution and categorical sparsity indicate that some variables may have limited discriminative power for analysis. The multi

IN

Key Insights

Distribution Analysis

Purpose

This section visualizes how values are distributed across the 31 variables in the dataset (12 numeric, 19 categorical). Distribution analysis reveals data shape, concentration patterns, and potential data quality issues—essential for understanding whether variables are suitable for modeling and whether transformations are needed.

Key Findings

  • Right Skew (1.23): Numeric distributions show moderate positive skew, indicating right-tailed patterns where extreme high values pull the mean above the median (mean=2325 vs median=24.25).
  • Bin Frequency Imbalance: Count values range 0–422 with high variance (sd=68.13), showing uneven population across bins; some intervals contain zero observations.
  • Categorical Dominance: 75.3% of categorical data concentrates in empty/missing categories, suggesting sparse representation in certain categorical variables.
  • Multimodal Pattern: Age distribution shows peaks at intervals (10, 14, 23, 27 counts), indicating potential subgroups or cohorts within the workforce.

Interpretation

The moderate right skew across numeric variables suggests natural business metrics (income, rates, tenure) follow typical organizational patterns where most values cluster at lower ranges with occasional high outliers. The uneven bin distribution and categorical sparsity indicate that some variables may have limited discriminative power for analysis. The multi

Missing Data Patterns

Visualize missingness co-occurrence across variables

MD

Missing Data Patterns

Heatmap showing missingness co-occurrence

0
Overall Missing %

Missing data patterns and co-occurrence

0
total missing
0
overall missing pct
IN

Key Insights

Missing Data Patterns

Purpose

This section evaluates data completeness across all 500 rows and 31 columns to identify missing value patterns that could compromise analysis validity. Understanding missingness is critical for determining whether data can be analyzed as-is or requires imputation, and for detecting systematic collection issues that might bias results.

Key Findings

  • Total Missing Values: 0 (0.00% of all cells) – The dataset is completely complete with no gaps across any row or column
  • Missing Pattern Type: No vertical bands, horizontal bands, or random speckles detected – absence of systematic or random missingness patterns
  • Data Integrity: All 15,500 cell observations (500 rows × 31 columns) contain valid values with no MCAR, MAR, or MNAR mechanisms present

Interpretation

The absence of missing data eliminates a major source of analytical bias and simplifies downstream processing. With zero missingness, there is no need to investigate collection failures, survey skip logic, or imputation strategies. This complete dataset enables direct statistical analysis without the complications of handling incomplete information, making correlation analysis, outlier detection, and distribution assessment more straightforward and reliable.

Context

This perfect completeness is relatively rare in real-world datasets and suggests either careful data collection practices or pre-processing that removed incomplete records. The finding directly supports the high **data quality score of 97.7%

IN

Key Insights

Missing Data Patterns

Purpose

This section evaluates data completeness across all 500 rows and 31 columns to identify missing value patterns that could compromise analysis validity. Understanding missingness is critical for determining whether data can be analyzed as-is or requires imputation, and for detecting systematic collection issues that might bias results.

Key Findings

  • Total Missing Values: 0 (0.00% of all cells) – The dataset is completely complete with no gaps across any row or column
  • Missing Pattern Type: No vertical bands, horizontal bands, or random speckles detected – absence of systematic or random missingness patterns
  • Data Integrity: All 15,500 cell observations (500 rows × 31 columns) contain valid values with no MCAR, MAR, or MNAR mechanisms present

Interpretation

The absence of missing data eliminates a major source of analytical bias and simplifies downstream processing. With zero missingness, there is no need to investigate collection failures, survey skip logic, or imputation strategies. This complete dataset enables direct statistical analysis without the complications of handling incomplete information, making correlation analysis, outlier detection, and distribution assessment more straightforward and reliable.

Context

This perfect completeness is relatively rare in real-world datasets and suggests either careful data collection practices or pre-processing that removed incomplete records. The finding directly supports the high **data quality score of 97.7%

Correlation Matrix

Identify multicollinearity in numeric variables

CR

Correlation Matrix

Numeric variable correlations

12
High Correlation Pairs

Correlation matrix for numeric variables with hierarchical clustering

12
numeric columns
4
high correlation pairs
IN

Key Insights

Correlation Matrix

Purpose

This section identifies multicollinearity in the dataset by detecting strong linear relationships among numeric variables. Understanding these correlations is critical for predictive modeling, as highly correlated variables can inflate standard errors, reduce model interpretability, and create redundancy in feature sets. This analysis directly supports the overall data quality assessment (97.7% score) by flagging potential structural issues.

Key Findings

  • High Correlation Pairs: 4 pairs identified with |r| ≥ 0.7, indicating moderate-to-strong redundancy among tenure-related variables
  • YearsWithCurrManager Relationships: Shows strong correlations with YearsAtCompany (0.77) and YearsInCurrentRole (0.74), suggesting these variables capture overlapping information about employee tenure
  • Overall Correlation Range: Mean correlation of 0.22 across all pairs indicates most variables are relatively independent, with only tenure metrics showing clustering
  • Perfect Correlations: Diagonal values of 1.0 represent expected self-correlations, not data issues

Interpretation

The dataset exhibits localized multicollinearity concentrated in tenure-related variables rather than systemic redundancy. The four high-correlation pairs suggest that years with current manager, years at company, and years in current role measure related but distinct aspects of employee tenure. This pattern is typical in HR datasets and

IN

Key Insights

Correlation Matrix

Purpose

This section identifies multicollinearity in the dataset by detecting strong linear relationships among numeric variables. Understanding these correlations is critical for predictive modeling, as highly correlated variables can inflate standard errors, reduce model interpretability, and create redundancy in feature sets. This analysis directly supports the overall data quality assessment (97.7% score) by flagging potential structural issues.

Key Findings

  • High Correlation Pairs: 4 pairs identified with |r| ≥ 0.7, indicating moderate-to-strong redundancy among tenure-related variables
  • YearsWithCurrManager Relationships: Shows strong correlations with YearsAtCompany (0.77) and YearsInCurrentRole (0.74), suggesting these variables capture overlapping information about employee tenure
  • Overall Correlation Range: Mean correlation of 0.22 across all pairs indicates most variables are relatively independent, with only tenure metrics showing clustering
  • Perfect Correlations: Diagonal values of 1.0 represent expected self-correlations, not data issues

Interpretation

The dataset exhibits localized multicollinearity concentrated in tenure-related variables rather than systemic redundancy. The four high-correlation pairs suggest that years with current manager, years at company, and years in current role measure related but distinct aspects of employee tenure. This pattern is typical in HR datasets and

Outlier Detection

IQR-based outlier identification for numeric columns

OD

Outlier Detection

Box plots for numeric columns

6
Columns with Outliers

IQR-based outlier detection for numeric variables

6
columns with outliers
IN

Key Insights

Outlier Detection

Purpose

This section identifies and classifies anomalous values in numeric variables using the Interquartile Range (IQR) method. Outlier detection is critical for understanding data distribution shape, identifying potential data quality issues, and determining whether statistical transformations or specialized modeling approaches are needed for accurate analysis.

Key Findings

  • Columns with Outliers: 6 numeric variables flagged - indicates moderate presence of extreme values across the dataset
  • Outlier Distribution: 97.7% normal values, 2% mild outliers (120 cases), 0.3% extreme outliers (17 cases) - highly concentrated in the tails
  • Value Range Skewness: Mean of 1,812.58 vs. median of 15 suggests right-skewed distributions with high-value outliers pulling the mean upward
  • IQR Bounds Variability: Upper bounds range from 7.5 to 37,446.88, indicating heterogeneous spread across numeric columns

Interpretation

The dataset exhibits minimal but meaningful outlier presence (2.3% total), concentrated in 6 numeric columns. The 17 extreme outliers warrant verification but represent <0.6% of observations, suggesting they are genuine data points rather than systematic errors. The right-skewed distribution pattern (skewness=1.12) indicates

IN

Key Insights

Outlier Detection

Purpose

This section identifies and classifies anomalous values in numeric variables using the Interquartile Range (IQR) method. Outlier detection is critical for understanding data distribution shape, identifying potential data quality issues, and determining whether statistical transformations or specialized modeling approaches are needed for accurate analysis.

Key Findings

  • Columns with Outliers: 6 numeric variables flagged - indicates moderate presence of extreme values across the dataset
  • Outlier Distribution: 97.7% normal values, 2% mild outliers (120 cases), 0.3% extreme outliers (17 cases) - highly concentrated in the tails
  • Value Range Skewness: Mean of 1,812.58 vs. median of 15 suggests right-skewed distributions with high-value outliers pulling the mean upward
  • IQR Bounds Variability: Upper bounds range from 7.5 to 37,446.88, indicating heterogeneous spread across numeric columns

Interpretation

The dataset exhibits minimal but meaningful outlier presence (2.3% total), concentrated in 6 numeric columns. The 17 extreme outliers warrant verification but represent <0.6% of observations, suggesting they are genuine data points rather than systematic errors. The right-skewed distribution pattern (skewness=1.12) indicates

Categorical Breakdown

Frequency distributions and cardinality analysis

CB

Categorical Breakdown

Frequency distributions for categorical variables

19
Categorical Columns

Frequency analysis and cardinality warnings for categorical variables

19
categorical columns
0
high cardinality categoricals
IN

Key Insights

Categorical Breakdown

Purpose

This section evaluates the distribution and quality of categorical variables across the dataset. It identifies class imbalance patterns and cardinality issues that could affect model performance and interpretability. Understanding categorical structure is essential for feature engineering and ensuring meaningful variation in predictive analysis.

Key Findings

  • Categorical Columns: 19 variables analyzed with 0 high-cardinality fields (>20 unique values), indicating clean, manageable categorical data
  • Class Imbalance Detected: Attrition shows 84.4% “No” vs. 15.6% “Yes”—severe imbalance suggesting predictive modeling challenges
  • Dominant Categories: BusinessTravel (71.2% “Travel_Rarely”) and WorkLifeBalance (61.6% rating “3”) exhibit strong skew toward single categories
  • Distribution Pattern: Mean category frequency of 114.46 with right skew (0.86) indicates uneven representation across categories

Interpretation

The absence of high-cardinality variables simplifies analysis and reduces preprocessing burden. However, pronounced class imbalance in key variables like Attrition limits discriminative power—models may struggle to learn minority patterns. The concentration of observations in dominant categories (e.g., 71% rarely travel) reflects real-world distributions but reduces variation available for distinguishing between groups.

Context

IN

Key Insights

Categorical Breakdown

Purpose

This section evaluates the distribution and quality of categorical variables across the dataset. It identifies class imbalance patterns and cardinality issues that could affect model performance and interpretability. Understanding categorical structure is essential for feature engineering and ensuring meaningful variation in predictive analysis.

Key Findings

  • Categorical Columns: 19 variables analyzed with 0 high-cardinality fields (>20 unique values), indicating clean, manageable categorical data
  • Class Imbalance Detected: Attrition shows 84.4% “No” vs. 15.6% “Yes”—severe imbalance suggesting predictive modeling challenges
  • Dominant Categories: BusinessTravel (71.2% “Travel_Rarely”) and WorkLifeBalance (61.6% rating “3”) exhibit strong skew toward single categories
  • Distribution Pattern: Mean category frequency of 114.46 with right skew (0.86) indicates uneven representation across categories

Interpretation

The absence of high-cardinality variables simplifies analysis and reduces preprocessing burden. However, pronounced class imbalance in key variables like Attrition limits discriminative power—models may struggle to learn minority patterns. The concentration of observations in dominant categories (e.g., 71% rarely travel) reflects real-world distributions but reduces variation available for distinguishing between groups.

Context

Data Quality Assessment

Overall quality score based on missingness and outliers

DQ

Data Quality Score

Overall assessment of data quality

97.7
Data quality score

Overall data quality assessment

97.7
data quality score
IN

Key Insights

Data Quality Score

Purpose

This section evaluates the overall integrity and completeness of your dataset before analysis. A high data quality score indicates minimal data issues, enabling reliable statistical analysis and meaningful insights. This foundational assessment determines whether the dataset is suitable for downstream analytical work without extensive preprocessing.

Key Findings

  • Data Quality Score: 97.7/100 - Falls in the “Excellent” range (90-100), indicating the dataset is production-ready
  • Missing Data: 0% across all 500 rows and 31 columns - Complete dataset with no gaps
  • Outliers: 2.3% of numeric values flagged (120 mild, 17 extreme) - Minimal contamination within acceptable thresholds
  • Data Completeness: Zero deductions applied; score reflects only minor outlier presence

Interpretation

The dataset demonstrates exceptional quality with zero missing values and negligible outlier contamination. This pristine condition suggests the data has been well-curated or preprocessed before analysis. The 97.7 score reflects only the natural presence of statistical outliers (2.3%), which is expected in real-world distributions and does not warrant removal. This quality level supports confident analysis across all 12 numeric and 19 categorical variables without data imputation or extensive cleaning.

Context

The scoring methodology prioritizes completeness and outlier detection. The absence of missing

IN

Key Insights

Data Quality Score

Purpose

This section evaluates the overall integrity and completeness of your dataset before analysis. A high data quality score indicates minimal data issues, enabling reliable statistical analysis and meaningful insights. This foundational assessment determines whether the dataset is suitable for downstream analytical work without extensive preprocessing.

Key Findings

  • Data Quality Score: 97.7/100 - Falls in the “Excellent” range (90-100), indicating the dataset is production-ready
  • Missing Data: 0% across all 500 rows and 31 columns - Complete dataset with no gaps
  • Outliers: 2.3% of numeric values flagged (120 mild, 17 extreme) - Minimal contamination within acceptable thresholds
  • Data Completeness: Zero deductions applied; score reflects only minor outlier presence

Interpretation

The dataset demonstrates exceptional quality with zero missing values and negligible outlier contamination. This pristine condition suggests the data has been well-curated or preprocessed before analysis. The 97.7 score reflects only the natural presence of statistical outliers (2.3%), which is expected in real-world distributions and does not warrant removal. This quality level supports confident analysis across all 12 numeric and 19 categorical variables without data imputation or extensive cleaning.

Context

The scoring methodology prioritizes completeness and outlier detection. The absence of missing