Analysis Overview and Data Quality
CSV Auto-Profiler
Analysis overview and configuration
Analysis Overview
This analysis provides a comprehensive data quality and statistical profile of a 500-row, 31-column dataset. The objective is to automatically characterize the dataset’s structure, completeness, distributions, and relationships to establish a baseline understanding before deeper analytical work. This foundational assessment ensures data reliability and identifies potential issues early.
The dataset exhibits exceptional data quality with complete coverage and minimal data integrity issues. The low
Analysis Overview
This analysis provides a comprehensive data quality and statistical profile of a 500-row, 31-column dataset. The objective is to automatically characterize the dataset’s structure, completeness, distributions, and relationships to establish a baseline understanding before deeper analytical work. This foundational assessment ensures data reliability and identifies potential issues early.
The dataset exhibits exceptional data quality with complete coverage and minimal data integrity issues. The low
Zero rows removed - profiling raw data
Data preprocessing and column mapping
Data Preprocessing
This section documents the data preprocessing pipeline applied to the 500-row dataset before analysis. It demonstrates data integrity through the preprocessing stage and establishes the foundation for the statistical analysis and outlier detection performed across 31 columns with a 97.7% data quality score.
The complete retention of all 500 observations indicates the dataset arrived in excellent condition with no data quality issues requiring removal. Combined with the 0% overall missing rate and 97.7% quality score, this suggests the data preprocessing phase was minimal—primarily validation rather than transformation. This clean state enables confident statistical analysis, correlation assessment, and outlier detection without concerns about data loss biasing results.
The absence of documented train/test splits indicates this analysis focuses
Data Preprocessing
This section documents the data preprocessing pipeline applied to the 500-row dataset before analysis. It demonstrates data integrity through the preprocessing stage and establishes the foundation for the statistical analysis and outlier detection performed across 31 columns with a 97.7% data quality score.
The complete retention of all 500 observations indicates the dataset arrived in excellent condition with no data quality issues requiring removal. Combined with the 0% overall missing rate and 97.7% quality score, this suggests the data preprocessing phase was minimal—primarily validation rather than transformation. This clean state enables confident statistical analysis, correlation assessment, and outlier detection without concerns about data loss biasing results.
The absence of documented train/test splits indicates this analysis focuses
Key Findings and Recommended Next Steps
Key Findings & Recommendations
| Metric | Value |
|---|---|
| Total Rows | 500 |
| Total Columns | 31 |
| Numeric Columns | 12 |
| Categorical Columns | 19 |
| Missing Data % | 0.00% |
| Data Quality Score | 97.7/100 |
| High Correlations | 4 |
| Columns with Outliers | 6 |
Bottom Line: This dataset contains 500 rows across 31 columns (12 numeric, 19 categorical). Data quality score: 97.7%. Overall missing data: 0.00%. Found 4 high correlation pairs (|r| > 0.70) - potential multicollinearity.
Dataset Characteristics:
• 500 rows × 31 columns (12 numeric, 19 categorical)
• Missing data: 0.00% overall
• Data quality score: 97.7/100
Key Findings:
• 4 high correlation pairs detected
• 6 columns with outliers
• 0 high-cardinality categoricals
Recommended Next Steps:
• ANOVA or Chi-Square tests (categorical grouping variables)
• Address multicollinearity before regression (drop redundant features)
• Outlier investigation (verify data entry, consider robust methods)
Executive Summary
This section synthesizes the complete data quality and structural assessment of a 500-row, 31-column dataset. Understanding these foundational metrics is critical for determining whether the data is suitable for downstream analysis and what preprocessing steps are required before modeling or statistical testing.
This dataset demonstrates exceptional data quality with zero missing values and a 97.7 quality score, indicating minimal data entry errors or structural problems. The presence of 4 high-correlation pairs suggests redundancy in tenure-related
Executive Summary
This section synthesizes the complete data quality and structural assessment of a 500-row, 31-column dataset. Understanding these foundational metrics is critical for determining whether the data is suitable for downstream analysis and what preprocessing steps are required before modeling or statistical testing.
This dataset demonstrates exceptional data quality with zero missing values and a 97.7 quality score, indicating minimal data entry errors or structural problems. The presence of 4 high-correlation pairs suggests redundancy in tenure-related
Type detection and per-column statistics
Type detection and statistics per column
Per-column statistics and type detection results
| column_name | detected_type | unique_values | missing_count | missing_pct | min_value | max_value | mean_value | median_value | std_dev | top_categories |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | numeric | 43.000 | 0.000 | 0.000 | 18.000 | 60.000 | 36.896 | 36.000 | 9.360 | NA |
| Attrition | categorical | 2.000 | 0.000 | 0.000 | NA | NA | NA | NA | NA | No, Yes |
| BusinessTravel | categorical | 3.000 | 0.000 | 0.000 | NA | NA | NA | NA | NA | Travel_Rarely, Travel_Frequently, Non-Travel |
| DailyRate | numeric | 415.000 | 0.000 | 0.000 | 103.000 | 1499.000 | 838.874 | 843.000 | 408.792 | NA |
| Department | categorical | 3.000 | 0.000 | 0.000 | NA | NA | NA | NA | NA | Research & Development, Sales, Human Resources |
| DistanceFromHome | numeric | 29.000 | 0.000 | 0.000 | 1.000 | 29.000 | 9.120 | 6.000 | 8.255 | NA |
| Education | categorical | 5.000 | 0.000 | 0.000 | NA | NA | NA | NA | NA | 3, 4, 2 |
| EducationField | categorical | 6.000 | 0.000 | 0.000 | NA | NA | NA | NA | NA | Life Sciences, Medical, Marketing |
| EnvironmentSatisfaction | categorical | 4.000 | 0.000 | 0.000 | NA | NA | NA | NA | NA | 3, 4, 2 |
| Gender | categorical | 2.000 | 0.000 | 0.000 | NA | NA | NA | NA | NA | Male, Female |
| HourlyRate | numeric | 71.000 | 0.000 | 0.000 | 30.000 | 100.000 | 65.742 | 66.000 | 20.604 | NA |
| JobInvolvement | categorical | 4.000 | 0.000 | 0.000 | NA | NA | NA | NA | NA | 3, 2, 4 |
| JobLevel | categorical | 5.000 | 0.000 | 0.000 | NA | NA | NA | NA | NA | 1, 2, 3 |
| JobRole | categorical | 9.000 | 0.000 | 0.000 | NA | NA | NA | NA | NA | Sales Executive, Research Scientist, Laboratory Technician |
| JobSatisfaction | categorical | 4.000 | 0.000 | 0.000 | NA | NA | NA | NA | NA | 4, 3, 2 |
| MaritalStatus | categorical | 3.000 | 0.000 | 0.000 | NA | NA | NA | NA | NA | Married, Single, Divorced |
| MonthlyIncome | numeric | 482.000 | 0.000 | 0.000 | 1102.000 | 19999.000 | 6598.644 | 4952.000 | 4814.582 | NA |
| MonthlyRate | numeric | 497.000 | 0.000 | 0.000 | 2094.000 | 26959.000 | 14157.400 | 14174.000 | 7002.615 | NA |
| NumCompaniesWorked | categorical | 10.000 | 0.000 | 0.000 | NA | NA | NA | NA | NA | 1, 0, 4 |
| OverTime | categorical | 2.000 | 0.000 | 0.000 | NA | NA | NA | NA | NA | No, Yes |
| PercentSalaryHike | numeric | 15.000 | 0.000 | 0.000 | 11.000 | 25.000 | 15.222 | 14.000 | 3.729 | NA |
| PerformanceRating | categorical | 2.000 | 0.000 | 0.000 | NA | NA | NA | NA | NA | 3, 4 |
| RelationshipSatisfaction | categorical | 4.000 | 0.000 | 0.000 | NA | NA | NA | NA | NA | 3, 4, 2 |
| StockOptionLevel | categorical | 4.000 | 0.000 | 0.000 | NA | NA | NA | NA | NA | 0, 1, 2 |
| TotalWorkingYears | numeric | 38.000 | 0.000 | 0.000 | 0.000 | 40.000 | 11.464 | 10.000 | 7.777 | NA |
| TrainingTimesLastYear | categorical | 7.000 | 0.000 | 0.000 | NA | NA | NA | NA | NA | 2, 3, 5 |
| WorkLifeBalance | categorical | 4.000 | 0.000 | 0.000 | NA | NA | NA | NA | NA | 3, 2, 4 |
| YearsAtCompany | numeric | 33.000 | 0.000 | 0.000 | 0.000 | 40.000 | 7.038 | 5.000 | 6.458 | NA |
| YearsInCurrentRole | numeric | 19.000 | 0.000 | 0.000 | 0.000 | 18.000 | 4.238 | 3.000 | 3.720 | NA |
| YearsSinceLastPromotion | numeric | 15.000 | 0.000 | 0.000 | 0.000 | 15.000 | 2.194 | 1.000 | 3.275 | NA |
| YearsWithCurrManager | numeric | 16.000 | 0.000 | 0.000 | 0.000 | 17.000 | 4.184 | 3.000 | 3.576 | NA |
Column Profiles
This section provides a structural overview of the dataset’s composition, identifying which columns are numeric versus categorical. Understanding column types is foundational for selecting appropriate statistical methods, interpreting distributions, and detecting data quality issues across the 500-row dataset.
The dataset demonstrates balanced composition between numeric and categorical variables, enabling comprehensive multivariate analysis. The absence of missing values eliminates imputation concerns and ensures all 500 observations are usable across all 31 dimensions. The high data quality score validates that the subsequent correlation analysis (4 high-correlation pairs identified) and outlier detection (6 columns with outliers) are based on reliable, complete information.
The column profile table appears empty in the provided summary, suggesting detailed per-column statistics are available elsewhere in the analysis framework. The type detection logic applied here (numeric >10 unique values; categorical ≤10
Column Profiles
This section provides a structural overview of the dataset’s composition, identifying which columns are numeric versus categorical. Understanding column types is foundational for selecting appropriate statistical methods, interpreting distributions, and detecting data quality issues across the 500-row dataset.
The dataset demonstrates balanced composition between numeric and categorical variables, enabling comprehensive multivariate analysis. The absence of missing values eliminates imputation concerns and ensures all 500 observations are usable across all 31 dimensions. The high data quality score validates that the subsequent correlation analysis (4 high-correlation pairs identified) and outlier detection (6 columns with outliers) are based on reliable, complete information.
The column profile table appears empty in the provided summary, suggesting detailed per-column statistics are available elsewhere in the analysis framework. The type detection logic applied here (numeric >10 unique values; categorical ≤10
Histograms for numeric, bar charts for categorical variables
Histograms and bar charts for all columns
Data distributions: histograms for numeric, bar charts for categorical
Distribution Analysis
This section visualizes how values are distributed across the 31 variables in the dataset (12 numeric, 19 categorical). Distribution analysis reveals data shape, concentration patterns, and potential data quality issues—essential for understanding whether variables are suitable for modeling and whether transformations are needed.
The moderate right skew across numeric variables suggests natural business metrics (income, rates, tenure) follow typical organizational patterns where most values cluster at lower ranges with occasional high outliers. The uneven bin distribution and categorical sparsity indicate that some variables may have limited discriminative power for analysis. The multi
Distribution Analysis
This section visualizes how values are distributed across the 31 variables in the dataset (12 numeric, 19 categorical). Distribution analysis reveals data shape, concentration patterns, and potential data quality issues—essential for understanding whether variables are suitable for modeling and whether transformations are needed.
The moderate right skew across numeric variables suggests natural business metrics (income, rates, tenure) follow typical organizational patterns where most values cluster at lower ranges with occasional high outliers. The uneven bin distribution and categorical sparsity indicate that some variables may have limited discriminative power for analysis. The multi
Visualize missingness co-occurrence across variables
Heatmap showing missingness co-occurrence
Missing data patterns and co-occurrence
Missing Data Patterns
This section evaluates data completeness across all 500 rows and 31 columns to identify missing value patterns that could compromise analysis validity. Understanding missingness is critical for determining whether data can be analyzed as-is or requires imputation, and for detecting systematic collection issues that might bias results.
The absence of missing data eliminates a major source of analytical bias and simplifies downstream processing. With zero missingness, there is no need to investigate collection failures, survey skip logic, or imputation strategies. This complete dataset enables direct statistical analysis without the complications of handling incomplete information, making correlation analysis, outlier detection, and distribution assessment more straightforward and reliable.
This perfect completeness is relatively rare in real-world datasets and suggests either careful data collection practices or pre-processing that removed incomplete records. The finding directly supports the high **data quality score of 97.7%
Missing Data Patterns
This section evaluates data completeness across all 500 rows and 31 columns to identify missing value patterns that could compromise analysis validity. Understanding missingness is critical for determining whether data can be analyzed as-is or requires imputation, and for detecting systematic collection issues that might bias results.
The absence of missing data eliminates a major source of analytical bias and simplifies downstream processing. With zero missingness, there is no need to investigate collection failures, survey skip logic, or imputation strategies. This complete dataset enables direct statistical analysis without the complications of handling incomplete information, making correlation analysis, outlier detection, and distribution assessment more straightforward and reliable.
This perfect completeness is relatively rare in real-world datasets and suggests either careful data collection practices or pre-processing that removed incomplete records. The finding directly supports the high **data quality score of 97.7%
Identify multicollinearity in numeric variables
Numeric variable correlations
Correlation matrix for numeric variables with hierarchical clustering
Correlation Matrix
This section identifies multicollinearity in the dataset by detecting strong linear relationships among numeric variables. Understanding these correlations is critical for predictive modeling, as highly correlated variables can inflate standard errors, reduce model interpretability, and create redundancy in feature sets. This analysis directly supports the overall data quality assessment (97.7% score) by flagging potential structural issues.
The dataset exhibits localized multicollinearity concentrated in tenure-related variables rather than systemic redundancy. The four high-correlation pairs suggest that years with current manager, years at company, and years in current role measure related but distinct aspects of employee tenure. This pattern is typical in HR datasets and
Correlation Matrix
This section identifies multicollinearity in the dataset by detecting strong linear relationships among numeric variables. Understanding these correlations is critical for predictive modeling, as highly correlated variables can inflate standard errors, reduce model interpretability, and create redundancy in feature sets. This analysis directly supports the overall data quality assessment (97.7% score) by flagging potential structural issues.
The dataset exhibits localized multicollinearity concentrated in tenure-related variables rather than systemic redundancy. The four high-correlation pairs suggest that years with current manager, years at company, and years in current role measure related but distinct aspects of employee tenure. This pattern is typical in HR datasets and
IQR-based outlier identification for numeric columns
Box plots for numeric columns
IQR-based outlier detection for numeric variables
Outlier Detection
This section identifies and classifies anomalous values in numeric variables using the Interquartile Range (IQR) method. Outlier detection is critical for understanding data distribution shape, identifying potential data quality issues, and determining whether statistical transformations or specialized modeling approaches are needed for accurate analysis.
The dataset exhibits minimal but meaningful outlier presence (2.3% total), concentrated in 6 numeric columns. The 17 extreme outliers warrant verification but represent <0.6% of observations, suggesting they are genuine data points rather than systematic errors. The right-skewed distribution pattern (skewness=1.12) indicates
Outlier Detection
This section identifies and classifies anomalous values in numeric variables using the Interquartile Range (IQR) method. Outlier detection is critical for understanding data distribution shape, identifying potential data quality issues, and determining whether statistical transformations or specialized modeling approaches are needed for accurate analysis.
The dataset exhibits minimal but meaningful outlier presence (2.3% total), concentrated in 6 numeric columns. The 17 extreme outliers warrant verification but represent <0.6% of observations, suggesting they are genuine data points rather than systematic errors. The right-skewed distribution pattern (skewness=1.12) indicates
Frequency distributions and cardinality analysis
Frequency distributions for categorical variables
Frequency analysis and cardinality warnings for categorical variables
Categorical Breakdown
This section evaluates the distribution and quality of categorical variables across the dataset. It identifies class imbalance patterns and cardinality issues that could affect model performance and interpretability. Understanding categorical structure is essential for feature engineering and ensuring meaningful variation in predictive analysis.
The absence of high-cardinality variables simplifies analysis and reduces preprocessing burden. However, pronounced class imbalance in key variables like Attrition limits discriminative power—models may struggle to learn minority patterns. The concentration of observations in dominant categories (e.g., 71% rarely travel) reflects real-world distributions but reduces variation available for distinguishing between groups.
Categorical Breakdown
This section evaluates the distribution and quality of categorical variables across the dataset. It identifies class imbalance patterns and cardinality issues that could affect model performance and interpretability. Understanding categorical structure is essential for feature engineering and ensuring meaningful variation in predictive analysis.
The absence of high-cardinality variables simplifies analysis and reduces preprocessing burden. However, pronounced class imbalance in key variables like Attrition limits discriminative power—models may struggle to learn minority patterns. The concentration of observations in dominant categories (e.g., 71% rarely travel) reflects real-world distributions but reduces variation available for distinguishing between groups.
Overall quality score based on missingness and outliers
Overall assessment of data quality
Overall data quality assessment
Data Quality Score
This section evaluates the overall integrity and completeness of your dataset before analysis. A high data quality score indicates minimal data issues, enabling reliable statistical analysis and meaningful insights. This foundational assessment determines whether the dataset is suitable for downstream analytical work without extensive preprocessing.
The dataset demonstrates exceptional quality with zero missing values and negligible outlier contamination. This pristine condition suggests the data has been well-curated or preprocessed before analysis. The 97.7 score reflects only the natural presence of statistical outliers (2.3%), which is expected in real-world distributions and does not warrant removal. This quality level supports confident analysis across all 12 numeric and 19 categorical variables without data imputation or extensive cleaning.
The scoring methodology prioritizes completeness and outlier detection. The absence of missing
Data Quality Score
This section evaluates the overall integrity and completeness of your dataset before analysis. A high data quality score indicates minimal data issues, enabling reliable statistical analysis and meaningful insights. This foundational assessment determines whether the dataset is suitable for downstream analytical work without extensive preprocessing.
The dataset demonstrates exceptional quality with zero missing values and negligible outlier contamination. This pristine condition suggests the data has been well-curated or preprocessed before analysis. The 97.7 score reflects only the natural presence of statistical outliers (2.3%), which is expected in real-world distributions and does not warrant removal. This quality level supports confident analysis across all 12 numeric and 19 categorical variables without data imputation or extensive cleaning.
The scoring methodology prioritizes completeness and outlier detection. The absence of missing