Netflix claims it uses data to decide everything from which shows to greenlight to how thumbnails are displayed. But what does the actual catalog distribution tell us? When we ran exploratory data analysis on 8,807 Netflix titles from the canonical Kaggle dataset, the patterns were stark: movies dominate the title count at 69.6%, the US produces 37% of all content (more than the next 10 countries combined), and TV-MA rated content is king. Here's what the data actually shows—and what it reveals about Netflix's content strategy.
This isn't predictive modeling or hypothesis testing. This is pure descriptive analytics: univariate distributions, categorical breakdowns, temporal trends, and missing-value profiling. Before you build a recommendation engine or content acquisition model, you need to know what you're working with. That's what exploratory data analysis delivers.
Why Exploratory Data Analysis Comes Before Experiments
Here's the truth about data analysis: you can't run a proper experiment if you don't understand your baseline distribution. Every A/B test, every causal inference model, every predictive algorithm starts with one question: what does the data actually look like?
Exploratory data analysis (EDA) is the systematic profiling of a dataset before modeling. It covers:
- Univariate distributions — What's the shape of each variable? Are values concentrated or dispersed? Where are the outliers?
- Categorical breakdowns — What are the dominant categories? How skewed is the distribution?
- Temporal trends — How has the data changed over time? Are there growth periods, peaks, or declines?
- Missing-value profiling — Which fields have gaps? Is missingness random or systematic?
- Bivariate relationships — How do variables correlate? Are there unexpected associations?
For the Netflix content catalog, EDA answers foundational questions: Is this a movie platform or a TV platform? Which countries supply the content? What's the rating distribution? When were these titles produced? Without these answers, any downstream analysis is flying blind.
shivamb/netflix-shows dataset from Kaggle (9,814 votes, CC0 license), containing 8,807 titles with 12 attributes: show_id, type, title, director, cast, country, date_added, release_year, rating, duration, listed_in, description. The dataset is a snapshot, not a live feed, and reflects Netflix's catalog as of the snapshot date. Some fields (director, cast, country) have substantial missing values.
The Netflix Catalog in Numbers: Dataset Overview
Before we dive into distributions, let's establish the scope. The Netflix content catalog dataset contains 8,807 records across two content types: movies and TV shows. Each record includes metadata spanning production details (country, director, cast), temporal markers (release year, date added to Netflix), content classification (rating, genre), and duration.
The dataset is structured but messy—exactly like real-world data. Director and cast fields have missing values in 30% and 9% of records respectively. Country information is missing in 7% of titles. Genre categories (the listed_in field) can include multiple comma-separated values, requiring careful parsing. This is normal. Clean, perfectly populated datasets don't exist outside of textbooks.
What makes this dataset valuable for exploratory analysis is its categorical richness. With discrete variables like content type, rating, and country, we can produce clear frequency distributions. With continuous variables like release year, we can identify temporal patterns. And with multi-value fields like genre, we can examine category overlap. That's the full scope of descriptive analytics.
Movie vs TV Show Split
The first distribution to examine is content type. Netflix's catalog contains 6,131 movies (69.6%) and 2,676 TV shows (30.4%). If you judged Netflix purely by title count, you'd call it a movie platform. Movies outnumber TV shows by more than 2:1.
But title count doesn't equal viewing time. A single TV show can deliver 50+ hours of content across multiple seasons, while the average movie runs 90-120 minutes. So while movies dominate the catalog numerically, TV shows likely dominate total viewing hours and engagement metrics. This is a critical distinction for content strategy: acquisition teams may prioritize movie quantity for catalog breadth, while retention and engagement depend heavily on serialized TV content.
The split also reflects Netflix's evolution. Early Netflix was a DVD-by-mail service optimized for movies. As streaming took over, TV shows became the engagement driver—binge-watching is fundamentally a TV show behavior. The 70/30 split suggests Netflix still acquires movies aggressively (they're cheaper and faster to license than multi-season shows), but the strategic focus has shifted toward original and licensed TV series.
From an EDA perspective, this distribution sets the baseline for all downstream analysis. If you're building a recommendation model, you need to account for the class imbalance. If you're analyzing genre preferences, you need to stratify by content type first. The 70/30 split isn't just a fun fact—it's a structural feature of the dataset that affects every subsequent calculation.
Release Year Distribution
Netflix's catalog shows extreme recency bias. The distribution of release years is heavily right-skewed, with the vast majority of titles produced after 2010. The peak occurs around 2017-2019, corresponding to Netflix's global expansion phase when the company was simultaneously acquiring international content and ramping up original production.
Pre-2000 titles are sparse. Classic films and older TV shows represent a small fraction of the catalog. This isn't an accident—it reflects licensing economics and viewer preferences. Older content is often locked into long-term distribution deals with cable networks or other streaming platforms. When licenses are available, they're expensive relative to viewership potential. Most Netflix users prefer recent releases, so the company allocates acquisition budgets accordingly.
The 2017-2019 peak is particularly notable. This period coincides with Netflix's transition from licensed content aggregator to original content producer. Titles from this era include both aggressive third-party acquisitions (international films, indie productions) and the first wave of Netflix Originals at scale. The subsequent decline after 2019 likely reflects the dataset's snapshot date rather than an actual production slowdown—newer titles simply hadn't been added yet when the data was collected.
For analysts, the recency bias has implications for temporal modeling. If you're building a time-series forecast or analyzing trends in content attributes (runtime, genre popularity, rating distribution), you need to account for the fact that 80%+ of the catalog was produced in the last 15 years. Historical patterns before 2000 won't have sufficient sample size for robust inference. This is why EDA matters—it tells you where your statistical power actually lives.
Content Rating Distribution
TV-MA dominates the rating distribution, accounting for the largest share of Netflix's catalog. This is adult-oriented content intended for mature audiences—think violent crime dramas, explicit comedies, and mature-themed documentaries. TV-14 and TV-PG follow, but the gap between TV-MA and everything else is substantial.
What does this tell us about Netflix's content strategy? The platform skews older. While family-friendly content exists (TV-Y, TV-Y7, G, PG), it's a minority of the catalog. Netflix isn't trying to be Disney+. The company's core demographic is adults 18-49, and the content acquisition reflects that. TV-MA content tends to be cheaper to license (fewer restrictions on distribution windows) and generates strong engagement among younger adults who are the platform's heaviest users.
The rating distribution also has regional implications. TV-MA is a US rating system designation. International markets use different classification schemes (UK: 15, 18; Australia: MA15+; etc.). The prevalence of TV-MA ratings suggests this dataset is US-centric in its categorization, even though the catalog itself is global. If you're analyzing international content strategy, you'd need to normalize ratings across regional systems—another layer of data cleaning that EDA helps identify.
From a product perspective, the rating distribution explains Netflix's parental control features. With TV-MA content dominating, the platform needs robust tools for households with children to filter out mature content. The rating distribution isn't just descriptive—it drives product requirements and UX design.
Top Countries by Title Count
The United States dominates Netflix's content catalog with 37% of all titles—more than 3,200 productions. That's more than the next 10 countries combined. India ranks second, followed by the United Kingdom, Japan, South Korea, Canada, and France. But even India, with its massive film industry, accounts for less than half the US total.
This distribution reflects both production capacity and Netflix's historical focus. The company is headquartered in California, started as a US service, and spent its first decade acquiring primarily English-language content. US production companies have decades of experience creating content optimized for international distribution—big budgets, recognizable stars, universally accessible storytelling. That's why US content travels well globally.
But the presence of India, South Korea, and Japan in the top 5 signals Netflix's international pivot. These countries have robust domestic production industries and growing middle classes with streaming budgets. South Korean dramas and Japanese anime have proven global appeal beyond their home markets. India's Bollywood industry produces more films annually than Hollywood. Netflix's strategy is to license this content cheaply for local markets, then test it in other regions. When a Korean drama like Squid Game breaks out globally, Netflix wins twice—cheap acquisition cost, massive international viewership.
The country distribution also highlights a data quality issue: the country field has missing values in 7% of records. For titles with multiple producing countries (common in international co-productions), the dataset may list only the primary country. This means the distribution undercounts international content and overstates US dominance slightly. Still, the 37% US figure is directionally correct—American content is the backbone of Netflix's global catalog.
Run This Analysis on Your Own Catalog Data
Netflix's catalog has a clear structure: 70% movies, 37% US content, heavy recency bias, and a skew toward mature ratings. Does your content catalog follow the same patterns? Upload your dataset to MCP Analytics' EDA tool and get automated profiling in 60 seconds—no code required.
Analyze Your Catalog →What This Analysis Doesn't Tell You (And What Comes Next)
Exploratory data analysis describes the dataset. It doesn't explain why the patterns exist, and it doesn't predict what will happen next. That requires different methods.
For example, we know TV-MA content dominates the catalog. But does TV-MA content generate more viewing hours than TV-14 content? That's a different question requiring viewership data (which this dataset doesn't include). We know the US produces 37% of titles, but does US content get recommended more often than international content? That requires algorithmic audit and experimentation.
Here's where you'd go next after EDA:
- Correlation analysis — Do movies from certain countries have different average ratings? Is there a relationship between release year and content rating?
- Cohort analysis — How do titles added in 2015 differ from titles added in 2020 in terms of genre, rating, and country mix?
- Genre clustering — The
listed_infield contains comma-separated genres. Use text analysis or clustering to identify genre combinations that frequently co-occur. - Survival analysis — How long do titles stay on Netflix before being removed? (Requires panel data tracking additions and removals over time.)
- Predictive modeling — Can you predict viewership, engagement, or retention based on title attributes? (Requires performance metrics not in this dataset.)
But all of those methods assume you've already done the foundational work: understanding what's in the dataset, how it's distributed, where the gaps are, and what the baseline patterns look like. That's what this EDA delivers.
How to Interpret Your Own Catalog Results
If you're running this analysis on your own content catalog—whether it's a streaming service, media library, e-commerce inventory, or product database—here's how to interpret the results:
1. Check for concentration risk. Netflix has 37% US content. If one country or supplier accounts for more than 50% of your catalog, you have concentration risk. A single licensing dispute, regulatory change, or supplier issue could decimate your offering. Diversification matters.
2. Look for recency bias. Netflix's catalog peaks at 2017-2019. If your catalog shows extreme recency bias (90%+ of items added in the last 2 years), you might lack depth for users seeking historical content. If your catalog skews old (majority added 5+ years ago), you might struggle to signal freshness and relevance.
3. Assess category balance. Netflix is 70% movies, 30% TV shows. Does your catalog have similar imbalances? A 90/10 split might indicate you're neglecting a category that could drive differentiation. A 50/50 split might indicate strategic indecision—trying to serve two audiences equally often means serving neither well.
4. Profile your ratings distribution. Netflix skews TV-MA. If you're building a family-friendly platform, your ratings distribution should skew TV-Y, TV-G, and PG. If your distribution doesn't match your target audience, you have a content strategy problem.
5. Identify missing value patterns. Netflix has 7% missing country data and 30% missing director data. Are the missing values random or systematic? If premium content has complete metadata but budget content doesn't, that's a signal about data governance priorities. Missing data isn't just a nuisance—it's diagnostic.
Common Pitfalls When Running Content Catalog EDA
Here's where people go wrong when profiling catalog data:
Mistake 1: Confusing title count with strategic importance. Movies outnumber TV shows 2:1, but that doesn't mean movies are twice as important. Total viewing hours, engagement, retention, and revenue per title all matter more than raw counts. Don't optimize for count—optimize for impact.
Mistake 2: Ignoring multi-value fields. The listed_in genre field contains comma-separated values like "Dramas, International Movies, Thrillers." If you treat this as a single categorical variable, you miss the genre combinations. Parse it, create binary flags for each genre, and analyze overlap. That's where the insight lives.
Mistake 3: Treating temporal snapshots as time-series. This dataset is a snapshot of Netflix's catalog at one point in time. You can analyze release year distribution (when content was produced), but you can't analyze growth trends without panel data tracking the catalog over time. Don't confuse cross-sectional data with longitudinal data.
Mistake 4: Skipping missing-value profiling. 7% missing country data might seem trivial until you realize international co-productions are systematically underreported. Always profile missingness by category. If high-budget titles have complete data and low-budget titles don't, that's not random—it's a data collection bias.
Mistake 5: Over-interpreting distributions without context. TV-MA dominates the ratings distribution, but so what? Without knowing Netflix's subscriber demographics, content performance metrics, or competitive positioning, you can't say whether TV-MA dominance is good or bad. Descriptive statistics describe. They don't prescribe.
Tools and Methods for Automated Exploratory Data Analysis
You don't need to manually code every histogram and frequency table. Automated EDA tools generate profiling reports in seconds. Here's what they cover:
- Summary statistics — Mean, median, standard deviation, min/max for continuous variables; frequency counts and mode for categorical variables.
- Univariate distributions — Histograms for continuous variables, bar charts for categorical variables, with automatic outlier detection.
- Missing value analysis — Percentage missing by field, missing value heatmaps showing which records have incomplete data.
- Correlation matrices — Heatmaps showing pairwise correlations between numeric variables (Pearson or Spearman).
- Categorical associations — Chi-square tests or Cramér's V for relationships between categorical variables.
- Temporal plots — Time-series charts for any date or year fields, showing trends and seasonality.
For the Netflix dataset, an automated EDA tool would flag the following immediately: 70/30 movie/TV split, 37% US content concentration, missing director data in 30% of records, recency bias with peak at 2017-2019, and TV-MA rating dominance. You'd have that within 60 seconds of uploading the CSV. That's the baseline. From there, you ask follow-up questions and dig into specific subsets.
Frequently Asked Questions
What is the Netflix content catalog dataset?
The Netflix content catalog dataset contains 8,807 titles (movies and TV shows) with attributes including type, release year, date added to Netflix, rating, duration, country of production, and genre categories. It's one of the most comprehensive public datasets for streaming platform content analysis, sourced from Kaggle with over 9,800 community votes. The dataset is a snapshot, not a live feed, and reflects Netflix's catalog as of the snapshot date.
What percentage of Netflix titles are movies vs TV shows?
Movies represent 69.6% of the Netflix catalog (6,131 titles) while TV shows account for 30.4% (2,676 titles). However, TV shows typically offer more total viewing hours due to multiple seasons and episodes. The 70/30 split reflects acquisition economics—movies are cheaper and faster to license in volume—but engagement and retention metrics likely favor TV shows due to their serialized, binge-friendly format.
Which countries produce the most Netflix content?
The United States dominates with 37% of all Netflix titles (over 3,200 productions). India ranks second, followed by the United Kingdom, Japan, South Korea, Canada, and France. The gap between US production and all other countries is substantial, reflecting both Netflix's California headquarters and the historical focus on English-language content. However, the presence of India, South Korea, and Japan in the top 5 signals Netflix's international expansion strategy.
What is the most common content rating on Netflix?
TV-MA (Mature Audiences) is the most prevalent rating on Netflix, indicating the platform skews toward adult-oriented content. TV-14 and TV-PG round out the top three ratings, showing a distribution that favors viewers aged 14 and above. This reflects Netflix's core demographic (adults 18-49) and the economics of content acquisition—TV-MA content tends to be cheaper to license and generates strong engagement among younger adults.
When were most Netflix titles released?
Netflix's catalog shows strong recency bias, with the majority of titles released after 2010. Production peaks around 2017-2019, corresponding to Netflix's aggressive content acquisition and original production phase during its global expansion. Pre-2000 titles are sparse, reflecting licensing economics (older content is locked into long-term deals) and viewer preferences (most users prefer recent releases).
Can I run this analysis on my own content catalog?
Yes. MCP Analytics provides an automated EDA tool that accepts any catalog dataset in CSV format. Upload your file, map the relevant columns (title, type, release date, category, etc.), and get a full profiling report in 60 seconds. The tool generates univariate distributions, categorical breakdowns, temporal trends, and missing-value analysis—no coding required.
Related Content
Customer Segmentation Analysis for Subscription Services
Cluster analysis identifies subscriber segments by viewing behavior, churn risk, and content preferences. Here's how streaming platforms use segmentation to personalize recommendations and retention campaigns.
Read More →Time-Series Forecasting for Content Demand
Predict viewership trends and content performance using ARIMA, exponential smoothing, and Prophet models. Learn how to forecast seasonal patterns and plan content acquisition budgets.
Read More →Chi-Square Test for Content Rating Independence
Does content rating vary by country of production? Chi-square tests assess whether categorical variables are independent or associated. Here's how to test for statistical relationships in catalog data.
Read More →Survival Analysis for Content Retention
How long do titles stay on Netflix before being removed? Cox regression and Kaplan-Meier curves model time-to-event data for content lifecycle analysis.
Read More →