Mastering the Pandas describe() Method: A Comprehensive Guide to Descriptive Statistics
Pandas is a cornerstone of data analysis in Python, offering powerful tools to explore and summarize structured data. Among its essential methods is describe(), which generates descriptive statistics for a DataFrame or Series, providing a quick overview of key numerical and categorical metrics. This method is invaluable for understanding data distributions, identifying patterns, and detecting anomalies during exploratory data analysis (EDA). This comprehensive guide dives deep into the Pandas describe() method, exploring its functionality, parameters, and practical applications. Designed for both beginners and experienced users, this blog provides detailed explanations and examples to ensure you can effectively leverage describe() in your data analysis workflows.
What is the Pandas describe() Method?
The describe() method in Pandas generates a summary of descriptive statistics for a DataFrame or Series, offering insights into the central tendency, dispersion, and shape of the data’s distribution. For numerical columns, it computes metrics like mean, standard deviation, and percentiles, while for categorical or object columns, it provides counts of unique values and the most frequent value. This method is a key tool for EDA, enabling users to quickly assess data characteristics without writing complex code.
The describe() method is part of Pandas’ data inspection toolkit, complementing methods like info() for metadata, head() for viewing rows, and value_counts() for frequency analysis. It’s widely used after loading data to understand its properties or after transformations to verify changes. For a broader overview of data viewing in Pandas, see viewing-data.
Why Use describe()?
The describe() method offers several benefits:
- Quick Statistical Summary: Provides a snapshot of key metrics (e.g., mean, min, max) in one command.
- Data Quality Insights: Helps identify outliers, missing values, or skewed distributions.
- Categorical Analysis: Summarizes non-numerical data, such as unique values and top categories.
- Workflow Efficiency: Simplifies EDA by automating common statistical calculations.
- Versatility: Customizable to focus on specific data types or percentiles.
By incorporating describe() into your analysis, you can make informed decisions about data cleaning, preprocessing, and modeling.
Understanding the describe() Method
The describe() method is available for both DataFrames and Series with the following syntax:
DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
Series.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
- percentiles: List of percentiles to include (default is [0.25, 0.5, 0.75] for quartiles).
- include: Data types to include (e.g., 'all', ['int64', 'float64'], or numpy.number).
- exclude: Data types to exclude (e.g., ['object']).
- datetime_is_numeric: Treat datetime columns as numeric (default is False).
- Returns: A DataFrame (for DataFrames) or Series (for Series) containing summary statistics.
The method is non-destructive, accessing data without modifying it, and is optimized for quick computation, making it suitable for datasets of varying sizes.
Default Behavior
- Numerical Columns: Computes count, mean, standard deviation, min, max, and quartiles (25%, 50%, 75%).
- Categorical/Object Columns: Computes count, unique values, top value, and frequency of the top value (requires include='object' or include='all').
- Exclusions: By default, only numerical columns are summarized unless include is specified.
For related methods, see insights-info-method and value-counts.
Using the describe() Method
Let’s explore how to use describe() with practical examples, covering DataFrames, Series, and common scenarios.
describe() with DataFrames
For DataFrames, describe() generates statistics for numerical columns by default.
import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 45],
'Salary': [50000, 60000, 70000, np.nan, 80000],
'City': ['New York', 'London', 'Tokyo', 'Paris', 'Sydney']
})
print(df.describe())
Output:
Age Salary
count 5.000000 4.000000
mean 35.000000 65000.000000
std 7.905694 12990.381056
min 25.000000 50000.000000
25% 30.000000 57500.000000
50% 35.000000 65000.000000
75% 40.000000 72500.000000
max 45.000000 80000.000000
This output shows:
- count: Number of non-null values (5 for Age, 4 for Salary due to NaN).
- mean: Average value (35 for Age, 65000 for Salary).
- std: Standard deviation, measuring dispersion.
- min/max: Minimum and maximum values.
- 25%/50%/75%: Quartiles (50% is the median).
The Name and City columns are excluded because they are object type. For data creation, see creating-data.
Including Non-Numerical Columns
To include categorical or object columns, use include='object' or include='all':
print(df.describe(include='object'))
Output:
Name City
count 5 5
unique 5 5
top Alice New York
freq 1 1
This shows:
- count: Number of non-null values.
- unique: Number of unique values.
- top: Most frequent value.
- freq: Frequency of the top value.
Use include='all' to summarize all columns:
print(df.describe(include='all'))
Output:
Name Age Salary City
count 5 5.000000 4.000000 5
unique 5 NaN NaN 5
top Alice NaN NaN New York
freq 1 NaN NaN 1
mean NaN 35.000000 65000.000000 NaN
std NaN 7.905694 12990.381056 NaN
min NaN 25.000000 50000.000000 NaN
25% NaN 30.000000 57500.000000 NaN
50% NaN 35.000000 65000.000000 NaN
75% NaN 40.000000 72500.000000 NaN
max NaN 45.000000 80000.000000 NaN
Non-applicable metrics (e.g., mean for Name) are shown as NaN.
Customizing Percentiles
Adjust the percentiles included in the output:
print(df.describe(percentiles=[0.1, 0.5, 0.9]))
Output:
Age Salary
count 5.000000 4.000000
mean 35.000000 65000.000000
std 7.905694 12990.381056
min 25.000000 50000.000000
10% 27.000000 53000.000000
50% 35.000000 65000.000000
90% 43.000000 77000.000000
max 45.000000 80000.000000
This includes the 10th, 50th, and 90th percentiles, omitting the default 25% and 75%. For percentile calculations, see quantile-calculation.
describe() with Series
For a Series, describe() summarizes a single column:
series = df['Age']
print(series.describe())
Output:
count 5.000000
mean 35.000000
std 7.905694
min 25.000000
25% 30.000000
50% 35.000000
75% 40.000000
max 45.000000
Name: Age, dtype: float64
For a categorical Series:
series = df['City']
print(series.describe())
Output:
count 5
unique 5
top New York
freq 1
Name: City, dtype: object
For Series details, see series.
Handling Datetime Columns
By default, datetime columns are excluded unless datetime_is_numeric=True:
df['Date'] = pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'])
print(df.describe(datetime_is_numeric=True))
This treats datetime columns as numeric, computing statistics like min/max. For datetime handling, see datetime-conversion.
Practical Applications of describe()
The describe() method is versatile and supports various data analysis tasks:
Exploratory Data Analysis (EDA)
Use describe() to understand data distributions:
df = pd.read_csv('sales_data.csv')
print(df.describe())
This reveals ranges, averages, and potential outliers in numerical columns like sales or revenue. For data loading, see read-write-csv.
Identifying Outliers
Check min/max and quartiles to spot outliers:
print(df['Salary'].describe())
If max is significantly higher than the 75th percentile, investigate:
outliers = df[df['Salary'] > df['Salary'].quantile(0.75) * 1.5]
print(outliers)
For outlier handling, see handle-outliers.
Missing Data Detection
Low count values indicate missing data:
print(df.describe())
The Salary column’s count (4 vs. 5 rows) suggests missing values. Follow up with:
print(df.isnull().sum())
For missing data, see handling-missing-data.
Categorical Data Analysis
Summarize categorical columns to assess diversity:
print(df[['Name', 'City']].describe())
A high unique count (e.g., 5 for City) indicates varied categories. For frequency analysis, see value-counts.
Verifying Transformations
Check statistics after transformations:
df['Bonus'] = df['Salary'] * 0.1
print(df.describe())
This confirms the new Bonus column aligns with expectations. For column addition, see adding-columns.
Debugging Pipelines
Inspect statistics at pipeline stages:
df = pd.read_json('data.json')
print("Original:", df.describe())
df = df.dropna()
print("After dropna:", df.describe())
For JSON handling, see read-json.
Customizing describe() Output
Enhance the describe() experience with these techniques:
Adjusting Display Settings
Customize Pandas’ display for readability:
pd.set_option('display.float_format', '{:.2f}'.format)
print(df.describe())
Reset to defaults:
pd.reset_option('all')
For display customization, see option-settings.
Combining with Other Methods
Pair describe() with inspection methods:
- info(): View metadata:
print(df.info())
print(df.describe())
See insights-info-method.
- head(): Preview data:
print(df.head())
print(df.describe())
See head-method.
- shape: Check dimensions:
print(df.shape)
print(df.describe())
Focusing on Specific Columns
Summarize selected columns:
print(df[['Age', 'Salary']].describe())
For column selection, see selecting-columns.
Visualizing Statistics
Plot statistics for visual insights:
df.describe().loc[['mean', 'std']].plot(kind='bar')
For visualization, see plotting-basics.
Common Issues and Solutions
While describe() is straightforward, consider these scenarios:
- Missing Values: Low count indicates NaN or None. Use dropna() or fillna() before analysis. See handle-missing-fillna.
- Excluded Columns: Numerical columns are default. Use include='all' for categorical data.
- Skewed Data: High std or extreme min/max suggest skewness. Visualize with histograms:
df['Salary'].hist()
- Large Datasets: describe() is fast, but wide DataFrames may produce cluttered output. Select columns or adjust display settings.
- MultiIndex Data: describe() works normally, but verify index alignment:
df_multi = pd.DataFrame(
{'Value': [1, 2, 3]},
index=pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1)])
)
print(df_multi.describe())
See multiindex-creation.
Advanced Techniques
For advanced users, enhance describe() usage with these approaches:
Custom Statistical Functions
Combine describe() with custom aggregations:
print(df.agg(['mean', 'median', 'skew']).T)
For aggregation, see groupby-agg.
Memory Optimization
Check memory usage with info() and refine dtypes based on describe():
print(df.info(memory_usage='deep'))
print(df.describe())
df['Age'] = df['Age'].astype('int32')
For optimization, see optimize-performance.
Interactive Environments
In Jupyter Notebooks, describe() outputs are formatted as tables:
df.describe() # Displays as a formatted table
Statistical Validation
Compare describe() metrics with manual calculations:
print(df['Age'].mean() == df['Age'].describe()['mean'])
For statistical methods, see mean-calculations.
Verifying describe() Output
After using describe(), verify the results:
- Check Structure: Use info() or shape to confirm row/column counts. See data-dimensions-shape.
- Validate Content: Use head() or tail() to inspect data. See tail-method.
- Assess Quality: Use isnull() or nunique() to address missing values or duplicates. See nunique-values.
Example:
print(df.info())
print(df.describe())
print(df.head())
Conclusion
The Pandas describe() method is a powerful tool for summarizing descriptive statistics, offering quick insights into numerical and categorical data. By revealing metrics like mean, standard deviation, and unique value counts, describe() supports exploratory data analysis, outlier detection, and data quality assessment. Its flexibility and integration with other Pandas methods make it essential for understanding datasets and guiding analysis.
To deepen your Pandas expertise, explore insights-info-method for metadata, handling-missing-data for cleaning, or plotting-basics for visualization. With describe(), you’re equipped to unlock the statistical potential of your data with ease.