Mastering the Pandas describe() Method: A Comprehensive Guide to Descriptive Statistics

Pandas is a cornerstone of data analysis in Python, offering powerful tools to explore and summarize structured data. Among its essential methods is describe(), which generates descriptive statistics for a DataFrame or Series, providing a quick overview of key numerical and categorical metrics. This method is invaluable for understanding data distributions, identifying patterns, and detecting anomalies during exploratory data analysis (EDA). This comprehensive guide dives deep into the Pandas describe() method, exploring its functionality, parameters, and practical applications. Designed for both beginners and experienced users, this blog provides detailed explanations and examples to ensure you can effectively leverage describe() in your data analysis workflows.

What is the Pandas describe() Method?

The describe() method in Pandas generates a summary of descriptive statistics for a DataFrame or Series, offering insights into the central tendency, dispersion, and shape of the data’s distribution. For numerical columns, it computes metrics like mean, standard deviation, and percentiles, while for categorical or object columns, it provides counts of unique values and the most frequent value. This method is a key tool for EDA, enabling users to quickly assess data characteristics without writing complex code.

The describe() method is part of Pandas’ data inspection toolkit, complementing methods like info() for metadata, head() for viewing rows, and value_counts() for frequency analysis. It’s widely used after loading data to understand its properties or after transformations to verify changes. For a broader overview of data viewing in Pandas, see viewing-data.

Why Use describe()?

The describe() method offers several benefits:

  • Quick Statistical Summary: Provides a snapshot of key metrics (e.g., mean, min, max) in one command.
  • Data Quality Insights: Helps identify outliers, missing values, or skewed distributions.
  • Categorical Analysis: Summarizes non-numerical data, such as unique values and top categories.
  • Workflow Efficiency: Simplifies EDA by automating common statistical calculations.
  • Versatility: Customizable to focus on specific data types or percentiles.

By incorporating describe() into your analysis, you can make informed decisions about data cleaning, preprocessing, and modeling.

Understanding the describe() Method

The describe() method is available for both DataFrames and Series with the following syntax:

DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
Series.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
  • percentiles: List of percentiles to include (default is [0.25, 0.5, 0.75] for quartiles).
  • include: Data types to include (e.g., 'all', ['int64', 'float64'], or numpy.number).
  • exclude: Data types to exclude (e.g., ['object']).
  • datetime_is_numeric: Treat datetime columns as numeric (default is False).
  • Returns: A DataFrame (for DataFrames) or Series (for Series) containing summary statistics.

The method is non-destructive, accessing data without modifying it, and is optimized for quick computation, making it suitable for datasets of varying sizes.

Default Behavior

  • Numerical Columns: Computes count, mean, standard deviation, min, max, and quartiles (25%, 50%, 75%).
  • Categorical/Object Columns: Computes count, unique values, top value, and frequency of the top value (requires include='object' or include='all').
  • Exclusions: By default, only numerical columns are summarized unless include is specified.

For related methods, see insights-info-method and value-counts.

Using the describe() Method

Let’s explore how to use describe() with practical examples, covering DataFrames, Series, and common scenarios.

describe() with DataFrames

For DataFrames, describe() generates statistics for numerical columns by default.

import pandas as pd
import numpy as np

# Create a sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 40, 45],
    'Salary': [50000, 60000, 70000, np.nan, 80000],
    'City': ['New York', 'London', 'Tokyo', 'Paris', 'Sydney']
})
print(df.describe())

Output:

Age        Salary
count   5.000000      4.000000
mean   35.000000  65000.000000
std     7.905694   12990.381056
min    25.000000  50000.000000
25%    30.000000  57500.000000
50%    35.000000  65000.000000
75%    40.000000  72500.000000
max    45.000000  80000.000000

This output shows:

  • count: Number of non-null values (5 for Age, 4 for Salary due to NaN).
  • mean: Average value (35 for Age, 65000 for Salary).
  • std: Standard deviation, measuring dispersion.
  • min/max: Minimum and maximum values.
  • 25%/50%/75%: Quartiles (50% is the median).

The Name and City columns are excluded because they are object type. For data creation, see creating-data.

Including Non-Numerical Columns

To include categorical or object columns, use include='object' or include='all':

print(df.describe(include='object'))

Output:

Name     City
count     5        5
unique    5        5
top     Alice  New York
freq      1        1

This shows:

  • count: Number of non-null values.
  • unique: Number of unique values.
  • top: Most frequent value.
  • freq: Frequency of the top value.

Use include='all' to summarize all columns:

print(df.describe(include='all'))

Output:

Name        Age        Salary     City
count       5   5.000000      4.000000        5
unique      5        NaN           NaN        5
top     Alice        NaN           NaN  New York
freq        1        NaN           NaN        1
mean      NaN  35.000000  65000.000000      NaN
std       NaN   7.905694  12990.381056      NaN
min       NaN  25.000000  50000.000000      NaN
25%       NaN  30.000000  57500.000000      NaN
50%       NaN  35.000000  65000.000000      NaN
75%       NaN  40.000000  72500.000000      NaN
max       NaN  45.000000  80000.000000      NaN

Non-applicable metrics (e.g., mean for Name) are shown as NaN.

Customizing Percentiles

Adjust the percentiles included in the output:

print(df.describe(percentiles=[0.1, 0.5, 0.9]))

Output:

Age        Salary
count   5.000000      4.000000
mean   35.000000  65000.000000
std     7.905694  12990.381056
min    25.000000  50000.000000
10%    27.000000  53000.000000
50%    35.000000  65000.000000
90%    43.000000  77000.000000
max    45.000000  80000.000000

This includes the 10th, 50th, and 90th percentiles, omitting the default 25% and 75%. For percentile calculations, see quantile-calculation.

describe() with Series

For a Series, describe() summarizes a single column:

series = df['Age']
print(series.describe())

Output:

count     5.000000
mean     35.000000
std       7.905694
min      25.000000
25%      30.000000
50%      35.000000
75%      40.000000
max      45.000000
Name: Age, dtype: float64

For a categorical Series:

series = df['City']
print(series.describe())

Output:

count           5
unique          5
top       New York
freq            1
Name: City, dtype: object

For Series details, see series.

Handling Datetime Columns

By default, datetime columns are excluded unless datetime_is_numeric=True:

df['Date'] = pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'])
print(df.describe(datetime_is_numeric=True))

This treats datetime columns as numeric, computing statistics like min/max. For datetime handling, see datetime-conversion.

Practical Applications of describe()

The describe() method is versatile and supports various data analysis tasks:

Exploratory Data Analysis (EDA)

Use describe() to understand data distributions:

df = pd.read_csv('sales_data.csv')
print(df.describe())

This reveals ranges, averages, and potential outliers in numerical columns like sales or revenue. For data loading, see read-write-csv.

Identifying Outliers

Check min/max and quartiles to spot outliers:

print(df['Salary'].describe())

If max is significantly higher than the 75th percentile, investigate:

outliers = df[df['Salary'] > df['Salary'].quantile(0.75) * 1.5]
print(outliers)

For outlier handling, see handle-outliers.

Missing Data Detection

Low count values indicate missing data:

print(df.describe())

The Salary column’s count (4 vs. 5 rows) suggests missing values. Follow up with:

print(df.isnull().sum())

For missing data, see handling-missing-data.

Categorical Data Analysis

Summarize categorical columns to assess diversity:

print(df[['Name', 'City']].describe())

A high unique count (e.g., 5 for City) indicates varied categories. For frequency analysis, see value-counts.

Verifying Transformations

Check statistics after transformations:

df['Bonus'] = df['Salary'] * 0.1
print(df.describe())

This confirms the new Bonus column aligns with expectations. For column addition, see adding-columns.

Debugging Pipelines

Inspect statistics at pipeline stages:

df = pd.read_json('data.json')
print("Original:", df.describe())
df = df.dropna()
print("After dropna:", df.describe())

For JSON handling, see read-json.

Customizing describe() Output

Enhance the describe() experience with these techniques:

Adjusting Display Settings

Customize Pandas’ display for readability:

pd.set_option('display.float_format', '{:.2f}'.format)
print(df.describe())

Reset to defaults:

pd.reset_option('all')

For display customization, see option-settings.

Combining with Other Methods

Pair describe() with inspection methods:

  • info(): View metadata:
print(df.info())
print(df.describe())

See insights-info-method.

  • head(): Preview data:
print(df.head())
print(df.describe())

See head-method.

  • shape: Check dimensions:
print(df.shape)
print(df.describe())

See data-dimensions-shape.

Focusing on Specific Columns

Summarize selected columns:

print(df[['Age', 'Salary']].describe())

For column selection, see selecting-columns.

Visualizing Statistics

Plot statistics for visual insights:

df.describe().loc[['mean', 'std']].plot(kind='bar')

For visualization, see plotting-basics.

Common Issues and Solutions

While describe() is straightforward, consider these scenarios:

  • Missing Values: Low count indicates NaN or None. Use dropna() or fillna() before analysis. See handle-missing-fillna.
  • Excluded Columns: Numerical columns are default. Use include='all' for categorical data.
  • Skewed Data: High std or extreme min/max suggest skewness. Visualize with histograms:
df['Salary'].hist()
  • Large Datasets: describe() is fast, but wide DataFrames may produce cluttered output. Select columns or adjust display settings.
  • MultiIndex Data: describe() works normally, but verify index alignment:
df_multi = pd.DataFrame(
    {'Value': [1, 2, 3]},
    index=pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1)])
)
print(df_multi.describe())

See multiindex-creation.

Advanced Techniques

For advanced users, enhance describe() usage with these approaches:

Custom Statistical Functions

Combine describe() with custom aggregations:

print(df.agg(['mean', 'median', 'skew']).T)

For aggregation, see groupby-agg.

Memory Optimization

Check memory usage with info() and refine dtypes based on describe():

print(df.info(memory_usage='deep'))
print(df.describe())
df['Age'] = df['Age'].astype('int32')

For optimization, see optimize-performance.

Interactive Environments

In Jupyter Notebooks, describe() outputs are formatted as tables:

df.describe()  # Displays as a formatted table

Statistical Validation

Compare describe() metrics with manual calculations:

print(df['Age'].mean() == df['Age'].describe()['mean'])

For statistical methods, see mean-calculations.

Verifying describe() Output

After using describe(), verify the results:

  • Check Structure: Use info() or shape to confirm row/column counts. See data-dimensions-shape.
  • Validate Content: Use head() or tail() to inspect data. See tail-method.
  • Assess Quality: Use isnull() or nunique() to address missing values or duplicates. See nunique-values.

Example:

print(df.info())
print(df.describe())
print(df.head())

Conclusion

The Pandas describe() method is a powerful tool for summarizing descriptive statistics, offering quick insights into numerical and categorical data. By revealing metrics like mean, standard deviation, and unique value counts, describe() supports exploratory data analysis, outlier detection, and data quality assessment. Its flexibility and integration with other Pandas methods make it essential for understanding datasets and guiding analysis.

To deepen your Pandas expertise, explore insights-info-method for metadata, handling-missing-data for cleaning, or plotting-basics for visualization. With describe(), you’re equipped to unlock the statistical potential of your data with ease.