How to Merge DataFrames in Python | The School of Code

Settings

Appearance

Choose a typography theme that suits your style

Back to How-to Guides
Python

How to Merge DataFrames in Python

Learn how to combine DataFrames in pandas using merge, join, and concat operations.

PythonPandasDataFramesData Science

Combining DataFrames is essential for data analysis. Pandas provides several methods to merge, join, and concatenate DataFrames.

Setup

import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David']
})

df2 = pd.DataFrame({
    'id': [1, 2, 3, 5],
    'salary': [50000, 60000, 55000, 70000]
})

Using merge()

Inner Merge (Default)

Keep only matching rows:

# Inner merge on 'id'
result = pd.merge(df1, df2, on='id')
print(result)
#    id     name  salary
# 0   1    Alice   50000
# 1   2      Bob   60000
# 2   3  Charlie   55000

Left Merge

Keep all rows from left DataFrame:

result = pd.merge(df1, df2, on='id', how='left')
print(result)
#    id     name   salary
# 0   1    Alice  50000.0
# 1   2      Bob  60000.0
# 2   3  Charlie  55000.0
# 3   4    David      NaN

Right Merge

Keep all rows from right DataFrame:

result = pd.merge(df1, df2, on='id', how='right')
print(result)
#    id     name  salary
# 0   1    Alice   50000
# 1   2      Bob   60000
# 2   3  Charlie   55000
# 3   5      NaN   70000

Outer Merge

Keep all rows from both DataFrames:

result = pd.merge(df1, df2, on='id', how='outer')
print(result)
#    id     name   salary
# 0   1    Alice  50000.0
# 1   2      Bob  60000.0
# 2   3  Charlie  55000.0
# 3   4    David      NaN
# 4   5      NaN  70000.0

Different Column Names

df1 = pd.DataFrame({
    'employee_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie']
})

df2 = pd.DataFrame({
    'emp_id': [1, 2, 3],
    'department': ['Sales', 'IT', 'HR']
})

# Merge with different column names
result = pd.merge(df1, df2, left_on='employee_id', right_on='emp_id')
print(result)

Multiple Join Keys

df1 = pd.DataFrame({
    'year': [2024, 2024, 2025],
    'quarter': [1, 2, 1],
    'sales': [100, 150, 120]
})

df2 = pd.DataFrame({
    'year': [2024, 2024, 2025],
    'quarter': [1, 2, 1],
    'target': [90, 140, 110]
})

result = pd.merge(df1, df2, on=['year', 'quarter'])
print(result)

Using join()

Join on index:

df1 = pd.DataFrame({
    'name': ['Alice', 'Bob']
}, index=[1, 2])

df2 = pd.DataFrame({
    'salary': [50000, 60000]
}, index=[1, 2])

# Join on index
result = df1.join(df2)
print(result)

Using concat()

Vertical Concatenation

Stack DataFrames on top of each other:

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# Concatenate vertically
result = pd.concat([df1, df2])
print(result)
#    A  B
# 0  1  3
# 1  2  4
# 0  5  7
# 1  6  8

# Reset index
result = pd.concat([df1, df2], ignore_index=True)
print(result)
#    A  B
# 0  1  3
# 1  2  4
# 2  5  7
# 3  6  8

Horizontal Concatenation

result = pd.concat([df1, df2], axis=1)
print(result)
#    A  B  A  B
# 0  1  3  5  7
# 1  2  4  6  8

Handling Duplicates

Suffixes for Overlapping Columns

df1 = pd.DataFrame({
    'id': [1, 2],
    'value': [10, 20]
})

df2 = pd.DataFrame({
    'id': [1, 2],
    'value': [100, 200]
})

result = pd.merge(df1, df2, on='id', suffixes=('_left', '_right'))
print(result)
#    id  value_left  value_right
# 0   1          10          100
# 1   2          20          200

Validate Merge

# Ensure one-to-one relationship
try:
    result = pd.merge(df1, df2, on='id', validate='one_to_one')
except Exception as e:
    print(f"Validation failed: {e}")

Practical Examples

Combining Customer and Order Data

customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'city': ['NYC', 'LA', 'Chicago']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104],
    'customer_id': [1, 2, 1, 3],
    'amount': [100, 200, 150, 300]
})

# All customer orders
result = pd.merge(customers, orders, on='customer_id')

# Customer order summary
summary = pd.merge(
    customers,
    orders.groupby('customer_id')['amount'].sum().reset_index(),
    on='customer_id'
)
print(summary)

Combining Multiple Files

import glob

# Read and combine multiple CSV files
files = glob.glob('data/*.csv')
dfs = [pd.read_csv(f) for f in files]
combined = pd.concat(dfs, ignore_index=True)

Left Join with Default Values

result = pd.merge(df1, df2, on='id', how='left')
result['salary'] = result['salary'].fillna(0)  # Fill missing with 0

Performance Tips

# Sort before merge for better performance
df1_sorted = df1.sort_values('id')
df2_sorted = df2.sort_values('id')

# Use categorical for repeated string values
df1['category'] = df1['category'].astype('category')

Summary

  • merge() - SQL-style joins on columns
  • join() - join on index
  • concat() - stack DataFrames vertically or horizontally
  • Join types: inner, left, right, outer
  • Use suffixes for overlapping column names
  • Use validate to check merge assumptions