How to Merge DataFrames in Python
Learn how to combine DataFrames in pandas using merge, join, and concat operations.
Combining DataFrames is essential for data analysis. Pandas provides several methods to merge, join, and concatenate DataFrames.
Setup
import pandas as pd
# Sample DataFrames
df1 = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David']
})
df2 = pd.DataFrame({
'id': [1, 2, 3, 5],
'salary': [50000, 60000, 55000, 70000]
})
Using merge()
Inner Merge (Default)
Keep only matching rows:
# Inner merge on 'id'
result = pd.merge(df1, df2, on='id')
print(result)
# id name salary
# 0 1 Alice 50000
# 1 2 Bob 60000
# 2 3 Charlie 55000
Left Merge
Keep all rows from left DataFrame:
result = pd.merge(df1, df2, on='id', how='left')
print(result)
# id name salary
# 0 1 Alice 50000.0
# 1 2 Bob 60000.0
# 2 3 Charlie 55000.0
# 3 4 David NaN
Right Merge
Keep all rows from right DataFrame:
result = pd.merge(df1, df2, on='id', how='right')
print(result)
# id name salary
# 0 1 Alice 50000
# 1 2 Bob 60000
# 2 3 Charlie 55000
# 3 5 NaN 70000
Outer Merge
Keep all rows from both DataFrames:
result = pd.merge(df1, df2, on='id', how='outer')
print(result)
# id name salary
# 0 1 Alice 50000.0
# 1 2 Bob 60000.0
# 2 3 Charlie 55000.0
# 3 4 David NaN
# 4 5 NaN 70000.0
Different Column Names
df1 = pd.DataFrame({
'employee_id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie']
})
df2 = pd.DataFrame({
'emp_id': [1, 2, 3],
'department': ['Sales', 'IT', 'HR']
})
# Merge with different column names
result = pd.merge(df1, df2, left_on='employee_id', right_on='emp_id')
print(result)
Multiple Join Keys
df1 = pd.DataFrame({
'year': [2024, 2024, 2025],
'quarter': [1, 2, 1],
'sales': [100, 150, 120]
})
df2 = pd.DataFrame({
'year': [2024, 2024, 2025],
'quarter': [1, 2, 1],
'target': [90, 140, 110]
})
result = pd.merge(df1, df2, on=['year', 'quarter'])
print(result)
Using join()
Join on index:
df1 = pd.DataFrame({
'name': ['Alice', 'Bob']
}, index=[1, 2])
df2 = pd.DataFrame({
'salary': [50000, 60000]
}, index=[1, 2])
# Join on index
result = df1.join(df2)
print(result)
Using concat()
Vertical Concatenation
Stack DataFrames on top of each other:
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
# Concatenate vertically
result = pd.concat([df1, df2])
print(result)
# A B
# 0 1 3
# 1 2 4
# 0 5 7
# 1 6 8
# Reset index
result = pd.concat([df1, df2], ignore_index=True)
print(result)
# A B
# 0 1 3
# 1 2 4
# 2 5 7
# 3 6 8
Horizontal Concatenation
result = pd.concat([df1, df2], axis=1)
print(result)
# A B A B
# 0 1 3 5 7
# 1 2 4 6 8
Handling Duplicates
Suffixes for Overlapping Columns
df1 = pd.DataFrame({
'id': [1, 2],
'value': [10, 20]
})
df2 = pd.DataFrame({
'id': [1, 2],
'value': [100, 200]
})
result = pd.merge(df1, df2, on='id', suffixes=('_left', '_right'))
print(result)
# id value_left value_right
# 0 1 10 100
# 1 2 20 200
Validate Merge
# Ensure one-to-one relationship
try:
result = pd.merge(df1, df2, on='id', validate='one_to_one')
except Exception as e:
print(f"Validation failed: {e}")
Practical Examples
Combining Customer and Order Data
customers = pd.DataFrame({
'customer_id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'city': ['NYC', 'LA', 'Chicago']
})
orders = pd.DataFrame({
'order_id': [101, 102, 103, 104],
'customer_id': [1, 2, 1, 3],
'amount': [100, 200, 150, 300]
})
# All customer orders
result = pd.merge(customers, orders, on='customer_id')
# Customer order summary
summary = pd.merge(
customers,
orders.groupby('customer_id')['amount'].sum().reset_index(),
on='customer_id'
)
print(summary)
Combining Multiple Files
import glob
# Read and combine multiple CSV files
files = glob.glob('data/*.csv')
dfs = [pd.read_csv(f) for f in files]
combined = pd.concat(dfs, ignore_index=True)
Left Join with Default Values
result = pd.merge(df1, df2, on='id', how='left')
result['salary'] = result['salary'].fillna(0) # Fill missing with 0
Performance Tips
# Sort before merge for better performance
df1_sorted = df1.sort_values('id')
df2_sorted = df2.sort_values('id')
# Use categorical for repeated string values
df1['category'] = df1['category'].astype('category')
Summary
merge()- SQL-style joins on columnsjoin()- join on indexconcat()- stack DataFrames vertically or horizontally- Join types:
inner,left,right,outer - Use
suffixesfor overlapping column names - Use
validateto check merge assumptions