Combining DataFrames is essential for data analysis. Pandas provides several methods to merge, join, and concatenate DataFrames.

Setup

import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David']
})

df2 = pd.DataFrame({
    'id': [1, 2, 3, 5],
    'salary': [50000, 60000, 55000, 70000]
})

Using merge()

Inner Merge (Default)

Keep only matching rows:

# Inner merge on 'id'
result = pd.merge(df1, df2, on='id')
print(result)
#    id     name  salary
# 0   1    Alice   50000
# 1   2      Bob   60000
# 2   3  Charlie   55000

Left Merge

Keep all rows from left DataFrame:

result = pd.merge(df1, df2, on='id', how='left')
print(result)
#    id     name   salary
# 0   1    Alice  50000.0
# 1   2      Bob  60000.0
# 2   3  Charlie  55000.0
# 3   4    David      NaN

Right Merge

Keep all rows from right DataFrame:

result = pd.merge(df1, df2, on='id', how='right')
print(result)
#    id     name  salary
# 0   1    Alice   50000
# 1   2      Bob   60000
# 2   3  Charlie   55000
# 3   5      NaN   70000

Outer Merge

Keep all rows from both DataFrames:

result = pd.merge(df1, df2, on='id', how='outer')
print(result)
#    id     name   salary
# 0   1    Alice  50000.0
# 1   2      Bob  60000.0
# 2   3  Charlie  55000.0
# 3   4    David      NaN
# 4   5      NaN  70000.0

Different Column Names

df1 = pd.DataFrame({
    'employee_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie']
})

df2 = pd.DataFrame({
    'emp_id': [1, 2, 3],
    'department': ['Sales', 'IT', 'HR']
})

# Merge with different column names
result = pd.merge(df1, df2, left_on='employee_id', right_on='emp_id')
print(result)

Multiple Join Keys

df1 = pd.DataFrame({
    'year': [2024, 2024, 2025],
    'quarter': [1, 2, 1],
    'sales': [100, 150, 120]
})

df2 = pd.DataFrame({
    'year': [2024, 2024, 2025],
    'quarter': [1, 2, 1],
    'target': [90, 140, 110]
})

result = pd.merge(df1, df2, on=['year', 'quarter'])
print(result)

Using join()

Join on index:

df1 = pd.DataFrame({
    'name': ['Alice', 'Bob']
}, index=[1, 2])

df2 = pd.DataFrame({
    'salary': [50000, 60000]
}, index=[1, 2])

# Join on index
result = df1.join(df2)
print(result)

Using concat()

Vertical Concatenation

Stack DataFrames on top of each other:

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# Concatenate vertically
result = pd.concat([df1, df2])
print(result)
#    A  B
# 0  1  3
# 1  2  4
# 0  5  7
# 1  6  8

# Reset index
result = pd.concat([df1, df2], ignore_index=True)
print(result)
#    A  B
# 0  1  3
# 1  2  4
# 2  5  7
# 3  6  8

Horizontal Concatenation

result = pd.concat([df1, df2], axis=1)
print(result)
#    A  B  A  B
# 0  1  3  5  7
# 1  2  4  6  8

Handling Duplicates

Suffixes for Overlapping Columns

df1 = pd.DataFrame({
    'id': [1, 2],
    'value': [10, 20]
})

df2 = pd.DataFrame({
    'id': [1, 2],
    'value': [100, 200]
})

result = pd.merge(df1, df2, on='id', suffixes=('_left', '_right'))
print(result)
#    id  value_left  value_right
# 0   1          10          100
# 1   2          20          200

Validate Merge

# Ensure one-to-one relationship
try:
    result = pd.merge(df1, df2, on='id', validate='one_to_one')
except Exception as e:
    print(f"Validation failed: {e}")

Practical Examples

Combining Customer and Order Data

customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'city': ['NYC', 'LA', 'Chicago']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104],
    'customer_id': [1, 2, 1, 3],
    'amount': [100, 200, 150, 300]
})

# All customer orders
result = pd.merge(customers, orders, on='customer_id')

# Customer order summary
summary = pd.merge(
    customers,
    orders.groupby('customer_id')['amount'].sum().reset_index(),
    on='customer_id'
)
print(summary)

Combining Multiple Files

import glob

# Read and combine multiple CSV files
files = glob.glob('data/*.csv')
dfs = [pd.read_csv(f) for f in files]
combined = pd.concat(dfs, ignore_index=True)

Left Join with Default Values

result = pd.merge(df1, df2, on='id', how='left')
result['salary'] = result['salary'].fillna(0)  # Fill missing with 0

Performance Tips

# Sort before merge for better performance
df1_sorted = df1.sort_values('id')
df2_sorted = df2.sort_values('id')

# Use categorical for repeated string values
df1['category'] = df1['category'].astype('category')

Summary

merge() - SQL-style joins on columns
join() - join on index
concat() - stack DataFrames vertically or horizontally
Join types: inner, left, right, outer
Use suffixes for overlapping column names
Use validate to check merge assumptions

Settings

Appearance