Python: pandas merge multiple dataframes
Below, is the most clean, comprehensible way of merging multiple dataframe if complex queries aren't involved.
Just simply merge with DATE as the index and merge using OUTER method (to get all the data).
import pandas as pd
from functools import reduce
df1 = pd.read_table('file1.csv', sep=',')
df2 = pd.read_table('file2.csv', sep=',')
df3 = pd.read_table('file3.csv', sep=',')
Now, basically load all the files you have as data frame into a list. And, then merge the files using merge
or reduce
function.
# compile the list of dataframes you want to merge
data_frames = [df1, df2, df3]
Note: you can add as many data-frames inside the above list. This is the good part about this method. No complex queries involved.
To keep the values that belong to the same date you need to merge it on the DATE
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['DATE'],
how='outer'), data_frames)
# if you want to fill the values that don't exist in the lines of merged dataframe simply fill with required strings as
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['DATE'],
how='outer'), data_frames).fillna('void')
- Now, the output will the values from the same date on the same lines.
- You can fill the non existing data from different frames for different columns using fillna().
Then write the merged data to the csv file if desired.
pd.DataFrame.to_csv(df_merged, 'merged.txt', sep=',', na_rep='.', index=False)
This should give you
DATE VALUE1 VALUE2 VALUE3 ....
Merging a lot of data.frames
Put them into a list
and use merge
with Reduce
Reduce(function(x, y) merge(x, y, all=TRUE), list(df1, df2, df3))
# id v1 v2 v3
# 1 1 1 NA NA
# 2 10 4 NA NA
# 3 2 3 4 NA
# 4 43 5 NA NA
# 5 73 2 NA NA
# 6 23 NA 2 1
# 7 57 NA 3 NA
# 8 62 NA 5 2
# 9 7 NA 1 NA
# 10 96 NA 6 NA
You can also use this more concise version:
Reduce(function(...) merge(..., all=TRUE), list(df1, df2, df3))
Efficient way to merge multiple large DataFrames
You may get some benefit from performing index-aligned concatenation using pd.concat
. This should hopefully be faster and more memory efficient than an outer merge as well.
df_list = [df1, df2, ...]
for df in df_list:
df.set_index(['name', 'id'], inplace=True)
df = pd.concat(df_list, axis=1) # join='inner'
df.reset_index(inplace=True)
Alternatively, you can replace the concat
(second step) by an iterative join
:
from functools import reduce
df = reduce(lambda x, y: x.join(y), df_list)
This may or may not be better than the merge
.
Best strategy for merging a lot of data frames using pandas
I would think the fastest way is to set the column you want to merge on to the index, create a list of the dataframes and then pd.concat
them. Something like this:
import os
import pandas as pd
directory = os.path.expanduser('~/home')
files = os.path.listdir(directory)
dfs = []
for filename in files:
if '.tsv' in file:
df = pd.read_table(os.path.join(directory,filename),sep='\t').set_index('bird')
dfs.append(df)
master_df = pd.concat(dfs,axis=1)
Merge multiple DataFrames Pandas
Consider setting index on each data frame and then run the horizontal merge with pd.concat
:
dfs = [df.set_index(['profile', 'depth']) for df in [df1, df2, df3]]
print(pd.concat(dfs, axis=1).reset_index())
# profile depth VAR1 VAR2 VAR3
# 0 profile_1 0.5 38.198002 NaN NaN
# 1 profile_1 0.6 38.198002 0.20440 NaN
# 2 profile_1 1.1 NaN 0.20442 NaN
# 3 profile_1 1.2 NaN 0.20446 15.188
# 4 profile_1 1.3 38.200001 NaN 15.182
# 5 profile_1 1.4 NaN NaN 15.182
How to merge(efficient way) multiple data frame in one go?
You can use merge
in Reduce
:
Reduce(merge, list(df, df1, df2, df3))
# ID YEAR MONTH DAY HOUR VALUE1 VALUE2 VALUE3 VALUE4
#1 A 2020 1 16 15 1 3 6 9
#2 B 2020 1 16 15 2 4 7 10
#3 C 2020 1 16 15 3 5 8 11
Merging data frame and filling missing values
You can get data frames in a list and use merge
with Reduce
. Missing values in the new dataframe can be replaced with -1.
new_df <- Reduce(function(x, y) merge(x, y, all = TRUE), list(df1, df2, df3))
new_df[is.na(new_df)] <- -1
new_df
# Letter Values1 Values2 Values3
#1 A 1 0 -1
#2 B 2 -1 -1
#3 C 3 5 -1
#4 D -1 9 5
A tidyverse
way with the same logic :
library(dplyr)
library(purrr)
list(df1, df2, df3) %>%
reduce(full_join) %>%
mutate(across(everything(), replace_na, -1))
Related Topics
Delete Rows That Exist in Another Data Frame
Selecting Multiple Odd or Even Columns/Rows for Dataframe
R Collapse Multiple Rows into 1 Row - Same Columns
Divide All Columns by the Value from the 2Nd Column - Apply for All Rows
R: Pulling Data from One Column to Create New Columns
If Else Statements to Check If a String Contains a Substring in R
Concatenating Two Text Columns in Dplyr
Adding Value from One Data.Frame to Another Data.Frame by Matching a Variable
How to Convert Only Some Positive Numbers to Negative Numbers (Conditional Recoding)
Dplyr Conditional Summarise Function
How to Add a Suffix (Or Prefix) Elements of an Existing List
Converting Data Frame into a List of Lists in R
Conditionally Replace Values of Subset of Rows With Column Name in R Using Only Tidy
Combine (Rbind) Data Frames and Create Column With Name of Original Data Frames