Pandas Merge - How to Avoid Duplicating Columns

Pandas Merge - How to avoid duplicating columns

You can work out the columns that are only in one DataFrame and use this to select a subset of columns in the merge.

cols_to_use = df2.columns.difference(df.columns)

Then perform the merge (note this is an index object but it has a handy tolist() method).

dfNew = merge(df, df2[cols_to_use], left_index=True, right_index=True, how='outer')

This will avoid any columns clashing in the merge.

Pandas merge without duplicating columns

Use DataFrame.combine_first with indices by postcode in both DataFrames and then if necessary add DataFrame.reindex for same order of columns like original df1:

print (df1)
  postcode  lat  lon  plus  32  more  columns
0      M20  2.3  0.2   NaN NaN   NaN      NaN
1      LS1  NaN  NaN   NaN NaN   NaN      NaN
2      LS1  NaN  NaN   NaN NaN   NaN      NaN
3      LS2  NaN  NaN   NaN NaN   NaN      NaN
4      M21  2.4  0.3   NaN NaN   NaN      NaN

df1 = df1.set_index('postcode')
df2 = df2.set_index('postcode')

df3 = df1.combine_first(df2).reindex(df1.columns, axis=1)
print (df3)
          lat  lon  plus  32  more  columns
postcode                                   
LS1       1.4  0.1   NaN NaN   NaN      NaN
LS1       1.4  0.1   NaN NaN   NaN      NaN
LS2       1.5  0.2   NaN NaN   NaN      NaN
M20       2.3  0.2   NaN NaN   NaN      NaN
M21       2.4  0.3   NaN NaN   NaN      NaN

How to merge Pandas dataframes without duplicating columns

Your problem is that you don't really want to just merge everything. You need to concat your first set of frames, then merge.

import pandas as pd
import numpy as np

base_frame.merge(pd.concat([frame1, frame2]), how='left')

#   id supplier1_match0
#0   1                x
#1   2               2x
#2   3              NaN

Alternatively, you could define base_frame so that it has all of the relevant columns of the other frames and set id to be the index and use .update. This ensures base_frame remains the same size, while the above does not. Though data would be over-written if there are multiple non-null values for a given cell.

base_frame = pd.DataFrame({'id':[1,2,3]}).assign(supplier1_match0 = np.NaN).set_index('id')

for df in [frame1, frame2]:
    base_frame.update(df.set_index('id'))

print(base_frame)

   supplier1_match0
id                 
1                 x
2                2x
3               NaN

Avoid duplicate columns while merging with pandas

I would pd.concat similar structured dataframes then merge the others like this:

df.merge(pd.concat([df1, df3]), on='date_time', how='left')\
  .merge(df2, on='date_time', how='left')

Output:

             date_time  potato  carrot
0  2018-06-01 00:00:00     NaN     NaN
1  2018-06-01 00:30:00    13.0     NaN
2  2018-06-01 01:00:00    21.0     NaN
3  2018-06-01 01:30:00    27.0    14.0

Per comments below:

df = pd.DataFrame({'date_time':['2018-06-01 00:00:00','2018-06-01 00:30:00','2018-06-01 01:00:00','2018-06-01 01:30:00']})

# Dataframes to merge to reference dataframe
df1 = pd.DataFrame({'date_time':['2018-06-01 00:30:00','2018-06-01 01:00:00'],
                'potato':[13,21]})

df2 = pd.DataFrame({'date_time':['2018-06-01 01:30:00','2018-06-01 02:00:00','2018-06-01 02:30:00'],
                'carrot':[14,8,32]})

df3 = pd.DataFrame({'date_time':['2018-06-01 01:30:00', '2018-06-01 02:00:00'],'potato':[27,31], 'zucchini':[11,1]})

df.merge(pd.concat([df1, df3]), on='date_time', how='left').merge(df2, on='date_time', how='left')

Output:

             date_time  potato  zucchini  carrot
0  2018-06-01 00:00:00     NaN       NaN     NaN
1  2018-06-01 00:30:00    13.0       NaN     NaN
2  2018-06-01 01:00:00    21.0       NaN     NaN
3  2018-06-01 01:30:00    27.0      11.0    14.0

Python Pandas merge dataframes without duplicating columns

FYI, you do not need the reduce function, you can simply use:

df_all = df1.merge(df2)

It is duplicating columns because you are merging on 'Name'. If all your columns are the same, you can drop the on='Name' argument and it will merge on all common columns instead of duplicating them.

Alternatively, you can merge only the non-duplicate columns from df2:

df_all = df1.merge(df2[['Name','Age']])

Merging multiple data frames causing duplicate column names

You can do

s = pd.concat([x.set_index('key') for x in df_list],axis = 1,keys=range(len(df_list)))
s.columns = s.columns.map('{0[1]}_{0[0]}'.format)
s = s.reset_index()
s
Out[236]: 
  key   value_0   value_1   value_2   value_3
0   A -1.957968       NaN -0.852135 -0.976960
1   B  1.545932 -0.276838       NaN  0.197615
2   C -2.149727       NaN -0.364382  0.349993
3   D  0.524990 -0.476655       NaN       NaN
4   E       NaN -2.135870  0.798782       NaN
5   F       NaN  1.456544 -0.255705  0.447279

Pandas Merge - How to Avoid Duplicating Columns