Comparing two dataframes and getting the differences
This approach, df1 != df2
, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same
method, and exception is raised if differences found, even in columns/indices order.
If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:
>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)
group by
>>> df_gpby = df.groupby(list(df.columns))
get index of unique records
>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
filter
>>> df.reindex(idx)
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red
Compare two DataFrames and output their differences side-by-side
The first part is similar to Constantine, you can get the boolean of which rows are empty*:
In [21]: ne = (df1 != df2).any(1)
In [22]: ne
Out[22]:
0 False
1 True
2 True
dtype: bool
Then we can see which entries have changed:
In [23]: ne_stacked = (df1 != df2).stack()
In [24]: changed = ne_stacked[ne_stacked]
In [25]: changed.index.names = ['id', 'col']
In [26]: changed
Out[26]:
id col
1 score True
2 isEnrolled True
Comment True
dtype: bool
Here the first entry is the index and the second the columns which has been changed.
In [27]: difference_locations = np.where(df1 != df2)
In [28]: changed_from = df1.values[difference_locations]
In [29]: changed_to = df2.values[difference_locations]
In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
from to
id col
1 score 1.11 1.21
2 isEnrolled True False
Comment None On vacation
* Note: it's important that df1
and df2
share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index
, but I think I'll leave that as an exercise.
Pandas better method to compare two dataframes and find entries that only exist in one
Looks like using 'outer' as the how
was the solution
z = pd.merge(ORIGINAL, NEW, on=cols, how = 'outer', indicator=True)
z = z[z._merge != 'both'] # Filter out records from both
Output looks like this (after only showing the columns I care about)
Name Site _merge
Charlie A left_only
Doug B right_only
Python Pandas - Find difference between two data frames
By using drop_duplicates
pd.concat([df1,df2]).drop_duplicates(keep=False)
Update :
The above method only works for those data frames that don't already have duplicates themselves. For example:
df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})
It will output like below , which is wrong
Wrong Output :
pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]:
A B
1 2 3
Correct Output
Out[656]:
A B
1 2 3
2 3 4
3 3 4
How to achieve that?
Method 1: Using isin
with tuple
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]:
A B
1 2 3
2 3 4
3 3 4
Method 2: merge
with indicator
df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]:
A B _merge
1 2 3 left_only
2 3 4 left_only
3 3 4 left_only
Compare two DataFrames and get the differences between them as output
Assuming this input:
df1 = pd.DataFrame([['tom', 10],['nick',15], ['juli',14]])
df2 = pd.DataFrame([['juli', 14],['daniel',15], ['tom',10], ['tom',10]])
You could use merge
with the indicator
option.
The rationale here is to create an additional column with an index per group to identify the duplicates.
cols = list(df1.columns)
(df1.assign(idx=df1.groupby(cols).cumcount())
.merge(df2.assign(idx=df2.groupby(cols).cumcount()),
on=list(df1.columns)+['idx'],
indicator=True,
how='outer')
.drop('idx', axis=1)
.query('_merge != "both"')
#.to_excel('output.xlsx') ## uncomment to export as xlsx
)
output:
0 1 _merge
1 nick 15 left_only
3 daniel 15 right_only
4 tom 10 right_only
python pandas - compare two dataframes in multiple ways by custom ID
I am not sure if it is the fastest possible solution, but this problem seems to call for pd.merge
. As you say, let's first deal with things that are in one dataframe but not the other:
def get_only_left(df1, df2):
left_merge = pd.merge(df1, df2, on='ID', suffixes=('', '_other'), how='left')
added_columns = [c + '_other' for c in df1.columns if c != 'ID']
mask = left_merge.loc[:, added_columns].isna().all(axis=1)
return left_merge[mask].drop(added_columns, axis=1)
pd.concat([get_only_left(prior_df, current_df), get_only_left(current_df, prior_df)])
This gives
Date ID Value Category Subcategory
4 30-Nov 0005 500.0 D D900
4 31-Dec 0006 600.0 D D900
Then, let's deal with properly changing values.
columns = list(current_df.columns)
df = pd.merge(current_df, prior_df, on='ID', suffixes=('', '_prior'), how='inner')
mask = df['Value'] != df['Value_prior']
df[mask].loc[:, columns + ['Value_prior']]
This gives
Date ID Value Category Subcategory Value_prior
3 31-Dec 0004 400.0 E E900 450.0
Then similarly:
mask = df['Category'] != df['Category_prior']
df[mask].loc[:, columns + ['Category_prior']]
gives
Date ID Value Category Subcategory Category_prior
3 31-Dec 0004 400.0 E E900 D
And finally
import numpy as np
mask = np.logical_and(df['Category'] == df['Category_prior'], df['Subcategory'] != df['Subcategory_prior'])
df[mask].loc[:, columns + ['Subcategory_prior']]
gives
Date ID Value Category Subcategory Subcategory_prior
1 31-Dec 0002 200.0 B B101 B120
Comparing two pandas dataframes for differences
You also need to be careful to create a copy of the DataFrame, otherwise the csvdata_old will be updated with csvdata (since it points to the same object):
csvdata_old = csvdata.copy()
To check whether they are equal, you can use assert_frame_equal as in this answer:
from pandas.util.testing import assert_frame_equal
assert_frame_equal(csvdata, csvdata_old)
You can wrap this in a function with something like:
try:
assert_frame_equal(csvdata, csvdata_old)
return True
except: # appeantly AssertionError doesn't catch all
return False
There was discussion of a better way...
Comparing two dataframes and getting the differences
This approach, df1 != df2
, works only for dataframes with identical rows and columns. In fact, all dataframes axes are compared with _indexed_same
method, and exception is raised if differences found, even in columns/indices order.
If I got you right, you want not to find changes, but symmetric difference. For that, one approach might be concatenate dataframes:
>>> df = pd.concat([df1, df2])
>>> df = df.reset_index(drop=True)
group by
>>> df_gpby = df.groupby(list(df.columns))
get index of unique records
>>> idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
filter
>>> df.reindex(idx)
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red
Diff of two Dataframes
merge
the 2 dfs using method 'outer' and pass param indicator=True
this will tell you whether the rows are present in both/left only/right only, you can then filter the merged df after:
In [22]:
merged = df1.merge(df2, indicator=True, how='outer')
merged[merged['_merge'] == 'right_only']
Out[22]:
Buyer Quantity _merge
3 Carl 2 right_only
4 Mark 1 right_only
Related Topics
Python: How to Match Nested Parentheses With Regex
Compare a Column Between 2 CSV Files and Write Differences Using Python
How to Make a Discord Bot Leave a Server from a Command in Another Server
Grab a Number After a String in a File
How to Change the Title Bar in Tkinter
Change Date Formats in CSV With Python 3
Concatenate Two Columns in Csv: Python
Pandas.Read_Excel Parameter "Sheet_Name" Not Working
Remove Last Few Characters in Pyspark Dataframe Column
Python: How to Read and Load an Excel File from Aws S3
Remove Very First Row in Pandas
How to Write Multiple Images (Subplots) into One Image
How to Append Data Using Openpyxl Python to Excel File from a Specified Row
How to Read a List of Parquet Files from S3 as a Pandas Dataframe Using Pyarrow
Pandas Dataframe Check If Column Value Exists in a Group of Columns
Regex to Append Some Characters in a Certain Position