Delete rows that exist in another data frame?
You need the %in%
operator. So,
df1[!(df1$name %in% df2$name),]
should give you what you want.
df1$name %in% df2$name
tests whether the values indf1$name
are indf2$name
- The
!
operator reverses the result.
How to remove rows from Pandas dataframe if the same row exists in another dataframe but end up with all columns from both df
You can use a left join to get only the id
's in the first data frame and not the second data frame while also keeping all the second data frames columns.
import pandas as pd
df1 = pd.DataFrame(
data={"id": [1, 2, 3, 4], "col1": [9, 8, 7, 6], "col2": [5, 4, 3, 2]},
columns=["id", "col1", "col2"],
)
df2 = pd.DataFrame(
data={"id": [3, 4, 7], "col3": [11, 12, 13], "col4": [15, 16, 17]},
columns=["id", "col3", "col4"],
)
df_1_2 = df1.merge(df2, on="id", how="left", indicator=True)
df_1_not_2 = df_1_2[df_1_2["_merge"] == "left_only"].drop(columns=["_merge"])
which returns
id col1 col2 col3 col4
0 1 9 5 NaN NaN
1 2 8 4 NaN NaN
DataFrame remove rows existing in another DataFrame
Using pyspark
:
You can create a list containing the customerId from DF2
with collect()
:
from pyspark.sql.types import *
id_df2 = [id[0] for id in df2.select('customerId').distinct().collect()]
And then filter your DF1
customerId using isin
with negation ~
:
diff = df1.where(~col('customerId').isin(id_df2))
How to remove rows in a Pandas dataframe if the same row exists in another dataframe?
You an use merge
with parameter indicator
and outer join, query
for filtering and then remove helper column with drop
:
DataFrames are joined on all columns, so on
parameter can be omit.
print (pd.merge(a,b, indicator=True, how='outer')
.query('_merge=="left_only"')
.drop('_merge', axis=1))
0 1
0 1 10
2 3 30
In Pandas, how to delete rows from a Data Frame based on another Data Frame?
You can use boolean indexing
and condition with isin
, inverting boolean Series
is by ~
:
import pandas as pd
USERS = pd.DataFrame({'email':['a@g.com','b@g.com','b@g.com','c@g.com','d@g.com']})
print (USERS)
email
0 a@g.com
1 b@g.com
2 b@g.com
3 c@g.com
4 d@g.com
EXCLUDE = pd.DataFrame({'email':['a@g.com','d@g.com']})
print (EXCLUDE)
email
0 a@g.com
1 d@g.com
print (USERS.email.isin(EXCLUDE.email))
0 True
1 False
2 False
3 False
4 True
Name: email, dtype: bool
print (~USERS.email.isin(EXCLUDE.email))
0 False
1 True
2 True
3 True
4 False
Name: email, dtype: bool
print (USERS[~USERS.email.isin(EXCLUDE.email)])
email
1 b@g.com
2 b@g.com
3 c@g.com
Another solution with merge
:
df = pd.merge(USERS, EXCLUDE, how='outer', indicator=True)
print (df)
email _merge
0 a@g.com both
1 b@g.com left_only
2 b@g.com left_only
3 c@g.com left_only
4 d@g.com both
print (df.loc[df._merge == 'left_only', ['email']])
email
1 b@g.com
2 b@g.com
3 c@g.com
Delete rows from dataframe if column value does not exist in another dataframe
Your question doesn't contain enough information. So I'll try to guess and show you a toy example.
If your using pandas then the solution would be:
>>> df1 = pd.DataFrame([x for x in pd.date_range('1/1/2020', '3/1/2020')], columns=['date'])
>>> df2 = pd.DataFrame([x for x in pd.date_range('2/20/2020', '3/1/2020')], columns=['date'])
>>> df1.shape
out: (61, 1)
>>> df2.shape
out: (11, 1)
>>> df1.head()
out:
date
0 2020-01-01
1 2020-01-02
2 2020-01-03
3 2020-01-04
4 2020-01-05
>>> df2.head()
out:
date
0 2020-02-20
1 2020-02-21
2 2020-02-22
3 2020-02-23
4 2020-02-24
>>> new_df = df1[df1['date'].isin(df2['date'])]
>>> new_df
out:
date
50 2020-02-20
51 2020-02-21
52 2020-02-22
53 2020-02-23
54 2020-02-24
55 2020-02-25
56 2020-02-26
57 2020-02-27
58 2020-02-28
59 2020-02-29
60 2020-03-01
>>> new_df.shape
out: (11, 1)
Now in the "new_df" you will have only those dates which are contained in both dataframes
How to remove rows of a DataFrame based off of data from another DataFrame?
isin
with &
df.loc[~((df.Product_Num.isin(df2['Product_Num']))&(df.Price.isin(df2['Price']))),:]
Out[246]:
Product_Num Date Description Price
0 10 1-1-18 FruitSnacks 2.99
1 10 1-2-18 FruitSnacks 2.99
4 10 1-10-18 FruitSnacks 2.99
5 45 1-1-18 Apples 2.99
6 45 1-3-18 Apples 2.99
7 45 1-5-18 Apples 2.99
11 45 1-15-18 Apples 2.99
Update
df.loc[~df.index.isin(df.merge(df2.assign(a='key'),how='left').dropna().index)]
Out[260]:
Product_Num Date Description Price
0 10 1-1-18 FruitSnacks 2.99
1 10 1-2-18 FruitSnacks 2.99
4 10 1-10-18 FruitSnacks 2.99
5 45 1-1-18 Apples 2.99
6 45 1-3-18 Apples 2.99
7 45 1-5-18 Apples 2.99
11 45 1-15-18 Apples 2.99
Pandas delete rows in a dataframe that are not in another dataframe
Please try this:
df = pd.merge(df1, df2, how='left', indicator='Exist')
df['Exist'] = np.where(df.Exist == 'both', True, False)
df = df[df['Exist']==True].drop(['Exist','z'], axis=1)
Related Topics
How to Remove Rows With Any Zero Value
Adding Value from One Data.Frame to Another Data.Frame by Matching a Variable
To Find Most Frequently Occuring Element in Matrix in R
How to Convert Only Some Positive Numbers to Negative Numbers (Conditional Recoding)
Too Much White Space Between Caption and Figure Produced by Tikzdevice and Ggplot2 in Latex
How to Specify the Size of a Graph in Ggplot2 Independent of Axis Labels
Dplyr Conditional Summarise Function
How to Force R to Use a Specified Factor Level as Reference in a Regression
How to Add a Suffix (Or Prefix) Elements of an Existing List
Calculate Difference Between Values in Consecutive Rows by Group
How to Create a Consecutive Group Number
Converting Data Frame into a List of Lists in R
How to Loop Through List and Create Separate Dataframes in R
How to Delete Rows Where All the Columns Are Zero
Conditionally Replace Values of Subset of Rows With Column Name in R Using Only Tidy