How to drop rows from pandas data frame that contains a particular string in a particular column?
pandas has vectorized string operations, so you can just filter out the rows that contain the string you don't want:
In [91]: df = pd.DataFrame(dict(A=[5,3,5,6], C=["foo","bar","fooXYZbar", "bat"]))
In [92]: df
Out[92]:
A C
0 5 foo
1 3 bar
2 5 fooXYZbar
3 6 bat
In [93]: df[~df.C.str.contains("XYZ")]
Out[93]:
A C
0 5 foo
1 3 bar
3 6 bat
How to delete ANY row containing specific string in pandas?
You can use isin
with any
.
df = df[~df.isin(['refused']).any(axis=1)]
Drop rows in dataframe if the column matches particular string
Essentially you are forgetting to pass the boolean series (True/False) into brackets [...]
or better with .loc[...]
. Instead, you are re-assigning the values within those chunk columns to the result of your conditions but not applying conditions logically to the data frame.
Therefore, consider calling .loc[]
with intersection of both those conditions:
# ASSIGN BOOLEAN SERIES
fname_jr = ~chunk.loc[0].str.contains("jr", na=False)
lname_jr = ~chunk.loc[1].str.contains("jr", na=False)
# PASS INTO .loc
chunk_sub = chunk.loc[fname_jr & lname_jr]
chunk_sub
# 0 1 ... 9 10
# 0 jane doe ... kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI= cigna_TOKEN_ENCRYPTION_KEY
# 2 jane sr ... kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI= cigna_TOKEN_ENCRYPTION_KEY
And to integrate multiple selections, call str.join
to combine a list of items with pipe-delimiters:
# ASSIGN BOOLEAN SERIES
fname_jr_sr = ~chunk[0].str.contains("|".join(["sr", "jr"]), na=False)
lname_jr_sr = ~chunk[1].str.contains("|".join(["sr", "jr"]), na=False)
# PASS INTO .loc
chunk_sub = chunk.loc[fname_jr_sr & lname_jr_sr]
chunk_sub
# 0 1 ... 9 10
# 0 jane doe ... kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI= cigna_TOKEN_ENCRYPTION_KEY
Relatedly, your np.where
call is not necessary as .loc
will run on boolean series. Be sure to also escape |
with backslashes \\
since the pipe symbol is a string matching operator. Altogether:
chunk = chunk.loc[(chunk[0].astype('str').str.len()>1) &
(chunk[1].astype('str').str.len()>1) &
(chunk[4].astype('str').str.len()>4) &
(chunk[4].astype('str').str.len()<8) &
~chunk[0].str.contains("|".join(["sr", "jr", "\\|", "\\|\\|"]), na=False) &
~chunk[1].str.contains("|".join(["sr", "jr", "\\|", "\\|\\|"]), na=False)]
chunk.to_csv("/tmp/sample.csv", sep="|", header=None, index=False)
Python Pandas Dataframe dropping rows based on a column containing a character
IIUC
df[~df.dates.atype(str).str.contains('/')]
For example
df = pd.DataFrame()
df['dates'] = ['2011-01-20', '2011-01-20', '2011/01/20', '2011-01-20']
dates
0 2011-01-20
1 2011-01-20
2 2011/01/20
3 2011-01-20
Then
df[~df.dates.str.contains('/')]
dates
0 2011-01-20
1 2011-01-20
3 2011-01-20
You can also use map
(as you tried), but using bool
values rather than int
, such that you perform boolean masking
df[df['dates'].map(lambda x: False if '/' in x else True )]
dates
0 2011-01-20
1 2011-01-20
3 2011-01-20
However notice that False if '/' in x else True
is redundant. This is the same as just not '/' in x
df[df['dates'].map(lambda x: not '/' in x )]
dates
0 2011-01-20
1 2011-01-20
3 2011-01-20
Dropping rows with contain of a list of certain strings in Pandas
The Series.str.contains
method accepts a regex.
>>> df
col1
0 24/05/2020
1 May Year 2020
2 Monday
3 May 2020
>>> drop_values = ['Monday','Year', '/']
>>> df[~df['col1'].str.contains('|'.join(drop_values))]
col1
3 May 2020
Deleting/dropping rows in pandas DataFrame with particular string in ANY column
You can select only object columns, obviously strings by select_dtypes
:
df = energy.select_dtypes(object)
#added regex=False for improve performance like mentioned @jpp, thank you
mask = ~df.apply(lambda series: series.str.contains('Economy 7', regex=False)).any(axis=1)
no_eco = energy[mask]
Sample:
energy = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('adabbb')
})
print (energy)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 d
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df = energy.select_dtypes(object)
mask = ~df.apply(lambda series: series.str.contains('d')).any(axis=1)
no_eco = energy[mask]
print (no_eco)
A B C D E F
0 a 4 7 1 5 a
2 c 4 9 5 6 a
4 e 5 2 1 2 b
5 f 4 3 0 4 b
Pandas Drop Rows when a String is Matched to a Longer String in a Column in an Exact Match
You can create a set from drop_list
and use set.isdisjoint
on the split words in each row to evaluate if the exact match appears.
drop_set = set(drop_list)
msk = df['keyword'].apply(lambda x: drop_set.isdisjoint(x.split()))
df = df[msk]
Output:
keyword
0 adidas socks
2 adidas shoes
Related Topics
Python Pandas: Drop Rows of a Timeserie Based on Time Range
How to Remove the Double Quote When the Value Is Empty in Spark
Splitting One CSV into Multiple Files
How to Print Float to N Decimal Places Including Trailing 0S
Efficient Way to Unnest (Explode) Multiple List Columns in a Pandas Dataframe
Adding a Data File in Pyinstaller Using the Onefile Option
Python Xlsxwriter Set Border Around Multiple Cells
How to Split Image into Multiple Pieces in Python
Filenotfounderror: [Errno 2] No Such File or Directory
How to Name Dataframes Dynamically in Python
How to Remove Text Within Parentheses With a Regex
How to Make Python Get the Username in Windows and Then Implement It in a Script
Check If a Specific Class and Value Exist in HTML Using Beautifulsoup Python
How to Detect and Remove Outliers from Each Column of Pandas Dataframe At One Go
Masking User Input in Python With Asterisks