How do I select rows from a DataFrame based on column values?
To select rows whose column value equals a scalar, some_value
, use ==
:
df.loc[df['column_name'] == some_value]
To select rows whose column value is in an iterable, some_values
, use isin
:
df.loc[df['column_name'].isin(some_values)]
Combine multiple conditions with &
:
df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]
Note the parentheses. Due to Python's operator precedence rules, &
binds more tightly than <=
and >=
. Thus, the parentheses in the last example are necessary. Without the parentheses
df['column_name'] >= A & df['column_name'] <= B
is parsed as
df['column_name'] >= (A & df['column_name']) <= B
which results in a Truth value of a Series is ambiguous error.
To select rows whose column value does not equal some_value
, use !=
:
df.loc[df['column_name'] != some_value]
isin
returns a boolean Series, so to select rows whose value is not in some_values
, negate the boolean Series using ~
:
df.loc[~df['column_name'].isin(some_values)]
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
# A B C D
# 0 foo one 0 0
# 1 bar one 1 2
# 2 foo two 2 4
# 3 bar three 3 6
# 4 foo two 4 8
# 5 bar two 5 10
# 6 foo one 6 12
# 7 foo three 7 14
print(df.loc[df['A'] == 'foo'])
yields
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
If you have multiple values you want to include, put them in a
list (or more generally, any iterable) and use isin
:
print(df.loc[df['B'].isin(['one','three'])])
yields
A B C D
0 foo one 0 0
1 bar one 1 2
3 bar three 3 6
6 foo one 6 12
7 foo three 7 14
Note, however, that if you wish to do this many times, it is more efficient to
make an index first, and then use df.loc
:
df = df.set_index(['B'])
print(df.loc['one'])
yields
A C D
B
one foo 0 0
one bar 1 2
one foo 6 12
or, to include multiple values from the index use df.index.isin
:
df.loc[df.index.isin(['one','two'])]
yields
A C D
B
one foo 0 0
one bar 1 2
two foo 2 4
two foo 4 8
two bar 5 10
one foo 6 12
Filter a Dataframe on a column, if a list value is contained in the column value. Pandas
Here you go:
df = pd.DataFrame({'column':['abc', 'def', 'ghi', 'abc, def', 'ghi, jkl', 'abc']})
contains_filter = '|'.join(filter_list)
df = df[pd.notna(df.column) & df.column.str.contains(contains_filter)]
Output:
column
0 abc
3 abc, def
4 ghi, jkl
5 abc
Filtering a large data frame based on column values using R
We can reshape to 'long' format with pivot_longer
and filter
by creating a logical vector from the first character extracted (with substr
)
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = starts_with("code"),
values_to = 'code', names_to = NULL) %>%
filter(substr(code, 1, 1) == "E")
-output
# A tibble: 3 × 2
IDs code
<int> <chr>
1 1 E109
2 1 E341
3 3 E131
If the data is really big, we may do a filter
before the pivot_longer
to keep only rows having at least one 'E' in the column
df1 %>%
filter(if_any(starts_with('code'), ~ substr(., 1, 1) == 'E')) %>%
pivot_longer(cols = starts_with("code"),
values_to = 'code', names_to = NULL) %>%
filter(substr(code, 1, 1) == "E")
If it is a very big data, another option is data.table
. Convert the data.frame to 'data.table' (setDT
), loop across the columns of interest (.SDcols
) with lapply
, replace
the elements that are not starting with "E" to NA
, then use fcoalesce
to get the first non-NA element for each row using do.call
library(data.table)
na.omit(setDT(df1)[, .(IDs, code = do.call(fcoalesce,
lapply(.SD, function(x) replace(x, substr(x, 1, 1) != "E",
NA)))), .SDcols = patterns("code")])
-output
IDs code
1: 1 E109
2: 1 E341
3: 3 E131
data
df1 <- structure(list(IDs = c(1L, 2L, 1L, 3L), code1 = c("C443", "AX31",
"E341", "E131"), code2 = c("E109", "M223", "QWE1", "M223")),
class = "data.frame", row.names = c(NA,
-4L))
how can I Filter single column in a dataframe on multiple values
Put all 61 MRNs into a list-
mrnList = [val1, val2, ...,val61]
Then filter these MRNs like-
df_filtered = df[df['MRN'].isin(mrnList)]
Keep note of your MRN value's datatype while making mrnList.
Filter pandas dataframe based on values in multiple columns
UPDATE:
you can replace empty strings with NaN
, 7
or N
and then use isin
:
In [196]: df[~df[cols].replace('',np.nan).isin(['7','N', np.nan]).all(axis=1)]
Out[196]:
a b c dxpoa1 dxpoa2 dxpoa3 dxpoa4
0 0 A X W N X
2 7 W N W W 1 Z
4 Y 0 W N X 1
5 N X 1 E 1 Z 7
6 1 X 7 0 A W A
7 X X Z X N A 1
8 7 1 A N X Z N
10 A N Z 7 0 A E
11 E N A Z N N 1
12 E A 1 Z E E W
13 N W Z E X A 0
14 Y 1 A W A E X
OLD answer:
show those containing 7
or N
In [197]: df.loc[df[cols].isin(['7','N']).any(axis=1)]
Out[197]:
a b c dxpoa1 dxpoa2 dxpoa3 dxpoa4
0 0 A X W N X
1 Z W 2 7 7
3 1 7 E N N N N
4 Y 0 W N X 1
5 N X 1 E 1 Z 7
7 X X Z X N A 1
8 7 1 A N X Z N
9 N A Z N N N
10 A N Z 7 0 A E
11 E N A Z N N 1
remove those containing 7
or N
In [198]: df.loc[~df[cols].isin(['7','N']).any(axis=1)]
Out[198]:
a b c dxpoa1 dxpoa2 dxpoa3 dxpoa4
2 7 W N W W 1 Z
6 1 X 7 0 A W A
12 E A 1 Z E E W
13 N W Z E X A 0
14 Y 1 A W A E X
replace any
to all
if you want to have/exclude those where all columns should contain either 7
or N
setup:
rows = 15
s = [''] + list('YWE17N0AZX')
df = pd.DataFrame(np.random.choice(s, size=(rows, 7)), columns=list('abc') + ['dxpoa1', 'dxpoa2', 'dxpoa3', 'dxpoa4'])
cols = df.filter(like='dxpoa').columns
Filtering Dataframe by keeping numeric values of a specific column only in R
You could use a regular expression to filter the relevant rows of your dataframe.
The regular expression ^\\d+(\\.\\d+)?$
will check for character that contains only digits, possibly with .
as a decimal separator (i.e. 2, 2.3). You could then convert the Cost
column to numeric using as.numeric()
if needed.
See the example below:
Group = c("A", "A", "A", "B", "B", "C", "C", "C")
Cost = c(21,22,"closed", 12, 11,"ended", "closing", 13)
Year = c(2017,2016,2015,2017,2016,2017,2016,2015)
df = data.frame(Group, Cost, Year)
df[grep(pattern = "^\\d+(\\.\\d+)?$", df[,"Cost"]), ]
#> Group Cost Year
#> 1 A 21 2017
#> 2 A 22 2016
#> 4 B 12 2017
#> 5 B 11 2016
#> 8 C 13 2015
Note that this technique works even if your Cost
column is of factor
class while using df[!is.na(as.numeric(df$Cost)), ]
does not. For the latter you need to add as.character()
first: df[!is.na(as.numeric(as.character(df$Cost))), ]
. Both techniques keep factor levels.
Filter a dataframe based on condition in columns selected by name pattern
You can filter
multiple columns at once using if_all
:
library(dplyr)
df %>%
filter(if_all(matches("_qvalue"), ~ . < 0.05))
In this case I use the filtering condition x < 0.05
on all columns whose name matches _qvalue
.
Your second approach can also work if you group by ID
first and then use all
inside filter:
df_ID = df %>% mutate(ID = 1:n())
df_ID %>%
select(contains("qval"), ID) %>%
gather(variable, value, -ID) %>%
group_by(ID) %>%
filter(all(value < 0.05)) %>%
semi_join(df_ID, by = "ID")
Related Topics
Reshaping Data.Frame from Wide to Long Format
How to Make a List of Data Frames
Grouping Functions (Tapply, By, Aggregate) and the *Apply Family
Reshaping Multiple Sets of Measurement Columns (Wide Format) into Single Columns (Long Format)
Gather Multiple Sets of Columns
Linear Regression and Group by in R
Repeat Each Row of Data.Frame the Number of Times Specified in a Column
How to View the Source Code For a Function
Converting Year and Month ("Yyyy-Mm" Format) to a Date
Split Column At Delimiter in Data Frame
Pass a Data.Frame Column Name to a Function
Extract Row Corresponding to Minimum Value of a Variable by Group
Drop Unused Factor Levels in a Subsetted Data Frame
Create Grouping Variable For Consecutive Sequences and Split Vector
What Exactly Is Copy-On-Modify Semantics in R, and Where Is the Canonical Source
Dplyr Conditional Summarise Function
Dcast Warning: 'Aggregation Function Missing: Defaulting to Length'