Add column which contains binned values of a numeric column
See ?cut
and specify breaks
(and maybe labels
).
x$bins <- cut(x$rank, breaks=c(0,4,10,15), labels=c("1-4","5-10","10-15"))
x
# rank name info bins
# 1 1 steve red 1-4
# 2 3 joe blue 1-4
# 3 6 john green 5-10
# 4 3 liz yellow 1-4
# 5 15 jon pink 10-15
Binning a column with Python Pandas
You can use pandas.cut
:
bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = pd.cut(df['percentage'], bins)
print (df)
percentage binned
0 46.50 (25, 50]
1 44.20 (25, 50]
2 100.00 (50, 100]
3 42.12 (25, 50]
bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
df['binned'] = pd.cut(df['percentage'], bins=bins, labels=labels)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5
Or numpy.searchsorted
:
bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = np.searchsorted(bins, df['percentage'].values)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5
...and then value_counts
or groupby
and aggregate size
:
s = pd.cut(df['percentage'], bins=bins).value_counts()
print (s)
(25, 50] 3
(50, 100] 1
(10, 25] 0
(5, 10] 0
(1, 5] 0
(0, 1] 0
Name: percentage, dtype: int64
s = df.groupby(pd.cut(df['percentage'], bins=bins)).size()
print (s)
percentage
(0, 1] 0
(1, 5] 0
(5, 10] 0
(10, 25] 0
(25, 50] 3
(50, 100] 1
dtype: int64
By default cut
returns categorical
.
Series
methods like Series.value_counts()
will use all categories, even if some categories are not present in the data, operations in categorical.
How do I reassign the values of a column based on different ranges in R?
We could use case_when
from dplyr
package:
library(dplyr)
df %>%
mutate(NEW = case_when(sleep_duration < 5 ~ 3,
sleep_duration >=5 & sleep_duration < 6 ~ 2,
sleep_duration >=6 & sleep_duration < 7 ~ 1,
sleep_duration >=7 ~ 0))
Output:
sleep_duration NEW
1 6.0 1
2 7.5 0
3 8.0 0
4 10.0 0
5 5.0 2
6 9.0 0
data:
df <- data.frame(sleep_duration = c(6, 7.5, 8, 10, 5, 9))
How to bin data based on values in one column, and count occurrences from another column excluding duplicates in R?
Will This work?
df <- data.frame(CNV=c("1:10405137","1:10405137","1:10405137","1:101161140","1:110028467")
,r_value=c(0.035118621,0.070643341,0.391963719,0.376573375,0.950231679))
> df # minimal example
CNV r_value
1 1:10405137 0.03511862
2 1:10405137 0.07064334
3 1:10405137 0.39196372
4 1:101161140 0.37657337
5 1:110028467 0.95023168
df1 <- transform(df, group=cut(r_value,
breaks=c(0,0.1,0.2, 0.3, 0.4, 0.5,1),
labels=c("<0.1","0.1","0.2", "0.3", "0.4", "0.5<")))
res <- do.call(data.frame,aggregate(r_value~group, df1,
FUN=function(x) c(Count=length(x))))
> res # counts of intervals
group r_value
1 <0.1 2
2 0.3 2
3 0.5< 1
dNew <- data.frame(group=levels(df1$group))
dNew <- merge(res, dNew, all=TRUE)
colnames(dNew) <- c("interval","count")
> dNew # count of CNV by interval
interval count
1 <0.1 2
2 0.1 NA
3 0.2 NA
4 0.3 2
5 0.4 NA
6 0.5< 1
adapted from Group/bin/bucket data in R and get count per bucket and sum of values per bucket
Add column into a dataframe based on condition
EDIT:
Your code is NOT Wrong.
You just have to reconvert your result into factor like this:
df<-data.frame(B=c("A","B","C","C"), C=c("A","C","B","B"), D=c("B","A","C","A") )
df$A<-levels(df$B)[with(df,ifelse(df$B==df$C,df$D,df$C))]
To see why this happen you have to see what ifelse does:
debugonce(ifelse)
ifelse(df$B==df$C,df$D,df$C)
Keep in Mind "Factor variables are stored, internally, as numeric variables together with their levels. The actual values of the numeric variable are 1, 2, and so on."
In particular ifelse assign to the answer vector boolean values, that is you start with a logical vector. Then based on test comparison, ifelse subset this ans vector assigning "yes" values. So R keep the vector rapresentation.
Briefly something like this happen and you lose the factor rapresentation
a<-c(TRUE,FALSE)
a[1]<-df$D[1]
df$D
a
Try also this working example (an alternative way to do the same thing)
df<-data.frame(B=c("A","B","C","C"), C=c("A","C","B","B"), D=c("B","A","C","A") )
f<-data.frame(b,c,d)
df
f<-function(x,y,z){
if(x==y){
z
}else{
y
}
}
df$A<-unlist(Map(f,df$B,df$C,df$D))
Related Topics
Lm' Summary Not Display All Factor Levels
Numeric Comparison Difficulty in R
Global and Local Variables in R
Axis Labels on Two Lines With Nested X Variables (Year Below Months)
Storing Ggplot Objects in a List from Within Loop in R
Plot Multiple Boxplot in One Graph
Reorder Levels of a Factor Without Changing Order of Values
Generate List of All Possible Combinations of Elements of Vector
Add a Common Legend For Combined Ggplots
Find Indices of Duplicated Rows
Create Stacked Barplot Where Each Stack Is Scaled to Sum to 100%
Looping Over a Date or Posixct Object Results in a Numeric Iterator
Specify Custom Date Format For Colclasses Argument in Read.Table/Read.Csv
Adding a New Column Based Upon Values in Another Column Using Dplyr
Cleaning Up Factor Levels (Collapsing Multiple Levels/Labels)