Count occurrences of value in a set of variables in R (per row)
Try
apply(df,MARGIN=1,table)
Where df
is your data.frame
. This will return a list of the same length of the amount of rows in your data.frame. Each item of the list corresponds to a row of the data.frame (in the same order), and it is a table where the content is the number of occurrences and the names are the corresponding values.
For instance:
df=data.frame(V1=c(10,20,10,20),V2=c(20,30,20,30),V3=c(20,10,20,10))
#create a data.frame containing some data
df #show the data.frame
V1 V2 V3
1 10 20 20
2 20 30 10
3 10 20 20
4 20 30 10
apply(df,MARGIN=1,table) #apply the function table on each row (MARGIN=1)
[[1]]
10 20
1 2
[[2]]
10 20 30
1 1 1
[[3]]
10 20
1 2
[[4]]
10 20 30
1 1 1
#desired result
Count occurrences of value in a set of variables in R (per row) - with weights
One option could be apply table
function to each row and find out occurrence for value in each column. The factors defined in V
will then be applied to each column to find index of column with max freq*V
value. The value from that index
of that row values will be the desired value.
#Multiplier for occurrence in each column
V = c(0.25,0.25,0.5)
#data frame
df8=data.frame(V1=c(10,20,10,20),V2=c(20,30,20,30),V3=c(20,10,20,10))
# This function accepts all columns for a row. Finds frequencies for each
# column values and then multiply with V (column wise)
# Finally value in row at index with max(freq*V) is returned.
find_max_freq_val <- function(x){
freq_df <- as.data.frame(table(x))
freq_vec <- mapply(function(y)freq_df[freq_df$x==y,"Freq"], x)
#multiply with V with freq and find index of max(a*V)
#Then return item at that index from x
x[which((freq_vec*V) == max(freq_vec*V))]
}
# call above function to add an column with desired value
df8$new_val <- apply(df8, 1, find_max_freq_val)
df8
# V1 V2 V3 new_val
#1 10 20 20 20
#2 20 30 10 10
#3 10 20 20 20
#4 20 30 10 10
R count number of variables with value =mq per row
You can use the 'apply' function to count a particular value in your existing dataframe 'df',
df$count.MQ <- apply(df, 1, function(x) length(which(x=="mq")))
Here the second argument is 1 since you want to count for each row. You can read more about it from https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/apply
Count occurrence of string values per row in dataframe in R (dplyr)
You can use across
with rowSums
-
library(dplyr)
df %>% mutate(d9 = rowSums(across(all_of(cols), `%in%`, bcde)))
# d1 d2 d3 d4 d5 d6 d7 d8 d9
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#1 b a a a a a a a 0
#2 a a a a c a a a 1
#3 a b a a a a a a 1
#4 a a c a a b a a 2
#5 a a a a a a a a 0
#6 a a b a a a a a 1
#7 a a a a a d a a 1
#8 a a a d a a a a 1
This can also be written in base R -
df$d9 <- rowSums(sapply(df[cols], `%in%`, bcde))
How to count occurrences of several strings per row in a data frame in R
If I understand correctly, the OP has multiple lists of agents that can be clustered for one purpose not just one list of beta blockers. The OP mentions statins, e.g. The OP wants to count how many different agents belonging to each cluster are being taken by each subject. The counts for each agent cluster are to be appended to each row.
I suggest to compute the sums for all clusters at once rather than to do this manually list by list.
For this, we first need to set-up a data frame with the clustering:
cluster
Purpose Agent
1: BETA METOPROLOL
2: BETA BISOPROLOL
3: BETA NEBILET
4: BETA METOHEXAL
5: BETA SOTALEX
6: BETA QUERTO
7: BETA NEBIVOLOL
8: BETA CARVEDILOL
9: BETA METOPROLOLSUCCINAT
10: BETA BELOC
11: STATIN ATORVASTATIN
12: STATIN SIMVASTATIN
13: STATIN LOVASTATIN
14: STATIN PRAVASTATIN
15: STATIN FLUVASTATIN
16: STATIN PITAVASTIN
cluster
can be created, e.g., by
library(data.table)
library(magrittr)
cluster <- list(
BETA = c("METOPROLOL", "BISOPROLOL", "NEBILET", "METOHEXAL", "SOTALEX",
"QUERTO", "NEBIVOLOL", "CARVEDILOL", "METOPROLOLSUCCINAT", "BELOC"),
STATIN = c("ATORVASTATIN", "SIMVASTATIN", "LOVASTATIN", "PRAVASTATIN",
"FLUVASTATIN", "PITAVASTIN")
) %>%
lapply(data.table) %>%
rbindlist(idcol = "Purpose") %>%
setnames("V1", "Agent")
For counting the occurrences, we need to join or merge this table with the list of agents each subject is taking dat
after dat
has been reshaped from wide to long format.
While data in spreadsheet-style wide format, i.e., with one row per subject and many columns, are often suitable for data entry and inspection the database-style long format is often more suitable for data processing.
taken <- melt(setDT(dat)[, ID := .I], "ID", value.name = "Agent", na.rm = TRUE)[
Agent != ""][
, Agent := toupper(Agent)][]
ID variable Agent
1: 1 Med1 AMLODIPIN
2: 2 Med1 PLAVIX
3: 3 Med1 BISOPROLOL
4: 4 Med1 ASS
5: 5 Med1 ASS
6: 6 Med1 ASS
7: 1 Med2 RAMIPRIL
8: 2 Med2 SIMVASTATIN
9: 3 Med2 AMLODIPIN
10: 4 Med2 ENALAPRIL
11: 5 Med2 ATORVASTATIN
12: 6 Med2 FRAGMIN
13: 1 Med3 METOPROLOL
14: 2 Med3 MIRTAZAPIN
15: 3 Med3 ASS
16: 4 Med3 L-THYROXIN
17: 5 Med3 FOSAMAX
18: 6 Med3 TORASEMID
19: 3 Med4 VALSARTAN
20: 4 Med4 LITALIR
21: 5 Med4 CALCIUM
22: 6 Med4 SPIRONOLACTON
23: 3 Med5 CHLORALDURAT
24: 4 Med5 LITALIR
25: 5 Med5 PANTOZOL
26: 6 Med5 LORZAAR PROTECT
27: 3 Med6 DOXOZOSIN
28: 4 Med6 AMLODIPIN
29: 5 Med6 NOVAMINSULFON
30: 6 Med6 VESIKUR
31: 3 Med7 TAMSULOSIN
32: 4 Med7 CETIRIZIN
33: 6 Med7 ROCALTROL
34: 3 Med8 CIPRAMIL
35: 4 Med8 HCT
36: 6 Med8 ATORVASTATIN
37: 4 Med9 NACL
38: 6 Med9 PREDNISOLON
39: 4 Med10 CARMEN
40: 6 Med10 LACTULOSE
41: 4 Med11 PROTEIN 88
42: 6 Med11 MIRTAZAPIN
43: 4 Med12 NOVALGIN
44: 6 Med12 LANTUS
45: 6 Med13 ACTRAPID
46: 6 Med14 PANTOZOL
47: 6 Med15 SALBUTAMOL
48: 6 Med16 AMPHO MORONAL
ID variable Agent
dat
is modified by appending a row number which identifies each subject, then it is reshaped to long format using melt()
. Missing or empty entries are removed and agent names are converted to uppercase for consistency.
Edit In long format it is also easy to check for duplicate agents per subject
taken[duplicated(taken, by = c("ID", "Agent"))]
ID variable Agent
1: 4 Med5 LITALIR
and remove the duplicates:
taken <- unique(taken, by = c("ID", "Agent"))
The final step creates what I believe is the expected result:
ID BETA STATIN Med1 Med2 Med3 Med4 Med5 Med6 Med7 Med8
1: 1 1 0 AMLODIPIN RAMIPRIL METOPROLOL
2: 2 0 1 PLAVIX SIMVASTATIN MIRTAZAPIN
3: 3 1 0 BISOPROLOL AMLODIPIN ASS VALSARTAN CHLORALDURAT Doxozosin TAMSULOSIN CIPRAMIL
4: 4 0 0 ASS ENALAPRIL L-THYROXIN LITALIR LITALIR AMLODIPIN CETIRIZIN HCT
5: 5 0 1 ASS ATORVASTATIN FOSAMAX CALCIUM PANTOZOL NOVAMINSULFON
6: 6 0 1 ASS FRAGMIN TORASEMID SPIRONOLACTON LORZAAR PROTECT VESIKUR ROCALTROL ATORVASTATIN
Pleae, note the additional columns with the counts by cluster (Due to limited space not all columns of the result are shown here). This is created by
cluster[taken, on = .(Agent)][
, dcast(.SD, ID ~ Purpose, length)][
dat, on = "ID"][
, "NA" := NULL][]
using the following operations:
- Join
cluster
andtaken
to havePurpose
appended - Reshape to wide format, one row per subject and one column per purpose, thereby counting the number of occurrences
- Join this result result with the original data
dat
- Remove the superfluous column of NA counts
Data
dat <- structure(list(Med1 = c("AMLODIPIN", "PLAVIX", "BISOPROLOL",
"ASS", "ASS", "ASS"), Med2 = c("RAMIPRIL", "SIMVASTATIN", "AMLODIPIN",
"ENALAPRIL", "ATORVASTATIN", "FRAGMIN"), Med3 = c("METOPROLOL",
"MIRTAZAPIN", "ASS", "L-THYROXIN", "FOSAMAX", "TORASEMID"), Med4 = c("",
"", "VALSARTAN", "LITALIR", "CALCIUM", "SPIRONOLACTON"), Med5 = c("",
"", "CHLORALDURAT", "LITALIR", "PANTOZOL", "LORZAAR PROTECT"),
Med6 = c("", "", "Doxozosin", "AMLODIPIN", "NOVAMINSULFON",
"VESIKUR"), Med7 = c("", "", "TAMSULOSIN", "CETIRIZIN", "",
"ROCALTROL"), Med8 = c("", "", "CIPRAMIL", "HCT", "", "ATORVASTATIN"
), Med9 = c("", "", "", "NACL", "", "PREDNISOLON"), Med10 = c("",
"", "", "CARMEN", "", "LACTULOSE"), Med11 = c("", "", "",
"PROTEIN 88", "", "MIRTAZAPIN"), Med12 = c("", "", "", "NOVALGIN",
"", "LANTUS"), Med13 = c("", "", "", "", "", "ACTRAPID"),
Med14 = c("", "", "", "", "", "PANTOZOL"), Med15 = c("",
"", "", "", "", "SALBUTAMOL"), Med16 = c("", "", "", "",
"", "AMPHO MORONAL")), class = "data.frame", row.names = c(NA,
-6L))
Counting number of instances of a condition per row R
You can use rowSums
.
df$no_calls <- rowSums(df == "nc")
df
# rsID sample1 sample2 sample3 sample1304 no_calls
#1 abcd aa bb nc nc 2
#2 efgh nc nc nc nc 4
#3 ijkl aa ab aa nc 1
Or, as pointed out by MrFlick, to exclude the first column from the row sums, you can slightly modify the approach to
df$no_calls <- rowSums(df[-1] == "nc")
Regarding the row names: They are not counted in rowSums
and you can make a simple test to demonstrate it:
rownames(df)[1] <- "nc" # name first row "nc"
rowSums(df == "nc") # compute the row sums
#nc 2 3
# 2 4 1 # still the same in first row
Count occurrences of a variable having two given values corresponding to one value of another variable
The optimal solution in terms of memory space would be one row for each pair which would be 700*699 / 2. This problem is still relatively small and the simplicity of manipulating a 700*700 matrix is probably more valuable than the 700*701/2 cells you're saving, which would work out to 240kB with one byte per cell. It could be even less if the matrix is sparse (i.e. most pairs of materials are never ordered together) and you use an appropriate data structure.
Here's how the code would look like:
First we want to create a dataframe with as many rows and columns as there are materials. Matrices are easier to create so we create one that we convert to a dataframe afterwards.
all_materials = levels(as.factor(X$Materials))
number_materials = length(all_materials)
Pairs <- as.data.frame(matrix(data = 0, nrow = number_materials, ncol = number_materials))
(Here, X is your dataset)
We then set the row names and column names to be able to access the rows and columns directly with the identifiers of the materials which are apparently not necessarily numbered from 1 to 700.
colnames(Pairs) <- all_materials
rownames(Pairs) <- all_materials
Then we iterate over the dataset
for(order in levels(as.factor(X$Order.number))){
# getting the materials in each order
materials_for_order = X[X$Order.number==order, "Materials"]
if (length(materials_for_order)>1) {
# finding each possible pair from the materials list
all_pairs_in_order = combn(x=materials_for_order, m=2)
# incrementing the cell at the line and column corresponding to each pair
for(i in 1:ncol(all_pairs_in_order)){
Pairs[all_pairs_in_order[1, i], all_pairs_in_order[2, i]] = Pairs[all_pairs_in_order[1, i], all_pairs_in_order[2, i]] + 1
}
}
}
At the end of the loop, the Pairs
table should contain everything you need.
Related Topics
How to Get to the Next Line in the R Command Prompt Without Executing
How to Force a Line Break in Rmarkdown'S Title
Creating Grouped Bar-Plot of Multi-Column Data in R
Concatenate String Columns and Order in Alphabetical Order
Remove Last N Rows in Data Frame With the Arbitrary Number of Rows
How to Convert a Data Frame Column to Numeric Type
How to Show Code But Hide Output in Rmarkdown
R: How to Get the Percentage Change from Two Different Columns
Break Dataframe into Smaller Dataframe'S and Save Them
How to Change the Default Colors in Plotly Chart
Filter a Data Frame According to Minimum and Maximum Values
How to Replace Negative Values in a Dataframe Column With a Different Value
Removing All Empty Columns and Rows in Data.Frame When Rows Don't Go Away
Selecting Only Duplicates Based on Multiple Columns in R
Conditionally Remove Rows from a Database Using R
Add Column Values Based on Other Columns in Data Frame Using for and If
Convert Multiple Columns of Numeric Data to Dates in R
Adding Some Space Between the X-Axis and the Bars, in Ggplot