Count Occurrences of Value in a Set of Variables in R (Per Row)

Count occurrences of value in a set of variables in R (per row)

Try

apply(df,MARGIN=1,table)

Where df is your data.frame. This will return a list of the same length of the amount of rows in your data.frame. Each item of the list corresponds to a row of the data.frame (in the same order), and it is a table where the content is the number of occurrences and the names are the corresponding values.

For instance:

df=data.frame(V1=c(10,20,10,20),V2=c(20,30,20,30),V3=c(20,10,20,10))
#create a data.frame containing some data
df #show the data.frame
  V1 V2 V3
1 10 20 20
2 20 30 10
3 10 20 20
4 20 30 10
apply(df,MARGIN=1,table) #apply the function table on each row (MARGIN=1)
[[1]]

10 20 
 1  2 

[[2]]

10 20 30 
 1  1  1 

[[3]]

10 20 
 1  2 

[[4]]

10 20 30 
 1  1  1 

#desired result

Count occurrences of value in a set of variables in R (per row) - with weights

One option could be apply table function to each row and find out occurrence for value in each column. The factors defined in V will then be applied to each column to find index of column with max freq*V value. The value from that index of that row values will be the desired value.

#Multiplier for occurrence in each column
V = c(0.25,0.25,0.5)

#data frame
df8=data.frame(V1=c(10,20,10,20),V2=c(20,30,20,30),V3=c(20,10,20,10))

# This function accepts all columns for a row. Finds frequencies for each
# column values and then multiply with V (column wise)
# Finally value in row at index with max(freq*V) is returned.

find_max_freq_val <- function(x){
  freq_df <- as.data.frame(table(x))
  freq_vec <- mapply(function(y)freq_df[freq_df$x==y,"Freq"], x)
  #multiply with V with freq and find index of max(a*V)
  #Then return item at that index from x
  x[which((freq_vec*V) == max(freq_vec*V))]

}

# call above function to add an column with desired value
df8$new_val <- apply(df8, 1, find_max_freq_val)

df8
#  V1 V2 V3 new_val
#1 10 20 20      20
#2 20 30 10      10
#3 10 20 20      20
#4 20 30 10      10

R count number of variables with value =mq per row

You can use the 'apply' function to count a particular value in your existing dataframe 'df',

df$count.MQ <- apply(df, 1, function(x) length(which(x=="mq")))

Here the second argument is 1 since you want to count for each row. You can read more about it from https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/apply

Count occurrence of string values per row in dataframe in R (dplyr)

You can use across with rowSums -

library(dplyr)

df %>% mutate(d9 = rowSums(across(all_of(cols), `%in%`, bcde)))

#  d1    d2    d3    d4    d5    d6    d7    d8       d9
#  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#1 b     a     a     a     a     a     a     a         0
#2 a     a     a     a     c     a     a     a         1
#3 a     b     a     a     a     a     a     a         1
#4 a     a     c     a     a     b     a     a         2
#5 a     a     a     a     a     a     a     a         0
#6 a     a     b     a     a     a     a     a         1
#7 a     a     a     a     a     d     a     a         1
#8 a     a     a     d     a     a     a     a         1

This can also be written in base R -

df$d9 <- rowSums(sapply(df[cols], `%in%`, bcde))

How to count occurrences of several strings per row in a data frame in R

If I understand correctly, the OP has multiple lists of agents that can be clustered for one purpose not just one list of beta blockers. The OP mentions statins, e.g. The OP wants to count how many different agents belonging to each cluster are being taken by each subject. The counts for each agent cluster are to be appended to each row.

I suggest to compute the sums for all clusters at once rather than to do this manually list by list.

For this, we first need to set-up a data frame with the clustering:

cluster

    Purpose              Agent
 1:    BETA         METOPROLOL
 2:    BETA         BISOPROLOL
 3:    BETA            NEBILET
 4:    BETA          METOHEXAL
 5:    BETA            SOTALEX
 6:    BETA             QUERTO
 7:    BETA          NEBIVOLOL
 8:    BETA         CARVEDILOL
 9:    BETA METOPROLOLSUCCINAT
10:    BETA              BELOC
11:  STATIN       ATORVASTATIN
12:  STATIN        SIMVASTATIN
13:  STATIN         LOVASTATIN
14:  STATIN        PRAVASTATIN
15:  STATIN        FLUVASTATIN
16:  STATIN         PITAVASTIN

cluster can be created, e.g., by

library(data.table)
library(magrittr)
cluster <- list(
  BETA = c("METOPROLOL", "BISOPROLOL", "NEBILET", "METOHEXAL", "SOTALEX",
           "QUERTO", "NEBIVOLOL", "CARVEDILOL", "METOPROLOLSUCCINAT", "BELOC"),
  STATIN = c("ATORVASTATIN", "SIMVASTATIN", "LOVASTATIN", "PRAVASTATIN", 
           "FLUVASTATIN", "PITAVASTIN")
  ) %>% 
  lapply(data.table) %>% 
  rbindlist(idcol = "Purpose") %>% 
  setnames("V1", "Agent")

For counting the occurrences, we need to join or merge this table with the list of agents each subject is taking dat after dat has been reshaped from wide to long format.

While data in spreadsheet-style wide format, i.e., with one row per subject and many columns, are often suitable for data entry and inspection the database-style long format is often more suitable for data processing.

taken <- melt(setDT(dat)[, ID := .I], "ID", value.name = "Agent", na.rm = TRUE)[
  Agent != ""][
    , Agent := toupper(Agent)][]

    ID variable           Agent
 1:  1     Med1       AMLODIPIN
 2:  2     Med1          PLAVIX
 3:  3     Med1      BISOPROLOL
 4:  4     Med1             ASS
 5:  5     Med1             ASS
 6:  6     Med1             ASS
 7:  1     Med2        RAMIPRIL
 8:  2     Med2     SIMVASTATIN
 9:  3     Med2       AMLODIPIN
10:  4     Med2       ENALAPRIL
11:  5     Med2    ATORVASTATIN
12:  6     Med2         FRAGMIN
13:  1     Med3      METOPROLOL
14:  2     Med3      MIRTAZAPIN
15:  3     Med3             ASS
16:  4     Med3      L-THYROXIN
17:  5     Med3         FOSAMAX
18:  6     Med3       TORASEMID
19:  3     Med4       VALSARTAN
20:  4     Med4         LITALIR
21:  5     Med4         CALCIUM
22:  6     Med4   SPIRONOLACTON
23:  3     Med5    CHLORALDURAT
24:  4     Med5         LITALIR
25:  5     Med5        PANTOZOL
26:  6     Med5 LORZAAR PROTECT
27:  3     Med6       DOXOZOSIN
28:  4     Med6       AMLODIPIN
29:  5     Med6   NOVAMINSULFON
30:  6     Med6         VESIKUR
31:  3     Med7      TAMSULOSIN
32:  4     Med7       CETIRIZIN
33:  6     Med7       ROCALTROL
34:  3     Med8        CIPRAMIL
35:  4     Med8             HCT
36:  6     Med8    ATORVASTATIN
37:  4     Med9            NACL
38:  6     Med9     PREDNISOLON
39:  4    Med10          CARMEN
40:  6    Med10       LACTULOSE
41:  4    Med11      PROTEIN 88
42:  6    Med11      MIRTAZAPIN
43:  4    Med12        NOVALGIN
44:  6    Med12          LANTUS
45:  6    Med13        ACTRAPID
46:  6    Med14        PANTOZOL
47:  6    Med15      SALBUTAMOL
48:  6    Med16   AMPHO MORONAL
    ID variable           Agent

dat is modified by appending a row number which identifies each subject, then it is reshaped to long format using melt(). Missing or empty entries are removed and agent names are converted to uppercase for consistency.

Edit In long format it is also easy to check for duplicate agents per subject

taken[duplicated(taken, by = c("ID", "Agent"))]

   ID variable   Agent
1:  4     Med5 LITALIR

and remove the duplicates:

taken <- unique(taken, by = c("ID", "Agent"))

The final step creates what I believe is the expected result:

   ID BETA STATIN       Med1         Med2       Med3          Med4            Med5          Med6       Med7         Med8
1:  1    1      0  AMLODIPIN     RAMIPRIL METOPROLOL                                                                    
2:  2    0      1     PLAVIX  SIMVASTATIN MIRTAZAPIN                                                                    
3:  3    1      0 BISOPROLOL    AMLODIPIN        ASS     VALSARTAN    CHLORALDURAT     Doxozosin TAMSULOSIN     CIPRAMIL
4:  4    0      0        ASS    ENALAPRIL L-THYROXIN       LITALIR         LITALIR     AMLODIPIN  CETIRIZIN          HCT
5:  5    0      1        ASS ATORVASTATIN    FOSAMAX       CALCIUM        PANTOZOL NOVAMINSULFON                        
6:  6    0      1        ASS      FRAGMIN  TORASEMID SPIRONOLACTON LORZAAR PROTECT       VESIKUR  ROCALTROL ATORVASTATIN

Pleae, note the additional columns with the counts by cluster (Due to limited space not all columns of the result are shown here). This is created by

cluster[taken, on = .(Agent)][
  , dcast(.SD, ID ~ Purpose, length)][
    dat, on = "ID"][
      , "NA" := NULL][]

using the following operations:

Join cluster and taken to have Purpose appended
Reshape to wide format, one row per subject and one column per purpose, thereby counting the number of occurrences
Join this result result with the original data dat
Remove the superfluous column of NA counts

Data

dat <- structure(list(Med1 = c("AMLODIPIN", "PLAVIX", "BISOPROLOL", 
"ASS", "ASS", "ASS"), Med2 = c("RAMIPRIL", "SIMVASTATIN", "AMLODIPIN", 
"ENALAPRIL", "ATORVASTATIN", "FRAGMIN"), Med3 = c("METOPROLOL", 
"MIRTAZAPIN", "ASS", "L-THYROXIN", "FOSAMAX", "TORASEMID"), Med4 = c("", 
"", "VALSARTAN", "LITALIR", "CALCIUM", "SPIRONOLACTON"), Med5 = c("", 
"", "CHLORALDURAT", "LITALIR", "PANTOZOL", "LORZAAR PROTECT"), 
    Med6 = c("", "", "Doxozosin", "AMLODIPIN", "NOVAMINSULFON", 
    "VESIKUR"), Med7 = c("", "", "TAMSULOSIN", "CETIRIZIN", "", 
    "ROCALTROL"), Med8 = c("", "", "CIPRAMIL", "HCT", "", "ATORVASTATIN"
    ), Med9 = c("", "", "", "NACL", "", "PREDNISOLON"), Med10 = c("", 
    "", "", "CARMEN", "", "LACTULOSE"), Med11 = c("", "", "", 
    "PROTEIN 88", "", "MIRTAZAPIN"), Med12 = c("", "", "", "NOVALGIN", 
    "", "LANTUS"), Med13 = c("", "", "", "", "", "ACTRAPID"), 
    Med14 = c("", "", "", "", "", "PANTOZOL"), Med15 = c("", 
    "", "", "", "", "SALBUTAMOL"), Med16 = c("", "", "", "", 
    "", "AMPHO MORONAL")), class = "data.frame", row.names = c(NA, 
-6L))

Counting number of instances of a condition per row R

You can use rowSums.

df$no_calls <- rowSums(df == "nc")
df
#  rsID sample1 sample2 sample3 sample1304 no_calls
#1 abcd      aa      bb      nc         nc        2
#2 efgh      nc      nc      nc         nc        4
#3 ijkl      aa      ab      aa         nc        1

Or, as pointed out by MrFlick, to exclude the first column from the row sums, you can slightly modify the approach to

df$no_calls <- rowSums(df[-1] == "nc")

Regarding the row names: They are not counted in rowSums and you can make a simple test to demonstrate it:

rownames(df)[1] <- "nc"  # name first row "nc"
rowSums(df == "nc")      # compute the row sums
#nc  2  3             
# 2  4  1        # still the same in first row

Count occurrences of a variable having two given values corresponding to one value of another variable

The optimal solution in terms of memory space would be one row for each pair which would be 700*699 / 2. This problem is still relatively small and the simplicity of manipulating a 700*700 matrix is probably more valuable than the 700*701/2 cells you're saving, which would work out to 240kB with one byte per cell. It could be even less if the matrix is sparse (i.e. most pairs of materials are never ordered together) and you use an appropriate data structure.

Here's how the code would look like:

First we want to create a dataframe with as many rows and columns as there are materials. Matrices are easier to create so we create one that we convert to a dataframe afterwards.

all_materials = levels(as.factor(X$Materials))
number_materials = length(all_materials)
Pairs <- as.data.frame(matrix(data = 0, nrow = number_materials, ncol = number_materials))

(Here, X is your dataset)

We then set the row names and column names to be able to access the rows and columns directly with the identifiers of the materials which are apparently not necessarily numbered from 1 to 700.

colnames(Pairs) <- all_materials
rownames(Pairs) <- all_materials

Then we iterate over the dataset

for(order in levels(as.factor(X$Order.number))){
  # getting the materials in each order
  materials_for_order = X[X$Order.number==order, "Materials"]
  if (length(materials_for_order)>1) {
    # finding each possible pair from the materials list
    all_pairs_in_order = combn(x=materials_for_order, m=2)
    # incrementing the cell at the line and column corresponding to each pair
    for(i in 1:ncol(all_pairs_in_order)){
      Pairs[all_pairs_in_order[1, i], all_pairs_in_order[2, i]] = Pairs[all_pairs_in_order[1, i], all_pairs_in_order[2, i]] + 1
    }
  }
}

At the end of the loop, the Pairs table should contain everything you need.