Advice on a loop function to subset data according to variables
Based on your description, I assume your data looks something like this:
country_year <- c("Australia_2013", "Australia_2014", "Bangladesh_2013")
health <- matrix(nrow = 3, ncol = 3, data = runif(9))
dataset <- data.frame(rbind(country_year, health), row.names = NULL, stringsAsFactors = FALSE)
dataset
# X1 X2 X3
#1 Australia_2013 Australia_2014 Bangladesh_2013
#2 0.665947273839265 0.677187719382346 0.716064820764586
#3 0.499680359382182 0.514755881391466 0.178317369660363
#4 0.730102791683748 0.666969108628109 0.0719663293566555
First, move your row 1 (e.g., Australia_2013, Australia_2014 etc.) to the column names, and then apply the loop to create country-based data frames.
library(dplyr)
# move header
dataset2 <- dataset %>%
`colnames<-`(dataset[1,]) %>% # uses row 1 as column names
slice(-1) %>% # removes row 1 from data
mutate_all(type.convert) # converts data to appropriate type
# apply loop
for(country in unique(gsub("_\\d+", "", colnames(dataset2)))) {
assign(country, select(dataset2, starts_with(country))) # makes subsets
}
Regarding the loop,
gsub("_\\d+", "", colnames(dataset2))
extracts the country names by replacing "_[year]" with nothing (i.e., removing it), and the unique()
function that is applied extracts one of each country name.
assign(country, select(dataset2, starts_with(country)))
creates a variable named after the country and this country variable only contains the columns from dataset2
that start with the country name.
Edit: Responding to Comment
The question in the comment was asking how to add row-wise summaries (e.g., rowSums()
, rowMeans()
) as new columns in the country-based data frames, while using this for-loop.
Here is one solution that requires minimal changes:
for(country in unique(gsub("_\\d+", "", colnames(dataset2)))) {
assign(country,
select(dataset2, starts_with(country)) %>% # makes subsets
mutate( # creates new columns
rowSums = rowSums(select(., starts_with(country))),
rowMeans = rowMeans(select(., starts_with(country)))
)
)
}
mutate()
adds new columns to a dataset.
select(., starts_with(country))
selects columns that start with the country name from the current object (represented as .
in the function).
How to write a loop in R to create multiple different subsets of data based on column names?
Base function combn
is ideal for this. You can get all combinations 2 by 2 of the remaining column names and call a function on each of those combinations.
First, some data.
set.seed(1234)
df1 <- matrix(rnorm(5*(4+5)), nrow = 5)
df1 <- as.data.frame(df1)
Now the code. Note that I will just keep the first 4 columns common, not 9. And you should change the default value of function fun
argument DF = df1
to DF = yourdata
.
first_cols <- 1:4
fun <- function(nms, DF = df1, fc = first_cols){
cols <- c(names(DF)[fc], nms)
outfile <- paste(nms, collapse = 'x')
outfile <- paste(outfile, 'txt', sep = '.')
write.table(DF[cols], outfile,
row.names = FALSE, col.names = FALSE,
quote = FALSE, sep = ' ')
cols
}
combn(names(df1)[-first_cols], 2, fun)
How to create a loop which creates multiple subset dataframes from a larger data frame?
Your code works fine. Just remove list
so you create a vector of color names and not a list. If you only want distinct values, use unique
.
mydata <- data.frame(x = c(1,2,3), y = c('a','b','c'), z = c('red','red','yellow'))
colors <- unique(mydata$z)
for (i in 1:length(colors)) {
assign(paste0("mydata_",i), subset(mydata, z == colors[[i]]))
}
R: loop through data frame extracting subset of data depending on date
is this what you want ? df_list <- split(data, as.factor(data$date))
R: Subset data using for-loop
No!
That's not the way it works in R. ;) You want to use vectorized code because it's much more concise and faster (in R). Here are two solutions:
df = subset(CBS, `Wijken en buurten` %in% c("Oud-Overdie", "Overdie-West", "Overdie-Oost", "Oosterhout", "De Hoef III en IV"))
df = CBS[CBS$`Wijken en buurten` %in% c("Oud-Overdie", "Overdie-West", "Overdie-Oost", "Oosterhout", "De Hoef III en IV"),]
Subsetting a data set inside for loop
It is generally not advisable to use assign
in R. Yes the function is available, but its use is not recommended. I believe the results you are looking could be generated in a much simpler manner.
The lapply
command performs the same function as the for loops above.
#out<- #your dataframe of data
#define an array of string valuse
iter<-c("COD1", "COD2", "COD3")
#create a list of dataframes of the subsets
ans<-lapply(iter, function(x) {subset(out, TestId==x)})
#rename the list elements
names(ans)<-iter
#to access each subset any of the listed methods:
ans[[1]]
ans["COD1"]
ans$COD1
ans[iter[1]]
Subset data frame within a for loop
Don't use assign
use a list
instead!
# for loop approach
results = list()
for(nm in names(data)[-1]) { # omit the first column
results[[nm]] = data[data[[nm]] %in% "Y", "Column I want", drop = FALSE]
}
# lapply approach
results = lapply(data[-1], function(col) data[col %in% "Y", "Column I want", drop = FALSE])
The drop = FALSE
arguments makes sure you get 1-column data frames, not vectors, as the result.
As for the issue in your approach, names[i]
is just a string, so you're testing if, say, "var2" == "Y"
, which is false.
Related Topics
How to Convert a Data Frame Column to Numeric Type
Calculate Row Means on Subset of Columns
How to Find the Largest N Elements in a List in R
Converting Year and Month ("Yyyy-Mm" Format) to a Date
Evaluate Expression Given as a String
Count Number of Rows in a Data Frame in R Based on Group
Geographic/Geospatial Distance Between 2 Lists of Lat/Lon Points (Coordinates)
Split a Large Dataframe into a List of Data Frames Based on Common Value in Column
How to Delete a Row by Reference in Data.Table
Select the First and Last Row by Group in a Data Frame
How to Show Code But Hide Output in Rmarkdown
How to Change the Spacing Between Legend Items in Ggplot2
Removing Columns That Are All 0
Filter Data.Frame Rows by a Logical Condition
Split Column At Delimiter in Data Frame
Relative Frequencies/Proportions With Dplyr
How to Split Data into Training/Testing Sets Using Sample Function
Data.Table Objects Assigned With := from Within Function Not Printed