Extract text after a symbol in R
x <- c('>>xyz>>hello>>mate 1', '>>xyz>>hello>>mate 2', '>>xyz>>mate 3', ' >>xyz>>mate 4' ,'>>xyz>>hello>>mate 5')
sub('.*>>', '', x)
#[1] "mate 1" "mate 2" "mate 3" "mate 4" "mate 5"
Get the characters after a certain pattern in R - regex
You may use
df <- data.frame(cat = c("c(\\\"BPT\\\", \"BP\")", "c(\"BP2\", \"BP\")", "c(\"BPT\", \"BP\")", "c(\"CN\", \"NC\")"))
df$cat <- as.character(df$cat)
unlist(lapply(gsub('\\', '', df$cat, fixed=TRUE), function(x) eval(parse(text=x))[[1]]))
## => [1] "BPT" "BP2" "BPT" "CN"
See the R demo online.
Notes
gsub('\\', '', df$cat, fixed=TRUE)
removes all backslashes. You may usegsub('\\\"', '"', df$cat, fixed=TRUE)
if you only plan to remove backslashes before"
.eval(parse(text=x))[[1]]
parses the vector and returns the first itemlapply
helps traverse the whole data you have. See Using sapply and lapply.
How to extract everything after a specific string?
With str_extract
. \\b
is a zero-length token that matches a word-boundary. This includes any non-word characters:
library(stringr)
str_extract(test, '\\b\\w+$')
# [1] "Pomme" "Poire" "Fraise"
We can also use a back reference with sub
. \\1
refers to string matched by the first capture group (.+)
, which is any character one or more times following a -
at the end:
sub('.+-(.+)', '\\1', test)
# [1] "Pomme" "Poire" "Fraise"
This also works with str_replace
if that is already loaded:
library(stringr)
str_replace(test, '.+-(.+)', '\\1')
# [1] "Pomme" "Poire" "Fraise"
Third option would be using strsplit
and extract the second word from each element of the list (similar to word
from @akrun's answer):
sapply(strsplit(test, '-'), `[`, 2)
# [1] "Pomme" "Poire" "Fraise"
stringr
also has str_split
variant to this:
str_split(test, '-', simplify = TRUE)[,2]
# [1] "Pomme" "Poire" "Fraise"
R get rid of string before/after special characters (pipe and ) using regex
You can extract text between >
and |
. Special characters can be escaped with \\
.
sub('>(.*)\\|.*', '\\1', test)
#[1] "P01923" "P19405orf"
R Returning all characters after the first underscore
In the pattern
, we can change the zero or more any characters (.*
- here .
is metacharacter that can match any character) to zero or more characters that is not a _
([^_]*
) from the start (^
) of the string.
sub("^[^_]*_", "", x)
#[1] "binloop_v6" "binloopv2"
If we don't specify it as such, the _
will match till the last _
in the string and uptill that substring will be lost returning 'v6' and 'binloopv2'
An easier option would be word
from stringr
library(stringr)
word(x, 2, sep = "_")
#[1] "binloop" "binloopv2"
Characters before/after a symbol
It could be that this suffices:
unlist(strsplit("xxx, yyy. zzz","[,.]"))[2] # get yyy with space, or:
gsub(" ","",unlist(strsplit("xxx, yyy. zzz","[,.]")))[2] # remove space
Extract characters after the last appearance of a certain symbol in a vector
A possible solution, using stringr::str_extract
:
$
means the end of the string.\\d+
means one or more numeric digit.(?<=\\.)
looks behind, to check whether behind the numeric digit there is a dot.
You can learn more at: Lookahead and lookbehind regex tutorial
library(stringr)
x <- c("1.22.33.444","11.22.333.4","1e.3e.3444.45", "g.78.in.89")
str_extract(x, "(?<=\\.)\\d+$")
#> [1] "444" "4" "45" "89"
Extract digits and next string after from a character vector in R
Use the pattern to match one or more digits (\\d+
) followed by one or more spaces (\\s+
) and word (\\w+
)
library(stringr)
str_extract_all(my_text, "\\d+\\s+\\w+")[[1]]
Related Topics
Delete Rows Containing Specific Strings in R
Remove Unwanted Symbols from Expression Function - R
Force R to Stop Plotting Abbreviated Axis Labels (Scientific Notation) - E.G. 1E+00
Repeat Each Row of Data.Frame the Number of Times Specified in a Column
Counting Unique/Distinct Values by Group in a Data Frame
Replace Specific Characters Within Strings
Difference Between '%In%' and '=='
How to Count the Number of Unique Values by Group
Concatenate a Vector of Strings/Character
How to Get to the Next Line in the R Command Prompt Without Executing
Conditional Replacement of a Comma With a Dot in a Numeric Column
How to Declare a Vector of Zeros in R
Replacing Nas With Latest Non-Na Value
Is R'S Apply Family More Than Syntactic Sugar