Extract English words from a text in R
You could use the package I maintain qdapDictionaries (no need for the parent package qdap to be installed). If your data is more complex you may need to use tools like tolower
etc. to make it work. The idea here is basically to see where a known word list ?GradyAugmented
intersects with your words. Here are two very similar approaches, the first is likely slightly faster depending on data:
vector <- c("picture", "carpet", "lamp", "notaword", "anothernotaword")
library(qdapDictionaries)
vector[vector %in% GradyAugmented]
## [1] "picture" "carpet" "lamp"
intersect(vector, GradyAugmented)
## [1] "picture" "carpet" "lamp"
The error you are receiving with installing qdap sounds like @Ben Bolker is correct. You will need a newer version (I'd suggest the latest version) of data.table installed (use packageVersion("data.table")
to check this). That is an oversight on my part with not requiring a minimal version of data.table, I thought setDT
(a function in the data.table package) was always around but it appears to not be in your version. But to solve this particular problem you wouldn't need to install the parent qdap package, just qdapDictionaries.
How to split English letters, numbers and Chinese characters in R?
To extract the chinese words only,
We could use str_extract
: extracting all non latin characters with "[:alpha:]+"
:
library(stringr)
string <- c("123-321-中文.jpg", "001-123你好.png")
str_extract(string, "[:alpha:]+")
output:
[1] "中文" "你好"
Extract only words containing ASCII characters from vector of strings
Use sapply
with paste
as in:
b<-str_extract_all(c('hello ringпрг','trust'), regex("[a-z]+", TRUE))
sapply(b, paste, collapse = " ")
## [1] "hello ring" "trust"
stringr: extract words containing a specific word
You seem to want to remove all words containing WIFF
and the trailing ;
if there is any. Use
> dataframedataframe <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> dataframe$text <- str_replace_all(dataframe$text, "(?i)\\b(?!\\w*WIFF)\\w+;?", "")
> dataframe
text
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
The pattern (?i)\\b(?!\\w*WIFF)\\w+;?
matches:
(?i)
- a case insensitive inline modifier\\b
- a word boundary(?!\\w*WIFF)
- the negative lookahead fails any match where a word containsWIFF
anywhere inside it\\w+
- 1 or more word chars;?
- an optional;
(?
matches 1 or 0 occurrences of the pattern it modifies)
If for some reason you want to use str_extract
, note that your regex could not work because \bWIFF\b
matches a whole word WIFF and nothing else. You do not have such words in your DF. You may use "(?i)\\b\\w*WIFF\\w*\\b"
to match any words with WIFF
inside (case insensitively) and use str_extract_all
to get multiple occurrences, and do not forget to join the matches into a single "string":
> df <- data.frame(text = c('WAFF;WOFF;WIFF200;WIFF12', 'WUFF;WEFF;WIFF2;BIGWIFF'))
> res <- str_extract_all(df$text, "(?i)\\b\\w*WIFF\\w*\\b")
> res
[[1]]
[1] "WIFF200" "WIFF12"
[[2]]
[1] "WIFF2" "BIGWIFF"
> df$text <- sapply(res, function(s) paste(s, collapse=';'))
> df
text
1 WIFF200;WIFF12
2 WIFF2;BIGWIFF
You may "shrink" the code by placing str_extract_all
into the sapply
function, I separated them for better visibility.
regex: extract segments of a string containing a word, between symbols
With stringr
...
library(stringr)
library(dplyr)
dataframe %>%
mutate(text = trimws(str_extract(text, "(?<=[,;]).*keep")))
# A tibble: 2 × 1
text
<chr>
1 some words to keep
2 other stuff to keep
Created on 2022-02-01 by the reprex package (v2.0.1)
Extract letters from a string in R
you can try
sub("^([[:alpha:]]*).*", "\\1", x)
[1] "AB" "GF" "ABC"
Extract string up to a different word in each row - R
Loop over the 'words' column, get the matching 'stringlist' value with grep
, use sub
to capture the characters including the word and replace it with backreference (\\1
) of the captured group
df$new_words <- sapply(df$words, function(x)
sub(sprintf("(.*%s).*", x), "\\1", grep(x, stringlist,
value = TRUE)[1]))
-output
> df
words new_words
1 apple eukaryote;plant;apple
2 plant eukaryote;plant
3 banana eukaryote;plant;banana
4 animal eukaryote;animal
5 fly eukaryote;insect;fly
6 ecoli prokaryote;bacterium;ecoli
data
df <- structure(list(words = c("apple", "plant", "banana", "animal",
"fly", "ecoli")), class = "data.frame", row.names = c(NA, -6L
))
stringlist <- c("eukaryote;plant;apple", "eukaryote;plant;banana",
"eukaryote;animal;dog",
"eukaryote;plant;orange", "eukaryote;animal;cat", "eukaryote;insect;fly",
"prokaryote;bacterium;ecoli")
Related Topics
Solve Homogenous System Ax = 0 for Any M * N Matrix a in R (Find Null Space Basis for A)
R Programming: Read.Csv() Skips Lines Unexpectedly
Grouped Bar Graph Custom Colours
How to Merge Two Data Frame Based on Partial String Match with R
R: How to Create Grid-Graphics
Reshape Data from Long to Wide Format - More Than One Variable
Logistic Regression: How to Try Every Combination of Predictors in R
Splitting Dataframes in R Based on Empty Rows
How to Edit Column Names in Datatable Function When Running R Shiny App
Reshape Data from Wide to Long
How to Order a Nominale Variable. E.G Month in R
Visual Bug When Changing Robinson Projection's Central Meridian with Ggplot2