Regex to match only commas not in parentheses?
Assuming that there can be no nested parens (otherwise, you can't use a Java Regex for this task because recursive matching is not supported):
Pattern regex = Pattern.compile(
", # Match a comma\n" +
"(?! # only if it's not followed by...\n" +
" [^(]* # any number of characters except opening parens\n" +
" \\) # followed by a closing parens\n" +
") # End of lookahead",
Pattern.COMMENTS);
This regex uses a negative lookahead assertion to ensure that the next following parenthesis (if any) is not a closing parenthesis. Only then the comma is allowed to match.
perl regex to get comma not in parenthesis or nested parenthesis
A single regex for this is massively overcomplicated and difficult to maintain or extend. Here is an iterative parser approach:
use strict;
use warnings;
my $str = 'a , (b) , (d$_,c) , ((,),d,(,))';
my $nesting = 0;
my $buffer = '';
my @vals;
while ($str =~ m/\G([,()]|[^,()]+)/g) {
my $token = $1;
if ($token eq ',' and !$nesting) {
push @vals, $buffer;
$buffer = '';
} else {
$buffer .= $token;
if ($token eq '(') {
$nesting++;
} elsif ($token eq ')') {
$nesting--;
}
}
}
push @vals, $buffer if length $buffer;
print "$_\n" for @vals;
You can use Parser::MGC to construct this sort of parser more abstractly.
RegEx for matching all commas unless they are enclosed between parentheses or brackets
This looks more like a job for a custom parser than a single regex. I would love to be proved wrong, but while we're waiting, here's a very pedestrian parsing function that gets the job done.
parse_nested <- function(string) {
chars <- strsplit(string, "")[[1]]
parentheses <- numeric(length(chars))
parentheses[chars == "("] <- 1
parentheses[chars == ")"] <- -1
parentheses <- cumsum(parentheses)
brackets <- numeric(length(chars))
brackets[chars == "["] <- 1
brackets[chars == "]"] <- -1
brackets <- cumsum(brackets)
split_on <- which(brackets == 0 & parentheses == 0 & chars == ",")
split_on <- c(0, split_on, length(chars) + 1)
result <- character()
for(i in seq_along(head(split_on, -1))) {
x <- paste0(chars[(split_on[i] + 1):(split_on[i + 1] - 1)], collapse = "")
result <- c(result, x)
}
trimws(result)
}
Which produces:
parse_nested(x)
#> [1] "A" "B (C, D, E)" "F"
#> [4] "G [H, I, J]" "K (L (M, N), O)" "P (Q (R, S (T, U)))"
Regex to match only commas not in parentheses or square brackets
Maybe you want something like this:
(?!<(?:\(|\[)[^)\]]+),(?![^(\[]+(?:\)|\]))
Demo
When fed to Java with the input (note additional ]
and (
inserted at random positions to make it well-formed):
Potatoes, Vegetable Oil (Sunflower, Corn, And/or Canola Oil), Honey BBQ Seasoning [Sugar, Salt, Dextrose, Torula Yeast], Onion Powder, Spices, Maltodextrin Fructose, Yeast Extract, Molasses, Natural Flavor [Including Milk], Corn Starch, Honey, Gum Arabic, Paprika Extracts, Caramel Color (Garlic Powder, Citric Acid, And Sunflower Oil).
it produces the output:
Potatoes
Vegetable Oil (Sunflower, Corn, And/or Canola Oil)
Honey BBQ Seasoning [Sugar, Salt, Dextrose, Torula Yeast]
Onion Powder
Spices
Maltodextrin Fructose
Yeast Extract
Molasses
Natural Flavor [Including Milk]
Corn Starch
Honey
Gum Arabic
Paprika Extracts
Caramel Color (Garlic Powder, Citric Acid, And Sunflower Oil).
which is exactly the "split at top-level commas".
However, note that this regex is really inefficient. Counting parentheses with regex-lookarounds is not a very good idea. It seems as if it could be solved with a simple scan-left followed by simple split.
Regex to match only comma's but not inside multiple parentheses
Here is the regex which works perfectly for your input.
,(?![^()]*(?:\([^()]*\))?\))
DEMO
Explanation:
, ','
(?! negative look ahead, to see if there is not:
[^()]* any character except: '(', ')' (0 or
more times)
(?: group, but do not capture (optional):
\( '('
[^()]* any character except: '(', ')' (0 or
more times)
\) ')'
)? end of grouping, ? after the non-capturing group makes the
whole non-capturing group as optional.
\) ')'
) end of look-ahead
Limitations:
This regex works based on the assumption that parentheses will not be nested at a depth greater than 2, i.e. paren within paren. It could also fail if unbalanced, escaped, or quoted parentheses occur in the input, because it relies on the assumption that each closing paren corresponds to an opening paren and vice versa.
Regex to match commas that are not in an array (enclosed in square brackets)
You could use the following regex to match commas not in arrays:
(,)(?![^[]*\])
(Explanation on regex101.)
This says to match any comma which, if it is followed by a close bracket, has an opening bracket before that close bracket.
Example in JS:
outPut = outPut.replace(/(,)(?![^[]*\])/g, '\n');
gives:
"{glossary:{title:example glossary
GlossDiv:{title:S
GlossList:{GlossEntry:{ID:SGML
SortAs:SGML
GlossTerm:Standard Generalized Markup Language
Acronym:SGML
Abbrev:ISO 8879:1986
GlossDef:{para:A meta-markup language
used to create markup languages such as DocBook.
GlossSeeAlso:[GML,XML]}
GlossSee:markup}}}}}"
match all commas that are outside parentheses and square brackets in perl regex
The problem here is in identifying "balanced" pairs, of parenthesis/brackets in this case. This is a well recognized problem, for which there are libraries. They can find the top-level matching pairs, (...)
/[...]
with all that's inside, and all else outside parens -- then process the "else."
One way, using Regexp::Common
use warnings;
use strict;
use feature 'say';
use Regexp::Common;
my $str = shift // q{A, t(a,b(c,))u B, C, p(d,)q D,};
my @all_parts = split /$RE{balanced}{-parens=>'()[]'}/, $str;
my @no_paren_parts = grep { not /\(.*\) | \[.*\]/x } @all_parts;
say for @no_paren_parts;
This uses split's property to return the list with separators included when the regex in the separator pattern captures.† The library regex captures so we get it all back -- the parts obtained by splitting the string by what regex matched but also the parts matched by the regex. The separators contain the paired delimiters while other terms cannot, by construction, so I filter them out by that.‡ Prints
A, t
u B, C, p
q D,
The paren/bracket terms are gone, but how the string is split is otherwise a bit arbitrary.
The above is somewhat "generic," using the library merely to extract the balanced pairs ()
/[]
, along with all other parts of the string. Or, we can remove those patterns from the string
$str =~ s/$RE{balanced}{-parens=>'()[]'}//g;
to stay with
A, tu B, C, pq D,
Now one can simply split by commas
my @terms = split /\s*,\s*/, $str;
say for @terms;
for
A
tu B
C
pq D
This is the desired result in this case, as clarified in comments.
Another most notable library, in many ways more fundamental, is the core Text::Balance
. See Shawn's answer here, and for example this post and this one and this one for examples.
† An example. With
my $str = q(it, is; surely);
my @terms = split /[,;]/, $str;
one gets it
is
surely
in the array @terms
, while with
my @terms = split /([,;])/, $str;
we get in @terms
all of: it
,
is
;
surely
‡ Also by construction, it contains what the regex matched at even indices. So for all other parts we can fetch elements at odd indices
my @other_than_matched_parts = @all_parts[ grep { not $_ & 1 } 0..$#all_parts ];
Replace a comma that is not in parentheses using regex
Use a negative lookahead to achieve this:
,(?![^()]*\))
Explanation:
, # Match a literal ','
(?! # Start of negative lookahead
[^()]* # Match any character except '(' & ')', zero or more times
\) # Followed by a literal ')'
) # End of lookahead
Regex101 Demo
Regex split by comma not inside parenthesis (.NET)
This PCRE regex - (\((?:[^()]++|(?1))*\))(*SKIP)(*F)|,
- uses recursion, .NET does not support it, but there is a way to do the same thing using balancing construct. The From the PCRE verbs - (*SKIP)
and (*FAIL)
- only (*FAIL)
can be written as (?!)
(it causes an unconditional fail at the place where it stands), .NET does not support skipping a match at a specific position and resuming search from that failed position.
I suggest replacing all commas that are not inside nested parentheses with some temporary value, and then splitting the string with that value:
var s = Regex.Replace(text, @"\((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!))\)|(,)", m =>
m.Groups[1].Success ? "___temp___" : m.Value);
var results = s.Split("___temp___");
Details
\((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!))\)
- a pattern that matches nested parentheses:\(
- a(
char(?>[^()]+|(?<o>)\(|(?<-o>)\))*
- 0 or more occurrences of[^()]+|
- 1+ chars other than(
and)
or(?<o>)\(|
- a(
and a value is pushed on to the Group "o" stack(?<-o>)\)
- a)
and a value is popped from the Group "o" stack
(?(o)(?!))
- a conditional construct that fails the match if Group "o" stack is not empty\)
- a)
char
|
- or(,)
- Group 1: a comma
Only the comma captured in Group 1 is replaced with a temp substring since the m.Groups[1].Success
check is performed in the match evaluator part.
Related Topics
Unresponsive Keylistener for Jframe
How Can a String Be Initialized Using " "
Why Does the Jtable Header Not Appear in the Image
Build Eclipse Java Project from Command Line
Simpledateformat Parsing Date with 'Z' Literal
How to Return a JSON Object from a Java Servlet
Is There an Executorservice That Uses the Current Thread
How to Access a Value Defined in the Application.Properties File in Spring Boot
Java, Calculate the Number of Days Between Two Dates
Javafx Panel Inside Panel Auto Resizing
How to Add a Filter Class in Spring Boot
Java 8 Lambdas, Function.Identity() or T->T
How to Convert Outputstream to Inputstream
Differencebetween Integer and Int in Java
Mockito - Difference Between Doreturn() and When()
How to Ignore Ssl Certificate Errors in Apache Httpclient 4.0