Python: How to match nested parentheses with regex?
The regular expression tries to match as much of the text as possible, thereby consuming all of your string. It doesn't look for additional matches of the regular expression on parts of that string. That's why you only get one answer.
The solution is to not use regular expressions. If you are actually trying to parse math expressions, use a real parsing solutions. If you really just want to capture the pieces within parenthesis, just loop over the characters counting when you see ( and ) and increment a decrement a counter.
Regex nested parenthesis in python
Regex
(.+)\s+\(\d+\).+?(?:\(([^)]{2,})\)\s+(?={))?\{.+\(#(\d+\.\d+)\)\}
Text used for test
Name1 Name2 Name3 (2000) {Education (#3.2)}
Name1 Name2 Name3 (2000) (ok) {edu (#1.1)}
Name1 Name2 (2002) {edu (#1.1)}
Name1 Name2 Name3 (2000) (V) {variation (#4.12)}
Othername California (2000) (T) (S) (ok) {state (#2.1)}
Test
>>> regex = re.compile("(.+)\s+\(\d+\).+?(?:\(([^)]{2,})\)\s+(?={))?\{.+\(#(\d+\.\d+)\)\}")
>>> r = regex.search(string)
>>> r
<_sre.SRE_Match object at 0x54e2105f36c16a48>
>>> regex.match(string)
<_sre.SRE_Match object at 0x54e2105f36c169e8>
# Run findall
>>> regex.findall(string)
[
(u'Name1 Name2 Name3' , u'' , u'3.2'),
(u'Name1 Name2 Name3' , u'ok', u'1.1'),
(u'Name1 Name2' , u'' , u'1.1'),
(u'Name1 Name2 Name3' , u'' , u'4.12'),
(u'Othername California', u'ok', u'2.1')
]
How to handle nested parentheses with regex?
Standard1 regular expressions are not sophisticated enough to match nested structures like that. The best way to approach this is probably to traverse the string and keep track of opening / closing bracket pairs.
1 I said standard, but not all regular expression engines are indeed standard. You might be able to this with Perl, for instance, by using recursive regular expressions. For example:
$str = "[hello [world]] abc [123] [xyz jkl]";
my @matches = $str =~ /[^\[\]\s]+ | \[ (?: (?R) | [^\[\]]+ )+ \] /gx;
foreach (@matches) {
print "$_\n";
}
[hello [world]]
abc
[123]
[xyz jkl]
EDIT: I see you're using Python; check out pyparsing
.
Regex to find texts between nested parenthesis
The work around pattern can be the one that matches a line starting with {{info
and then matches any 0+ chars as few as possible up to the line with just }}
on it:
re.findall(r'(?sm)^{{[^\S\r\n]*info\s*(.*?)^}}$', s)
See the regex demo.
Details
(?sm)
-re.DOTALL
(now,.
matches a newline) andre.MULTILINE
(^
now matches line start and$
matches line end positions) flags^
- start of a line{{
- a{{
substring[^\S\r\n]*
- 0+ horizontal whitespacesinfo
- a substring\s*
- 0+ whitespaces(.*?)
- Group 1: any 0+ chars, as few as possible^}}$
- start of a line,}}
and end of the line.
Regular expression to return string split up respecting nested parentheses
Using regex
only for the task might work but it wouldn't be straightforward.
Another possibility is writing a simple algorithm to track the parentheses in the string:
- Split the string at all parentheses, while returning the delimiter (e.g. using
re.split
) - Keep a counters tracking the parentheses:
start_parens_count
for(
andend_parens_count
for)
. - Using the counters, proceed by either splitting at white spaces or adding the current data into a temp var (
term
) - When the left most parenthesis has been closed, append
term
to the list of values & reset the counters/temp vars.
Here's an example:
import re
string = "1 2 3 (test 0, test 0) (test (0 test) 0)"
result, start_parens_count, end_parens_count, term = [], 0, 0, ""
for x in re.split(r"([()])", string):
if not x.strip():
continue
elif x == "(":
if start_parens_count > 0:
term += "("
start_parens_count += 1
elif x == ")":
end_parens_count += 1
if end_parens_count == start_parens_count:
result.append(term)
end_parens_count, start_parens_count, term = 0, 0, ""
else:
term += ")"
elif start_parens_count > end_parens_count:
term += x
else:
result.extend(x.strip(" ").split(" "))
print(result)
# ['1', '2', '3', 'test 0, test 0', 'test (0 test) 0']
Not very elegant, but works.
Extract string between two brackets, including nested brackets in python
>>> import re
>>> s = """res = sqr(if((a>b)&(a<c),(a+b)*c,(a-b)*c)+if()+if()...)"""
>>> re.findall(r'if\((?:[^()]*|\([^()]*\))*\)', s)
['if((a>b)&(a<c),(a+b)*c,(a-b)*c)', 'if()', 'if()']
For such patterns, better to use VERBOSE
flag:
>>> lvl2 = re.compile('''
... if\( #literal if(
... (?: #start of non-capturing group
... [^()]* #non-parentheses characters
... | #OR
... \([^()]*\) #non-nested pair of parentheses
... )* #end of non-capturing group, 0 or more times
... \) #literal )
... ''', flags=re.X)
>>> re.findall(lvl2, s)
['if((a>b)&(a<c),(a+b)*c,(a-b)*c)', 'if()', 'if()']
To match any number of nested pairs, you can use regex module, see Recursive Regular Expressions
How can I make a regular expression that only matches the middle bracket of nested brackets?
Easiest way to capture something that does not entail some other things is with
[^ ....]
- the ^ disallowes anything inside the [] - as a special feature you do not need to escape brackets inside it - so by declaring your regex as
r'(\([^()]+\))'
you essentially capture a literal (
followed bei 1+ anythings but neither (
nor )
followed by a literal )
.
See https://regexr.com/3nsfg
From Regex Syntax:
- Characters that are not within a range can be matched by complementing the set. If the first character of the set is
^
, all
the characters that are not in the set will be matched. For example,
[^5]
will match any character except '5', and[^^]
will match any
character except '^'. ^ has no special meaning if it’s not the first
character in the set.- To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set. For example, both
[()[\]{}]
and[]()[{}]
will both match a parenthesis.
Code:
t = "x(x+3(x+3))"
import re
m = re.findall(r"(\([^()]+\))", t)
print(m[0])
Output:
(x+3)
Python Regex match parenthesis but not nested parenthesis
If (foo)
in x(foo)x
shall be matched, but (foo)
in ((foo))
not, what you want is not possible with regular expressions, as regular expressions represent regular grammars and all regular grammars are context free. But context (or 'state', as Jonathon Reinhart called it in his comment) is necessary for the distinction between the (foo)
substrings in x(foo)x
and ((foo))
.
If you only want to match strings that only consist of a parenthesized substring, without any parentheses (matched or unmatched) in that substring, the following regex will do:
^\([^()]*\)$
^
and$
'glue' the pattern to the beginning and end of the string, respectively, thereby excluding partial matches- note the arbitrary number of repetitions (…
*
) of the non-parenthesis character inside the parentheses. - note how special characters are not escaped inside a character set, but still have their literal meaning. (Putting backslashes in there would put literal backslashes in the character set. Or in this case out of the character set, due to the negation.)
- note how the
[
starting the character set isn't escaped, because we actually want its special meaning, rather than is literal meaning
The last two points might be specific to the dialect of regular expressions Python uses.
So this will match ()
and (foo)
completely, but not (not even partially) (foo)bar)
, (foo(bar)
, x(foo)
, (foo)x
or ()()
.
Related Topics
Check If Values of Multiple Columns Are the Same (Python)
Remove Very First Row in Pandas
How to Use Anaconda Python to Execute a .Py File
How to Dynamically Build a Json Object
How to Write Multiple Images (Subplots) into One Image
Pyspark Replace All Values in Dataframe With Another Values
Cv2.Videocapture.Open() Always Returns False
How to Append Data Using Openpyxl Python to Excel File from a Specified Row
How to Center a Window on the Screen in Tkinter
Spliting a Row to Multiple Row Pyspark
How to Read a List of Parquet Files from S3 as a Pandas Dataframe Using Pyarrow
How to Create Multiple Data Frames Using a for Loop in Python
How to Locate Elements on Webpage With Headless Chrome
Pandas Dataframe Check If Column Value Exists in a Group of Columns
How to Make an Auto Increment Integer Field in Django
Check If Dataframe Has a Zero Element