Pyspark - Regex - Extract value from last brackets
To extract the substring between parentheses with no other parentheses inside at the end of the string you may use
tmp = tmp.withColumn("new", regexp_extract(col("txt"), r"\(([^()]+)\)$", 1));
Details
\(
- matches(
([^()]+)
- captures into Group 1 any 1+ chars other than(
and)
\)
- a)
char$
- at the end of the string.
The 1
argument tells the regexp_extract
to extract Group 1 value.
See the regex demo online.
NOTE: To allow trailing whitespace, add \s*
right before $
: r"\(([^()]+)\)\s*$"
NOTE2: To match the last occurrence of such a substring in a longer string, with exactly the same code as above, use
r"(?s).*\(([^()]+)\)"
The .*
will grab all the text up to the end, and then backtracking will do the job.
Extract text between brackets and create rows for each bit of text
If assign back output of Series.str.findall
to column is possible use DataFrame.explode
, last for unique index is used DataFrame.reset_index
with drop=True
:
df2['text'] = df2['text'].str.findall(r"(?<=\[)([^]]+)(?=\])")
df4 = df2.explode('text').reset_index(drop=True)
Solution with Series.str.extractall
, removed second level of MultiIndex
and last use DataFrame.join
for append to original:
s = (df2.pop('text').str.extractall(r"(?<=\[)([^]]+)(?=\])")[0]
.reset_index(level=1, drop=True)
.rename('text'))
df4 = df2.join(s).reset_index(drop=True)
print (df4)
studyid Question text
0 101 Q1 Bananas
1 101 Q1 oranges
2 101 Q1 figs
3 101 Q2 Apples
4 102 Q1 Grapes
5 103 Q3 Mandarins
6 103 Q3 oranges
How to extract content from the regex output which has square bracket in python
You can use str.strip
if type
of values is string
:
print type(df.at[0,'email'])
<type 'str'>
df['email'] = df.email.str.strip("[]'")
print df
email
0 jsaw@yahoo.com
1 jfsjhj@yahoo.com
2 jwrk@yahoo.com
3 rankw@yahoo.com
If type
is list
apply
Series
:
print type(df.at[0,'email'])
<type 'list'>
df['email'] = df.email.apply(pd.Series)
print df
email
0 jsaw@yahoo.com
1 jfsjhj@yahoo.com
2 jwrk@yahoo.com
3 rankw@yahoo.com
EDIT: If you have multiple values in array, you can use:
df1 = df['email'].apply(pd.Series).fillna('')
print df1
0 1 2
0 jsaw@yahoo.com
1 jfsjhj@yahoo.com
2 jwrk@yahoo.com
3 rankw@yahoo.com fsffsnl@gmail.com
4 mklcu@yahoo.com riserk@gmail.com funkdl@yahoo.com
How to remove square brackets from dataframe
Try with apply
, explode
and groupby
:
>>> df.apply(lambda x: x.explode().astype(str).groupby(level=0).agg(", ".join))
column1 column2 column3
0 data1 data1 data1
1 nan data2 data2
2 data2 data3 data3, data3, testing how are you guys hope yo...
3 data3 data3 data4, dummy text to test to test test test
4 nan data4 data5
- Use
pandas.explode()
to transform each list element to its own row, replicating index values. - Then
groupby
identical index values and aggregate usingstr.join()
. - Use
apply
to apply the same function to all columns of the DataFrame.
Pyspark - Regex_Extract value between forward slash (/)
How about split
?
you can:
.withColumn("Acode", split("column1", "/")[0])
.withColumn("Bcode", split("column1", "/")[1])
.withColumn("Ccode", split("column1", "/")[2])
How do I get the value without the square brackets
first
returns a Row object, and you can use getString
method to extract elements from the row as string:
sigh.select("accountId").first.getString(0)
Related Topics
Using a Pandas Dataframe as a Lookup Table
Python: Draw Line Between Two Coordinates in a Matrix
Comparing Two Json Objects Irrespective of the Sequence of Elements in Them
Print Floating Point Values Without Leading Zero
Python Pandas: Drop Rows of a Timeserie Based on Time Range
Plot Different Dataframes in the Same Figure
How to Clear or Overwrite a Tkinter Canvas
Pyspark Regexp_Replace With List Elements Are Not Replacing the String
How to Sort a Single String Output in Ascii Descending Order Through a Function
Python - Split Array into Multiple Arrays
Python Creating Dictionary from Excel Data
Regular Expression to Check Whitespace in the Beginning and End of a String
How to Get Text from Span Tag in Beautifulsoup
How to Allocate Array With Shape and Data Type
How to Remove Parentheses from a String
Get List of Files in a Sharepoint Directory Using Python
How to End Program If Input == "Quit" With Many If Statements