if else in pyspark for collapsing column values
Try this :
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
def modify_values(r):
if r == "A" or r =="B":
return "dispatch"
else:
return "non-dispatch"
ol_val = udf(modify_values, StringType())
new_df = df.withColumn("wo_flag",ol_val(df.wo_flag))
Things you are doing wrong:
- You are trying to modify Rows (Rows are immmutable)
- When a map operation is done on a dataframe , the resulting data structure is a PipelinedRDD and not a dataframe . You have to apply .toDF() to get dataframe
If-If statement Scala Spark
Creating the dataframe:
val df1 = Seq((1, true, 1), (1, true, 0), (1, false, 1), (2, true, 1), (2, true, 0)).toDF("A", "B", "C")
df1.show()
// +---+-----+---+
// | A| B| C|
// +---+-----+---+
// | 1| true| 1|
// | 1| true| 0|
// | 1|false| 1|
// | 2| true| 1|
// | 2| true| 0|
// +---+-----+---+
The code:
val condition1 = ($"A" === 1) && ($"B" === true)
val condition2 = condition1 && ($"C" === 1)
val arr1 = array(when(condition1, "A"), when(condition2, "B"))
val arr2 = when(element_at(arr1, 2).isNull, slice(arr1, 1, 1)).otherwise(arr1)
val df2 = df.withColumn("D", explode(arr2))
df2.show()
// +---+-----+---+----+
// | A| B| C| D|
// +---+-----+---+----+
// | 1| true| 1| A|
// | 1| true| 1| B|
// | 1| true| 0| A|
// | 1|false| 1|null|
// | 2| true| 1|null|
// | 2| true| 0|null|
// +---+-----+---+----+
Pyspark Conditional statement
Possible duplicate of this question.
You need to use when
with (or without) otherwise
from pyspark.sql.functions
.
from pyspark.sql.functions import when, col
df = srcdf\
.withColumn("sum", when(col("flag") == 1, col("sal") * 2)\
.when(col("flag") == 2, col("sal") * 4)
)
OR
from pyspark.sql.functions import when, col
df = srcdf\
.withColumn("sum", when(col("flag") == 1, col("sal") * 2)\
.otherwise(col("sal") * 4)
)
Pyspark Conditional statement
Possible duplicate of this question.
You need to use when
with (or without) otherwise
from pyspark.sql.functions
.
from pyspark.sql.functions import when, col
df = srcdf\
.withColumn("sum", when(col("flag") == 1, col("sal") * 2)\
.when(col("flag") == 2, col("sal") * 4)
)
OR
from pyspark.sql.functions import when, col
df = srcdf\
.withColumn("sum", when(col("flag") == 1, col("sal") * 2)\
.otherwise(col("sal") * 4)
)
SPARK SQL - case when then
Before Spark 1.2.0
The supported syntax (which I just tried out on Spark 1.0.2) seems to be
SELECT IF(1=1, 1, 0) FROM table
This recent thread http://apache-spark-user-list.1001560.n3.nabble.com/Supported-SQL-syntax-in-Spark-SQL-td9538.html links to the SQL parser source, which may or may not help depending on your comfort with Scala. At the very least the list of keywords starting (at time of writing) on line 70 should help.
Here's the direct link to the source for convenience: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala.
Update for Spark 1.2.0 and beyond
As of Spark 1.2.0, the more traditional syntax is supported, in response to SPARK-3813: search for "CASE WHEN" in the test source. For example:
SELECT CASE WHEN key = 1 THEN 1 ELSE 2 END FROM testData
Update for most recent place to figure out syntax from the SQL Parser
The parser source can now be found here.
Update for more complex examples
In response to a question below, the modern syntax supports complex Boolean conditions.
SELECT
CASE WHEN id = 1 OR id = 2 THEN "OneOrTwo" ELSE "NotOneOrTwo" END AS IdRedux
FROM customer
You can involve multiple columns in the condition.
SELECT
CASE WHEN id = 1 OR state = 'MA'
THEN "OneOrMA"
ELSE "NotOneOrMA" END AS IdRedux
FROM customer
You can also nest CASE WHEN THEN expression.
SELECT
CASE WHEN id = 1
THEN "OneOrMA"
ELSE
CASE WHEN state = 'MA' THEN "OneOrMA" ELSE "NotOneOrMA" END
END AS IdRedux
FROM customer
Pyspark apply function to column value if condition is met
I hope this helps:
def myFun(x):
return (x**2).cast(IntegerType())
df2 = df.withColumn("y", when(df.col1 == 1, myFun(df.col2)).otherwise(None))
df2.show()
+----+----+----+
|col1|col2| y|
+----+----+----+
| 1| 2| 4|
| 2| 7|null|
| 1| 3| 9|
| 2| -6|null|
| 1| 3| 9|
| 1| 5| 25|
| 1| 4| 16|
| 2| 7|null|
+----+----+----+
IF Statement Pyspark
How to make it work (pass struct
)
from pyspark.sql.functions import struct
df_4.withColumn("y", y_udf(
# Include columns you want
struct(df_4['tot_amt'], df_4['purch_class'])
))
What would make more sense
y_udf = udf(lambda y: 1 if y < -50 else 0, IntegerType())
df_4.withColumn("y", y_udf('tot_amt'))
How it suppose to be done:
from pyspark.sql.functions import when
df_4.withColumn("y", when(df_4['tot_amt'] < -50, 1).otherwise(0))
Related Topics
How to Convert Datetime to Integer in Python
Best Way to Get the Max Value in a Spark Dataframe Column
Using Python, How to Access a Shared Folder on Windows Network
Python, Delete Json Element Having Specific Key from a Loop
How to Allocate Array With Shape and Data Type
Macos: How to Downgrade Homebrew Python
How to Insert String Value into Specific Column Value on Python Pandas
How to Downgrade Python from 3.7 to 3.5 in Anaconda
How to Clear/Delete the Contents of a Tkinter Text Widget
How to Find Which Version of Tensorflow Is Installed in My System
Python Selenium - Element Is Not Currently Interactable and May Not Be Manipulated
Python: How to Print Separate Lines from a List
How to Overwrite Part of a Text File in Python
How to Change Python Version in Anaconda Spyder
How to Update a Pyspark Dataframe With New Values from Another Dataframe
Convert Numbers into Corresponding Letter Using Python
Regex Check If Specific Multiple Words Present in a Sentence