Best way to get the max value in a Spark dataframe column
>df1.show()
+-----+--------------------+--------+----------+-----------+
|floor| timestamp| uid| x| y|
+-----+--------------------+--------+----------+-----------+
| 1|2014-07-19T16:00:...|600dfbe2| 103.79211|71.50419418|
| 1|2014-07-19T16:00:...|5e7b40e1| 110.33613|100.6828393|
| 1|2014-07-19T16:00:...|285d22e4|110.066315|86.48873585|
| 1|2014-07-19T16:00:...|74d917a1| 103.78499|71.45633073|
>row1 = df1.agg({"x": "max"}).collect()[0]
>print row1
Row(max(x)=110.33613)
>print row1["max(x)"]
110.33613
The answer is almost the same as method3. but seems the "asDict()" in method3 can be removed
Is there any way to get max value from a column in Pyspark other than collect()?
No need to sort, you can just select the maximum:
res = df.select(max(col('col1')).alias('max_col1')).first().max_col1
Or you can use selectExpr
res = df1.selectExpr('max(diff) as max_col1').first().max_col1
How to get the rows with Max value in Spark DataFrame
You can easily to did by extracting the MAX
High value and finally applying a filter
against the value on the entire Dataframe
Data Preparation
df = pd.DataFrame({
'Date':['2021-01-23','2021-02-09','2009-09-19'],
'High':[89,90,96],
'Low':[43,54,50]
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+----------+----+---+
| Date|High|Low|
+----------+----+---+
|2021-01-23| 89| 43|
|2021-02-09| 90| 54|
|2009-09-19| 96| 50|
+----------+----+---+
Filter
max_high = sparkDF.select(F.max(F.col('High')).alias('High')).collect()[0]['High']
>>> 96
sparkDF.filter(F.col('High') == max_high).orderBy(F.col('Date').desc()).limit(1).show()
+----------+----+---+
| Date|High|Low|
+----------+----+---+
|2009-09-19| 96| 50|
+----------+----+---+
Add column to Spark dataframe with the max value that is less than the current record's value
You may try the following which uses max
as a window function with when
(a case expression) but focuses on the preceding rows
from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.withColumn('previous_service_date',F.max(
F.when(F.col("status")=="PD",F.col("service_date")).otherwise(None)
).over(
Window.partitionBy("product")
.rowsBetween(Window.unboundedPreceding,-1)
))
df.orderBy('service_date').show(truncate=False)
+---+--------------------+-------------------+------+-------+---------------------+
|id |claim_id |service_date |status|product|previous_service_date|
+---+--------------------+-------------------+------+-------+---------------------+
|123|10606134411906233408|2018-09-17 00:00:00|PD |blue |null |
|123|10606147900401009928|2019-01-24 00:00:00|PD |yellow |null |
|123|10606160940704723994|2019-05-23 00:00:00|RV |yellow |2019-01-24 00:00:00 |
|123|10606171648203079553|2019-08-29 00:00:00|RJ |blue |2018-09-17 00:00:00 |
|123|10606186611407311724|2020-01-13 00:00:00|PD |blue |2018-09-17 00:00:00 |
+---+--------------------+-------------------+------+-------+---------------------+
Edit 1
You may also use last
as denoted below
df = df.withColumn('previous_service_date',F.last(
F.when(F.col("status")=="PD" ,F.col("service_date")).otherwise(None),True
).over(
Window.partitionBy("product")
.orderBy('service_date')
.rowsBetween(Window.unboundedPreceding,-1)
))
Let me know if this works for you.
get min and max from a specific column scala spark dataframe
How about getting the column name from the metadata:
val selectedColumnName = df.columns(q) //pull the (q + 1)th column from the columns array
df.agg(min(selectedColumnName), max(selectedColumnName))
Getting the maximum value of a column in an Apache Spark Dataframe (Scala)
In your code, the spark max was being mistaken for scala max, so I just specified the max to be from spark.
That will give you the max of id column as an integer value:
%scala
import org.apache.spark.sql.functions.col
val max = df.agg(org.apache.spark.sql.functions.max(col("id"))).collect()(0)(0).asInstanceOf[Int]
Output: max: Int = 6
OR
If you want to create column to store max value of id:
%scala
import org.apache.spark.sql.functions._
df.withColumn("max", lit(df.agg(org.apache.spark.sql.functions.max($"id")).as[Int].first))
How to get the max value of date column in pyspark
I don't understand why you used try/except
. The if-statement should be enough. Also you need to use the Spark SQL min/max instead of those in Python. Avoid naming your variables as min/max, which overrides default functions.
import pyspark.sql.functions as F
for col in df.columns:
if dict(df.dtypes)[col]== 'string':
minval, maxval = df.select(F.min(col), F.max(col)).first()
print(maxval)
else:
print(col, 'NA')
how to find the max value of all columns in a spark dataframe
The code will work irrespective of how many columns or mix of datatypes there are.
Note: OP suggested in her comments that for string columns, take the first non-Null
value while grouping.
# Import relevant functions
from pyspark.sql.functions import max, first, col
# Take an example DataFrame
values = [('Alice',10,5,None,50),('Bob',15,15,'Simon',10),('Jack',5,1,'Timo',3)]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4','col5'])
df.show()
+-----+----+----+-----+----+
| col1|col2|col3| col4|col5|
+-----+----+----+-----+----+
|Alice| 10| 5| null| 50|
| Bob| 15| 15|Simon| 10|
| Jack| 5| 1| Timo| 3|
+-----+----+----+-----+----+
# Lists all columns in the DataFrame
seq_of_columns = df.columns
print(seq_of_columns)
['col1', 'col2', 'col3', 'col4', 'col5']
# Using List comprehensions to create a list of columns of String DataType
string_columns = [i[0] for i in df.dtypes if i[1]=='string']
print(string_columns)
['col1', 'col4']
# Using Set function to get non-string columns by subtracting one list from another.
non_string_columns = list(set(seq_of_columns) - set(string_columns))
print(non_string_columns)
['col2', 'col3', 'col5']
Read about first()
and ignorenulls
here
# Aggregating both string and non-string columns
df = df.select(*[max(col(c)).alias(c) for c in non_string_columns],*[first(col(c),ignorenulls = True).alias(c) for c in string_columns])
df = df[[seq_of_columns]]
df.show()
+-----+----+----+-----+----+
| col1|col2|col3| col4|col5|
+-----+----+----+-----+----+
|Alice| 15| 15|Simon| 50|
+-----+----+----+-----+----+
Related Topics
How to Transfer Data from One Worksheet into Another Using Python in the Same Workbook
Python: How to Turn CSV Data in to Array
How to Convert Number 1 to a Boolean in Python
How to Find and Replace a Part of a Value in Json File
Python Tkinter Return Value from Function Used in Command
Pyspark - Sum a Column in Dataframe and Return Results as Int
Splitting a Phone Number into a List of Digits: Python
How to Constantly Run Python Script in the Background on Windows
Find the Index of the First Digit in a String
Comparing Items in Lists Within Same Indices Python
Convert Regular Python String to Raw String
Pandas Convert from Datetime to Integer Timestamp
Bold Formatting in Python Console
Python Pandas .Isnull() Does Not Work on Nat in Object Dtype