Is there a weighted.median() function?
The following packages all have a function to calculate a weighted median: 'aroma.light', 'isotone', 'limma', 'cwhmisc', 'ergm', 'laeken', 'matrixStats, 'PSCBS', and 'bigvis' (on github).
To find them I used the invaluable findFn() in the 'sos' package which is an extension for R's inbuilt help.
findFn('weighted median')
Or,
???'weighted median'
as ??? is a shortcut in the same way ?some.function
is for help(some.function)
weighted median in spatstat package
I believe this is a flaw in the package, and I'll explain why.
Firstly, weighted.median
actually just calls weighted.quantile
with the probs
vector set to 0.5
. But if you call weighted.quantile
with your data, you get very strange results:
weighted.quantile(x, w)
#> 0% 25% 50% 75% 100%
#> 10.00 10.00 10.50 11.25 12.00
That's not right.
If you look at the body of this function using body(weighted.quantile)
, and follow the logic through, there seems to be a problem with the way the weights are normalized on line 10 into a variable called Fx
. To work properly, the normalized weights should be a vector of the same length as x
, but starting at 0 and ending in 1, with the spacing in between being proportional to the weights.
But if you look at how this is actually calculated:
body(weighted.quantile)[[10]]
#> Fx <- cumsum(w)/sum(w)
You can see it doesn't start at 0. In your case, the first element would be 0.3333.
So to show this is the case, let's write over this line with the correct expression. (First we need to unlock the binding to give access to the function)
unlockBinding("weighted.quantile", asNamespace("spatstat"))
body(weighted.quantile)[[10]] <- substitute(Fx <- (cumsum(w) - min(w))/(sum(w) - min(w)))
Now we get the correct result for weighted quantiles (including the correct median)
weighted.quantile(x, w)
#> 0% 25% 50% 75% 100%
#> 10.0 10.5 11.0 11.5 12.0
Python: define function to get the weighted median
Try to stack only one level:
wmedian = lambda x: x.loc[x['weight'].cumsum().gt(0.5), 'close'].head(1)
out = df1.stack(level=0).groupby(level=0).apply(wmedian) \
.reset_index(level=[1, 2], drop=True)
Output:
>>> out
01-01-2020 23
01-02-2020 21
01-03-2020 44
Name: close, dtype: int64
>>> df1.stack(level=0)
close weight
01-01-2020 A 10 0.1
B 20 0.2
C 23 0.3
D 45 0.5
01-02-2020 A 12 0.3
B 19 0.1
C 21 0.4
D 47 0.2
01-03-2020 A 15 0.1
B 29 0.2
C 4 0.1
D 44 0.6
How to calculate weighted mean and median in python?
First, install the weightedstats library in python.
pip install weightedstats
Then, do the following -
Weighted Mean
ws.weighted_mean(state['Murder.Rate'], weights=state['Population'])
4.445833981123394
Weighted Median
ws.weighted_median(state['Murder.Rate'], weights=state['Population'])
4.4
It also has special weighted mean and median methods to use with numpy arrays. The above methods will work but in case if you need it.
my_data = [1, 2, 3, 4, 5]
my_weights = [10, 1, 1, 1, 9]
ws.numpy_weighted_mean(my_data, weights=my_weights)
ws.numpy_weighted_median(my_data, weights=my_weights)
KDB: weighted median
For values v
and weights w
, med v where w
gobbles space for larger values of w
.
Instead, sort w
into ascending order of v
and look for where cumulative sums reach half their sum.
q)show v:10?100
17 23 12 66 36 37 44 28 20 30
q)show w:.001*10?1000
0.418 0.126 0.077 0.829 0.503 0.12 0.71 0.506 0.804 0.012
q)med v where "j"$w*1000
36f
q)w iasc v / sort w into ascending order of v
0.077 0.418 0.804 0.126 0.506 0.012 0.503 0.12 0.71 0.829
q)0.5 1*(sum;sums)@\:w iasc v / half the sum and cumulative sums of w
2.0525
0.077 0.495 1.299 1.425 1.931 1.943 2.446 2.566 3.276 4.105
q).[>]0.5 1*(sum;sums)@\:w iasc v / compared
1111110000b
q)v i sum .[>]0.5 1*(sum;sums)@\:w i:iasc v / weighted median
36
q)\ts:1000 med v where "j"$w*1000
18 132192
q)\ts:1000 v i sum .[>]0.5 1*(sum;sums)@\:w i:iasc v
2 2576
q)wmed:{x i sum .[>]0.5 1*(sum;sums)@\:y i:iasc x}
Some vector techniques worth noticing:
- Applying two functions with Each Left
(sum;sums)@\:
and using Apply.
and an operator on the result, rather than setting a variable, e.g.(0.5*sum yi)>sums yi:y i
or defining an inner lambda{sums[x]<0.5*sum x}y i
- Grading one list with
iasc
to sort another - Multiple mappings through juxtaposition:
v i sum ..
Python: weighted median algorithm with pandas
If you want to do this in pure pandas, here's a way. It does not interpolate either. (@svenkatesh, you were missing the cumulative sum in your pseudocode)
df.sort_values('impwealth', inplace=True)
cumsum = df.indweight.cumsum()
cutoff = df.indweight.sum() / 2.0
median = df.impwealth[cumsum >= cutoff].iloc[0]
This gives a median of 925000.
Calculate median from x, y data R
Without transforming:
lapply(df[,2:3], function(y) median(rep(df$Size, times = y)))
$val1
[1] 49
$val2
[1] 47
data:
set.seed(99)
df <- data.frame(Size = c(1:100),
val1 = sample(0:9,100,replace = TRUE,),
val2 = sample(0:9,100,replace = TRUE))
Related Topics
Programmatically Creating Markdown Tables in R with Knitr
How to Extract the Fill Colours from a Ggplot Object
How to Index an Element of a List Object in R
Plot a Line Chart with Conditional Colors Depending on Values
Replace Empty Values with Value from Other Column in a Dataframe
Add Text to Horizontal Barplot in R, Y-Axis at Different Scale
How to Make a Discontinuous Axis in R with Ggplot2
What You Can Do with a Data.Frame That You Can't with a Data.Table
Simple Way to Subset Spatialpolygonsdataframe (I.E. Delete Polygons) by Attribute in R
R - Markdown Avoiding Package Loading Messages
Get Column Index from Label in a Data Frame
Cleaning 'Inf' Values from an R Dataframe
Administrative Regions Map of a Country with Ggmap and Ggplot2
Difference Between If() and Ifelse() Functions
Counting Number of Instances of a Condition Per Row R
How to Stack Error Bars in a Stacked Bar Plot Using Geom_Errorbar