Make Pandas DataFrame apply() use all cores?
You may use the swifter
package:
pip install swifter
(Note that you may want to use this in a virtualenv to avoid version conflicts with installed dependencies.)
Swifter works as a plugin for pandas, allowing you to reuse the apply
function:
import swifter
def some_function(data):
return data * 10
data['out'] = data['in'].swifter.apply(some_function)
It will automatically figure out the most efficient way to parallelize the function, no matter if it's vectorized (as in the above example) or not.
More examples and a performance comparison are available on GitHub. Note that the package is under active development, so the API may change.
Also note that this will not work automatically for string columns. When using strings, Swifter will fallback to a “simple” Pandas apply
, which will not be parallel. In this case, even forcing it to use dask
will not create performance improvements, and you would be better off just splitting your dataset manually and parallelizing using multiprocessing
.
Python: using multiprocessing on a pandas dataframe
What's wrong
This line from your code:
pool.map(calc_dist, ['lat','lon'])
spawns 2 processes - one runs calc_dist('lat')
and the other runs calc_dist('lon')
. Compare the first example in doc. (Basically, pool.map(f, [1,2,3])
calls f
three times with arguments given in the list that follows: f(1)
, f(2)
, and f(3)
.) If I'm not mistaken, your function calc_dist
can only be called calc_dist('lat', 'lon')
. And it doesn't allow for parallel processing.
Solution
I believe you want to split the work between processes, probably sending each tuple (grp, lst)
to a separate process. The following code does exactly that.
First, let's prepare for splitting:
grp_lst_args = list(df.groupby('co_nm').groups.items())
print(grp_lst_args)
[('aa', [0, 1, 2]), ('cc', [7, 8, 9]), ('bb', [3, 4, 5, 6])]
We'll send each of these tuples (here, there are three of them) as an argument to a function in a separate process. We need to rewrite the function, let's call it calc_dist2
. For convenience, it's argument is a tuple as in calc_dist2(('aa',[0,1,2]))
def calc_dist2(arg):
grp, lst = arg
return pd.DataFrame(
[ [grp,
df.loc[c[0]].ser_no,
df.loc[c[1]].ser_no,
vincenty(df.loc[c[0], ['lat','lon']],
df.loc[c[1], ['lat','lon']])
]
for c in combinations(lst, 2)
],
columns=['co_nm','machineA','machineB','distance'])
And now comes the multiprocessing:
pool = mp.Pool(processes = (mp.cpu_count() - 1))
results = pool.map(calc_dist2, grp_lst_args)
pool.close()
pool.join()
results_df = pd.concat(results)
results
is a list of results (here data frames) of calls calc_dist2((grp,lst))
for (grp,lst)
in grp_lst_args
. Elements of results
are later concatenated to one data frame.
print(results_df)
co_nm machineA machineB distance
0 aa 1 2 156.876149391 km
1 aa 1 3 313.705445447 km
2 aa 2 3 156.829329105 km
0 cc 8 9 156.060165391 km
1 cc 8 0 311.910998169 km
2 cc 9 0 155.851498134 km
0 bb 4 5 156.665641837 km
1 bb 4 6 313.214333025 km
2 bb 4 7 469.622535339 km
3 bb 5 6 156.548897414 km
4 bb 5 7 312.957597466 km
5 bb 6 7 156.40899677 km
BTW, In Python 3 we could use a with
construction:
with mp.Pool() as pool:
results = pool.map(calc_dist2, grp_lst_args)
Update
I tested this code only on linux. On linux, the read only data frame df
can be accessed by child processes and is not copied to their memory space, but I'm not sure how it exactly works on Windows. You may consider splitting df
into chunks (grouped by co_nm
) and sending these chunks as arguments to some other version of calc_dist
.
Related Topics
How to Trace the Path in a Breadth-First Search
Redirect While Passing Arguments
Beautifulsoup:Difference Between .Find() and .Select()
Selenium Webdriver: How to Download a PDF File with Python
Why Do I Get "Pickle - Eoferror: Ran Out of Input" Reading an Empty File
Using a Python Subprocess Call to Invoke a Python Script
Why Is 'Self' in Python Objects Immutable
Correct Style for Python Functions That Mutate the Argument
Why Apply Sometimes Isn't Faster Than For-Loop in a Pandas Dataframe
Selenium Unable to Locate Element Only When Using Headless Chrome (Python)
Access Data in Package Subdirectory
Reading Two Text Files Line by Line Simultaneously
Simulate Python Keypresses for Controlling a Game
How to Create Collapsible Box in Pyqt
Matrix Multiplication in Pure Python
Valueerror: Could Not Broadcast Input Array from Shape (224,224,3) into Shape (224,224)
SchröDinger's Variable: the _Class_ Cell Magically Appears If You'Re Checking for Its Presence