python性能(2):multiprocessing多参数调用

承接上节python性能(1):multiprocessing多进程,使用多进程经常会遇到多个参数的问题,此时硬要使用pool.map()略显麻烦,可以使用Pool.starmapPool.map_async,后者和apply_async区别于map和starmap,后续单独小结

starmap用法

用法与之前类似,只不过提供了一个更加方便的参数打包和完成自动解析的步骤

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from multiprocessing.pool import Pool

def myfunction(outpath,x,y,z):
# 此处是处理操作
...

outpath = "C:\Users\huangs\Desktop"
outpaths = [outpath]*4
X = [1,2,3,4]
Y = [3,4,5,6]
Z = [1,4,5,6]
# X,Y,Z为参数集合,也可以用元组包装,如X = (1,2,3,4)
paras = [outpaths,X, Y, Z]
with Pool(8) as pool:
pool.starmap(myfunction, paras_in)

# 推荐循环构造参数
paras_in = []
for i in range(len(X)):
paras_in.append(X[i],Y[i],Z[i],outpath)
with Pool(8) as pool:
pool.starmap(myfunction, paras_in)

这样实际上就会多进程执行:

1
2
3
4
myfunction(outpaths[0],X[0],Y[0],Z[0])
myfunction(outpaths[1],X[1],Y[1],Z[1])
......
myfunction(outpaths[i],X[i],Y[i],Z[i])

pandas多行并行

如果能够直接构造一个函数使用pandas apply函数即可,有时候pandas的多行提取得到一些结果,不方便直接apply,可以==将表拆分成多个表完成并行操作==,记录一下刚刚完成的实例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def main(tids, track_ids, user_ids, cities):
# 此处是操作
newrows = []
for i in range(len(tids)):
...

if __name__ == '__main__':
# 原始轨迹数据表
tf = pd.read_csv(trackinfo, header=0, encoding='utf-8')
paras = []
# 按照城市拆分成367个表格,每组数据作为paras的一个元素
for city,table in tf.groupby('city'):
tids = list(table['tid'])
track_ids = list(table['track_id'])
user_ids = list(table['user_id'])
cities = list(table['city'])
paras.append((tids,track_ids,user_ids,cities))
# 此时paras是一个包含367个元组的列表,每个元素是一组输入
# 遍历处理,函数内设循环
with Pool(8) as pool:
pool.starmap(main, paras)