Pandas(杂):List/pandas文本包含筛选

处理文本或字符串数据常需要对内容进行关键词筛选,如选出表格df中包含某几个关键词的行,选出某个列表texts中所有包含某些关键词的句子keywords,或选择属于某个list的行……

字符串

子串包含行筛选

单个字符串的筛选则可以通过if (substring in text)进行判断

pandas可通过df['text'].str.contains("keyword1|keyword2...")

代码举例:

1
2
3
4
5
6
7
# 表格行筛选
keywords = "天鸽|台风|hato|HATO|Hato|帕卡|Pakhar|pakhar"
df = df.loc[df['content_cleaned'].str.contains(keywords)]
# 列表筛选举例,先转为列表(为了举例,实际中会采用contains()的方式)
contents = df['content_cleaned'].to_list()
keywords = ["天鸽","台风","hato","HATO","Hato"]
contents2 = [content for content in contents if any(w in content for w in keywords)]

如果是且,则改成&;all()

列表包含

列表推导

1
newlist = [text for text in texts if any(w in text for w in keywords)]

如果是列元素是否包含于list的筛选:

  • df2 = df1.loc[df1['ID'].isin(keywords)]
  • 取反则加~,如用户筛选:
1
2
3
4
vipusers = [...]
wb_vip = wb.loc[wb['ID'].isin(vipusers)]
# 取反则
wb_vip = wb.loc[~wb['ID'].isin(vipusers)]