我可以在每个单元格都是列表列表的Python Pandas列上使用正则表达式搜索或匹配吗?

I have a somewhat large CSV file (>2,000 rows) I've read into Pandas and want to create a new indicator column based on whether or not a specific word appears in one of the data columns. I have been trying to use regex search, which may be overkill because the word will always appear split by spaces, but the cells of the DataFrame are lists of lists of strings. I've tried iterating using a double list comprehension, but there are errors, and I am also curious, as a Python newbie, if there is a generalized solution for an unspecified number of nested lists to flatten. Here is an example, where my end goal is a new column with 1 in the rows where the word 'saddle' appears anywhere in the cell from the selected column, and 0 if not.

我的DataFrame看起来像这样

import pandas as pd
import numpy as np

cycling = pd.DataFrame(
    {
        'qty' : [1,0,2,1,1],
        'item' : ['frame','frame',np.nan,'order including a saddle and other things','brake'],
        'desc' : [np.nan,['bike','wheel'],['bike',['tire','tube']],['saddle',['seatpost','bag']],['bike','brakes']]
    }
)

Here is the DataFrame

I can search the item column to achieve my goal, using this code (efficiency and other suggestions VERY welcome!!):

cycling['saddle1'] = [int(bool(re.search(r"saddle",x))) for x in cycling['item'].replace(np.nan,'missing')]

My original dataset has missing values that I want to resolve to 0 in the indicator column; I don't care about them otherwise. The above code works great for a column with strings in each cell the fourth row is correctly identified, but I can't modify it to work when the cell contains a list, or a list of lists, like the desc column. I tried:

cycling['saddle2'] = [int(bool(re.search(r"saddle",x))) for y in cycling['desc'].replace(np.nan,'missing') for x in y]

但我收到以下错误

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-45-4c72cdaa87a4> in <module>()
----> 1 cycling['saddle2'] = [int(bool(re.search(r"saddle",x))) for y in cycling['desc'].replace(np.nan,'missing') for x in y]
      2 cycling.head()

1 frames
/usr/lib/python3.6/re.py in search(pattern, string, flags)
    180     """Scan through string looking for a match to the pattern, returning
    181     a match object, or None if no match was found."""
--> 182     return _compile(pattern, flags).search(string)
    183 
    184 def sub(pattern, repl, string, count=0, flags=0):

TypeError: expected string or bytes-like object

I think that error is that it doesn't like receiving non-strings for regex (maybe the unflattened lists?). Is there a way to search a column in Pandas for a specific word (possibly using regex), where some cells are lists of strings, some are lists of strings that also contain nested lists, and some cells are missing, to create an indicator column, with a 1 for anywhere it appears (whether it's nested or not), and 0 otherwise?

评论