均值分为两列,窗口时间为3个月,NaN时间少于3个月

I have to apply the mean calculation in this dataset by customer, account but this mean needs to be applied to each 3 months in these groups. For the customer A1200 that don't has 3 months, the result need to be NaN.

customer    account    month        invoice
C1000       A1100      2019-10-01   34000
                       2019-11-01   55000
                       2019-12-01   80000
            A1200      2019-10-01   90000
                       2019-11-01   55000
            A1300      2019-10-01   10000
                       2019-11-01   10000
                       2019-12-01   20000
C2000       A2100      2019-10-01   78000
                       2019-11-01   55000
                       2019-12-01   80000

我尝试使用此命令,但平均值看起来不正确。

df_3m.groupby(['customer','account']).mean()

Are there some ideias in pandas or pyspark?

评论
kautem
kautem

数据

+----------+---------+----------+----------+
| customer | account |  month   |  invoice |
+----------+---------+----------+----------+
| C1000    | A1100   | 01-10-19 |    34000 |
| C1000    | A1100   | 01-11-19 |    55000 |
| C1000    | A1100   | 01-12-19 |    80000 |
| C1000    | A1200   | 01-10-19 |    90000 |
| C1000    | A1200   | 01-11-19 |    55000 |
| C1000    | A1300   | 01-10-19 |    10000 |
| C1000    | A1300   | 01-11-19 |    10000 |
| C1000    | A1300   | 01-12-19 |    20000 |
| C2000    | A2100   | 01-10-19 |    78000 |
| C2000    | A2100   | 01-11-19 |    55000 |
| C2000    | A2100   | 01-12-19 |    80000 |
+----------+---------+----------+----------+

您的查询

res = df_3m.groupby(['customer','account']).mean()

Query to filter accounts with less than 3 months

lt_3 = df.groupby(['account']).count() >2

最后结果

res [lt_3]

结果

+----------+---------+--------------+
| customer | account |   invoice    |
+----------+---------+--------------+
| C1000    | A1100   | 56333.333333 |
|          | A1200   | NaN          |
|          | A1300   | 13333.333333 |
| C2000    | A2100   | 71000.000000 |
+----------+---------+--------------+
点赞
评论