我有一个问题。我想调查我的假设是否正确,可以说 receiver X
经常收到来自 consignor Y
的 express .
我分组了 receiver
和 consignor
为此目的。
但是,我现在想做的是,我想输出所有receivers
给自己。哪里最频繁consignor
包含所有 consignors
的 80% 以上在receiver
.
例如我们有receiver c
这是从consignor 2, 3
收到的什么。来自 consignor 2
只有 1 包来自 consignor 3
4 包。因此他总共收到了 5 个数据包。因此 consignor 3
接受所有 consignor
的 80%在 receiver c
内.
因此 receiver c
应该用表格表示。这应该在下面为所有其他人完成。
简单来说,我想要所有receivers
超过 80% 的 consignor
.
我该怎么做?
我想要什么
receiver consignor count_with_most_consignor count_all
c 3 4 5
d 2 4 5
数据框
consignor receiver
0 1 a
1 2 a
2 1 a
3 2 c
4 3 c
5 1 b
6 3 c
7 3 c
8 3 c
9 2 d
10 2 d
11 2 d
12 1 d
13 2 d
代码
import pandas as pd
d = {'consignor': [1, 2, 1, 2, 3, 1, 3, 3, 3, 2, 2, 2, 1, 2],
'receiver': ['a', 'a', 'a', 'c', 'c', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'd', 'd']}
df = pd.DataFrame(data=d)
print(df)
df_ = df.groupby(['receiver','consignor']).size().reset_index(name='counts').sort_values(by=['counts'], ascending=False)
print(df_)
X = 'c'
print(df_[df_['receiver'] == X].shape[1])
print(df_[df_['receiver'] == X])
出去
[OUT]
3
receiver consignor counts
4 c 3 4
3 c 2 1
最佳答案
交叉表
# frequency table
s = pd.crosstab(df['receiver'], df['consignor'])
# agg columns along axis=1
d = {
'consignor': s.idxmax(1),
'count_with_most_consignor': s.max(1),
'count_all': s.sum(1)
}
(
s.assign(**d) # assign the agg columns
.drop(s.columns, axis=1) # drop the unwanted columns
.query('count_with_most_consignor / count_all >= .8') # filter the rows
)
结果
consignor consignor count_with_most_consignor count_all
receiver
b 1 1 1
c 3 4 5
d 2 4 5
https://stackoverflow.com/questions/74170120/