选择数据
1 | import numpy as np |
既可以通过下标来截取 也可以通过标签来选取
运行结果:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30df数据
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
A列的数据
2013-01-01 0
2013-01-02 4
2013-01-03 8
2013-01-04 12
2013-01-05 16
2013-01-06 20
Freq: D, Name: A, dtype: int32 2013-01-01 0
2013-01-02 4
2013-01-03 8
2013-01-04 12
2013-01-05 16
2013-01-06 20
Freq: D, Name: A, dtype: int32
0-3下标,20130102到20130104行
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11 A B C D
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
loc
1 | import numpy as np |
运行结果:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22df数据
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
以标签名义选择20130102
A 4
B 5
C 6
D 7
Name: 2013-01-02 00:00:00, dtype: int32
保留所有行 选取A B两列
A B
2013-01-01 0 1
2013-01-02 4 5
2013-01-03 8 9
2013-01-04 12 13
2013-01-05 16 17
2013-01-06 20 21
iloc
1 | import numpy as np |
运行结果1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18df数据
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
以下标选择 下标为3的数据
A 12
B 13
C 14
D 15
Name: 2013-01-04 00:00:00, dtype: int32
选下标3到5行 下标1到3列的数据 左包含右不包含
B C
2013-01-04 13 14
2013-01-05 17 18
条件选择
可以通过表达式来进行选择1
2
3
4
5
6
7
8
9
10
11
12import numpy as np
import pandas as pd
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
print("df数据")
print(df)
# Boolean indexing
print("以条件来筛选 选出df A列中所有大于8的数据 BCD列也会显示出来")
print(df[df.A > 8]) #以条件来筛选 选出df A列中所有大于8的数据 BCD列也会显示出来
运行结果1
2
3
4
5
6
7
8
9
10
11
12
13df数据
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
以条件来筛选 选出df A列中所有大于8的数据 BCD列也会显示出来
A B C D
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
设置值
同样可以用loc和iloc来进行选取 并设置1
2
3
4
5
6
7
8
9
10
11
12
13
14
15import numpy as np
import pandas as pd
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
print("df数据")
print(df)
print("通过iloc选取设置")
df.iloc[2,2] = 1111 #以下标设置值
print(df)
print("通过loc选取设置")
df.loc['20130101','B'] = 222 #以标签设置值
print(df)
运行结果1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24df数据
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
通过iloc选取设置
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 1111 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
通过loc选取设置
A B C D
2013-01-01 0 222 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 1111 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
条件语句也同样适用1
2
3
4
5
6
7
8
9
10
11
12import numpy as np
import pandas as pd
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
print("df数据")
print(df)
print("A列大于4的的所有行的数据都变为0")
df[df.A>4] = 0 #A列大于4的的所有行的数据都变为0
print(df)
运行结果1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16df数据
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
A列大于4的的所有行的数据都变为0
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 0 0 0 0
2013-01-04 0 0 0 0
2013-01-05 0 0 0 0
2013-01-06 0 0 0 01
2
3
4
5
6
7
8
9
10
11
12import numpy as np
import pandas as pd
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
print("df数据")
print(df)
print("A列大于4的数据都变为0")
df.A[df.A>4] = 0 #A列大于4的所有数据都变为0
print(df)
运行结果1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16df数据
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
A列大于4的数据都变为0
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 0 9 10 11
2013-01-04 0 13 14 15
2013-01-05 0 17 18 19
2013-01-06 0 21 22 23
注意这两个所选取的范围不同 变化也会不同
还可以新增一列1
2
3
4
5
6
7
8
9
10
11
12
13
14
15import numpy as np
import pandas as pd
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
print("df数据")
print(df)
print("定义一个新的列F 值都为NAN")
df['F'] = np.nan #定义一个新的列F 值都为NAN
print(df)
print("设置一个新列E 行的标签要保持一致 赋值")
df['E'] = pd.Series([1,2,3,4,5,6],index=pd.date_range('20130101',periods=6)) #设置一个新列E 行的标签要保持一致 赋值
print(df)
运行结果1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24df数据
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
定义一个新的列F 值都为NAN
A B C D F
2013-01-01 0 1 2 3 NaN
2013-01-02 4 5 6 7 NaN
2013-01-03 8 9 10 11 NaN
2013-01-04 12 13 14 15 NaN
2013-01-05 16 17 18 19 NaN
2013-01-06 20 21 22 23 NaN
设置一个新列E 行的标签要保持一致 赋值
A B C D F E
2013-01-01 0 1 2 3 NaN 1
2013-01-02 4 5 6 7 NaN 2
2013-01-03 8 9 10 11 NaN 3
2013-01-04 12 13 14 15 NaN 4
2013-01-05 16 17 18 19 NaN 5
2013-01-06 20 21 22 23 NaN 6
导入导出数据
导入导出数据有很多 直接参考网站 都差不多
处理丢失数据
dropna 丢弃
1 | import numpy as np |
运行结果1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22df数据
A B C D
2013-01-01 0 NaN 2.0 3
2013-01-02 4 5.0 NaN 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
丢掉行 1丢掉列 how={'any','all'} 该行有nan就丢掉
A B C D
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
丢掉行 1丢掉列 how={'any','all'} 只有该行所有数都为nan才丢掉
A B C D
2013-01-01 0 NaN 2.0 3
2013-01-02 4 5.0 NaN 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
fillna 指定填充
1 | import numpy as np |
运行结果1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16df数据
A B C D
2013-01-01 0 NaN 2.0 3
2013-01-02 4 5.0 NaN 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
将丢失的数据变为0
A B C D
2013-01-01 0 0.0 2.0 3
2013-01-02 4 5.0 0.0 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
isnull 判断缺失
1 | import numpy as np |
运行结果1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16df数据
A B C D
2013-01-01 0 NaN 2.0 3
2013-01-02 4 5.0 NaN 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
是否有缺失 缺失为True 不缺失为False
A B C D
2013-01-01 False True False False
2013-01-02 False False True False
2013-01-03 False False False False
2013-01-04 False False False False
2013-01-05 False False False False
2013-01-06 False False False False
1 | import numpy as np |
运行结果1
2
3
4
5
6
7
8
9
10df数据
A B C D
2013-01-01 0 NaN 2.0 3
2013-01-02 4 5.0 NaN 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
至少有一个为True 就意味着至少丢失了一个数据 True丢失了 False没有丢失
True
concat合并
1 | import numpy as np |
运行结果1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20df1
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
df2
a b c d
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
df3
a b c d
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
res
0 1 2 3 4 5 6 7 8 9 10 11
0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
1 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
join参数
1 | import numpy as np |
运行结果1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31df1
a b c d
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
df2
b c d e
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
对于标签不一样的 会把彼此没有的标签那一栏用nan填充 join默认outer模式
a b c d e
1 0.0 0.0 0.0 0.0 NaN
2 0.0 0.0 0.0 0.0 NaN
3 0.0 0.0 0.0 0.0 NaN
2 NaN 1.0 1.0 1.0 1.0
3 NaN 1.0 1.0 1.0 1.0
4 NaN 1.0 1.0 1.0 1.0
进行裁剪 将两者标签相同的部分保留
b c d
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0
5 1.0 1.0 1.0
按df1的索引 进行水平合并
a b c d a b c d
1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
2 0.0 0.0 0.0 0.0 NaN 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 NaN 1.0 1.0 1.0
append方法
1 | import numpy as np |
运行结果1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46df1
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
df2
a b c d
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
df3
a b c d
2 2.0 2.0 2.0 2.0
3 2.0 2.0 2.0 2.0
4 2.0 2.0 2.0 2.0
将df2添加到df1并合并 重新排序
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
将df2,df3都添加到df1中 不重新排序
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
2 2.0 2.0 2.0 2.0
3 2.0 2.0 2.0 2.0
4 2.0 2.0 2.0 2.0
设置一个s1矩阵并添加到df1中 重新排序
s1
a 1
b 2
c 3
d 4
dtype: int64
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 2.0 3.0 4.0
merge合并
简单的例子1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17import numpy as np
import pandas as pd
# merging two df by key/keys (may be used in database)
# simple example
left = pd.DataFrame({'key':['K0','K1','K2','K3'],
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K2','K3'],
'C':['C0','C1','C2','C3'],
'D':['D0','D1','D2','D3']})
print(left)
print(right)
res = pd.merge(left,right,on='key')
print(res)
运行结果1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
key C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
3 K3 C3 D3
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
两个key
1 | import numpy as np |
运行结果1
2
3
4
5
6
7
8
9
10
11
12
13
14 key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
indicator 显示以何做标签
1 | import numpy as np |
运行结果1
2
3
4
5
6
7
8
9
10
11
12 col1 col_left
0 0 a
1 1 b
col1 col_right
0 1 2
1 2 2
2 2 2
col1 col_left col_right _merge
0 0 a NaN left_only
1 1 b 2.0 both
2 2 NaN 2.0 right_only
3 2 NaN 2.0 right_only
按索引合并
1 | import numpy as np |
运行结果1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16 A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
C D
K0 C0 D0
K2 C2 D2
K3 C3 D3
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 NaN NaN C3 D3
A B C D
K0 A0 B0 C0 D0
K2 A2 B2 C2 D2
处理重叠
1 | import numpy as np |
运行结果1
2
3
4
5
6
7
8
9
10
11 k age
0 K0 1
1 K1 2
2 K2 3
k age
0 K0 4
1 K0 5
2 K3 6
k age_boy age_girl
0 K0 1 4
1 K0 1 5
plot画图
简单举例 详细会单独写一篇1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.DataFrame(np.random.randn(1000,4),
# index=np.arange(1000),
columns=list("ABCD"))
data = data.cumsum()
ax = data.plot.scatter(x='A',y='B',color='DarkBlue',label="Class1")
data.plot.scatter(x='A',y='C',color='DarkGreen',label='Class2',ax=ax)
plt.show()
运行结果