选择数据

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])

print("df数据")
print(df)

print("A列的数据")
print(df['A'],df.A)
print("0-3下标,20130102到20130104行")
print(df[0:3],df['20130102':'20130104']) #可以通过下标来截取 也可以通过标签来选取

既可以通过下标来截取 也可以通过标签来选取

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
df数据
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
A列的数据
2013-01-01 0
2013-01-02 4
2013-01-03 8
2013-01-04 12
2013-01-05 16
2013-01-06 20
Freq: D, Name: A, dtype: int32 2013-01-01 0
2013-01-02 4
2013-01-03 8
2013-01-04 12
2013-01-05 16
2013-01-06 20
Freq: D, Name: A, dtype: int32
0-3下标,20130102到20130104行
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11 A B C D
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15

loc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])

print("df数据")
print(df)

# select by label: loc
print("以标签名义选择20130102")
print(df.loc['20130102']) #以标签的名义来选择
print("保留所有行 选取A B两列")
print(df.loc[:,['A','B']]) #保留所有行 选取A B两列

运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
df数据
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
以标签名义选择20130102
A 4
B 5
C 6
D 7
Name: 2013-01-02 00:00:00, dtype: int32
保留所有行 选取A B两列
A B
2013-01-01 0 1
2013-01-02 4 5
2013-01-03 8 9
2013-01-04 12 13
2013-01-05 16 17
2013-01-06 20 21

iloc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])

print("df数据")
print(df)

# select by position: iloc
print("以下标选择 下标为3的数据")
print(df.iloc[3]) #以下标选择
print("选下标3到5行 下标1到3列的数据 左包含右不包含")
print(df.iloc[3:5,1:3]) #选下标3到5行 下标1到3列的数据 左包含右不包含

运行结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
df数据
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
以下标选择 下标为3的数据
A 12
B 13
C 14
D 15
Name: 2013-01-04 00:00:00, dtype: int32
选下标35行 下标13列的数据 左包含右不包含
B C
2013-01-04 13 14
2013-01-05 17 18

条件选择

可以通过表达式来进行选择

1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])

print("df数据")
print(df)

# Boolean indexing
print("以条件来筛选 选出df A列中所有大于8的数据 BCD列也会显示出来")
print(df[df.A > 8]) #以条件来筛选 选出df A列中所有大于8的数据 BCD列也会显示出来

运行结果
1
2
3
4
5
6
7
8
9
10
11
12
13
df数据
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
以条件来筛选 选出df A列中所有大于8的数据 BCD列也会显示出来
A B C D
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23

设置值

同样可以用loc和iloc来进行选取 并设置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])

print("df数据")
print(df)

print("通过iloc选取设置")
df.iloc[2,2] = 1111 #以下标设置值
print(df)
print("通过loc选取设置")
df.loc['20130101','B'] = 222 #以标签设置值
print(df)

运行结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
df数据
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
通过iloc选取设置
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 1111 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
通过loc选取设置
A B C D
2013-01-01 0 222 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 1111 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23

条件语句也同样适用

1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])

print("df数据")
print(df)

print("A列大于4的的所有行的数据都变为0")
df[df.A>4] = 0 #A列大于4的的所有行的数据都变为0
print(df)

运行结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
df数据
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
A列大于4的的所有行的数据都变为0
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 0 0 0 0
2013-01-04 0 0 0 0
2013-01-05 0 0 0 0
2013-01-06 0 0 0 0

1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])

print("df数据")
print(df)

print("A列大于4的数据都变为0")
df.A[df.A>4] = 0 #A列大于4的所有数据都变为0
print(df)

运行结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
df数据
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
A列大于4的数据都变为0
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 0 9 10 11
2013-01-04 0 13 14 15
2013-01-05 0 17 18 19
2013-01-06 0 21 22 23

注意这两个所选取的范围不同 变化也会不同

还可以新增一列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])

print("df数据")
print(df)

print("定义一个新的列F 值都为NAN")
df['F'] = np.nan #定义一个新的列F 值都为NAN
print(df)
print("设置一个新列E 行的标签要保持一致 赋值")
df['E'] = pd.Series([1,2,3,4,5,6],index=pd.date_range('20130101',periods=6)) #设置一个新列E 行的标签要保持一致 赋值
print(df)

运行结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
df数据
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
定义一个新的列F 值都为NAN
A B C D F
2013-01-01 0 1 2 3 NaN
2013-01-02 4 5 6 7 NaN
2013-01-03 8 9 10 11 NaN
2013-01-04 12 13 14 15 NaN
2013-01-05 16 17 18 19 NaN
2013-01-06 20 21 22 23 NaN
设置一个新列E 行的标签要保持一致 赋值
A B C D F E
2013-01-01 0 1 2 3 NaN 1
2013-01-02 4 5 6 7 NaN 2
2013-01-03 8 9 10 11 NaN 3
2013-01-04 12 13 14 15 NaN 4
2013-01-05 16 17 18 19 NaN 5
2013-01-06 20 21 22 23 NaN 6

导入导出数据

导入导出数据有很多 直接参考网站 都差不多

pandas_I/O

处理丢失数据

dropna 丢弃

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
df.iloc[0,1] = np.nan #01列数据为空
df.iloc[1,2] = np.nan

print("df数据")
print(df)

print("丢掉行 1丢掉列 how={'any','all'} 该行有nan就丢掉")
print(df.dropna(axis=0,how='any')) #丢掉行 1丢掉列 how={'any','all'} 该行有nan就丢掉
print("丢掉行 1丢掉列 how={'any','all'} 只有该行所有数都为nan才丢掉")
print(df.dropna(axis=0,how='all')) #丢掉行 1丢掉列 how={'any','all'} 只有该行所有数都为nan才丢掉

运行结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
df数据
A B C D
2013-01-01 0 NaN 2.0 3
2013-01-02 4 5.0 NaN 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
丢掉行 1丢掉列 how={'any','all'} 该行有nan就丢掉
A B C D
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
丢掉行 1丢掉列 how={'any','all'} 只有该行所有数都为nan才丢掉
A B C D
2013-01-01 0 NaN 2.0 3
2013-01-02 4 5.0 NaN 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23

fillna 指定填充

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
df.iloc[0,1] = np.nan #01列数据为空
df.iloc[1,2] = np.nan

print("df数据")
print(df)

print("将丢失的数据变为0")
print(df.fillna(value=0)) #将丢失的数据变为0

运行结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
df数据
A B C D
2013-01-01 0 NaN 2.0 3
2013-01-02 4 5.0 NaN 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
将丢失的数据变为0
A B C D
2013-01-01 0 0.0 2.0 3
2013-01-02 4 5.0 0.0 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23

isnull 判断缺失

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
df.iloc[0,1] = np.nan #01列数据为空
df.iloc[1,2] = np.nan

print("df数据")
print(df)

print("是否有缺失 缺失为True 不缺失为False")
print(df.isnull()) #是否有缺失 缺失为True 不缺失为False

运行结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
df数据
A B C D
2013-01-01 0 NaN 2.0 3
2013-01-02 4 5.0 NaN 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
是否有缺失 缺失为True 不缺失为False
A B C D
2013-01-01 False True False False
2013-01-02 False False True False
2013-01-03 False False False False
2013-01-04 False False False False
2013-01-05 False False False False
2013-01-06 False False False False

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np
import pandas as pd

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
df.iloc[0,1] = np.nan #01列数据为空
df.iloc[1,2] = np.nan

print("df数据")
print(df)

print("至少有一个为True 就意味着至少丢失了一个数据 True丢失了 False没有丢失")
print(np.any(df.isnull())==True) #至少有一个为True 就意味着至少丢失了一个数据 True丢失了 False没有丢失

运行结果

1
2
3
4
5
6
7
8
9
10
df数据
A B C D
2013-01-01 0 NaN 2.0 3
2013-01-02 4 5.0 NaN 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
至少有一个为True 就意味着至少丢失了一个数据 True丢失了 False没有丢失
True

concat合并

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import numpy as np
import pandas as pd

# concatenating
df1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'])
df2 = pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d'])
df3 = pd.DataFrame(np.ones((3,4))*2,columns=['a','b','c','d'])

print("df1")
print(df1)
print("df2")
print(df2)
print("df3")
print(df3)

print("res")
res = pd.concat([df1,df2,df3],axis=1,ignore_index=True) #水平合并df1 df2 df3 0为垂直合并 ignore_index=True 就是忽略之前的标签 重新开始排序
print(res)

运行结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
df1
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
df2
a b c d
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
df3
a b c d
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
res
0 1 2 3 4 5 6 7 8 9 10 11
0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
1 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0

join参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import numpy as np
import pandas as pd

# join,['inner','outer']
df1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'],index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1,columns=['b','c','d','e'],index=[2,3,4])

print("df1")
print(df1)
print("df2")
print(df2)

print("对于标签不一样的 会把彼此没有的标签那一栏用nan填充 join默认outer模式")
res = pd.concat([df1,df2],join='outer') #对于标签不一样的 会把彼此没有的标签那一栏用nan填充 join默认outer模式
print(res)

print("进行裁剪 将两者标签相同的部分保留")
res = pd.concat([df1,df2],join='inner',ignore_index=True) #进行裁剪 将两者标签相同的部分保留
print(res)

print("按df1的索引 进行水平合并")
res = pd.concat([df1,df2.reindex_like(df1)],axis=1) #按df1的索引 进行水平合并
print(res)

运行结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
df1
a b c d
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
df2
b c d e
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
对于标签不一样的 会把彼此没有的标签那一栏用nan填充 join默认outer模式
a b c d e
1 0.0 0.0 0.0 0.0 NaN
2 0.0 0.0 0.0 0.0 NaN
3 0.0 0.0 0.0 0.0 NaN
2 NaN 1.0 1.0 1.0 1.0
3 NaN 1.0 1.0 1.0 1.0
4 NaN 1.0 1.0 1.0 1.0
进行裁剪 将两者标签相同的部分保留
b c d
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0
5 1.0 1.0 1.0
按df1的索引 进行水平合并
a b c d a b c d
1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
2 0.0 0.0 0.0 0.0 NaN 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 NaN 1.0 1.0 1.0

append方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
import pandas as pd

# append
df1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'])
df2 = pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d'])
df3 = pd.DataFrame(np.ones((3,4))*2,columns=['a','b','c','d'],index=[2,3,4])

print("df1")
print(df1)
print("df2")
print(df2)
print("df3")
print(df3)


print("将df2添加到df1并合并 重新排序")
res = df1.append(df2,ignore_index=True)
print(res)

print("将df2,df3都添加到df1中 不重新排序")
res = df1.append([df2,df3])
print(res)

print("设置一个s1并添加到df1中 重新排序")
print("s1")
s1 = pd.Series([1,2,3,4],index=['a','b','c','d'])
print(s1)
res = df1.append(s1,ignore_index=True)
print(res)

运行结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
df1
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
df2
a b c d
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
df3
a b c d
2 2.0 2.0 2.0 2.0
3 2.0 2.0 2.0 2.0
4 2.0 2.0 2.0 2.0
将df2添加到df1并合并 重新排序
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
将df2,df3都添加到df1中 不重新排序
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
2 2.0 2.0 2.0 2.0
3 2.0 2.0 2.0 2.0
4 2.0 2.0 2.0 2.0
设置一个s1矩阵并添加到df1中 重新排序
s1
a 1
b 2
c 3
d 4
dtype: int64
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 2.0 3.0 4.0

merge合并

简单的例子

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import numpy as np
import pandas as pd

# merging two df by key/keys (may be used in database)
# simple example

left = pd.DataFrame({'key':['K0','K1','K2','K3'],
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K2','K3'],
'C':['C0','C1','C2','C3'],
'D':['D0','D1','D2','D3']})

print(left)
print(right)
res = pd.merge(left,right,on='key')
print(res)

运行结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
  key   A   B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
key C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
3 K3 C3 D3
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3

两个key

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import numpy as np
import pandas as pd

# merging two df by key/keys (may be used in database)
# consider two keys

left = pd.DataFrame({'key1':['K0','K0','K1','K2'],
'key2':['K0','K1','K0','K1'],
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key1':['K0','K1','K1','K2'],
'key2':['K0','K0','K0','K0'],
'C':['C0','C1','C2','C3'],
'D':['D0','D1','D2','D3']})

print(left)
print(right)
#how=['inner','outer','left','right']
res = pd.merge(left,right,on=['key1','key2'],how='inner') #默认inner 只合并相同的数据
print(res)

运行结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
  key1 key2   A   B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2

indicator 显示以何做标签

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import numpy as np
import pandas as pd

# merging two df by key/keys (may be used in database)
# indicator
df1 = pd.DataFrame({'col1':[0,1],'col_left':['a','b']})
df2 = pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]})
print(df1)
print(df2)

res = pd.merge(df1,df2,on='col1',how='outer',indicator=True) #显示合并是以哪一个df作为标签
# give the indicator a custom name
# res = pd.merge(df1,df2,on='col1',how='outer',indicator='indicator_column')
print(res)

运行结果

1
2
3
4
5
6
7
8
9
10
11
12
   col1 col_left
0 0 a
1 1 b
col1 col_right
0 1 2
1 2 2
2 2 2
col1 col_left col_right _merge
0 0 a NaN left_only
1 1 b 2.0 both
2 2 NaN 2.0 right_only
3 2 NaN 2.0 right_only

按索引合并

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import numpy as np
import pandas as pd

# merging two df by key/keys (may be used in database)
# merged by index

left = pd.DataFrame({'A':['A0','A1','A2'],
'B':['B0','B1','B2']},
index=['K0','K1','K2'])
right = pd.DataFrame({'C':['C0','C2','C3'],
'D':['D0','D2','D3']},
index=['K0','K2','K3'])

print(left)
print(right)
#left_index and right_index
res = pd.merge(left,right,left_index=True,right_index=True,how='outer')
print(res)
res = pd.merge(left,right,left_index=True,right_index=True,how='inner')
print(res)

运行结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
     A   B
K0 A0 B0
K1 A1 B1
K2 A2 B2
C D
K0 C0 D0
K2 C2 D2
K3 C3 D3
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 NaN NaN C3 D3
A B C D
K0 A0 B0 C0 D0
K2 A2 B2 C2 D2

处理重叠

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np
import pandas as pd

# merging two df by key/keys (may be used in database)
# handle overlapping
boys = pd.DataFrame({'k':['K0','K1','K2'],'age':[1,2,3]})
print(boys)

girls = pd.DataFrame({'k':['K0','K0','K3'],'age':[4,5,6]})
print(girls)

res = pd.merge(boys,girls,on='k',suffixes=['_boy','_girl'],how='inner')
print(res)

运行结果

1
2
3
4
5
6
7
8
9
10
11
    k  age
0 K0 1
1 K1 2
2 K2 3
k age
0 K0 4
1 K0 5
2 K3 6
k age_boy age_girl
0 K0 1 4
1 K0 1 5

plot画图

简单举例 详细会单独写一篇

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#plot data

# Series
# data = pd.Series(np.random.randn(1000),index=np.arange(1000))
# data = data.cumsum() #累加

#DataFrame
data = pd.DataFrame(np.random.randn(1000,4),
# index=np.arange(1000),
columns=list("ABCD"))
data = data.cumsum()
# print(data.head())
# data.plot()
# plot methods:
# 'bar': 柱状图
# 'hist'
# 'box'
# 'kde'
# 'scatter'
# hexbin'
# 'pie'
ax = data.plot.scatter(x='A',y='B',color='DarkBlue',label="Class1")
data.plot.scatter(x='A',y='C',color='DarkGreen',label='Class2',ax=ax)

plt.show()

运行结果

avatar