前言

李宏毅老师的宝可梦分类挺有意思的,所以我想尝试一下走这方面的分类,因为查看了一下网上并没有源代码,那就自己写试试吧,,数据集来源于kaggle。

分析数据

首先我们可以看一下我们的数据

abilities against_bug against_dark against_dragon against_electric against_fairy against_fight against_fire against_flying against_ghost percentage_male pokedex_number sp_attack sp_defense speed type1 type2 weight_kg generation is_legendary
0 [‘Overgrow’, ‘Chlorophyll’] 1.0 1.0 1.0 0.5 0.5 0.5 2.0 2.0 1.0 88.1 1 65 65 45 grass poison 6.9 1 0
1 [‘Overgrow’, ‘Chlorophyll’] 1.0 1.0 1.0 0.5 0.5 0.5 2.0 2.0 1.0 88.1 2 80 80 60 grass poison 13.0 1 0
2 [‘Overgrow’, ‘Chlorophyll’] 1.0 1.0 1.0 0.5 0.5 0.5 2.0 2.0 1.0 88.1 3 122 120 80 grass poison 100.0 1 0
3 [‘Blaze’, ‘Solar Power’] 0.5 1.0 1.0 1.0 0.5 1.0 0.5 1.0 1.0 88.1 4 60 50 65 fire NaN 8.5 1 0
4 [‘Blaze’, ‘Solar Power’] 0.5 1.0 1.0 1.0 0.5 1.0 0.5 1.0 1.0 88.1 5 80 65 80 fire NaN 19.0 1 0

然后通过pokemon_df.info()
了解一下数据的信息,可见有一些缺损数据,一共有801个信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 41 columns):
abilities 801 non-null object
against_bug 801 non-null float64
against_dark 801 non-null float64
against_dragon 801 non-null float64
against_electric 801 non-null float64
against_fairy 801 non-null float64
against_fight 801 non-null float64
against_fire 801 non-null float64
against_flying 801 non-null float64
against_ghost 801 non-null float64
against_grass 801 non-null float64
against_ground 801 non-null float64
against_ice 801 non-null float64
against_normal 801 non-null float64
against_poison 801 non-null float64
against_psychic 801 non-null float64
against_rock 801 non-null float64
against_steel 801 non-null float64
against_water 801 non-null float64
attack 801 non-null int64
base_egg_steps 801 non-null int64
base_happiness 801 non-null int64
base_total 801 non-null int64
capture_rate 801 non-null object
classfication 801 non-null object
defense 801 non-null int64
experience_growth 801 non-null int64
height_m 781 non-null float64
hp 801 non-null int64
japanese_name 801 non-null object
name 801 non-null object
percentage_male 703 non-null float64
pokedex_number 801 non-null int64
sp_attack 801 non-null int64
sp_defense 801 non-null int64
speed 801 non-null int64
type1 801 non-null object
type2 417 non-null object
weight_kg 781 non-null float64
generation 801 non-null int64
is_legendary 801 non-null int64
dtypes: float64(21), int64(13), object(7)
memory usage: 256.7+ KB

可见一共有800只宝可梦,去掉首行的话。

hp percentage_male pokedex_number sp_attack sp_defense speed weight_kg generation is_legendary
count 781.000000 801.000000 703.000000 801.000000 801.000000 801.000000 801.000000 781.000000 801.000000
mean 1.163892 68.958801 55.155761 401.000000 71.305868 70.911361 66.334582 61.378105 3.690387
std 1.080326 26.576015 20.261623 231.373075 32.353826 27.942501 28.907662 109.354766 1.930420
min 0.100000 1.000000 0.000000 1.000000 10.000000 20.000000 5.000000 0.100000 1.000000
25% 0.600000 50.000000 50.000000 201.000000 45.000000 50.000000 45.000000 9.000000 2.000000
50% 1.000000 65.000000 50.000000 401.000000 65.000000 66.000000 65.000000 27.300000 4.000000
75% 1.500000 80.000000 50.000000 601.000000 91.000000 90.000000 85.000000 64.800000 5.000000
max 14.500000 255.000000 100.000000 801.000000 194.000000 230.000000 180.000000 999.900000 7.000000
简单分布

然后画个图,这样我们对数据就有大概的了解,可见各种属性的分布情况![](10. Classification代码实现/image-20210813112559905.png)

传说宝可梦
1
2
3
0    731
1 70
Name: is_legendary, dtype: int64

计数传说宝可梦的数量
传说宝可梦类似于特殊值,因为和一般宝可梦不同,各种值都会异常高,因此我们预测的时候要排除传说宝可梦

分析属性

李宏毅老师的课里,是统计了水系一般系宝可梦的二分类问题,我们的数据里有一些宝可梦有两个属性,比如说草+水,因此我们得看看这两个属性都分别有哪些属性。

对属性1和属性2画图,发现最多的就是水系和一般系宝可梦

统计属性1+属性2结合一起的宝可梦属性分类图

可以用bie,也可以用pie来画图

可见属性1和属性2是有重叠的,有我们共同需要的水系和一般系统计。因此需要叠加统计属性1和属性2都是水系的宝可梦

数据处理

删除传说宝可梦
1
2
3
4
pokemon_df['classfication']
pokemon_df_ = pokemon_df
pokemon_df_ = pokemon_df_.drop(pokemon_df_[pokemon_df_['is_legendary']==1].index)
pokemon_df_
统计水系一般系宝可梦数量

这里的399号是属性1为一般,属性2为水,非常干扰分类,我把这个删了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
abilities            ['Simple', 'Unaware', 'Moody']
against_bug 1
against_dark 1
against_dragon 1
against_electric 2
against_fairy 1
against_fight 2
against_fire 0.5
against_flying 1
against_ghost 0
against_grass 2
against_ground 1
against_ice 0.5
against_normal 1
against_poison 1
against_psychic 1
against_rock 1
against_steel 0.5
against_water 0.5
attack 85
base_egg_steps 3840
base_happiness 70
base_total 410
capture_rate 127
classfication Beaver Pokémon
defense 60
experience_growth 1000000
height_m 1
hp 79
japanese_name Beadaruビーダル
name Bibarel
percentage_male 50
pokedex_number 400
sp_attack 55
sp_defense 60
speed 71
type1 normal
type2 water
weight_kg 31.5
generation 4
is_legendary 0
Name: 399, dtype: object
1
2
3
4
##这里有个一般系+水系的,太干扰了,排除这个玩意儿

water2 = water2[~water2['name'].isin(['Bibarel'])]
water2['name']

最后统计完结,一共有123只宝可梦为水系, 106只一般系

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
6       Squirtle
7 Wartortle
8 Blastoise
53 Psyduck
54 Golduck
...
689 Skrelp
746 Mareanie
747 Toxapex
766 Wimpod
767 Golisopod
Name: name, Length: 123, dtype: object

15 Pidgey
16 Pidgeotto
17 Pidgeot
18 Rattata
19 Raticate
...
779 Drampa
666 Litleo
667 Pyroar
693 Helioptile
694 Heliolisk
Name: name, Length: 106, dtype: object

将水系加一般系结合在一起就是我们需要的数据集了

做标准化处理
1
2
3
4
5
6
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
labels = le.fit_transform(total['type1'])
print(len(le.classes_)) ## len() -> length, classes is number of classes
print(le.classes_) ## prints what those classes are from the labels

然后提取只有sp_defense,和defense的列来预测

1
2
3
4
5
6
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
numerical = pd.DataFrame(sc_X.fit_transform(total[['defense', 'sp_defense',
]]),columns=['defense','sp_defense'],index= total.index
)
numerical

得到如下数据

defense sp_defense
6 -0.054458 -0.070344
7 0.490912 0.546262
8 1.945231 1.895089
53 -0.672543 -0.609875
54 0.418196 0.546262
779 0.672702 0.970179
666 -0.308963 -0.455723
667 0.200048 0.006732
693 -1.217913 -0.879641
694 -0.527111 1.085793

重新排列数据

对于这个数据可见index不是按顺序来的,因此需要重新排序,并且把type1也作为分类的y属性连接

defense sp_defense type1
0 -0.054458 -0.070344 water
1 0.490912 0.546262 water
2 1.945231 1.895089 water
3 -0.672543 -0.609875 water
4 0.418196 0.546262 water
224 0.672702 0.970179 normal
225 -0.308963 -0.455723 normal
226 0.200048 0.006732 normal
227 -1.217913 -0.879641 normal
228 -0.527111 1.085793 normal
设置独热码

我们的y标签改为用one-hot label, 水就是1,一般就是0。改完后的y为

1
2
3
4
5
6
7
8
9
10
11
12
0      1
1 1
2 1
3 1
4 1
..
224 0
225 0
226 0
227 0
228 0
Name: water, Length: 229, dtype: uint8

改完后的x为

defense sp_defense
0 -0.054458 -0.070344
1 0.490912 0.546262
2 1.945231 1.895089
3 -0.672543 -0.609875
4 0.418196 0.546262
224 0.672702 0.970179
225 -0.308963 -0.455723
226 0.200048 0.006732
227 -1.217913 -0.879641
228 -0.527111 1.085793

229 rows × 2 columns

开始训练设置

打包好训练集测试集

1
2
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y,random_state = 2,test_size=0.4,stratify=y)
训练结果

训练模型的话,用分类模型knn做实验的结果是

1
2
Max train score 87.59124087591242 % and k = [1]
Max test score 64.13043478260869 % and k = [23]

画一下loss,train图可见train的分数一直提升不上去,说明模型太简单了。

增加模型训练特征

我试着可以增加一些参数,比如hp,attack,sp_attack看预测效果如何

训练集准确率提升明显,测试集也提升了,达到了百分之77.17准确率

1
2
Max test score 77.17391304347827 % and k = [9]
Max train score 100.0 % and k = [1]

增加到七个特征

之后有空再加入吧