实现宝可梦分类以及对宝可梦数据分析

前言

李宏毅老师的宝可梦分类挺有意思的，所以我想尝试一下走这方面的分类，因为查看了一下网上并没有源代码，那就自己写试试吧,，数据集来源于kaggle。

分析数据

首先我们可以看一下我们的数据

	abilities	against_bug	against_dark	against_dragon	against_electric	against_fairy	against_fight	against_fire	against_flying	against_ghost	…	percentage_male	pokedex_number	sp_attack	sp_defense	speed	type1	type2	weight_kg	generation
0	[‘Overgrow’, ‘Chlorophyll’]	1.0	1.0	1.0	0.5	0.5	0.5	2.0	2.0	1.0	…	88.1	1	65	65	45	grass	poison	6.9	1
1	[‘Overgrow’, ‘Chlorophyll’]	1.0	1.0	1.0	0.5	0.5	0.5	2.0	2.0	1.0	…	88.1	2	80	80	60	grass	poison	13.0	1
2	[‘Overgrow’, ‘Chlorophyll’]	1.0	1.0	1.0	0.5	0.5	0.5	2.0	2.0	1.0	…	88.1	3	122	120	80	grass	poison	100.0	1
3	[‘Blaze’, ‘Solar Power’]	0.5	1.0	1.0	1.0	0.5	1.0	0.5	1.0	1.0	…	88.1	4	60	50	65	fire	NaN	8.5	1
4	[‘Blaze’, ‘Solar Power’]	0.5	1.0	1.0	1.0	0.5	1.0	0.5	1.0	1.0	…	88.1	5	80	65	80	fire	NaN	19.0	1

然后通过pokemon_df.info()
了解一下数据的信息，可见有一些缺损数据，一共有801个信息

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 41 columns):
abilities            801 non-null object
against_bug          801 non-null float64
against_dark         801 non-null float64
against_dragon       801 non-null float64
against_electric     801 non-null float64
against_fairy        801 non-null float64
against_fight        801 non-null float64
against_fire         801 non-null float64
against_flying       801 non-null float64
against_ghost        801 non-null float64
against_grass        801 non-null float64
against_ground       801 non-null float64
against_ice          801 non-null float64
against_normal       801 non-null float64
against_poison       801 non-null float64
against_psychic      801 non-null float64
against_rock         801 non-null float64
against_steel        801 non-null float64
against_water        801 non-null float64
attack               801 non-null int64
base_egg_steps       801 non-null int64
base_happiness       801 non-null int64
base_total           801 non-null int64
capture_rate         801 non-null object
classfication        801 non-null object
defense              801 non-null int64
experience_growth    801 non-null int64
height_m             781 non-null float64
hp                   801 non-null int64
japanese_name        801 non-null object
name                 801 non-null object
percentage_male      703 non-null float64
pokedex_number       801 non-null int64
sp_attack            801 non-null int64
sp_defense           801 non-null int64
speed                801 non-null int64
type1                801 non-null object
type2                417 non-null object
weight_kg            781 non-null float64
generation           801 non-null int64
is_legendary         801 non-null int64
dtypes: float64(21), int64(13), object(7)
memory usage: 256.7+ KB

可见一共有800只宝可梦，去掉首行的话。

	hp	percentage_male	pokedex_number	sp_attack	sp_defense	speed	weight_kg	generation	is_legendary
count	781.000000	801.000000	703.000000	801.000000	801.000000	801.000000	801.000000	781.000000	801.000000
mean	1.163892	68.958801	55.155761	401.000000	71.305868	70.911361	66.334582	61.378105	3.690387
std	1.080326	26.576015	20.261623	231.373075	32.353826	27.942501	28.907662	109.354766	1.930420
min	0.100000	1.000000	0.000000	1.000000	10.000000	20.000000	5.000000	0.100000	1.000000
25%	0.600000	50.000000	50.000000	201.000000	45.000000	50.000000	45.000000	9.000000	2.000000
50%	1.000000	65.000000	50.000000	401.000000	65.000000	66.000000	65.000000	27.300000	4.000000
75%	1.500000	80.000000	50.000000	601.000000	91.000000	90.000000	85.000000	64.800000	5.000000
max	14.500000	255.000000	100.000000	801.000000	194.000000	230.000000	180.000000	999.900000	7.000000

简单分布

然后画个图，这样我们对数据就有大概的了解，可见各种属性的分布情况![](10. Classification代码实现/image-20210813112559905.png)

传说宝可梦

1
2
3

0    731
1     70
Name: is_legendary, dtype: int64

计数传说宝可梦的数量
传说宝可梦类似于特殊值，因为和一般宝可梦不同，各种值都会异常高，因此我们预测的时候要排除传说宝可梦

分析属性

李宏毅老师的课里，是统计了水系一般系宝可梦的二分类问题，我们的数据里有一些宝可梦有两个属性，比如说草+水，因此我们得看看这两个属性都分别有哪些属性。

对属性1和属性2画图，发现最多的就是水系和一般系宝可梦

统计属性1+属性2结合一起的宝可梦属性分类图

可以用bie，也可以用pie来画图

可见属性1和属性2是有重叠的，有我们共同需要的水系和一般系统计。因此需要叠加统计属性1和属性2都是水系的宝可梦

数据处理

删除传说宝可梦

pokemon_df['classfication']
pokemon_df_ = pokemon_df
pokemon_df_ = pokemon_df_.drop(pokemon_df_[pokemon_df_['is_legendary']==1].index)
pokemon_df_

统计水系一般系宝可梦数量

这里的399号是属性1为一般，属性2为水，非常干扰分类，我把这个删了。

abilities            ['Simple', 'Unaware', 'Moody']
against_bug                                       1
against_dark                                      1
against_dragon                                    1
against_electric                                  2
against_fairy                                     1
against_fight                                     2
against_fire                                    0.5
against_flying                                    1
against_ghost                                     0
against_grass                                     2
against_ground                                    1
against_ice                                     0.5
against_normal                                    1
against_poison                                    1
against_psychic                                   1
against_rock                                      1
against_steel                                   0.5
against_water                                   0.5
attack                                           85
base_egg_steps                                 3840
base_happiness                                   70
base_total                                      410
capture_rate                                    127
classfication                        Beaver Pokémon
defense                                          60
experience_growth                           1000000
height_m                                          1
hp                                               79
japanese_name                           Beadaruビーダル
name                                        Bibarel
percentage_male                                  50
pokedex_number                                  400
sp_attack                                        55
sp_defense                                       60
speed                                            71
type1                                        normal
type2                                         water
weight_kg                                      31.5
generation                                        4
is_legendary                                      0
Name: 399, dtype: object

##这里有个一般系+水系的，太干扰了，排除这个玩意儿

water2 = water2[~water2['name'].isin(['Bibarel'])]
water2['name']

最后统计完结，一共有123只宝可梦为水系, 106只一般系

6       Squirtle
7      Wartortle
8      Blastoise
53       Psyduck
54       Golduck
         ...    
689       Skrelp
746     Mareanie
747      Toxapex
766       Wimpod
767    Golisopod
Name: name, Length: 123, dtype: object

15         Pidgey
16      Pidgeotto
17        Pidgeot
18        Rattata
19       Raticate
          ...    
779        Drampa
666        Litleo
667        Pyroar
693    Helioptile
694     Heliolisk
Name: name, Length: 106, dtype: object

将水系加一般系结合在一起就是我们需要的数据集了

做标准化处理

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
labels = le.fit_transform(total['type1'])
print(len(le.classes_)) ## len() -> length, classes is number of classes
print(le.classes_) ## prints what those classes are from the labels

然后提取只有sp_defense,和defense的列来预测

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
numerical =  pd.DataFrame(sc_X.fit_transform(total[['defense', 'sp_defense',
       ]]),columns=['defense','sp_defense'],index= total.index
        )
numerical

得到如下数据

	defense	sp_defense
6	-0.054458	-0.070344
7	0.490912	0.546262
8	1.945231	1.895089
53	-0.672543	-0.609875
54	0.418196	0.546262
…	…	…
779	0.672702	0.970179
666	-0.308963	-0.455723
667	0.200048	0.006732
693	-1.217913	-0.879641
694	-0.527111	1.085793

重新排列数据

对于这个数据可见index不是按顺序来的，因此需要重新排序，并且把type1也作为分类的y属性连接

	defense	sp_defense	type1
0	-0.054458	-0.070344	water
1	0.490912	0.546262	water
2	1.945231	1.895089	water
3	-0.672543	-0.609875	water
4	0.418196	0.546262	water
…	…	…	…
224	0.672702	0.970179	normal
225	-0.308963	-0.455723	normal
226	0.200048	0.006732	normal
227	-1.217913	-0.879641	normal
228	-0.527111	1.085793	normal

设置独热码

我们的y标签改为用one-hot label, 水就是1，一般就是0。改完后的y为

0      1
1      1
2      1
3      1
4      1
      ..
224    0
225    0
226    0
227    0
228    0
Name: water, Length: 229, dtype: uint8

改完后的x为

	defense	sp_defense
0	-0.054458	-0.070344
1	0.490912	0.546262
2	1.945231	1.895089
3	-0.672543	-0.609875
4	0.418196	0.546262
…	…	…
224	0.672702	0.970179
225	-0.308963	-0.455723
226	0.200048	0.006732
227	-1.217913	-0.879641
228	-0.527111	1.085793

229 rows × 2 columns

开始训练设置

打包好训练集测试集

1 2	from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(x, y,random_state = 2,test_size=0.4,stratify=y)