Machine Learning

ゼロからやさしくはじめるPython入門　クジラ飛行机

機械学習ライブラリ　scikit-learn: Machine Learning in Python, Built on NumPy, SciPy, and matplotlib

課題：アヤメの分類（classification）

アヤメのガク片の長さと幅，花びらの長さと幅の4つを入力データ x として与え，アヤメの種類 y（0: Iris-Setosa, 1: Iris-Versicolour, 2: Iris-Virginica）を判定する。

y = f(p, x)　　入力 x: 4次元ベクトル，出力 y: 0(setosa), 1(versicolor), 2(virginica) , p: parameters
教師付き訓練データを利用して，関数fが正しく答えるように，parametersを調整する。

from IPython.display import Image
Image("./Iris_setosa.jpg", width=200)  # ヒオウギ・アヤメ

Image("./Iris_versicolor.jpg", width=200)  # アイリス・バージカラー

Image("./Iris_virginica.jpg", width=200)

1）データセットの読み込み

#sklearnのデータセットをimport
from sklearn import datasets

#datasetsからirisのデータセットをloadし，irisでラベリング
iris = datasets.load_iris()

#iris datasetのkeyを表示
print(iris.keys())

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

#iris datasetの説明DESCRを表示
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

#iris target_names：分類の名前
print('target_names:', iris.target_names)
#iris feature_names：特徴量の名前
print('feature_names:', iris.feature_names)

target_names: ['setosa' 'versicolor' 'virginica']
feature_names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

# ガク片(sepal)の長さ，幅，花びら(petal)の長さ，幅(cm)：１つのアヤメが4次元ベクトル空間の点であらわされる。
print( 'iris.dataのサイズ=', len(iris.data) )
iris.data[:10]  # 最初の10個

iris.dataのサイズ= 150

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

print( 'iris.targetのサイズ=', len(iris.target) )    # targetデータ数（教師データ）
print( 'iris.target\n', iris.target )    # 教師データ：アヤメの種類

iris.targetのサイズ= 150
iris.target
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

2）データの観察

irisデータを2次元空間に射影して観察する

gL = [ x[0] for x in iris.data]    # ガクの長さ
gW = [ x[1] for x in iris.data]    # ガクの幅
hL = [ x[2] for x in iris.data]    # 花びらの長さ
hW = [ x[3] for x in iris.data]    # 花びらの幅
print('gL:', gL[:10] )   # ガクの長さを最初から10個表示
print('gW:', gW[:10] )   # ガガクの幅を最初から10個表示
print('hL:', hL[:10] )   # 花びらの長さを最初から10個表示
print('hW:', hW[:10] )   # 花びらの幅を最初から10個表示

gL: [5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9]
gW: [3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1]
hL: [1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5]
hW: [0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1]

%matplotlib inline
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(16, 8))    # 1600px*800px

fig.add_subplot(231)
plt.title('gL-gW')
plt.scatter(gL, gW, c=iris.target)
plt.colorbar()

fig.add_subplot(232)
plt.title('hL-hW')
plt.scatter(hL, hW, c=iris.target)
plt.colorbar()

fig.add_subplot(233)
plt.title('gL-hL')
plt.scatter(gL, hL, c=iris.target)
plt.colorbar()

fig.add_subplot(234)
plt.title('gW-hW')
plt.scatter(gW, hW, c=iris.target)
plt.colorbar()

fig.add_subplot(235)
plt.title('gL-hW')
plt.scatter(gL, hW, c=iris.target)
plt.colorbar()

fig.add_subplot(236)
plt.title('gW-hL')
plt.scatter(gW,hL, c=iris.target)
plt.colorbar()

plt.show()

irisデータを3次元空間に射影して観察

# 3次元散布図
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = Axes3D(fig)
ax.set_xlabel("gW")
ax.set_ylabel("hL")
ax.set_zlabel("hW")
ax.scatter(gW, hL, hW, c = iris.target)
plt.show()

3）データをシャッフルした後，training用とtest用に分割

iris.data.tolist()[:10]    # arrayをlistに変換

[[5.1, 3.5, 1.4, 0.2],
 [4.9, 3.0, 1.4, 0.2],
 [4.7, 3.2, 1.3, 0.2],
 [4.6, 3.1, 1.5, 0.2],
 [5.0, 3.6, 1.4, 0.2],
 [5.4, 3.9, 1.7, 0.4],
 [4.6, 3.4, 1.4, 0.3],
 [5.0, 3.4, 1.5, 0.2],
 [4.4, 2.9, 1.4, 0.2],
 [4.9, 3.1, 1.5, 0.1]]

iris.target[:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

import random, numpy as np

# iris.data.tolist(): arrayをlistに変換し，アヤメのデータiris.dataと種類iris.targetをzipする
d = list(zip(iris.data.tolist(), iris.target)) 
d[:10]

[([5.1, 3.5, 1.4, 0.2], 0),
 ([4.9, 3.0, 1.4, 0.2], 0),
 ([4.7, 3.2, 1.3, 0.2], 0),
 ([4.6, 3.1, 1.5, 0.2], 0),
 ([5.0, 3.6, 1.4, 0.2], 0),
 ([5.4, 3.9, 1.7, 0.4], 0),
 ([4.6, 3.4, 1.4, 0.3], 0),
 ([5.0, 3.4, 1.5, 0.2], 0),
 ([4.4, 2.9, 1.4, 0.2], 0),
 ([4.9, 3.1, 1.5, 0.1], 0)]

random.shuffle(d)    # 元データはアヤメの種類毎にまとまっているので，データをシャッフルして混ぜる
d[:10]

[([4.7, 3.2, 1.3, 0.2], 0),
 ([6.3, 2.5, 5.0, 1.9], 2),
 ([4.6, 3.4, 1.4, 0.3], 0),
 ([6.9, 3.2, 5.7, 2.3], 2),
 ([6.9, 3.1, 5.1, 2.3], 2),
 ([5.8, 2.7, 3.9, 1.2], 1),
 ([5.4, 3.4, 1.7, 0.2], 0),
 ([7.1, 3.0, 5.9, 2.1], 2),
 ([6.5, 2.8, 4.6, 1.5], 1),
 ([7.7, 3.8, 6.7, 2.2], 2)]

train_data = d[0:120]    # 4/5 をトレーニングデータとする
test_data = d[120:150]    # 1/5 をテストデータとする

data_train, target_train = zip(*train_data)    # train_data: train_data はサイズ120のリスト
# zip(*train_data)でzipしたリストを展開し基に戻す。
# 変数の数が要素の数よりも少ない場合、変数名にアスタリスク*をつけると、
# 要素がリストとしてまとめて代入される。

data_train[:10]

([4.7, 3.2, 1.3, 0.2],
 [6.3, 2.5, 5.0, 1.9],
 [4.6, 3.4, 1.4, 0.3],
 [6.9, 3.2, 5.7, 2.3],
 [6.9, 3.1, 5.1, 2.3],
 [5.8, 2.7, 3.9, 1.2],
 [5.4, 3.4, 1.7, 0.2],
 [7.1, 3.0, 5.9, 2.1],
 [6.5, 2.8, 4.6, 1.5],
 [7.7, 3.8, 6.7, 2.2])

target_train[:10]

(0, 2, 0, 2, 2, 1, 0, 2, 1, 2)

data_test, target_test = zip(*test_data) #*をつけると、イテラブル内の要素が個々の引数としてzipに渡される

# 上記一連の処理の代わりに，sklearnのtrain_test_split関数を使うと同様の処理が簡便に記述できる
from sklearn.model_selection import train_test_split as split
x_train, x_test, y_train, y_test = split(iris.data, iris.target, train_size=0.8, test_size = 0.2)
list(zip(x_train, y_train))[:10]

[(array([5.8, 2.7, 5.1, 1.9]), 2),
 (array([6.5, 3. , 5.8, 2.2]), 2),
 (array([7.2, 3.6, 6.1, 2.5]), 2),
 (array([4.9, 3.1, 1.5, 0.1]), 0),
 (array([4.8, 3. , 1.4, 0.3]), 0),
 (array([6.8, 2.8, 4.8, 1.4]), 1),
 (array([6.1, 2.8, 4. , 1.3]), 1),
 (array([6.7, 3. , 5.2, 2.3]), 2),
 (array([6.3, 2.3, 4.4, 1.3]), 1),
 (array([4.9, 3.6, 1.4, 0.1]), 0)]

4）機械学習，SVM:support vector machine

サポートベクターマシンの基本は線形入力素子を利用して 2 クラスのパターン識別器を構成する手法である。訓練サンプルから、各データ点との距離が最大となるマージン最大化超平面を求めるという基準（超平面分離定理）で線形入力素子のパラメータを学習する。線形分離できない場合はカーネル関数を利用して変数変換する。 https://data-science.gr.jp/implementation/iml_sklearn_svm.html https://www.datacamp.com/community/tutorials/svm-classification-scikit-learn-python

from sklearn import svm
# SVCアルゴリズムオブジェクト生成し， clf(classification)とラベル付け
clf = svm.SVC(gamma = 'auto')  # gamma = 'auto'を指定しないと警告がでる
# trainingデータを使って，clfを学習させる
clf.fit(x_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

# 学習済みの識別機clfを利用して，testデータを識別させる
y_pred = clf.predict(x_test)
print('y_pred =', y_pred)
print('y_test =', y_test)
# 正当率
result = list(y_pred == y_test).count(True)/len(y_test)
print('正当率=', result)
# 分類が間違ったデータを表示
for i in range(30):
    if y_pred[i] != y_test[i]:
        print(i+1,' 誤判定:', x_test[i], '判定:', y_pred[i],', 正解', y_test[i] )

y_pred = [1 2 0 2 2 2 0 1 1 2 0 1 2 1 1 1 2 0 0 0 1 0 1 2 0 0 2 2 2 1]
y_test = [1 2 0 2 2 2 0 1 1 2 0 1 2 1 1 1 2 0 0 0 1 0 1 2 0 0 2 2 2 1]
正当率= 1.0

# irisのデータを1つ入力すると，判定し種類を出力
xdata = x_test[8]       # irisデータを一つ取ってくる
print('xdata:', xdata)
clf.predict([xdata])    # clf.predict　識別機clfで予想させる

xdata: [6.7 3.1 4.7 1.5]

array([1])

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(y_test)
print(y_pred)
print(cm)

import seaborn as sbn
sbn.heatmap(cm)

[1 2 0 2 2 2 0 1 1 2 0 1 2 1 1 1 2 0 0 0 1 0 1 2 0 0 2 2 2 1]
[1 2 0 2 2 2 0 1 1 2 0 1 2 1 1 1 2 0 0 0 1 0 1 2 0 0 2 2 2 1]
[[ 9  0  0]
 [ 0 10  0]
 [ 0  0 11]]

<matplotlib.axes._subplots.AxesSubplot at 0x25d86baf188>

5）データをシャッフルしないとどうなるか？

# シャフルしない訓練データとテストデータを用意
n = 40
nx_train = np.concatenate([iris.data[0:n], iris.data[50:50+n], iris.data[100: 100+n] ])
nx_test = np.concatenate([iris.data[n:50], iris.data[50+n:100], iris.data[100+n: 150] ])
ny_train =  np.concatenate([iris.target[0:n], iris.target[50:50+n], iris.target[100: 100+n] ])
ny_test = np.concatenate([iris.target[n:50], iris.target[50+n:100], iris.target[100+n: 150] ])
ny_test

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2])

# SVCアルゴリズムオブジェクト生成し， clf2(classification)とラベル付け
clf2 = svm.SVC(gamma = 'auto')  # gamma = 'auto'を指定しないと警告がでる
clf2.fit(nx_train, ny_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

ny_pred = clf2.predict(nx_test)
print('ny_pred     =', ny_pred)
print('iris.target =', ny_test)
# 正当率
result2 = list(ny_pred == ny_test).count(True)/len(ny_test)
print('正当率=', result2)
# 分類が間違ったデータを表示
for i in range(30):
    if ny_pred[i] != ny_test[i]:
        print(i+1,' 誤判定:', ny_test[i], '判定:', ny_pred[i],', 正解', ny_test[i] )

ny_pred     = [0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2]
iris.target = [0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2]
正当率= 1.0

ny_train # シャフルしていないことを確認

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2])