Skip to content
Rain Hu's Workspace
Go back

[AI] 2-3. 優化器 Optimizer

Rain Hu

梯度下降法(gradient descent)

y_pred = dot(W, x)
loss_value = loss(y_pred, y_true)

假設只考慮 W 為變數,我們可以將 loss_value 視為 W 的函數。

loss_value = loss(dot(W, x), y_true) = f(W)

假設 W 的當前值為 W0,那麼函數 f 在 W0 這點的導數就是

f'(W)=\frac{d\text{(loss\\_value)}}{d\text{W}}|\text{W}_0

我們將上式寫成

g = grad(loss_value, W0)

代表了在 W0 附近,loss_value = f(W) 的梯度最陡方向之張量,我們可以透過往往「斜率反方向」移動來降低 f(W)。

if g > 0:
    W' = W0 - dW
elif: g < 0:
    W' = W0 + dW

我們將 g 直接代入式子,並引入 \(\eta\)(eta) 的概念,\(\eta\)可以想成是每次移動的步伐大小。

W' = W0 - g * eta
圖解

gradient_descent

隨機梯度下降

反向傳播演算法(Backpropagation)

前面介紹的函數是簡單函式,可以很簡單地算出導數(梯度),但在實際情況下,我們需要能夠處理複雜函數的梯度。

連鎖律(Chain Rule)

反向傳播是借助簡單運算(eg. 加法、relu或是張量積)的導數,進而得出這算簡單運算的複雜組合的梯度。
舉例而言,如下圖,我們使用了兩個密集層做轉換,

\( \boxed{ \begin{array}{ccccccc} && \text{輸入資料 X} & \\ && \downarrow & \\ \boxed{\text{權重’}} & \rightarrow & \boxed{\text{層(資料轉換)}} & \red{\text{relu(W1,b1)}}\\ \uparrow && \downarrow & \\ \boxed{\text{權重’}} & \rightarrow & \boxed{\text{層(資料轉換)}} & \red{\text{softmax(W1,b2)}}\\ && \downarrow & \\ \uparrow && \boxed{\text{預測 Y’}}\rightarrow & \boxed{\text{損失函數}} & \leftarrow & \boxed{\text{標準答案 Y}} \\ &&& \downarrow & \\ \boxed{\text{優化器}} && \leftarrow & \boxed{\text{損失分數}} \end{array} } \)

我們可以把函式表運成:

y1 = relu(dot(W1, X)+b1)
y2 = softmax(dot(W2, y1)+b2)
loss_value = loss(y_true, y2)

loss_value = loss(y_true, softmax(dot(W2, relu(dot(W1, X)+b1))+b2))

我們可以透過連鎖律求得連鎖函數的導數:假設有 \(f\), \(g\) 兩個函數,它們的複合函數 \(fg\) 有著 \(fg(x)=f(g(x))\)的特性。

def fg(x):
    x1 = g(x)
    y = f(x1)
    return y

根據連鎖律

y=f(x1,x2,x3...xn),dydx=dydx1×dx1dx2×dx2dx3×...×dxndx y=f(x_1,x_2,x_3...x_n),\frac{d\text{y}}{d\text{x}}=\red{\frac{d\text{y}}{d\text{x}_1}\times\frac{d\text{x}_1}{d\text{x}_2}\times\frac{d\text{x}_2}{d\text{x}_3}\times...\times\frac{d\text{x}_n}{d\text{x}}}

換言之,我們只需要求出 \(\frac{d\text{y}}{d\text{x}_1}\times\frac{d\text{x}_1}{d\text{x}_2}\times\frac{d\text{x}_2}{d\text{x}_3}\times…\times\frac{d\text{x}_n}{d\text{x}}\) 就可以知道找出複合函數 \(y\) 的導數了。

詳述

前向傳播
對於一個神經元,輸入值 \(\text{x}\) 通過權重 \(\text{w}\) 和偏置 \(\text{b}\) 進行線性組合,然後通過激活函數 \(\text{relu}\): z=wx+b\text{z}=\text{wx+b} y=relu(z)\text{y}=\text{relu(z)}

損失函數
假設使用均方誤差(MSE)作為損失函數: L=12(yytrue)2L = \frac{1}{2}(\text{y} - \text{y}_\text{true})^2

反向傳播
使用連鎖律計算損失函數對權重的偏導數: Lw=Lyyzzw\frac{\partial L}{\partial \text{w}} = \frac{\partial L}{\partial \text{y}} \cdot \frac{\partial \text{y}}{\partial \text{z}} \cdot \frac{\partial \text{z}}{\partial \text{w}} 分解每一項:

Ly=(yytrue)\frac{\partial L}{\partial \text{y}} = (\text{y} - \text{y}_\text{true}) yz=relu(z)\frac{\partial \text{y}}{\partial \text{z}} = \text{relu}'(\text{z}) zw=x\frac{\partial \text{z}}{\partial \text{w}} = \text{x}

因此: Lw=(yytrue)relu(z)x\frac{\partial L}{\partial \text{w}} = (\text{y} - \text{y}_\text{true}) \cdot \text{relu}'(\text{z}) \cdot \text{x}

權重更新
使用梯度下降更新權重: w’=wηLw\text{w'} = \text{w} - \eta \frac{\partial L}{\partial \text{w}}

  • 思考正向傳播時,我們計算的順序是
zyL \text{z}\rightarrow\text{y}\rightarrow L
  • 而在推導梯度時,我們計算的順序是,剛好是反向過來計算的,故這個過程稱作「反向傳播」 Lyyzzw\frac{\partial L}{\partial \text{y}}\rightarrow\frac{\partial \text{y}}{\partial \text{z}}\rightarrow\frac{\partial \text{z}}{\partial \text{w}}

如今我們在現代框架中,已經可以透過自動微分來實作神經網路,如 TensorFlow 的 Gradient Tape。故我們可以只專注在「正向傳播」的過程。

TensorFlow 的 Gradient Tape

透過 TensorFlow 的 Gradient Tape 函式,可以快速的取得微分值:

import tensorflow as tf
x = tf.Variable(0.)
with tf.GradientTape() as tape:
  y = 2 * x + 3

grad = tape.gradient(y, x)
grad

>>> <tf.Tensor: shape=(), dtype=float32, numpy=2.0>

實作一個簡單的模型

Dense Layer

import tensorflow as tf

class NaiveDense:
  def __init__(self, input_size, output_size, activation):
    self.activation = activation
    w_shape = (input_size, output_size)
    w_initial_value = tf.random.uniform(w_shape, minval = 0, maxval = 0.1)
    self.W = tf.Variable(w_initial_value)
    b_shape = (output_size, )
    b_initial_value = tf.zeros(b_shape)
    self.b = tf.Variable(b_initial_value)

  def __call__(self, inputs):
    return self.activation(tf.matmul(inputs, self.W) + self.b)

  @property
  def weights(self):
    return [self.W, self.b]

Sequential

class NaiveSequential:
  def __init__(self, layers):
    self.layers = layers

  def __call__(self, inputs):
    x = inputs
    for layer in self.layers:
      x = layer(x)
    return x

  @property
  def weights(self):
    weights = []
    for layer in self.layers:
      weights += layer.weights
    return weights

Build Model

model = NaiveSequential([
    NaiveDense(input_size=28*28, output_size=512, activation=tf.nn.relu),
    NaiveDense(input_size=512, output_size=10, activation=tf.nn.softmax)
])

Batch Generator

import math
class BatchGeneartor:
  def __init__(self, images, labels, batch_size=128):
    assert len(images) == len(labels)
    self.index = 0
    self.images = images
    self.labels = labels
    self.batch_size = batch_size
    self.num_batches = math.ceil(len(images) / batch_size)

  def next(self):
    images = self.images[self.index : self.index + self.batch_size]
    labels = self.labels[self.index : self.index + self.batch_size]
    self.index += self.batch_size
    return images, labels

Update Weights

def update_weights(gradients, weights, learning_rate=1e-3):
  for g, w in zip(gradients, weights):
    w.assign_sub(g * learning_rate)
from tensorflow.keras import optimizers
optimizer = optimizers.SGD(learning_rate=1e-3)
def update_weights(gradients, weights):
  optimizer.apply_gradients(zip(gradients, weights))

Training

def one_training_step(model, images_batch, labels_batch):
  with tf.GradientTape() as tape:
    # y' = f(wx+b)
    predictions = model(images_batch)    
    # loss[] = y' - y_true
    per_sample_losses = (tf.keras.losses.sparse_categorical_crossentropy   (labels_batch, predictions))    
    # avg_loss = loss
    average_loss = tf.reduce_mean(per_sample_losses)[] / batch_size
    # g = grad(L, w)
    gradients = tape.gradient(average_loss, model.weights)  
    # w' = w-ηg
    update_weights(gradients, model.weights)
    return average_loss
def fit(model, images, labels, epochs, batch_size=128):
  for epoch_counter in range(epochs):
    print(f"Epoch {epoch_counter}")
    batch_generator = BatchGeneartor(images, labels)
    for batch_counter in range(batch_generator.num_batches):
      images_batch, labels_batch = batch_generator.next()
      loss = one_training_step(model, images_batch, labels_batch)
      if batch_counter % 100 == 0:
        print(f"loss at batch {batch_counter}: {loss:.2f}")
from tensorflow.keras.datasets import mnist
(train_images, train_labels),(test_images, test_labels) = mnist.load_data()

train_images = train_images.reshape((60000, 28*28))
train_images = train_images.astype("float32") / 255
test_images = test_images.reshape((10000, 28*28))
test_images = test_images.astype("float32") / 255

fit(model, train_images, train_labels, epochs=10, batch_size=128)

Evaluation

import numpy as np

predictions = model(test_images)
predictions = predictions.numpy()

predicted_labels = np.argmax(predictions, axis=1)
matches = predicted_labels == test_labels
print(f"accuracy: {matches.mean():.2f}")

Share this post on:

Previous
[AI] 3-1. TensorFlow 介紹
Next
[AI] 2-2. 張量 Tensor