Rest Math - 나머지 수학

$Quaternion representation _B^AOverscript[q,^]$

A quaternion is a four-dimensional complex number

An arbitrary orientation of frame B relative to frame A

$Overscript[A, f] →Overscript[B, f]$

The coordinate system {A} is rotated by θ relative to the $^A$ $Overscript[r,^]$ direction is called the coordinate system {B}, it is called the quaternion $_B^A$ $Overscript[q,^]$ .

$_B^AOverscript[q,^]$

$[Graphics:math/1_Quaternion/index_8.gif]$

$A rotation of angle θ around an axis ^AOverscript[r,^] defined in frame A$

$^AOverscript[r,^] = ({{r_x, r_y, r_z}})$

Angle

$θ$

normalized four-dimensional vector

$_B^AOverscript[q,^] = ({{q_1, q_2, q_3, q_4}}) = ({{Cos[θ/2], -r_xSin[θ/2], -r_ySin[θ/2], -r_zSin[θ/2]}})$

the quaternion paramentrization obeys

$q_1^2 + q_2^2 + q_3^2 + q_4^2 = 1$

hypercomplex numbers {i,j,k}

$a + b i + c j + d k$

$i^2 = j^2 = k^2 = -1$

$i j = -j i = k$

$j k = -k j = i$

$k i = -i k = j$

A quaternion conjugate

$_B^AOverscript[q,^]^* = _A^BOverscript[q,^] = ({{q_1, -q_2, -q_3, -q_4}})$

A quaternion product ⊗

$_C^AOverscript[q,^] = _C^BOverscript[q,^] ⊗_B^AOverscript[q,^]$

$_C^AOverscript[q,^]^* = _B^AOverscript[q,^]^* ⊗_C^BOverscript[q,^]^*$

Hamilton rule

proof

$Overscript[b, →] = ({{{{b_x, b}} _y, b_z}})$

$q_b = ({{b_1, b_2, b_3, b_4}}) = ({{b_1, Overscript[b, →]}})$

$q_a⊗q_b$

$= ({{a_1, a_2, a_3, a_4}}) ⊗ ({{b_1, b_2, b_3, b_4}})$

$= ({{a_1, a_2, a_3, a_4}}) . (b_1Ι + ({{0, Overscript[b, →]}, {-Overscript[b, →] , ×Overscript[b, →]}}))$

$= ({{a_1, a_2, a_3, a_4}}) . ({{b_1, b_2, b_3, b_4}, {-b_2, b_1, -b_4, b_3}, {-b_3, b_4, b_1, -b_2}, {-b_4, -b_3, b_2, b_1}})$

Euler axis

$Overscript[e,^] = ({{e_x}, {e_y}, {e_z}})$

Vector's Elements

$q_1 = Cos[θ/2]$

$q_2 = e_xSin[θ/2]$

$q_3 = e_ySin[θ/2]$

$q_4 = e_zSin[θ/2]$

Half Angle Identities

$Sin[θ/2] = ± (1 - Cos[θ])/2^(1/2)$

$Cos[θ/2] = ± (1 + Cos[θ])/2^(1/2)$

$Choice of ± sign depends on the quadrant in which θ/2 lies .$

Norm

$| _B^AOverscript[q,^] | = (q_1^2 + q_2^2 + q_3^2 + q_4^2)^(1/2)$

$| _B^AOverscript[q,^]^* | = | _B^AOverscript[q,^] |$

Inverse

$_B^AOverscript[q,^]^(-1) = _B^AOverscript[q,^]^*/(| _B^AOverscript[q,^] |)$

$Quaternion _B^AOverscript[q,^]$

A three dimensional vector

$^Aυ = ({{0, ^Aυ_x, ^Aυ_y, ^Aυ_z}})$

$[Graphics:math/1_Quaternion/index_46.gif]$

$^Bυ = _B^AOverscript[q,^] ⊗^Aυ⊗_B^AOverscript[q,^]^*$

$^Cυ = _C^BOverscript[q,^] ⊗^Bυ⊗_C^BOverscript[q,^]^*$

$_C^AOverscript[q,^] = _C^BOverscript[q,^] ⊗_B^AOverscript[q,^]$

$Rotation Matrix_A^BR$

$^Bυ = _A^BR . ^Aυ$

$^Cυ = _B^CR . ^Bυ$

$^Cυ = _A^CR . ^Aυ = _B^CR . _A^BR . ^Aυ = _B^CR . ^Bυ$

$_A^CR = _B^CR . _A^BR$

$_A^ER = _V^ER . _A^VR$

$_B^ER = _V^ER . _B^VR$

$_B^ER = _V^ER . _A^VR . _B^AR = _V^ER . _B^VR = _A^ER . _B^AR$

Derivative of Quaternion

$Overscript[_B^AOverscript[q,^], .] = 1/2_B^AOverscript[q,^] ⊗ω$

$Rotation Matrix representation _A^BR$

Orthogonal matrix

$Overscript[x,^] . Overscript[y,^] = 0$

$Overscript[y,^] . Overscript[z,^] = 0$

$Overscript[z,^] . Overscript[x,^] = 0$

$| Overscript[x,^] | = 1$

$| Overscript[y,^] | = 1$

$| Overscript[z,^] | = 1$

$RR = I$

Transpose = Inverse

$R = R^(-1)$

$[Graphics:math/2_RotationMatrix/index_11.gif]$

$Rotation Matrix_A^BR$

A three dimensional vector

$^Aυ = ({{0, ^Aυ_x, ^Aυ_y, ^Aυ_z}})$

$[Graphics:math/2_RotationMatrix/index_14.gif]$

$^Bυ = _A^BR . ^Aυ$

$^Cυ = _B^CR . ^Bυ$

$^Cυ = _A^CR . ^Aυ = _B^CR . _A^BR . ^Aυ = _B^CR . ^Bυ$

$_A^CR = _B^CR . _A^BR$

$_A^ER = _V^ER . _A^VR$

$_B^ER = _V^ER . _B^VR$

$_B^ER = _V^ER . _A^VR . _B^AR = _V^ER . _B^VR = _A^ER . _B^AR$

Inner Product Space

$Overscript[x, →] . Overscript[y, →] = | Overscript[x, →] | * | Overscript[y, →] | * Cos[θ]$

$Cos[θ] = Overscript[x, →] . Overscript[y, →]/(| Overscript[x, →] | * | Overscript[y, →] |)$

Representation of orientation

$Overscript[u, →] = u_1Overscript[e_1, →] + u_2Overscript[e_2, →] + u_3Overscript[e_3, →]$

$a^2 + b^2 + c^2 = Cos[α]^2 + Cos[β]^2 + Cos[γ]^2 = 1$

$| Overscript[v, →] | = (v_x^2 + v_y^2 + v_z^2)^(1/2)$

$= Cos[α] Overscript[e, →] _x + Cos[β] Overscript[e, →] _y + Cos[γ] Overscript[e, →] _z$

$= a Overscript[e, →] _x + b Overscript[e, →] _y + c Overscript[e, →] _z$

The coordinate system {A} is rotated by θ relative to the $^A$ $Overscript[r,^]$ direction is called the coordinate system {B}, it is called the rotation matrix $_A^B$ R.

Rotation Matrix: Direction Cosine Matrix

$The coordinate system {A} is rotated by θ relative to the^AOverscript[r,^] direction is called the coordinate system {B}, it is called the rotation matrix_A^BR .$

Properties of rotation matrix

$_A^BR = _B^AR^(-1) = _B^AR$

$Rotation matrix of coordinate system {B} with respect to coordinate system {A} : _A^BR$

$Rotation matrix of coordinate system {A} with respect to coordinate system {B} : _B^AR$

$Overscript[x,^] _A = ({{^Ax_x}, {^Ax_y}, {^Ax_z}}) ; Overscript[y,^] _A = ({{^Ay_x}, {^Ay_y}, {^Ay_z}}) ; Overscript[z,^] _A = ({{^Az_x}, {^Az_y}, {^Az_z}}) ;$

$Overscript[x,^] _B = ({{^Bx_x}, {^Bx_y}, {^Bx_z}}) ; Overscript[y,^] _B = ({{^By_x}, {^By_y}, {^By_z}}) ; Overscript[z,^] _B = ({{^Bz_x}, {^Bz_y}, {^Bz_z}}) ;$

${{{^AP = _A^BR . ^BP}, {^BP = _B^AR . ^AP}} ⇒ ^AP = _B^AR . ^BP = _A^BR . _B^AR . ^AP ⇒_A^BR . _B^AR = I$

$_A^BR = _B^AR^(-1) = _B^AR$

$^AP = _B^AR .^BP$

$Convert the {B} coordinate system component to the {A} coordinate system component$

$⇒_B^AR : Transformation Matrix, Rotation matrix of coordinate system {B} relative to coordinate system {A}$

$Euler Angle (Roll φ Overscript[x,^] _B, Pitch θ Overscript[y,^] _B, Yaw ψ Overscript[z,^] _B, ZYX)$

$Euler Angle (Roll φ Overscript[x,^] _B, Pitch θ Overscript[y,^] _B, Yaw ψ Overscript[z,^] _B, ZYX) →Rotation Matrix_B^AR$

Z-Y-X 오일러 각 (이동 좌표계)

→모든 회전이 이동 좌표계에 대한 회전을 이용한 상대변환이므로 해당 변환행렬을 앞에서부터 순차적으로 곱한다.

$[Graphics:math/3_EulerAngle/index_4.gif]$

Roll-Pitch-Yaw (고정 좌표계)

→고정기준좌표계{A}의 X축에 대해 φ만큼 회전 후 고정좌표계의 Y축에 대해 θ만큼 회전한 후,
고정좌표계의 Z축에 대해 ψ만틈 연속적으로 회전을 하여 특정방위를 나타내는 좌표계{B}를 얻는 방법

→모든 회전이 고정 기준 좌표계에 대한 회전을 이용한 절대변환이므로 해당 변환행렬을 뒤에서부터 앞으로 역순으로 곱한다.

$[Graphics:math/3_EulerAngle/index_5.gif]$

$_B^AR = R_Overscript[z,^][ψ] R_Overscript[y,^][θ] R_Overscript[x,^][φ]$

$Overscript[x,^] _B = ({{r_11, r_21, r_31}})$

$Overscript[y,^] _B = ({{r_12, r_22, r_32}})$

$Overscript[z,^] _B = ({{r_13, r_23, r_33}})$

Rotation Matrix $_B^A$ R → Roll φ $Overscript[x,^] _B$ ,Pitch θ $Overscript[y,^] _B$ ,Yaw ψ $Overscript[z,^] _B$

$_B^AR = ({{r_11, r_12, r_13}, {r_21, r_22, r_23}, {r_31, r_32, r_33}})$

→the three RPY angles are (θ∈(-π/2,π/2))

$φ = ArcTan[r_33, r_32]$

$θ = ArcTan[(r_32^2 + r_33^2)^(1/2), -r_31] = ArcSin[-r_31]$

$ψ = ArcTan[r_11, r_21]$

→or (θ∈(π/2,3 π/2))

$φ = ArcTan[-r_33, -r_32]$

$θ = ArcTan[-(r_32^2 + r_33^2)^(1/2), -r_31] = ArcSin[-r_31]$

$ψ = ArcTan[-r_11, -r_21]$

x,y,z

$Cross[{r11, r21, r31}, {1, 0, 0}] = {0, r31, -r21}$

$Cross[{r12, r22, r32}, {0, 1, 0}] = {-r32, 0, r12}$

$Cross[{r13, r23, r33}, {0, 0, 1}] = {r23, -r13, 0}$

$Dot[{r11, r21, r31}, {1, 0, 0}] = r11$

$Dot[{r12, r22, r32}, {0, 1, 0}] = r22$

$Dot[{r13, r23, r33}, {0, 0, 1}] = r33$

$r11 = Cos[α]$

$r22 = Cos[β]$

$r33 = Cos[γ]$

$Sin[θ/2] = ± (1 - Cos[θ])/2^(1/2)$

$Cos[θ/2] = ± (1 + Cos[θ])/2^(1/2)$

$_B^AOverscript[q,^] = ({{q_1, q_2, q_3, q_4}}) = ({{Cos[θ/2], -r_xSin[θ/2], -r_ySin[θ/2], -r_zSin[θ/2]}})$

$= {(q_1^2 + q_2^2)^(1/2), 0, -2 (q_2q_4 + q_1q_3) (1 - q_1^2 - q_2^2)^(1/2), 2 (q_2q_3 - q_1q_4) (1 - q_1^2 - q_2^2)^(1/2)}$

$= {(q_1^2 + q_3^2)^(1/2), 2 (q_3q_4 - q_1q_2) (1 - q_1^2 - q_3^2)^(1/2), 0, -2 (q_2q_3 + q_1q_4) (1 - q_1^2 - q_3^2)^(1/2)}$

$= {(q_1^2 + q_4^2)^(1/2), -2 (q_3q_4 + q_1q_2) (1 - q_1^2 - q_4^2)^(1/2), 2 (q_2q_4 - q_1q_3) (1 - q_1^2 - q_4^2)^(1/2), 0}$

$qx⊗q = {(q_1^2 + q_2^2)^(1/2), 0, -2 (q_2q_4 + q_1q_3) (1 - q_1^2 - q_2^2)^(1/2), 2 (q_2q_3 - q_1q_4) (1 - q_1^2 - q_2^2)^(1/2)} ⊗ {q_1, q_2, q_3, q_4}$

$qy⊗q = {(q_1^2 + q_3^2)^(1/2), 2 (q_3q_4 - q_1q_2) (1 - q_1^2 - q_3^2)^(1/2), 0, -2 (q_2q_3 + q_1q_4) (1 - q_1^2 - q_3^2)^(1/2)} ⊗ {q_1, q_2, q_3, q_4}$

$qz⊗q = {(q_1^2 + q_4^2)^(1/2), -2 (q_3q_4 + q_1q_2) (1 - q_1^2 - q_4^2)^(1/2), 2 (q_2q_4 - q_1q_3) (1 - q_1^2 - q_4^2)^(1/2), 0} ⊗ {q_1, q_2, q_3, q_4}$

$Rotation Matrix_A^BR →Quaternion_B^AOverscript[q,^]$

$q_1 = 1/2 (r_11 + r_22 + r_33 + 1)^(1/2)$

$q_2 = -Sign[r_32 - r_23]/2 (r_11 - r_22 - r_33 + 1)^(1/2)$

$q_3 = -Sign[r_13 - r_31]/2 (r_22 - r_33 - r_11 + 1)^(1/2)$

$q_4 = -Sign[r_21 - r_12]/2 (r_33 - r_11 - r_22 + 1)^(1/2)$

Rotation Matrix $_B^A$ R → Quaternion $_B^A$ $Overscript[q,^]$

$_B^AOverscript[q,^] = ({{q_1, q_2, q_3, q_4}})$

$q_1^2 + q_2^2 + q_3^2 + q_4^2 = 1$

$r_11 + r_22 + r_33 + 1$

$= (2q_1^2 - 1 + 2q_2^2) + (2q_1^2 - 1 + 2q_3^2) + (2q_1^2 - 1 + 2q_4^2) + 1$

$= 2 (-1 + 2 q_1^2 + q_1^2 + q_2^2 + q_3^2 + q_4^2) = 2 (-1 + 2 q_1^2 + 1)$

$= 4 q_1^2$

$r_11 - r_22 - r_33 + 1$

$= (2q_1^2 - 1 + 2q_2^2) - (2q_1^2 - 1 + 2q_3^2) - (2q_1^2 - 1 + 2q_4^2) + 1$

$= -2 (-1 + q_1^2 - q_2^2 + q_3^2 + q_4^2) = -2 (-1 - 2q_2^2 + q_1^2 + q_2^2 + q_3^2 + q_4^2) = -2 (-1 - 2q_2^2 + 1)$

$= 4q_2^2$

$-r_11 + r_22 - r_33 + 1$

$= -(2q_1^2 - 1 + 2q_2^2) + (2q_1^2 - 1 + 2q_3^2) - (2q_1^2 - 1 + 2q_4^2) + 1$

$= -2 (-1 + q_1^2 + q_2^2 - q_3^2 + q_4^2) = -2 (-1 - 2q_3^2 + q_1^2 + q_2^2 + q_3^2 + q_4^2) = -2 (-1 - 2q_3^2 + 1)$

$= 4q_3^2$

$-r_11 - r_22 + r_33 + 1$

$= -(2q_1^2 - 1 + 2q_2^2) - (2q_1^2 - 1 + 2q_3^2) + (2q_1^2 - 1 + 2q_4^2) + 1$

$= -2 (-1 + q_1^2 + q_2^2 + q_3^2 - q_4^2) = -2 (-1 - 2a_4^2 + q_1^2 + q_2^2 + q_3^2 + q_4^2) = -2 (-1 - 2a_4^2 + 1)$

$= 4q_4^2$

$q_1 = 1/2 (r_11 + r_22 + r_33 + 1)^(1/2)$

$q_2 = -Sign[r_32 - r_23]/2 (r_11 - r_22 - r_33 + 1)^(1/2)$

$q_3 = -Sign[r_13 - r_31]/2 (r_22 - r_33 - r_11 + 1)^(1/2)$

$q_4 = -Sign[r_21 - r_12]/2 (r_33 - r_11 - r_22 + 1)^(1/2)$

$Overscript[q,^] = Q_x[φ] ⊗Q_y[θ] ⊗Q_z[ψ]$

$Overscript[q,^] = {Cos[φ/2], -Sin[φ/2], 0, 0} ⊗ {Cos[θ/2], 0, -Sin[θ/2], 0} ⊗ {Cos[ψ/2], 0, 0, -Sin[ψ/2]}$

$q_1 = Cos[φ/2] Cos[θ/2] Cos[ψ/2] + Sin[φ/2] Sin[θ/2] Sin[ψ/2]$

$q_2 = -Sin[φ/2] Cos[θ/2] Cos[ψ/2] + Cos[φ/2] Sin[θ/2] Sin[ψ/2]$

$q_3 = -Cos[φ/2] Sin[θ/2] Cos[ψ/2] - Sin[φ/2] Cos[θ/2] Sin[ψ/2]$

$q_4 = -Cos[φ/2] Cos[θ/2] Sin[ψ/2] + Sin[φ/2] Sin[θ/2] Cos[ψ/2]$

Euler Angle (Roll φ $Overscript[x,^] _B$ , Pitch θ $Overscript[y,^] _B$ , Yaw ψ $Overscript[z,^] _B$ , ZYX) → Quaternion $_B^A$ $Overscript[q,^]$

$q_1 = 1/2 (r_11 + r_22 + r_33 + 1)^(1/2)$

$q_2 = -Sign[r_32 - r_23]/2 (r_11 - r_22 - r_33 + 1)^(1/2)$

$q_3 = -Sign[r_13 - r_31]/2 (r_22 - r_33 - r_11 + 1)^(1/2)$

$q_4 = -Sign[r_21 - r_12]/2 (r_33 - r_11 - r_22 + 1)^(1/2)$

$q_1 = 1/2 (r_11 + r_22 + r_33 + 1)^(1/2)$

$= 1/2 (1 + Cos[θ] Cos[φ] + Cos[θ] Cos[ψ] + Cos[φ] Cos[ψ] + Sin[θ] Sin[φ] Sin[ψ])^(1/2)$

$q_2 = -Sign[r_32 - r_23]/2 (r_11 - r_22 - r_33 + 1)^(1/2)$

$q_3 = -Sign[r_13 - r_31]/2 (r_22 - r_33 - r_11 + 1)^(1/2)$

$q_4 = -Sign[r_21 - r_12]/2 (r_33 - r_11 - r_22 + 1)^(1/2)$

$q_1 = 1/2 √ (1 + Cos[θ] Cos[φ] + Cos[θ] Cos[ψ] + Cos[φ] Cos[ψ] + Sin[θ] Sin[φ] Sin[ψ])$

$Q[φ, 0, 0]$

$= {1/2 (2 + 2 Cos[φ])^(1/2), -1/2 (2 - 2 Cos[φ])^(1/2) Sign[Sin[φ]], 0, 0}$

$= {Cos[φ/2], -Sin[φ/2], 0, 0}$

$Q[0, θ, 0]$

$= {1/2 (2 + 2 Cos[θ])^(1/2), 0, -1/2 (2 - 2 Cos[θ])^(1/2) Sign[Sin[θ]], 0}$

$= {Cos[θ/2], 0, -Sin[θ/2], 0}$

$Q[0, 0, ψ]$

$= {1/2 (2 + 2 Cos[ψ])^(1/2), 0, 0, -1/2 (2 - 2 Cos[ψ])^(1/2) Sign[Sin[ψ]]}$

$= {Cos[ψ/2], 0, 0, -Sin[ψ/2]}$

$B[B[Q[φ, 0, 0], Q[0, θ, 0]], Q[0, 0, ψ]]$

$B[B[{Cos[φ/2], -Sin[φ/2], 0, 0}, {Cos[θ/2], 0, -Sin[θ/2], 0}], {Cos[ψ/2], 0, 0, -Sin[ψ/2]}]$

$q_1 = Cos[φ/2] Cos[θ/2] Cos[ψ/2] + Sin[φ/2] Sin[θ/2] Sin[ψ/2]$

$q_2 = -Sin[φ/2] Cos[θ/2] Cos[ψ/2] + Cos[φ/2] Sin[θ/2] Sin[ψ/2]$

$q_3 = -Cos[φ/2] Sin[θ/2] Cos[ψ/2] - Sin[φ/2] Cos[θ/2] Sin[ψ/2]$

$q_4 = -Cos[φ/2] Cos[θ/2] Sin[ψ/2] + Sin[φ/2] Sin[θ/2] Cos[ψ/2]$

$Euler Angle (Roll φ Overscript[x,^] _B, Pitch θ Overscript[y,^] _B, Yaw ψ Overscript[z,^] _B, ZYX) →Quaternion_B^AOverscript[q,^]$

$q_1 = 1/2 √ (1 + Cos[θ] Cos[φ] + Cos[θ] Cos[ψ] + Cos[φ] Cos[ψ] + Sin[θ] Sin[φ] Sin[ψ])$

$Q_x[φ] = Q[φ, 0, 0] = {1/2 (2 + 2 Cos[φ])^(1/2), -1/2 (2 - 2 Cos[φ])^(1/2) Sign[Sin[φ]], 0, 0} = {Cos[φ/2], -Sin[φ/2], 0, 0}$

$Q_y[θ] = Q[0, θ, 0] = {1/2 (2 + 2 Cos[θ])^(1/2), 0, -1/2 (2 - 2 Cos[θ])^(1/2) Sign[Sin[θ]], 0} = {Cos[θ/2], 0, -Sin[θ/2], 0}$

$Q_z[ψ] = Q[0, 0, ψ] = {1/2 (2 + 2 Cos[ψ])^(1/2), 0, 0, -1/2 (2 - 2 Cos[ψ])^(1/2) Sign[Sin[ψ]]} = {Cos[ψ/2], 0, 0, -Sin[ψ/2]}$

$Q[π/2, 0, 0] = {1/2^(1/2), -1/2^(1/2), 0, 0}$

$Q[0, π/2, 0] = {1/2^(1/2), 0, -1/2^(1/2), 0}$

$Q[0, 0, π/2] = {1/2^(1/2), 0, 0, -1/2^(1/2)}$

$Quaternion _B^AOverscript[q,^] →Rotation Matrix_A^BR$

$_B^AOverscript[q,^] = ({{q_1, q_2, q_3, q_4}}) = ({{Cos[θ/2], -r_xSin[θ/2], -r_ySin[θ/2], -r_zSin[θ/2]}})$

γ/ϑ → $Axis/Angle→Rotation Matrix_A^BR$

Rotation Matrix $_B^A$ R

$[Graphics:math/2_RotationMatrix/index_59.gif]$

$r = ({{r_x}, {r_y}, {r_z}})$

$R[ϑ, r] = R_z[α] . R_y[β] . R_z[ϑ] . R_y[-β] . R_z[-α]$

$Sin[α] = r_y/(r_x^2 + r_y^2)^(1/2) ; Cos[α] = r_x/(r_x^2 + r_y^2)^(1/2) ;$

$Sin[β] = (r_x^2 + r_y^2)^(1/2) ; Cos[β] = r_z ;$

$α = ArcTan[r_y/r_x] ; β = ArcTan[(r_x^2 + r_y^2)^(1/2)/r_z] ;$

$r_x^2 + r_y^2 + r_z^2 = 1 ;$

$(1 + (r_x^2 + r_y^2)/r_z^2)^(1/2) r_z = 1 ;$

Rotation Matrix $_B^A$ R → Axis γ/Angle ϑ

$R[-ϑ, -r] = R[ϑ, r]$

$ϑ = ArcCos[(r_11 + r_22 + r_33 - 1)/2]$

$r = ({{r_x}, {r_y}, {r_z}}) = 1/(2Sin[ϑ]) ({{r_32 - r_23}, {r_13 - r_31}, {r_21 - r_12}})$

$r_x^2 + r_y^2 + r_z^2 = 1$

Orientation $_B^A$ $Overscript[q,^]$ → Axis γ/Angle ϑ → Rotation Matrix $_B^A$ R

$_B^AOverscript[q,^] = ({{q_1, q_2, q_3, q_4}}) = ({{Cos[θ/2], -r_xSin[θ/2], -r_ySin[θ/2], -r_zSin[θ/2]}})$

Half-Angle Identities

$Cos[θ/2] = (1 + Cos[θ])/2^(1/2)$

$Sin[θ/2] = (1 - Cos[θ])/2^(1/2)$

$Quaternion _B^AOverscript[q,^] →Rotation Matrix_A^BR$

${{Cos[ϑ/2] = ± (1 + Cos[ϑ])/2^(1/2) ;, Sin[ϑ/2] = ± (1 - Cos[ϑ])/2^(1/2)}} ;$

$_B^AOverscript[q,^] = ({{q_1, q_2, q_3, q_4}}) = ({{Cos[ϑ/2], -r_xSin[ϑ/2], -r_ySin[ϑ/2], -r_zSin[ϑ/2]}})$

$q_1 = Cos[ϑ/2] = (1 + Cos[ϑ])/2^(1/2)$

$q_2 = -r_xSin[ϑ/2] = -r_x (1 - Cos[ϑ])/2^(1/2)$

$q_3 = -r_ySin[ϑ/2] = -r_y (1 - Cos[ϑ])/2^(1/2)$

$q_4 = -r_zSin[ϑ/2] = -r_z (1 - Cos[ϑ])/2^(1/2)$

$q_1^2 = (1 + Cos[ϑ])/2$

$q_2^2 = r_x^2 (1 - Cos[ϑ])/2$

$q_3^2 = r_y^2 (1 - Cos[ϑ])/2$

$q_4^2 = r_z^2 (1 - Cos[ϑ])/2$

$Cos[ϑ] = 2q_1^2 - 1$

$r_x^2 (1 - Cos[ϑ]) = 2q_2^2$

$r_y^2 (1 - Cos[ϑ]) = 2q_3^2$

$r_z^2 (1 - Cos[ϑ]) = 2q_4^2$

$Sin[ϑ] = (1 - Cos[ϑ]^2)^(1/2) = ((1 - Cos[ϑ]) (1 + Cos[ϑ]))^(1/2) = (2q_1^2 (1 - Cos[ϑ]))^(1/2) = 2^(1/2) q_1 (1 - Cos[ϑ])^(1/2)$

$r_x = (2^(1/2) q_2)/(1 - Cos[ϑ])^(1/2)$

$r_y = (2^(1/2) q_3)/(1 - Cos[ϑ])^(1/2)$

$r_z = (2^(1/2) q_4)/(1 - Cos[ϑ])^(1/2)$

$Sin[ϑ] = 2^(1/2) q_1 (1 - Cos[ϑ])^(1/2)$

$_B^AOverscript[q,^] = ({{q_1, q_2, q_3, q_4}}) = ({{Cos[ϑ/2], -r_xSin[ϑ/2], -r_ySin[ϑ/2], -r_zSin[ϑ/2]}})$

$q_1 = Cos[ϑ/2] ; q_2 = -r_xSin[ϑ/2] ; q_3 = -r_ySin[ϑ/2] ; q_4 = -r_zSin[ϑ/2] ;$

$Euler Angle (Roll φ Overscript[x,^] _B, Pitch θ Overscript[y,^] _B, Yaw ψ Overscript[z,^] _B, ZYX) →Rotation Matrix_A^BR$

$RowBox[{RowBox[{_A^BR, Cell[]}], =, ({{r_11, r_12, r_13}, {r_21, r_22, r_23}, {r_31, r_32, r_33}}) = R_Overscript[z,^][ψ] . R_Overscript[y,^][θ] . R_Overscript[x,^][φ]}]$

$Quaternion _B^AOverscript[q,^] →Rotation Matrix_A^BR→Euler Angle (Roll φ Overscript[x,^] _B, Pitch θ Overscript[y,^] _B, Yaw ψ Overscript[z,^] _B, ZYX)$

$ArcTan[x, y] gives the arc tangent of y/x$

$φ around Overscript[x,^] _B, θ around Overscript[y,^] _B, ψ around Overscript[z,^] _B$

$RowBox[{RowBox[{_A^BR, Cell[]}], =, ({{r_11, r_12, r_13}, {r_21, r_22, r_23}, {r_31, r_32, r_33}}) = R_Overscript[z,^][ψ] . R_Overscript[y,^][θ] . R_Overscript[x,^][φ]}]$

$φ = ArcTan[_A^BR_33, _A^BR_32] = ArcTan[2q_1^2 - 1 + 2q_4^2, 2 (q_3q_4 - q_1q_2)]$

$θ = -ArcSin[_A^BR_31] = -ArcTan[(1 - _A^BR_31^2)^(1/2), _A^BR_31] = -ArcSin[2 (q_2q_4 + q_1q_3)]$

$ψ = ArcTan[_A^BR_11, _A^BR_21] = ArcTan[2q_1^2 - 1 + 2q_2^2, 2 (q_2q_3 - q_1q_4)]$

$φ = ArcTan[_B^AR_33, _B^AR_32] = ArcTan[2q_1^2 - 1 + 2q_4^2, 2 (q_3q_4 - q_1q_2)]$

$θ = ArcSin[-_B^AR_31] = -ArcSin[2 (q_2q_4 + q_1q_3)]$

$ψ = ArcTan[_B^AR_11, _B^AR_21] = ArcTan[2q_1^2 - 1 + 2q_2^2, 2 (q_2 q_3 - q_1 q_4)]$

Orientation from angular rate

A tri-axis gyroscope

Angular rate

$^Sω = ({{0, ω_x, ω_y, ω_z}})$

Quaternion

$_E^SOverscript[q, .] = 1/2_E^SOverscript[q,^] ⊗^Sω$

Measured at time t

$_E^SOverscript[q, .] _ (ω, t) = 1/2_E^SOverscript[q,^] _ (est, t - 1) ⊗^Sω_t$

Orientation of the earth frame at time t

$_E^Sq_ (ω, t) = _E^SOverscript[q,^] _ (est, t - 1) + _E^SOverscript[q, .] _ (ω, t) Δt$

The samling period Δt

Gyroscope bias drift compensation

With temperature and motion

Karman-based approaches → estimate the gyroscope bias

Mahony et al → gyroscope bias drift (the integral feedback of the error)

Normalized direction of the estimated error in the rate of change of orientation $_E^S$ $Overscript[q, Overscript[^, .]] _ε$

Gyroscope bias $^S$ $ω_b$

$^Sω_ (b, t) = ζUnderscript[∑, t] ^Sω_ (ε, t) Δt$

The integral gain ζ

DC component of $^S$ $ω_ε$

$_E^SOverscript[q, .] _ (ω, t) = 1/2_E^SOverscript[q,^] _ (est, t - 1) ⊗^Sω_t$

$2_E^SOverscript[q,^] _ (est, t - 1)^* ⊗_E^SOverscript[q, .] _ (ω, t) = ^Sω_t$

$^Sω_ (ε, t) = 2_E^SOverscript[q,^] _ (est, t - 1)^* ⊗_E^SOverscript[q, Overscript[^, .]] _ (ε, t)$

Gyroscope measurements $^S$ $ω_c$

$^Sω_ (c, t) = ^Sω_t - ^Sω_ (b, t)$

$({{w_ (b x)}, {w_ (b y)}, {w_ (b z)}}) = ({{w_ (b x)}, {w_ (b y)}, {w_ (b z)}}) + ({{w_ (ε x)}, {w_ (ε y)}, {w_ (ε z)}}) * Δt * ζ$

$({{w_x}, {w_y}, {w_z}}) = ({{w_x}, {w_y}, {w_z}}) - ({{w_ (b x)}, {w_ (b y)}, {w_ (b z)}})$

Filter gains

$Estimated mean zero gyroscope measurement error of each axis Overscript[ω, ~] _η$

Filter gain

$η = | 1/2Overscript[q,^] ⊗ ({{0, Overscript[ω, ~] _η, Overscript[ω, ~] _η, Overscript[ω, ~] _η}}) | = 3/4^(1/2) Overscript[ω, ~] _η$

Estimated rate of gyroscope bias drift in each axis $Overscript[ω, .] _ζ$

Filter gain

$ζ = ζ3/4^(1/2) Overscript[ω, Overscript[., ~]] _ζ$

Jacobian matrix and determinant

$f : ^n→^m, (m×n matrix)$

$x∈^n, f[x] ∈^m$

$J_ (i, j) = ∂f_i/∂x_j$

$∇f = Jf$

Gradient descent (Linear system)

$Wx - b = 0$

$F[x] = | Wx - b |^2$

$∇F[x] = 2W (Wx - b)$

$x_t = x_ (t - 1) - α∇F[x_ (t - 1)]$

$∇F[x] = JF[x]$

Orientation from vector observations

A tri-axis accelerometer (Linear accelerations due to motion)

$the field of gravity$

A tri-axis magnetometer (Local magnetic flux and distortions)

$the earth ' s magnetic field$

Quaternion

$_E^SOverscript[q,^] = ({{q_1, q_2, q_3, q_4}})$

Predefined reference direction

$^EOverscript[d,^] = ({{0, d_x, d_y, d_z}})$

Measured direction

$^SOverscript[s,^] = ({{0, s_x, s_y, s_z}})$

Objective function

$f[_E^SOverscript[q,^], ^EOverscript[d,^], ^SOverscript[s,^]] = _E^SOverscript[q,^]^* ⊗^EOverscript[d,^] ⊗_E^SOverscript[q,^] - ^SOverscript[s,^]$

Quaternion may be found

$Underscript[min, _E^SOverscript[q,^] ∈ ^4] f[_E^SOverscript[q,^], ^EOverscript[d,^], ^SOverscript[s,^]]$

Gradient descent algorithm

Orientation estimation of $_E^S$ $Overscript[q,^] _ (n + 1)$

step-size: α

Gradient of the solution surface (General form)

$∇f[_E^SOverscript[q,^] _k, ^EOverscript[d,^], ^SOverscript[s,^]] = J[_E^SOverscript[q,^] _k, ^EOverscript[d,^]] f[_E^SOverscript[q,^] _k, ^EOverscript[d,^], ^SOverscript[s,^]]$

Objective function

$f[_E^SOverscript[q,^], ^EOverscript[d,^], ^SOverscript[s,^]] = _E^SOverscript[q,^]^* ⊗^EOverscript[d,^] ⊗_E^SOverscript[q,^] - ^SOverscript[s,^]$

$_E^SOverscript[q,^] = ({{q_1, q_2, q_3, q_4}})$

$^EOverscript[d,^] = ({{0, d_x, d_y, d_z}})$

$^SOverscript[s,^] = ({{0, s_x, s_y, s_z}})$

$_E^SOverscript[q,^]^* ⊗^EOverscript[d,^] ⊗_E^SOverscript[q,^] = ({{q_1, -q_2, -q_3, -q_4}}) ⊗ ({{0, d_x, d_y, d_z}}) ⊗ ({{q_1, q_2, q_3, q_4}})$

Jacobian matrix

Gradient of the solution surface (General form)

$∇f = Jf$

$∇f_ (g, b) = J_ (g, b)[_E^SOverscript[q,^], ^EOverscript[b,^]] f_ (g, b)[_E^SOverscript[q,^], ^SOverscript[a,^], ^EOverscript[b,^], ^SOverscript[m,^]]$

Calculation induction

Direction of gravity (Vertical axis, z axis)

Appropriate convention (the equations simplify)

$^EOverscript[g,^] = ({{0, 0, 0, 1}})$

Normalized accelermeter measurement

$^SOverscript[a,^] = ({{0, a_x, a_y, a_z}})$

Objective function

Jacobian matrix

$J_g[_E^SOverscript[q,^]] = ({{-2q_3, 2q_4, -2q_1, 2q_2}, {2q_2, 2q_1, 2q_4, 2q_3}, {0, -4q_2, -4q_3, 0}})$

Earth's magnetic field (One horizontal axis, Vertical axis)

Appropriate convention (the equations simplify)

$^EOverscript[b,^] = ({{0, b_x, 0, b_z}})$

Normalized accelermeter measurement

$^SOverscript[m,^] = ({{0, m_x, m_y, m_z}})$

Objective function

Jacobian matrix

Magnetic distortion compensation

Declination errors

Horizontal plane:earth's surface (Heading)

Inclination errors

Vertical plane: earth's surface (Sensor's attitude)

The measured direction of the earth's magnetic field in the earth frame at time t

$^EOverscript[h,^] _t = ({{0, h_x, h_y, h_z}}) = _E^SOverscript[q,^] _ (est, t - 1) ⊗^SOverscript[m,^] _t⊗_E^SOverscript[q,^] _ (est, t - 1)^*$

$({{0, h_x, h_y, h_z}}) = _E^SOverscript[q,^] _ (est, t - 1) ⊗^SOverscript[m,^] _t⊗_E^SOverscript[q,^] _ (est, t - 1)^*$

$= ({{q_1, q_2, q_3, q_4}}) ⊗ ({{0, m_x, m_y, m_z}}) ⊗ ({{q_1, -q_2, -q_3, -q_4}})$

$^EOverscript[b,^] _t = ({{0, (h_x^2 + h_y^2)^(1/2), 0, h_z}})$

$({{0, b_x, b_y, b_z}}) = ({{0, (h_x^2 + h_y^2)^(1/2), 0, h_z}})$

Solution surface → minimum

$∇f[_E^SOverscript[q,^] _k, ^EOverscript[d,^], ^SOverscript[s,^]] = J[_E^SOverscript[q,^] _k, ^EOverscript[d,^]]  . f[_E^SOverscript[q,^] _k, ^EOverscript[d,^], ^SOverscript[s,^]]$

Objective function

Jacobian matrix

$J_ (g, b)[_E^SOverscript[q,^], ^EOverscript[b,^]] = ({{J_g[_E^SOverscript[q,^]]}, {J_b[_E^SOverscript[q,^], ^EOverscript[b,^]]}})$

Gradient of the solution surface (General form)

Estimate orientation $_E^S$ $q_ (∇, t)$

$_E^Sq_ (∇, t) = _E^SOverscript[q,^] _ (est, t - 1) - α_t∇f/(| ∇f |)$

Convergence rate governed by $α_t$ ≥ Physical rate

$Previous estimate of orientation_E^SOverscript[q,^] _ (est, t - 1)$

Objective function gradient

$Convergence rate of_E^SOverscript[q,^] _ (∇, t)$

$Physical orientation rate_E^SOverscript[q, .] _ (ω, t)$

Optimal value of the step-size

$α_t = γ | _E^SOverscript[q, .] _ (ω, t) | Δt, γ>1$

Filter fusion algorithm

Estimated orientation

$_E^Sq_ (est, t) = β_t_E^Sq_ (∇, t) + (1 - β_t) _E^Sq_ (ω, t), 0≤β_t≤1$

$_E^SOverscript[q,^] _ (est, t) = _E^SOverscript[q,^] _ (est, t - 1) - β_tα_t∇f/(| ∇f |) + (1 - β_t) _E^SOverscript[q, .] _ (ω, t) Δt$

$_E^SOverscript[q,^] _ (est, t) = _E^SOverscript[q,^] _ (est, t - 1) + ((1 - β_t) _E^SOverscript[q, .] _ (ω, t) - β_tα_t/Δt∇f/(| ∇f |)) Δt$

η, $α_t/Δt$

$Weighted divergence of_E^Sq_ω$	==	$Weighted convergence of_E^Sq_∇$
η		$α_t/Δt$

Magnitude of a quaternion derivative == Gyroscope measurement error

$(1 - β_t) η = β_tα_t/Δt$

$β_t = η/(α_t/Δt + η)$

Optimal fusion

$Convergence rate of _E^Sq_∇ ≥γ → Physical rate of change of orientation$

$_E^Sq_ (∇, t) ≈ -α_t∇f/(| ∇f |)$

$β_t≈ (η Δt)/α_t$

Simplify

Estimated orientation

$_E^Sq_ (est, t) = (η Δt)/α_t (-α_t∇f/(| ∇f |)) + (1 - 0) (_E^SOverscript[q,^] _ (est, t - 1) + _E^SOverscript[q, .] _ (ω, t) Δt)$

$_E^Sq_ (est, t) = -η∇f/(| ∇f |) Δt + _E^SOverscript[q,^] _ (est, t - 1) + _E^SOverscript[q, .] _ (ω, t) Δt$

$_E^Sq_ (est, t) = _E^SOverscript[q,^] _ (est, t - 1) + _E^SOverscript[q, .] _ (ω, t) Δt - η∇f/(| ∇f |) Δt$

$_E^Sq_ (est, t) = _E^SOverscript[q,^] _ (est, t - 1) + (_E^SOverscript[q, .] _ (ω, t) - η∇f/(| ∇f |)) Δt$

$_E^SOverscript[q,^] _ (est, t) = _E^SOverscript[q,^] _ (est, t - 1) + (_E^SOverscript[q, .] _ (ω, t) - η_E^SOverscript[q, Overscript[^, .]] _ (ε, t)) Δt$

$_E^SOverscript[q,^] _ (est, t) = _E^SOverscript[q,^] _ (est, t - 1) + _E^SOverscript[q, .] _ (est, t) Δt$

$_E^Sq_ (est, t) = _E^SOverscript[q,^] _ (est, t - 1) + (1/2_E^SOverscript[q,^] _ (est, t - 1) ⊗^Sω_t - η∇f/(| ∇f |)) Δt$

Estimated orientation rate $_E^S$ $Overscript[q, .] _est$

Rate of change of orientation measured by gyroscopes $_E^S$ $Overscript[q, .] _ω$

Direction of the estimated error $_E^S$ $Overscript[q, Overscript[^, .]] _ε$

Simplified to equation

$_E^SOverscript[q,^] _ (est, t) = _E^SOverscript[q,^] _ (est, t - 1) + _E^SOverscript[q, .] _ (est, t) Δt$

Estimated rate of change of orientation

$_E^SOverscript[q, .] _ (est, t) = _E^SOverscript[q, .] _ (ω, t) - η_E^SOverscript[q, Overscript[^, .]] _ (ε, t)$

Quaternion derivative measured at time t

$_E^SOverscript[q, .] _ (ω, t) = 1/2_E^SOverscript[q,^] _ (est, t - 1) ⊗^Sω_t$

Direction of error of $_E^S$ $Overscript[q, .] _ (est, t)$

$_E^SOverscript[q, Overscript[^, .]] _ (ε, t) = ∇f/(| ∇f |)$

Objective function

Jacobian matrix

$J_ (g, b)[_E^SOverscript[q,^], ^EOverscript[b,^]] = ({{J_g[_E^SOverscript[q,^]]}, {J_b[_E^SOverscript[q,^], ^EOverscript[b,^]]}})$

Gradient of the solution surface (General form)

$∇f = Jf$

Calculation induction

$({{-2q_3, 2q_4, -2q_1, 2q_2}, {2q_2, 2q_1, 2q_4, 2q_3}, {0, -4q_2, -4q_3, 0}})  . ({{f_1}, {f_2}, {f_3}})//MatrixForm$

$( {{2 f_2 q_2 - 2 f_1 q_3}, {2 f_2 q_1 - 4 f_3 q_2 + 2 f_1 q_4}, {-2 f_1 q_1 - 4 f_3 q_3 + 2 f_2 q_4}, {2 f_1 q_2 + 2 f_2 q_3}} )$

IMU (Inertia Measurement Unit) algorithm

$[Graphics:math/4_IMU/index_151.gif]$

Least Squares LR (Linear Regression)

y= $w_0$ + $w_1$ x

$Underoverscript[∑, i = 1, arg3] y_i = Underoverscript[∑, i = 1, arg3] (w_0 + w_1x_i)$

$Underoverscript[∑, i = 1, arg3] y_ix_i = Underoverscript[∑, i = 1, arg3] (w_0x_i + w_1x_i^2)$

$RowBox[{({{1, …, 1}}) . ({{y_1}, {:}, {y_n}}), =, RowBox[{({{1, …, 1}}) . ({{1, x_1}, {:, :}, {1, x_n}}) . ({{w_0}, {w_1}}), Cell[], Cell[]}]}]$

$RowBox[{({{x_1, …, x_n}}) . ({{y_1}, {:}, {y_n}}), =, RowBox[{({{x_1, …, x_n}}) . ({{1, x_1}, {:, :}, {1, x_n}}) . ({{w_0}, {w_1}}), Cell[], Cell[]}]}]$

$XY = XX W$

$(XX)^(-1) (XX) W = (XX)^(-1) (XY)$

$W = (XX)^(-1) (XY)$

y= $w_0$ + $w_1$ x1+ $w_1$ x2

$Underoverscript[∑, i = 1, arg3] y_i = Underoverscript[∑, i = 1, arg3] (w_0 + w_1 (x1_i + x2_i))$

$Underoverscript[∑, i = 1, arg3] y_i (x1_i + x2_i) = Underoverscript[∑, i = 1, arg3] (w_0 (x1_i + x2_i) + w_1 (x1_i + x2_i)^2)$

$RowBox[{({{1, …, 1}}) . ({{y_1}, {:}, {y_n}}), =, RowBox[{({{1, …, 1}}) . ({{1, x1_1 + x2_1}, {:, :}, {1, x1_n + x2_n}}) . ({{w_0}, {w_1}}), Cell[], Cell[]}]}]$

$XY = XX W$

$(XX)^(-1) (XX) W = (XX)^(-1) (XY)$

$W = (XX)^(-1) (XY)$

Multivariate Regression

$y = w_1x_1 + w_2x_2 + … + w_mx_m$

$Y = X W$

$W = Overscript[W, →]  = ({{w_1}, {:}, {w_m}}) ; (%m×1 matrix%)$

$Overscript[X, →] = ({{x_1, ···, x_m}}) ; (%m vector%)$

$Y = ({{y_1}, {:}, {y_R}}) ; (%R×1 matrix%)$

$W = (XX)^(-1) (XY)$

$(XX) _ij⇒ Underoverscript[∑, k = 1, arg3] x_kix_kj ; (%m×m matrix%)$

$(XY) _i⇒ Underoverscript[∑, k = 1, arg3] x_kiy_k ; (%m - element vector%)$

Magnetic sensor Soft/Hard Iron

Elliptical Sphere

$AA (x - a)^2 + BB (y - b)^2 + CC (z - c)^2 = 1$

$W[[1]] x^2 + W[[2]] y^2 + W[[3]] z^2 - 2W[[4]] x - 2W[[5]] y - 2W[[6]] z + W[[7]] = 1$

$W[[1]] (x - W[[4]]/W[[1]])^2 + W[[2]] (y - W[[5]]/W[[2]])^2 + W[[3]] (z - W[[6]]/W[[3]])^2 + W[[7]] - W[[4]]^2/W[[1]] - W[[5]]^2/W[[2]] - W[[6]]^2/W[[3]] = 1$

$({{a}, {b}, {c}}) = ( {{W[[4]]/W[[1]]}, {W[[5]]/W[[2]]}, {W[[6]]/W[[3]]}} )$

Sphere

$AA (x - a)^2 + AA (y - b)^2 + AA (z - c)^2 == 1$

$W[[1]] x^2 + W[[1]] y^2 + W[[1]] z^2 - 2 W[[2]] x - 2W[[3]] y - 2 W[[4]] z + W[[5]] == 1$

$W[[1]] (x - W[[2]]/W[[1]])^2 + W[[1]] (y - W[[3]]/W[[1]])^2 + W[[1]] (z - W[[4]]/W[[1]])^2 + W[[5]] - W[[2]]^2/W[[1]] - W[[3]]^2/W[[1]] - W[[4]]^2/W[[1]] = 1$

$({{a}, {b}, {c}}) = ( {{W[[2]]/W[[1]]}, {W[[3]]/W[[1]]}, {W[[4]]/W[[1]]}} )$

$(x - a)^2/r^2 + (y - b)^2/r^2 + (z - c)^2/r^2 == 1$

$(x^2 + y^2 + z^2)/r^2 - (2 a x)/r^2 - (2 b y)/r^2 - (2 c z)/r^2 + (a^2 + b^2 + c^2)/r^2 == 1$

$x =[x^2 + y^2 + z^2, 2x, 2y, 2z]$

$W[[1]] == 1/r^2 ;$

$W[[2]] = a/r^2 ;$

$W[[3]] = b/r^2 ;$

$W[[4]] = c/r^2 ;$

$W[[5]] = (a^2 + b^2 + c^2)/r^2$

$a = W[[2]]/W[[1]] ;$

$b = W[[3]]/W[[1]] ;$

$c = W[[4]]/W[[1]] ;$

W= $(xx)^(-1)$ (xy)

$x =[x^2 + y^2 + z^2, -2x, -2y, -2z, 1]$

$y =[1]$

$W = ({{AA}, {BB}, {CC}, {DD}, {EE}})$

$X = ({{A1, B1, C1, D1, E1}, {At, Bt, Ct, Dt, Et}}) ; Y = ({{1}, {1}}) ;$

$W = (X . X)^(-1) . (X . Y) ;$

$X = ({{A0, B0, C0, D0, 1}}) ; Y = (1) ;$

$W = (X . X)^(-1) . (X . Y) ;$

$W//MatrixForm$

${{5/A0}, {5/B0}, {5/C0}, {5/D0}, {5}} . 1$

$X = ({{A0, B0, C0, D0, 1}, {A1, B1, C1, D1, 1}}) ; Y = ({{1}, {1}}) ;$

$W = (X . X)^(-1) . (X . Y) ;$

$W//MatrixForm$

$X = ({{A0, B0, C0, D0, 1}, {A1, B1, C1, D1, 1}, {A2, B2, C2, D2, 1}}) ; Y = ({{1}, {1}, {1}}) ;$

$W = (X . X)^(-1) . (X . Y) ;$

$W//MatrixForm$

$X = ({{A0, B0, C0, D0, 1}, {A1, B1, C1, D1, 1}, {A2, B2, C2, D2, 1}, {A3, B3, C3, D3, 1}}) ; Y = ({{1}, {1}, {1}, {1}}) ;$

$W = (X . X)^(-1) . (X . Y) ;$

$W//MatrixForm$

$At = x^2 + y^2 + z^2 ;$

$Bt = -2x ;$

$Ct = -2y ;$

$Dt = -2z ;$

$Et = 1 ;$

$X = ({{A1, B1, C1, D1, 1}, {:, :, :, :, :}, {At, Bt, Ct, Dt, 1}}) ; Y = ({{1}, {:}, {1}}) ;$

$W = (X . X)^(-1) . (X . Y) ;$

$W//MatrixForm$

$W[[5]] = 5$

$r^2 = 1/W[[1]] ;$

$a = W[[2]]/W[[1]] ;$

$b = W[[3]]/W[[1]] ;$

$c = W[[4]]/W[[1]] ;$

W= $(xx)^(-1)$ (xy)

$x =[x^2 + y^2 + z^2, -2x, -2y, -2z, 1]$

$y =[1]$

$W[[5]] = 5$

$r^2 = 1/W[[1]] ;$

$a = W[[2]]/W[[1]] ;$

$b = W[[3]]/W[[1]] ;$

$c = W[[4]]/W[[1]] ;$

Elliptical Sphere

$AA (x - a)^2 + BB (y - b)^2 + CC (z - c)^2 = 1$

$W[[1]] x^2 + W[[2]] y^2 + W[[3]] z^2 - 2W[[4]] x - 2W[[5]] y - 2W[[6]] z + W[[7]] = 1$

$W[[1]] (x - W[[4]]/W[[1]])^2 + W[[2]] (y - W[[5]]/W[[2]])^2 + W[[3]] (z - W[[6]]/W[[3]])^2 + W[[7]] - W[[4]]^2/W[[1]] - W[[5]]^2/W[[2]] - W[[6]]^2/W[[3]] = 1$

$({{a}, {b}, {c}}) = ( {{W[[4]]/W[[1]]}, {W[[5]]/W[[2]]}, {W[[6]]/W[[3]]}} )$

W= $(xx)^(-1)$ (xy)

$x =[x^2, y^2, z^2, -2x, -2y, -2z, 1]$

$X = ({{A1, B1, C1, D1, E1, F1, 1}}) ; Y = (1) ;$

$W = (X . X)^(-1) . (X . Y) ;$

$W//MatrixForm$

${{7/A1}, {7/B1}, {7/C1}, {7/D1}, {7/E1}, {7/F1}, {7}} . 1$

$X = ({{A1, B1, C1, D1, E1, F1, 1}, {A2, B2, C2, D2, E2, F2, 1}}) ; Y = ({{1}, {1}}) ;$

$W = (X . X)^(-1) . (X . Y) ;$

$W//MatrixForm$

$X = ({{A1, B1, C1, D1, E1, F1, 1}, {A2, B2, C2, D2, E2, F2, 1}, {A3, B3, C3, D3, E3, F3, 1}}) ; Y = ({{1}, {1}, {1}}) ;$

$W = (X . X)^(-1) . (X . Y) ;$

$W//MatrixForm$

$X = ({{A1, B1, C1, D1, E1, F1, 1}, {A2, B2, C2, D2, E2, F2, 1}, {At, Bt, Ct, Dt, Et, Ft, 1}}) ; Y = ({{1}, {1}, {1}}) ;$

$W = (X . X)^(-1) . (X . Y) ;$

$W//MatrixForm$

Artificial Intelligence (AI)

Machine Learning

Deep Learning

$[Graphics:math/5_Training/index_1.gif]$

Function Composition

$f≈σ_L◦σ_ (L - 1) ◦…◦σ_1$

Fully-connected network (FC-net)

$h_t^(l) = σ[W^(l) h_t^(l - 1) + b_t^(l)]$

Convolutional neural network (CNN)

$h_t^(l) = σ[W^(l) * h_t^(l - 1) + b_t^(l)]$

Recurrent neural network (RNN)

$h_t^(l) = σ[W^(l) h_t^(l - 1) + V^(l) h_ (t - 1)^(l) + b_t^(l)]$

Explosion

$AlexNet (5 convolutional layers + 3 fully connected layers), 2012$

$VGG (very deep CNN, 16 - 19 weight layers), 2015$

$GoogLeNet (22 layers), 2015$

$Deep Residual Net (100 - 1000 layers), 2015$

Gradient

$Pre - training$

$Restricted Boltzman machine (RBM) or autoencoder$

$Training$

$Dropout$

$batch normalization$

$Rectified linear units (ReLU)$

$Vanishing gradient alleviated$

$universal approximation$

Linear Models

Universal Approximation

$Task$

Model = Approximation

$Sample Training set$

$Algorithm$

$Engineering Features (SIFT, HOG) ⇒ Learning Features$

Linear Models

Linear regression

Regularization

$Alleviate overfitting$

$Ridge regression (L_2norm regularizer)$

$LASSO (L_1 norm regularizer)$

Binary Classification

Logistic Regression

$Models the input - output by conditional Bernoulli distribution$

$P[y_n = 1 | x_n] = σ[Wx_n]$

$IRLS algorithm$

$[Graphics:math/5_Training/index_28.gif]$

$Data = {(x_1, y_1), (x_2, y_2), …, (x_N, y_N)}$

$x_n : input$

$y_n : Target output$

$Goal : Find a mapping from X to Y$

Linear models for regression

$Regression : y_n∈R (real value)$

$Classification : y_n∈ {0, 1} or y_n∈ {1, 2, …, N}$

$D = {(x_1, y_1), (x_2, y_2), …, (x_N, y_N)}$

$y_n = w_1x_n + w_0$

$model parameter = {w_1, w_0}$

$linear algebraic equation$

$y_n≈w_1x_n + w_0$

$error : e_n = y_n - w_1x_n - w_0$

$Underscript[minimize, w_1, w_0] ∑e_n^2$

$∂/∂W_1∑e_n^2, ∂/∂W_0∑e_n^2$

Problem Setup

$Given a set of N labeled examples,  = {(x_n, y_n)} _ (n = 1)^N (x_n∈X⊂^D and y_n∈Y⊂), the goal is to learn a mapping$

$[x] : X→Y$

$which associates x with y, such tat we can make prediction abot y_ * when a new input x_ * ∉ is provided .$

Basic Function

$φ_j[x]$

Linear models

$[x] = Underoverscript[∑, j = 1, arg3] w_jφ_j[x] + w_0$

Neural networks

$[x] = Underoverscript[∑, j = 1, arg3] w_j^(2) φ[∑_kW_ (j, k)^{1} x_k + b_j^(1)]$

Kernel regression

$[x] = Underoverscript[∑, n = 1, arg3] w_n[x, x_n] + w_0$

Least Squares Method

$norm of x = | x | _p = (∑_i | x_i |^p)^1/p$

$Given a set of training data {(x_n, y_n)} _ (n = 1)^N, we determine the weight vector ∈^(M + 1) which minimizes$

$_LS[] = 1/(2N) Underoverscript[∑, n = 1, arg3] (y_n - φ[x_n])^2 = 1/(2N) |  - Φ | _2^2,$

$Note that$

$|  - Φ | _2^2 = ( - Φ)  ( - Φ)$

$_LS = (ΦΦ)^(-1) Φ$

Regularization

$Interested in : Inferring a function of any x, given N examples  = {(x_n, y_n)} _ (n = 1)^N$

$Consider a loss function ℓ[f[x_n ; ], y_n] . For instance, LS regression uses the square loss :$

$Underoverscript[∑, n = 1, arg3] ℓ[f[x_n ; ], y_n] = 1/(2N) |  - Φ | _2^2$

$A regularizer (which imposes a penalty on the complexity of ) is added to the loss function, leading to$

${{Underoverscript[∑, n = 1, arg3] ℓ[f[x_n ; ], y_n], +, λ R[]}, {loss, , reqularizer}},$

$where λ controls the importance of the regularization term (hyperparameter)$

$model parameter =  = {w_1, w_0}$

$hyperparameter = λ$

Ridge Regression

$The ridge regression can be written as$

$Underscript[min, w] 1/2 |  - Φ | _2^2 +λ/2 |  | _2^2$

$or as a bounded constrained form :$

$Underscript[min, w] 1/2 |  - Φ | _2^2, s . t . |  | _2^2≤B$

$A small (tight) bound B corresponds to large penalty λ and vice versa .$

$Then,$

$∂/∂[1/2 |  - Φ | _2^2 +λ/2 |  | _2^2] = -Φ + (ΦΦ + λΙ) $

$Equating to zero yields$

$_Ridge = (ΦΦ + λΙ)^(-1) Φ$

Find λ

$Fix λ → Find  by Training → Validation error$

$Cross - validataion$

Illusstration

$Square loss + ℒ_2norm regularizer : {{1/2 |  - Φ | _2^2, +, λ/2 |  | _2^2}, {LS fit, , regularizer}}$

Logistic Regression

$Expectation $

$model[w, x] = wx ⇔[y | x]$

$Binary classification$

$y∈ {0, 1}$

$[y | x] = P[y = 1 | x]$

$model[w, x] = P[y = 1 | x]$

$σ[wx] = P[y = 1 | x]$

$logistic function = 1/(1 + ^(-wx))$

Logistic Regression

$Predict a binary output y_n∈ {0, 1} from an input x_n$

$The logistic regression models the input - output by a conditional Bernouli distribution$

$[y_n | x_n] = [y_n = 1 | x_n] = σ[wx_n]$

$where$

$σ[ξ] = 1/(1 + ^(-ξ)) = ^ξ/(1 + ^ξ)$

$Discriminative model : Directly model p[y | x]$

Error function

$Regression ⇔ square loss$

$Binary classification ⇔ cross - entropy error$

$H[p ; q] = _p[-Log[q]]$

$= -∑_xp[x] Log[q[x]]$

Cross entropy error

$p∈ {y, 1 - y}, q∈ {Overscript[y,^], 1 - Overscript[y,^]}$

$ε = Underoverscript[∑, n = 1, arg3] (-y_nLog[Overscript[y,^] _n] - (1 - y_n) Log[1 - Overscript[y,^] _n])$

$p∈ {y_1, y_2, …, y_k}, q∈ {Overscript[y,^] _1, Overscript[y,^] _2, …, Overscript[y,^] _k}$

$ε = Underoverscript[∑, n = 1, arg3] Underoverscript[∑, k = 1, arg3] (-y_ (k, n) Log[Overscript[y,^] _ (k, n)])$

$model→Overscript[y,^] = ({{0.7}, {0.2}, {0.1}}), Target y = ({{1}, {0}, {0}})$

$CE = -1 Log[0.7] - 0 Log[0.2] - 0 Log[0.1]$

$= -Log[0.7] = 0.356675$

Logistic Regression:MLE

$Given {(x_n, y_n) | n = 1, …, N}, the likelihood is given by$

$p[y | X, w] = Underoverscript[∏, n = 1, arg3] p[y_n = 1 | x_n]^y_n (1 - p[y_n = 1 | x_n])^(1 - y_n)$

$= Underoverscript[∏, n = 1, arg3] σ[wx_n]^y_n (1 - σ[wx_n])^(1 - y_n)$

$Then log - likelihood function is given by$

$ℒ = Underoverscript[∑, n = 1, arg3] Log[p[y_n | x_n]] = Underoverscript[∑, n = 1, arg3] {y_nLog[Overscript[y,^] _n] + (1 - y_n) Log[1 - Overscript[y,^] _n]}$

$where Overscript[y,^] _n = σ[wx_n]$

$This is a nonlinear function of w whose maximum cannot be computed in a closed form .$

$Iterative re - weighted least squares (IRLS) is a popular algorithm, derived from Neton method .$

$Iterative, Gradient descent$

$x^(k) ←x^(k - 1) - α∂f/∂x | _ (x = x^(k - 1))$

Logistic Regression:IRLS

$Newton ' s update has the form$

Δw=-	$(Underoverscript[∑, n = 1, arg3] Overscript[y,^] _n (1 - Overscript[y,^] _n) x_nx_n)^(-1)$	$(-Underoverscript[∑, n = 1, arg3] (y_n - Overscript[y,^] _n) x_n)$
	$inverse of Hessian, (∇^2[w])^(-1)$	gradient,∇J[w]

$Newton ' s update reduces to iterative re - weighted least squares (IRLS)$

$w←w + (XSX)^(-1) XSb$

$where$

Multiclass Extension: Softmax Regression

$p[y = k | x] = ^(w_kx)/(^(w_1x) + … + ^(w_kx))$

Feedforward Nets

Linear classification

A linear discriminant function which has the form

$[x] = wx + b$

$where (w, b) ∈^D× (weight vector, bias) are the parameters that control the function .$

Decision rule is given by sgn[f[x]]

$sgn[[x]] = {{{1, if [x] ≥0}, {-1, otherwise}}$

Separating hyperplane

$wx + b = 0$

$The input space  is split into two parts by the hyperplane .$

Perceptron

A single layer neural network

The first iterative algorithm for learning linear classification

$y = sgn[Wx + b]$

Perceptron convergence Theorem

$The perceptron classifier minimizes the error probability, while MMSE classifier does not .$

$One can easily see that the perceptron learning reduce the error$

$-w_ (k + 1) x_ny_n = -w_ (k) x_ny_n - Underscript[∑, x_n∈ℳ] | x_ny_n |^2≤ -w_ (k) x_ny_n$

$If classes _1 and _2are linearly separable, then the perceptron rule converges in a finite number of steps to a separating hyperplane .$

$The algorithm is guaranteed to converge when data are linearly separable$

Perceptron Criterion

$Suppose that target values {y_n} take either 1 or - 1$

$y_n = {{{1, if x_n∈_1}, {-1, if x_n∈_2}}$

$what we want here is to find a w such that$

${{{wx>0, for x_n∈_1}, {wx<0, for x_n∈_2}}$

$which is identical to$

$wx_ny_n>0$

$∀x_n$

$The perceptron criterion leads to the following objective function$

$[w] = -Underscript[∑, x_n∈ℳ] wx_ny_n$

$where ℳ is the set of vectors x_n which are misclassified by the current weight vector .$

$The gradient of [w] is$

$∂/∂w = -Underscript[∑, x_n∈ℳ] x_ny_n$

objective function

$[w] = -Underscript[∑, x_n∈ℳ] wx_ny_n$

Gradient descent

$w^(k) ←w^(k - 1) - α∇[w^(k - 1)]$

$step size : α$

$learning rate : ∇[w] = ∂/∂w$

Perceptron Learning: A Basic Idea

$If the pattern is correctly classified, do nothing . If not .$

$Δw = αUnderscript[∑, x_n∈ℳ] x_ny_n$

$[w] : ^m→$

$∂[w]/∂w = ({{∂[w]/∂w_1}, {∂[w]/∂w_2}, {:}, {∂[w]/∂w_m}}) ∈^m$

Perceptron Learning: Algorithm Outline

$1. Get a training sample$

$2. Check to see if it is misclassified$

$2.1 If classified correctly, do nothing$

$2.2 If classified incorrectly, update w by$

$w_ (k + 1) = w_ (k) + α x_ny_n$

$3. Repeat steps 1 and 2 until convergence$

McCulloch-Pitts Model

$[Graphics:math/5_Training/index_159.gif]$

Activation Functions

$Sigmoid : φ[x] = 1/(1 + ^(-x))$

$tanh : φ[x] = ArcTan[x]$

$ReLU : φ[x] = Max[0, x]$

$Leaky ReLU : φ[x] = {{{αx, if x<0}, {x, if x≥0}}$

$Softplus : φ[x] = Log[1 + ^x]$

$∂/∂xLog[1 + ^x] = ^x/(1 + ^x) = 1/(1 + ^(-x))$

$Exponential Linear Unit (ELU) : φ[x] = {{{α (^x - 1), if x<0}, {x, if x≥0}}$

$ReLU - 6 : φ[x] = {{{0, if x < 0}, {x, if 0≤x<6}, {6, if x≥6}}$

Multilayer Perceptron (MLP)

Structure: bipartite

$[Graphics:math/5_Training/index_168.gif]$

$A multilayer extension of perceptron$

$universal approximation$

Square loss (error)

$[Graphics:math/5_Training/index_171.gif]$

$Underscript[argmin, w0, w1, w2] 1/(2N) Underoverscript[∑, n = 1, arg3] (y_n - Overscript[y,^] _n)^2$

$= w_1x_1 + w_2x_2 + w_0$

Sematic space

Backpropagation

$Error back - propagation (BP) algorithm$

$Gradient descent iteratively determines a local minimum of the error function [x]$

$x←x - α∂[x]/∂x$

$where the gradient ∂[x]/∂xcontains the critical information$

$[Graphics:math/5_Training/index_178.gif]$

$h_1 = φ[w_11x_1 + w_12x_2 + b_1]$

$h_2 = φ[w_21x_1 + w_22x_2 + b_2]$

$Overscript[y,^] = υ_1h_1 + υ_2h_2 + c_1$

$ = square loss$

$= 1/(2N) Underoverscript[∑, n = 1, arg3] (y_n - Overscript[y,^] _n)^2$

$[Graphics:math/5_Training/index_184.gif]$

$[Graphics:math/5_Training/index_185.gif]$

${{{y = φ[w_2h],,, Overscript[y, ~] = w_2h}, {h = φ[w_1x],,, Overscript[h, ~] = w_1x}}$

$error  = 1/2e^2 = 1/2 (t - y)^2 = 1/2 (t - φ[Overscript[y, ~]])^2$

$Calculate ∂/∂w_2 :$

$Define δ_y as the gradient of the objective  w . r . t . the pre - activation Overscript[y, ~]$

$δ_y = ∂/∂Overscript[y, ~] = -e φ^′[Overscript[y, ~]]$

$Calculate ∂/∂w_1 :$

$Thus, δ_h can be recursively calculated as$

$δ_h = ∂/∂Overscript[h, ~] = δ_yw_2φ^′[Overscript[h, ~]]$

$update w_2 :$

$Δw_2 = -α∂/∂w_2 = -α δ_yh$

$update w_1 :$

$Δw_1 = -α∂/∂w_1 = -α δ_hx$

$[Graphics:math/5_Training/index_200.gif]$

Image Classification

$MNIST$

$28×28 = 784$

$Overscript[y,^] _k = Exp[w_kh]/(∑_ (j = 1)^KExp[w_jh])$

Training

Optimization for deep learning

$Parameterized Model$

$Overscript[y,^] = [x ; θ]$

$input : x$

$parameters : θ$

$Training Data = {(Overscript[x, _] _1, y_1), …, (Overscript[x, _] _N, y_N)} = {(X, Overscript[y, _])}$

$Error function  = 1/(2N) Underoverscript[∑, n = 1, arg3] (y_n - Overscript[y,^] _n)^2$

$Underscript[minimize, θ] 1/(2N) Underoverscript[∑, n = 1, arg3] (y_n - Overscript[y,^] _n)^2$

$Gradient ∂/∂θ$

Gradient descent

$[Graphics:math/5_Training/index_212.gif]$

$Consider the objective function : [θ] : ^D→ . (e . g ., [θ] = 1/N∑_ (n = 1)^N | [x_n ; θ] - y_n |^2)$

A first-order iterative optimization algorithm for finding a local minimum of the objective function J[θ]

Moves from the current values of parameters, $θ_k$ , in the opposite direction of the gradient of the objective function J[θ] w.r.t. the parameters, evaluated at $θ_k$

$θ_ (k + 1) ←θ_k - α[∂[θ]/∂θ] _ (θ = θ_k)$

$current gradient :[∂[θ]/∂θ] _ (θ = θ_k)$

$where α is learning rate (step size)$

$Taining set = {(Overscript[x, _] _1, y_1), (Overscript[x, _] _2, y_2), …, (Overscript[x, _] _N, y_N)} = (X, Overscript[y, _]), where x_n∈^D and y_n∈$

$Batch = (X, Overscript[y, _])$

$Mini - batch of size M = (X^{m}, Overscript[y, _]^{m})$

$X^{m} = {Overscript[x, _] _m, …, Overscript[x, _] _ (m + M - 1)}, Overscript[y, _]^{m} = {y_m, …, y_ (m + M - 1)}$

(Full) Batch Gradient Descent ⇔ Vanilla Gradient

Resort to entire training dataset to compute the gradient of the object function

$θ_ (k + 1) ← θ_k - α[∂[θ ; X, Overscript[y, _]]/∂θ] _ (θ = θ_k)$

The accuracy of the parameter update is high but it can be slow

Intractable for datasets that do not fit in memory

Does not allow us to update the model online (with new examples on the fly)

Mini-Batch Gradient Descent ⇔ Stochastic Gradient descent (SGD)

$Consider the objective function that is the sum of errors evaluated on each batch$

$[θ ; X, Overscript[y, _]] = 1/N_MUnderoverscript[∑, m = 1, arg3] [θ ; X^{m}, Overscript[y, _]^{m}]$

$where X^{m} and Overscript[y, _]^{m} are training examples in mini - batch m and N_m is the number of mini batches$

$Mini - batch gradient descent updates parameters$

$θ_ (k + 1) ← θ_k - α[∂[θ ; X^{k}, y^{k}]/∂θ] _ (θ = θ_k)$

$Or using a batch of size 1$

$θ_ (t + 1) ← θ_t - α[∂[θ ; Overscript[x, _] _t, y_t]/∂θ] _ (θ = θ_t)$

Gradient=Steepest Descent Direction

$Let dθ = εOverscript[v, _] . Then, serach for the direction Overscript[v, _] that minimizes$

$[θ + dθ] ≈[θ] +[∇[θ]] dθ = [θ] + ε[∇[θ]] Overscript[v, _]$

$under the constraint that$

$| v |^2 = Underscript[∑, i, j] G_ (i, j) v_iv_j = Overscript[v, _] G Overscript[v, _] = 1$

By the Lagrangian method, we have

$∂/∂Overscript[v, _][ε[∇[θ]] Overscript[v, _] + λ (1 - Overscript[v, _] G Overscript[v, _])] = 0$

$leading to$

$Overscript[v, _] = ε/(2λ) G^(-1) ∇[θ]$

$When the space isEuclidean and the coordinate system is orthogonal, then G = I, indicating that the gradient is the steepest descent direction$

Convex function

$α f[x_1] + (1 - α) f[x_2] ≥f[α x_1 + (1 - α) x_2]$

$Concave$

$Saddle Points$

Neural Network $f_θ$ [x]

$θ : layer sum$

$Sample (Training Set) : S$

$Training : Algorithm (S, ℓ) = θ^*, f_θ^*[x]$

$loss : ℓ[f_θ[x], y]$

$square loss : ℓ = 1/(2N) Underoverscript[∑, n = 1, arg3] (f_θ[x_n] - y_n)^2$

$Underscript[argmin, θ] ℓ[f_θ[x], y]$

$Underscript[argmin, θ] ℓ[θ]$

Iterative methods

$initial θ_0→θ_1→θ_2→…→θ_k→θ^*$

$θ_k←θ_ (k - 1) + α υ_ (k - 1)$

$step size : α$

$direction : υ_ (k - 1)$

$ℓ[θ + dθ] <ℓ[θ]$

$dθ = ευ, | υ | _2 = 1$

$ℓ[θ + dθ] ≈ℓ[θ] + ∇ℓ[θ] dθ (first - order approximation)$

$= ℓ[θ] + ∇ℓ[θ] (ευ)$

$Underscript[minimize, υ] ∇ℓ[θ] (ευ) + λ (1 - υυ)$

$∂/∂υ[∇ℓ[θ] (ευ) + λ (1 - υυ)] = ε∇ℓ[θ] - 2λυ = 0$

$2λυ = ε∇ℓ[θ]$

$λ = ε/(2λ) ∇ℓ[θ]$

Problem

$We consider the objective function (loss function, error function) [θ] : ^D→$

$For instance, the square loss is given by$

$[θ] = 1/(2N) Underoverscript[∑, n = 1, arg3] | f[x_n ; θ] - y_n |^2$

$Traning (or learning) involves finding a minimizer of [θ]$

$Underscript[arg min, θ] [θ]$

General form of iteration methods

$θ_ (k + 1) ←θ_k + α_kv_k$

$step size : α_k$

$direction : v_k$

Definition

$For a given point θ∈^D, a direction v∈^D is called a decent direction if there exists Overscript[α, _] >0$

$such that$

$[θ + αv] <[θ], ∀α∈ (0, Overscript[α, _])$

Lemma

$For a point θ∈^D, any direction v satisfying$

$<∇[θ], v> = ∇[θ] . v < 0$

$is a descent direction$

$Certainly - ∇[θ] is a descent direction, since$

$<∇[θ], -∇[θ] > = -| ∇[θ] |^2 < 0$

$Suppose that you are sitting at a point θ∈^D and looking at the value of the function [θ] in all directions around you .$

$The direction with the maximum rate of decrease is along - ∇[θ]$

Exponentially Weighted Moving Average

$Suppose that we are given θ_1, θ_2, θ_3, …$

$Moving average of θ_t is calculated as$

$v_t = β v_ (t - 1) + (1 - β) θ_t$

$β∈[0, 1], β = 0.9$

$v_0 = 0$

$v_1 = β v_0 + (1 - β) θ_1 = (1 - β) θ_1$

$v_2 = β v_1 + (1 - β) θ_2 = (1 - β) (β θ_1 + θ_2)$

$v_t = (1 - β) (β^(t - 1) θ_1 + β^(t - 2) θ_2 + … + θ_t)$

$≈which approximately average over 1/(1 - β) samples$

$β = 0.9⇒1/(1 - β) = 10$

Bias Correction

$Use v_t/(1 - β^t) instead of v_t (useful during initial phase)$

$v_1/(1 - 0.9) = 10v_1$

$v_2/(1 - 0.9^2) = 5v_2$

$v_10/(1 - 0.9^10) = v_10$

Gradient Descent with Momentum

$Recall gradient descent$

$θ_ (t + 1) = θ_t - α[∇[θ_t]]$

$where α is the step size$

$Gradient descent with momentum uses moving averages of gradients to update parameters$

$v_ (t + 1) = β v_t + (1 - β)[∇[θ_t]]$

$θ_ (t + 1) = θ_t - α v_ (t + 1)$

$θ = ({{θ_1}, {θ_2}, {:}, {θ_D}})$

$Alternatively, we write the gradient descent with momentum as$

$When gradients keep pointing in the same direction, this will increase the size of the steps taken towards the minimum$

$When the gradient keeps changing direction, momentum will smooth out the variations$

Manhattan-Learning Rule

$Acts independently on each weight$

$Update depend on the sign of the gradient$

$The update - value is constant through iterations$

$θ_i^(t + 1) = θ_i^(t) + Δθ_i$

$where$

$Δθ_i = {{{-Δ_0, if ∂/∂θ_i>0}, {Δ_0, if ∂/∂θ_i<0}, {0, else}}$

$where Δ_0 is the update - value, which is a problem - dependent constant$

Resilent Backprop (Rprop)

$Used for full batch learning$

$Goal : Resolve the problem that gradients may vary widely in magnitudes$

$Acts independently on each weight$

$Extension of Manhattan learning rule$

$Combines the idea of using the sign of the gradient with the idea of adapting the step size individually for each weight$

$The update - value, Δ_i, for each weight evolves during the learning process$

$Increase the learning rate for a weight multiplicatively if signs of last two gradients agree$

$Else decrease learning rate multiplicatively$

$Initialize all updates at iteration 0 to constant value$

$The update - value, Δ_i, for each weight evolves during the learning process$

$where 0<η^- <1<η^+. (typical setting : η^+= 1.2, η^-= 0.5)$

$Update weights : θ_i^(t + 1) = θ_i^(t) + Δθ_i^(t), where$

$θ_i^(t) = {{{-Δ_i^(t), if ∂/∂θ_i>0}, {Δ_i^(t), if ∂/∂θ_i<0}, {0, else}}$

$Does not work well for mini - batch learning$

AdaGrad: Adaptive Gradient

$A different step size for every parameter θ_i at every time step t .$

$θ_i^(t + 1) = θ_i^(t) - α/(G_ (1, 1)^(t) + ε)^(1/2) ∇[θ_i^(t)]$

$where G_ (i, i)^(t) = ∑_ (j = 1)^t (∇[θ_i^(j)])^2 contains the sum of squares of the gradients w . r . t . θ_i up to time step t$

$G_imi^(t) represents the diagonal entries of the matrix ^(t) which is calculated as$

$^(t) = diag[Underoverscript[∑, j = 1, arg3] (∇[θ^(j)]) (∇[θ^(j)]) ]$

RMSProp

$RMSProp = Rprop + SGD$

$Adaptive individual learning rate for each weight$

$Instead of accumulating all past squared gradients, the moving average is used to scale the step size$

$Update parameters θ_t by$

$r_ (t + 1) = β r_t + (1 - β) (∇[θ_t])^2 (element - wise square)$

$v_ (t + 1) = ∇[θ_t]/(r_ (t + 1) + ε)^(1/2) (element - wise division)$

$θ_ (t + 1) = θ_t - α v_ (t + 1)$

ADAM Optimization

$Uses estimations of first and second moments of gradient to adapt the learning rate for each weight of the neural network$

$Adaptive individual learning rate for each weight$

$ADAM = momentum + bias correction + RMSProp$

$v_t = β_1v_ (t - 1) + (1 - β_1) (∇[θ_ (t - 1)])$

$r_t = β_2r_ (t - 1) + (1 - β_2) (∇[θ_ (t - 1)])^2 (element - wise square)$

$v_t^bc = v_t/(1 - β_1^t)$

$r_t^bc = r_t/(1 - β_2^t)$

$θ_t = θ_ (t - 1) - α v_t^bc/r_t^bc^(1/2) (element - wise division)$

Learning Rate Decay (Step Size)

$Slowly reduce the step size α$

$1 epoch = 1 pass through whole training examples$

$Strategies (η = decay rate&Ω = epoch number)$

$α = 1/(1 + ηΩ) α_0$

$α = 0.95^Ωα_0$

$α = k/Ω^(1/2) α_0$

$α = k/t^(1/2) α_0$

$or α is manually set such that the value of constant is decreasing in a stepwise fashion$

Dropout

$Form a vector of independent Bernoulli random variables, ^(l), where z_i^(l) ~ Bern[p]$

$Feedforward operations are : h_i^(l + 1) = σ[_i^(l + 1)  (h^(l) ⊙^(l)) + b_i^(l + 1)]$

Normalization

$Standardization$

$X = {x_1, x_2, …, x_N}$

$mean μ = 1/NUnderoverscript[∑, n = 1, arg3] x_n$

$var σ^2 = 1/NUnderoverscript[∑, n = 1, arg3] (x_n - μ)^2$

$Z = {z_1, z_2, …, z_N}$

$mean = 0$

$variance = 1$

$z_n = (x_n - μ)/σ$

Batch Normalization

$Pre - activation is calculated as a linear sum of incoming inputs$

$a_i^(l) = Underscript[∑, j] W_ (i, j)^(l) h_j^(l - 1) + b_i^(l)$

$Activation is a non - linear transform of the pre - activation$

$h_i^(l) = φ[a_i^(l)]$

$[Graphics:math/5_Training/index_370.gif]$

BN is applied to individual dimension for each mini-batch of size M

$Normalize pre - activation a_i$

$Overscript[z, ~] _i = (a_i - μ_i)/(σ_i^2 + ε)^(1/2)$

$where$

$μ_i = 1/MUnderoverscript[∑, m = 1, arg3] a_ (i, m)$

$σ_i^2 = 1/MUnderoverscript[∑, m = 1, arg3] (a_ (i, m) - μ_i)^2$

Rescale and shift by learnable parameters

$z_i = γ_iOverscript[z, ~] _i + β_i$

$where (γ_i, β_i) are learnable parameters that are learned via back - prop$

$[Graphics:math/5_Training/index_378.gif]$

$Since the mean is subtracted, bias term b_i^(l) is not necessary . In other words, simply use a_i^(l) = ∑_jW_ (i, j)^(l) h_j^(l - 1)$

BN in Inference Phase

$Suppose that μ_ (i, ℬ) and σ_ (i, ℬ)^2 are mean and variance computed using a mini - batch ℬ of size M$

$In inference phase, we compute$

$[a_i] = _ℬ[μ_ (i, ℬ)]$

$var[a_i] = M/(M - 1) _ℬ[σ_ (i, ℬ)^2]$

$Scale and shift$

$z_i = (γ_ia_i)/(var[a_i] + ε)^(1/2) + (β_i - γ[a_i]/(var[a_i] + ε)^(1/2))$

Layer Normalization

$CNN→BN$

$RNN→LN$

$Reduce the ' covariate shift ' problem by fixing the mean and the variance of the summed inputs within each layer$

$Compute the layer normalization statistics over all the hidden units in the same layer$

$μ_t^(l) = 1/HUnderoverscript[∑, i = 1, arg3] a_ (i, t)^(l)$

$σ_t^(l) = (1/HUnderoverscript[∑, i = 1, arg3] (a_ (i, t)^(l) - μ_t^(l))^2)^(1/2)$

$LN in a RNN is performed$

$a_t = W_hhh_ (t - 1) + W_hxx_t$

$h_t = φ[γ/(σ_t1) ⊙ (a_t - μ_t1) + b]$

CNN (Convolutional Neural Network)

Convolutional Neural Network (CNN)

$Cross - correlation → Convolution$

$Supervised representation learning : feature extraction + classification$

$Input Image→Feature extraction→Classifier→Label$

$Data Augmentation : up sampling$

$Capsule Net$

Pre-trained CNNs: before Freeze, Change Output (Transfer learning)

$Model Selection$

$p[ | ℳ_i] = ∫p[ | , ℳ_i] p[ | ℳ_i] $

LeNet-5

Operations

$standard convolution$

$Pooling (subsampling)$

$More on convolutions$

ResNet

$Test error≤Training error + Complexity$

$Over - fitting$

$Skip connection$

$h^(l + 2) = φ[W^(l + 2) h^(l + 1) + b^(l + 2) + h^(l)]$

Inception Net

$Inception modules$

$Auxiliary Classifiers$

Convolutions

Padding

$p = 1$

$(N_1 + 2p) × (N_2 + 2p) * 3×3 (f×f) = 5×5 (N_1 + 2p - f + 1) × (N_2 + 2p - f + 1)$

Strided Convolution

$stride = 2$

$N_1×N_2 * f×f = ((N_1 - f)/s + 1) × ((N_2 - f)/s + 1)$

Convolution over Volume

$N_1×N_2×N_3 * f×f×N_3 = (N_1 - f + 1) × (N_2 - f + 1)$

1×1 Convolution

$Reduce # Channels$

$N_1×N_2×N_3 * 1×1×N_3 = N_1×N_2$

Max Pooling

$Applied independently over channels$

$max pooling (2×2)$

Semantic Segmentation

$Predict pixel - wise labels, given a pre - defined set of categories$

$Does not differentiate instances, only care about pixels$

Fully Convolutional Networks

$Overscript[→, forward/inference]$

$Underscript[←, backward/learning]$

$image→96→256→384→384→256→4096→4096→21→21 (Pixelwise prediction) →segmentation g . t .$

$Convolutional layers (no fully - connected layers) + upsampling$

Conv+Deconv

$Convolution network→Deconvolution network$

Upsampling

$Down sampling ⇔ Up sampling$

$pooling ⇔ unpooling (Transposed Convolution)$

Nearest Neighbor

$Input 2×2Overscript[→, upsampling] Output 4×4$

$({{q, b}, {c, d}}) → ({{a, a, b, b}, {a, a, b, b}, {c, c, d, d}, {c, c, d, d}})$

Bed of Nails

$({{a, b}, {c, d}}) → ({{a, 0, b, 0}, {0, 0, 0, 0}, {c, 0, d, 0}, {0, 0, 0, 0}})$

Max unpooling

$locations of max activations are recorded in switch wariables$

$({{a, b}, {c, d}}) → ({{a, 0, 0, 0}, {0, 0, b, 0}, {0, c, 0, d}, {0, 0, 0, 0}})$

Deconvolution or Transposed Convolution

$2×2 conv$

$stride 1$

$padding 0$

$Input ({{1, 2}, {3, 4}})$

$Kernel ({{a, b}, {c, d}})$

$Ouput ({{a, b + 2a, 2b}, {c + 3a, d + 2c + 3b + 4a, 2d + 4b}, {3c, 3d + 4c, 4d}})$

e.x.1

$kernel size = 3$

$kernel = ({{w_1, w_2, w_3}})$

$stride = 2$

$padding = 1$

$({{w_1, w_2, w_3, 0, 0, 0}, {0, 0, w_1, w_2, w_3, 0}}) ({{0}, {a}, {b}, {c}, {d}, {0}}) = ({{w_2a + w_3b}, {w_1b + w_2c + w_ed}})$

e.x.2

$Transposed convolution$

$Input ({{a}, {b}})$

$Kernel ({{w_1}, {w_2}, {w_3}})$

$Output ({{w_1a}, {w_2a}, {w_3a + w_1b}, {w_2b}, {w_3b}})$

$({{w_1, 0}, {w_2, 0}, {w_3, w_1}, {0, w_2}, {0, w_3}}) ({{a}, {b}}) = ({{w_1a}, {w_2a}, {w_3a + w_1b}, {w_2b}, {w_3b}})$

Residual Block

$Intuition : If the identity mapping is optimal, it is easy to come up with a solution F[x] = 0 rether than F[x] = x using a stack of nonlinear layers$

$skip connection$

$H[x] = F[x] + x$

$F[x] = H[x] - x$

$residual function F[x]$

$Gradients are calculated as$

$∂/∂x = ∂/∂H∂H/∂x = ∂/∂H (∂F/∂x + 1) = ∂/∂H∂F/∂x + ∂/∂H$

$gradient highway : ∂/∂H$

Mitigate vanishing gradients:

$Gradients can flow directly through the skip connections backwards from later layers to initial filters$

Skip Connections

Identity shortcuts:

$When input and output dimensions are the same, then use the identity shortcut$

$H[x] = F[x ; θ] + x$

when dimensions change

$Identity shortcuts with extra zero entries padded with the increased dimension$

$Projection shortcuts to match the dimension$

$H[x] = F[x ; θ] + W_sx$

Inception Nets

Naive Version

$Filter concatenation$

$1×1 convolutions$

$3×3 convolutions$

$5×5convolutions$

$3×3 max pooling$

$Previous layer$

With Dimension Reduction

$Filter concatenation$

$1×1 convolutions$

$3×3 convolutions$

$1×1 convolutions$

$5×5convolutions$

$1×1 convolutions$

$3×3 max pooling$

$Previous layer$

Object Detection: Localization + Classification

$Multi - label$

$Localization$

$Patch→feature→SVM$

R-CNN:Regions with CNN features

$Input image → Extract region propocals →Compute CNN features → Classify regions$

YOLO

$5×5 grid on input$

$Bounding boxes + confidence$

$Class probability map$

$Final detections$

CNN for Time Series Classification

$Time series = sequence$

$Fourier Transform$

$Human Activity Recognition using Accelerometers and Gyroscope Sensors$

$partial weight sharing$

$full weight sharing$

RNN (Recurrent Neural Network)

IID Data (Independent, Identically, Distributed)

$x_1, x_2, …, x_N$

$x_n ~ P[x]$

$P[x_1, x_2, …, x_n] = P[x_1] P[x_2] …P[x_n]$

$Fully - Connected Net, CNN$

Sequence Modeling

$Wanted : Probability over sequences, x ~ P[x_1, x_2, …, x_T]$

Non IID Data

$Time series$

$Language$

$Audio/Speech$

$Video$

$RNN$

Feedforward Net

$[Graphics:math/7_RNN/index_11.gif]$

$y_t = φ[W_yhh_t + b_y]$

$h_t = φ[W_hxx_t + b_h]$

Vanilla RNN: Unfolding Computational Graph

$h_t = RNN[h_ (t - 1), x_t]$

$[Graphics:math/7_RNN/index_15.gif]$

$y_t = φ[W_yhh_t + b_y]$

$h_t = φ[W_hxx_t + W_hhh_ (t - 1) + b_h]$

Hidden Markov Model

$[Graphics:math/7_RNN/index_18.gif]$

Many to Many: Encoder-Decoder

$[Graphics:math/7_RNN/index_19.gif]$

$[Graphics:math/7_RNN/index_20.gif]$

Seq-to-Seq Learning

$Context vector$

$C_t = Underscript[∑, t^′] α_ (t, t^′) _t^′$

$α_ (t, t^′) : amount of attention y_t should pay to _t^′$

$α_ (t, t^′) = ^z_ (t, t^′)/(Underoverscript[∑, l = 1, arg3] ^z_ (t, l))$

$[Graphics:math/7_RNN/index_25.gif]$

Aligment Model

$s_ (t - 1)$	→
		FF Net	→	$z_ (t, t^′)$
$_t^′$	→

$Scores how well the inputs around position t^′ and the output at position t match$

Vanilla RNN: Gradient Flow

$[Graphics:math/7_RNN/index_30.gif]$

$Gradient of h_1 involves multiplies of may W$

$Exploding gradients when largest singular value > 1$

$Vanishing gradients when largest singular value < 1$

$계속 작아지는 값이 문제가 있어 LSTM$

$[Graphics:math/7_RNN/index_35.gif]$

$y = φ[Wx]$

$compute ∂/∂x !$

$⇒∂/∂x_i = Underscript[∑, j] ∂/∂y_j∂y_j/∂x_i = Underscript[∑, j] (∂/∂y_j) (φ_j^′) (W_ji)$

$(∂/∂x) = (W) ({{φ_1^′, , 0}, { , ·., }, {0, , φ_k^′}}) (∂/∂y)$

$Backprop from y to x involves a multiplication by W$

Attention in Encoder and Decoder

Encoder

$Contains self - attention layers$

$Each position in the encoder can attend to all positions in the previous layer of the encoder$

Decoder

$Contains self - attention layers$

$The input of the decoder is masked, which avoids the decoder to see the future . (each position is the decoder can attend to all positions in the decoder up to and including that position)$

$Need to prevent leftward flow in the decoder to preserve the auto - regressive property$

Encoder-decoder attention

$Queries come from the previous decoder layer and keys/values come from the output of the encoder$

$Allow every position in the decoder attend over all positions in the input sequence$

$Mimics the encoder - decoder attention in seq2seq learning$

Hidden vector

$h_t = ({{Overscript[h, →] _t}, {Overscript[h, ←] _t}})$

Long Short Term Memory (LSTM)

$h_t = LSTM[h_ (t - 1), x_t]$

Permutation-Equivariant Attention Modules (SAB & ISAB)

MAB (multihead attention block)

$Given X, Y∈^(N×D)$

$MAB[X, Y] = LayerNorm[H + rFF[H]]$

$H = LayerNorm[X + Multihead[X, Y, Y ; ]]$

SAB (set attention block)

$SAB[X] = MAB[X, X]$

ISAB (induced set attention block)

$With inducing points I∈^(M×D) (trainable parameters, M<<N)$

$ISAB_M[X] = MAB[X, H]$

$where H = MAB[I, X]$

Gates ∈ [0,1]

LSTM Cell Updates

$⊙ Hadamard product, element - wise product$

$h_t = o_t⊙ArcTan[C_t]$

$c_t = f_t⊙c_ (t - 1) + i_t⊙g_t$

$g_t = ArcTan[W_gxx_t + W_ghh_ (t - 1) + b_g]$

$i_t = sigmoid[W_ixx_t + W_ihh_ (t - 1) + b_i]$

$o_t = sigmoid[W_oxx_t + W_ohh_ (t - 1) + b_o]$

$f_t = sigmoid[W_fxx_t + W_fhh_ (t - 1) + b_f]$

LSTM: Gradient Flow

$[Graphics:math/7_RNN/index_65.gif]$

$c_t = f_t⊙c_ (t - 1) + i_t⊙g_t$

$h_t = o_t⊙ArcTan[c_t]$

$Backprop from c_tto c_ (t - 1) involves only elementwise multiplication by f_t (no matrix multiplication by W, in contrast to the case of vanilla RNN)$

Gated Recurrent Unit (GRU)

$h_t = GRU[x_t, h_ (t - 1)]$

$Two gates : update gate z_t and reset gate r_t$

$h_t = (1 - z_t) ⊙h_ (t - 1) + z_t⊙g_t$

$g_t = ArcTan[W_gxx_t + W_gh (r_t⊙h_ (t - 1)) + b_g]$

$z_t = sigmoid[W_zxx_t + W_zhh_ (t - 1) + b_z]$

$r_t = sigmoid[W_rxx_t + W_rhh_ (t - 1) + b_r]$

Generative RNN

$Parameterize p[x_ (t + 1) | y_t], where$

$y_t = σ[W_yh_t + b_y]$

Likelihood

$p[x_ (1 : T + 1)] = Underoverscript[∏, t = 1, arg3] p[x_ (t + 1) | y_t]$

Loss function

$ = -Underoverscript[∑, t = 1, arg3] Log[p[x_ (t + 1) | y_t]]$

VAE

GAN

Generating Sequence

$[Graphics:math/7_RNN/index_79.gif]$

Training RNNs for sequence Prediction

$Input sequence (x_1, x_2, …, x_T)$

$Output sequence (y_1, y_2, …, y_T)$

$P[y_1, y_2, …, y_T | x_1, x_2, …, x_T]$

$= P[y_T | y_ (1 : T - 1), x_ (1 : T)] P[y_ (T - 1) | y_ (1 : T - 2), x_ (1 : T)] …P[y_1 | x_ (1 : T)]$

$= Underoverscript[∏, t = 1, arg3] P[y_t | h_t] : RNN$

Training

$θ_ML = Underscript[arg max, θ] Underscript[∑, (X^n, y^n) ∈D] Log[p[Y^n | x^n ; θ]]$

$= Underscript[arg max, θ] Underscript[∑, (X^n, y^n) ∈D] Underscript[∑, t] Log[p[Y_t^n | y_ (1 : t - 1)^n, x^n ; θ]]$

$= Underscript[arg max, θ] Underscript[∑, (X^n, y^n) ∈D] Underscript[∑, t] Log[p[Y_t^n | h_t^n ; θ]]$

$where$

$h_t^n = {{{f[X^n], if t = 1}, {f[h_ (t - 1)^n, y_ (t - 1)^n], otherwise}}$

$Prediction of token y_t requires either the true previous token y_ (t - 1) or an estimate Overscript[y,^] _ (t - 1) coming from model itself$

Teacher-Forcing

$Training : The model receives the ground truth output y_t (instead of the generated one Overscript[y,^] _t) as input in the next time step x_ (t + 1)$

$[Graphics:math/7_RNN/index_93.gif]$

Without TF

$[Graphics:math/7_RNN/index_94.gif]$

$Inference (test) : Open loop mode with network outputs fed back as inputs$

$Inputs that the model will see during training time could be quite different from that it will see at test time$

Attention in RNN-Encoder-Decoder

$Computes the conditional probability$

$p[y_ (1 : T_y) | x_ (1 : T_x)] = Underoverscript[∏, t = 1, arg3] p[y_t | c, y_ (1 : t - 1)] = Underoverscript[∏, t = 1, arg3] g[y_ (t - 1), s_t, c]$

$where s_t is the hidden state of the decoder, c is the fixed - dimensional representation of the input sequence (x_1, …, x_T_x), given by the last hidden state of the encoder$

$h_t = f[x_t, h_ (t - 1)]$

$c = q[h_1, …, h_T_x] = h_T_x$

$Compute the probability over the target sequence$

$p[y_ (1 : T_y) | x_ (1 : T_x)] = Underoverscript[∏, t = 1, arg3] p[y_t | c, y_ (1 : t - 1)] = Underoverscript[∏, t = 1, arg3] g[y_ (t - 1), s_t, c_t]$

$where s_t is the hidden state of the decoder, computed by$

$s_t = f[x_ (t - 1), y_ (t - 1), c_t]$

$and c_t is the context vector, computed by a weighted sum of encoder hidden states {h_t}$

$c_t = Underoverscript[∑, t^′ = 1, arg3] α_ (t, t^′) h_t^′$

$Note that in the previous model, we have c_1 = h_T_x and c_t = 0 for t = 2, …, T_y$

$[Graphics:math/7_RNN/index_109.gif]$

$context vector$

$c_t = Underoverscript[∑, t^′ = 1, arg3] α_ (t, t^′) h_t^′$

$Aligment Model$

$s_ (t - 1)$	→
		FF Net	→	$z_ (t, t^′)$
$_t^′$	→

$α_ (t, t^′) = ^z_ (t, t^′)/(Underoverscript[∑, l = 1, arg3] ^z_ (t, l))$

Visual Attention

$Capsulize$

$Learning words - image alignment$

$Input : Raw image$

$14×14 Feature Map$

$Convolutional Feature Extraction$

$RNN with attention over the image (LSTM)$

$Word by word generation$

$Output : A sequence of C words from vocabulary of size K, {y_1, …, y_c}, y_i∈^K$

Encoder

$Use a CNN to extract a set of D - dimensional feature vectors, referred to as annotation vectors$

$a =[a_1, …, a_L] ∈^(L×D)$

$which contains the output L_1×L_2×D (L = L_1×L_2) of a lower convolutional layer (before max pooling)$

Decoder

$Use a RNN with attention modules to produce a caption generating one word every time step conditioned on a context vector, the provious hidden state, and the previously generated words$

Tranformer models

$Self - Attention - based models$

RNN

$Sequential computations (autoregressive models in decoders) are expensive and are not easy to be parallelized$

CNN (ByteNet, ConvS2S)

$Need a lot of layers to catch long - term dependencies$

Vanilla Transformer

$Trnsformer = encoder + decoder$

$Encoder or decoder = attention + positional encoding + feedforward net$

Self-Attention: A sequence-to-sequence operation

$Input seq (x_1, …, x_T)$

$Self Attention$

$Output seq (y_1, …, y_T)$

$y_i = Underscript[∑, j] W_ijx_j$

$W_ij = ^(x_ix_j)/(Underscript[∑, l] ^(x_ix_l))$

Query, Key, Value

$Query : x_i, W_Q$

$Key : x_1, …, x_T, W_k$

$Value : x_1, …, x_T, W_v$

$W_gj = ^(qk_j)/(Underscript[∑, j] ^(qk_j))$

Scaled Dot-Product Attention

$Attention (Q, K, V) = softmax[QK/D_k^(1/2)] V∈^(N×D_v)$

$Queries : Q∈^(N×D_k), q_i∈^(1×D_k)$

$Keys : K∈^(N×D_k), k_i∈^(1×D_k)$

$Values : V∈^(N×D_v), v_i∈^(1×D_v)$

$For example, attention on a query q_i is calculated as$

$Underoverscript[∑, n = 1, arg3] (Exp[q_ik_n/D_k^(1/2)]/(Underoverscript[∑, j = 1, arg3] Exp[q_ik_j/D_k^(1/2)])) v_n$

$softmax[(q_ik_n/D_k^(1/2)) _ (n∈[N])]$

$Self - attention : Queries, keys, and values are from the same sequence$

Multi-Head Attention

$Allows the model to jointly attend to information from different representation subspaces at different positions$

$MultiHead[Q, K, V] = Concat[head_1, …, head_h] W^O$

$head_i = Attention[QW_i^Q, KW_i^K, VW_i^V]$

$where linear projections are done via parameters matrices$

$W_i^Q∈^(D_model×D_k)$

$W_i^K∈^(D_model×D_k)$

$W_i^V∈^(D_model×D_v)$

$W_i^Q∈^(hD_v×D_model)$

$where h = 8 is the number of parallel attention layers$

Position-wise Feedforward Networks

$Applied to each position separately and identically$

$Linear transformations are the same across different positions, but different parameters are used from layer to layer$

$FF net with signle hidden layer$

$FFN[x] = max[0, xW_1 + b_1] W_2 + b_2$

$where x∈^512 and the number of hidden units is 2048$

Positional Encoding

$Inject some information about the relative or absolute position of the tokens in the sequence$

$Add positional encodings to the input embeddings$

$PE_ (pos, 2i) = Sin[pos/(10000^(2i)/D_model)]$

$PE_ (pos, 2i + 1) = Cos[pos/(10000^(2i)/D_model)]$

$where pos is position and i is the dimension$

Reformer: Efficient Transformer

Transformer models

$Training large transformer models are prohibitively costly, especially on long sequences$

$The dot - product attention requires  (T^2) complexity, where T is the length of the sequence$

Reformer models

$LSH (Locality Sensitive Hashing) attention : Replaces the dot - product attention, reducing the complexity form  (T^2) to  (T log T)$

$Reversible layers : Reduce memory space (storing activations only once in the training process instead of L times, where L is the number of layers)$

$Chunking$

Locality Sensitive Hashing (LSH)

$Project the data into a low - dimensional Hamming space such that each hash function h_m[x_i] for m = 1, …, M, satisfies the local sensitivity hassing property$

$P[h[x_i] = h[x_j]] = S_ij$

$where P[h[x_i] = h[x_j]] is the probability of collision and S_ij represent the similarity between x_i and x_j$

$LSH is a data - independent method and the hash functions h_m[.] (for m = 1, …, M)$

$consist of random projections followed by rounding :$

$h_m[x_i] = sdn[W_mx_i + b_m]$

$For random weight vector w_m (drawn from p - stalbe distribution), the probability of collision was proven to be$

$P[h[x_i] = h[x_j]] ∝ (1 - 1/πCos^(-1)[(x_ix_j)/(| x_i || x_j |)])^M$

$In practice, LSH requires muliple hash tables with long binary codes . The large value of M decreases the collision probability$

Reversible Network

$Allow the activations at any given layer to be recovered from the activations at the followinglayer, using only the model parameters$

$A normal residual layer performs a function x|→y that operates on a single input and produce a single output and has the form z = x + F[x]$

$A reversible layer works on a paris of inputs/outputs$

$(x_1, x_2) |→ (z_1, z_2)$

$where (x_1, x_2) are a partition of units in a layer, and follows the equations$

$z_1 = x_1 + F[x_2]$

$z_2 = x_2 + G[z_1]$

$A layer can be reversed by subtracting the residuals$

$x_2 = z_2 - G[z_2]$

$x_2 = z_1 - F[x_2]$

$[Graphics:math/7_RNN/index_194.gif]$

$[Graphics:math/7_RNN/index_195.gif]$

Autoencoder

Permutation Invariance and Equivariance

$Consider a function  : X→Y$

Wanted

$The response of the function f is indifferent to the ordering of the elements in $

$Permutation of input instances permutes the output labels$

Definition (Permutation invariance)

$A function  : → is permutation invariant iff for any permutation π$

$[{x_1, …, x_N}] = [{x_ (π (1)), …, x_ (π (N))}]$

Definition (Permutation equivariance)

$A function  : X^N→Y^N is permutation equivariant iff$

$[{x_ (π (1)), …, x_ (π (N))}] = (_x_ (π (1))[X], …, _x_ (π (N))[X]) $

Permutation Equivariant Functions

$_θ : ^N→^N is permutation equivariant iff$

$[x ; θ] = σ[Θx]$

$where$

$Θ = λI + γ (11), for λ, γ∈$

$That is$

$_ (π (n)) = σ[λx_ (π (n)) + γ pool[{x_ (π (1)), …, x_ (π (N))}]]$

Amortized Clustering

$Input : X = {x_1, …, x_N} (set)$

$Output : f[X ; θ] = {π[X], {μ_k[X], σ_k[X]} _ (k = 1)^K} (parameters for MoG)$

Attention Operaotors

Dot-product attention

$Give a query Q∈^(N×D_p), a key K∈^(N×D_K), and a value matrix V∈^(N×D_v), an attention function is defined by$

$Att[Q, K, V ; ω] = ω[QK] V$

$where ω[.] is an activation function (e . g ., softmax function in the standard transformer model)$

Multihead attention

$As in the standard transformer model$

$Multihead[Q, K, V ; λ, ω] = concat (O_1, …, O_h) W^O$

$where$

$O_j = Att[QW_j^Q, KW_j^K, VW_j^V ; ω_j]$

$λ = {W_j^Q, W_j^K, W_j^V} _ (j = 1)^h$

Set Transformer: Encoder & Decoder

Encoder ( X → Z )

$Stacks of SABs or ISABs$

$Encoder[X] = SAB[SAB[X]] or$

$Encoder[X] = ISAB[ISAB[X]]$

Pooling by Multihead Attention (PMA)

$Apply MHA on a set of learnable K seed vectors S∈^(K×D)$

$PMA_K[Z] = MAB[S, rFF[Z]]$

Decoder ( Z → y )

$Aggregate features Z into a single or a set of vectors that is fed into a FF net to yield the final outputs$

$Decoder[Z ; λ] = rFF[SAB[PMA_k[Z]]]$

Deep Generative models

$Image → Deeo Neural Network → P[cybertruck] = 0.9$

$Discriminative Model$

Latent Space = Hidden space=Invisible space

$⇓ Linear→NN$

Observed space

A Powerful model for unsupervised learning

$Back - Prop (discriminative) Learning with labeled data$

$[Graphics:math/8_Autoencoder/index_36.gif]$

$Up - Prop (generative) Learning with unlabeled data$

$[Graphics:math/8_Autoencoder/index_38.gif]$

Image Inpainting

eCommerceGAN

Linear Generative Models: Earlier Days

Sparse Coding

Recognizing data (via discriminative models)

$⇓$

Creating data (via generative models)

$P_model[x ; θ] ≈P_data[x]$

$Generative model unsupervised learning$

$Training Data = {x_1, x_2, …, x_N}$

$Generative model = {Overscript[x,^] _1, Overscript[x,^] _2, …, Overscript[x,^] _N}$

Density Estimation

$A problem of modeling a density function p[x], given a finite number of data points, {x_n} _ (n = 1)^N drawn from that density function$

Prescribed models

$Fit model distribution p_θ[x] to the empirical distribution p_data[x]$

$Explicitly assign probability to every x in the data distribution$

Deep learning

$Tractable density : PixelRNN, PixelCNN$

$Approximate density : Variational autoencoders$

Implicit models

$Learn a generator network G[.] that generate samples whose distribution is close to that of the data generating distribution$

Deep learning

$Generative adversarial networks$

$Z ~ P[z] → Generator P[x | z] or G[z] → Generative Image$

$Generative = Decoder$

Variational Autoencoders (VAE)

Autoendoer

$[Graphics:math/8_Autoencoder/index_53.gif]$

$∑_n | x_n - Overscript[x,^] _n | _2^2$

Limitation

$x Underscript[→, encoding] z = E[x] Underscript[→, decoding] Overscript[x,^] = D[E[x]]$

Variational Autoencoder

$[Graphics:math/8_Autoencoder/index_56.gif]$

$z ~ N[μ, σ^2]$

$Standard normal$

$ε ~ N[0, 1]$

$z = μ + σ ε$

Training VAE with Reparameterizaion Trick

$[Graphics:math/8_Autoencoder/index_61.gif]$

Variational Autoencoder

$[Graphics:math/8_Autoencoder/index_62.gif]$

$Probabilistic decoder : p_θ[x | z]$

$Probabilistic encoder : q_φ[z | x]$

$Probabilistic Inference : [z | x] ≈q_φ[z | x]$

Probabilistic decoder (generator network)

$p_θ[x | z] = [x | μ_θ[z], diag[σ_θ^2[z]]]$

$D_θ[z] = x ~ p_θ[x | z]$

Probabilistic encoder (inference network) for amortized variational inference

$q_φ[z | x] = [z | μ_φ[x], diag[σ_φ^2[x]]]$

$E_φ[x] = z ~ q_φ[z | x]$

$Stochastic gradient variational Bayes$

$(with reparameterization trick)$

Training VAE

Variational lower-bound

$Log[p[x]] = Log[∫p[x, z] z] = Log[∫p_θ[x | z] p[z] z] = Log[∫q_φ[z | x] (p_θ[x | z] p[z])/q_φ[z | x] z]$

$≥∫q_φ[z | x] Log[(p_θ[x | z] p[z])/q_φ[z | x]] z (Jensen ' s inequality)$

$= ∫q_φ[z | x] Log[p_θ[x | z]] z + ∫q_φ[z | x] Log[p[z]/q_φ[z | x]] z$

$= _q_φ[z | x][Log[p_θ[x | z]]] - D_KL[q_φ[z | x] | P[z]] (KL - divergence)$

$Reconstruction - Penalty$

Reconstruction cost

$The expected log - likelihood measures how well samples from q_φ[z | x] are able to explain the data x$

Penalty

$The approximation q_φ[z | x] to the posterior does not deviate too far from your beliefs p[z]$

Maximize the variational lower-bound on the average log-likelihood

$Underscript[arg max, θ, φ] _Overscript[p, ~][x][_q_φ[z | x][Log[p_φ[x | z]]] - D_KL[q_φ[z | x] | p[z]]]$

Given

$A set of N unlabeled examples$

$D = {x_1, x_2, …, x_N} x_n∈^d$

Goal

$Construct a model s . t . the distribution of generated samples is close to the distribution over the training set$

Model

$VAE : p_θ[x | z] (likelihood - based)$

$GAN : G_θ[z] (likelihood - free)$

Variational lower-boud

$ℱ[θ, φ ; x] = _q_φ[Log[p_θ[x | z]]] - D_KL[q_φ[z | x] | p[z]]$

Two problems to be addressed

$1. Choosing the computationally feasible approximate posterior distribution q_φ[z | x]$

$(toward the richer distribution)$

$Mean - field approximation where a factorized form of distribution is assumed$

$Structured mean - field approximations that incorporate some basic form of dependency within the approximate posterior$

$Approximate posterior as a mixture model$

$Normalizing flows$

$Hierarchical variational models$

$2. Efficient computation of the derivatives of the expected log - likelihood$

$∇_q_φ[z | x][Log[p_θ[x | z]]]$

$Stochastic gradient variational Bayes$

Strochastic Gradient Variational Bayes

$SGVB : Monte Carlo estimates + gradient descent$

$Variational lower - bound$

$ℱ[θ, φ ; x] = _q[Log[p_θ[x | z]]] - D_KL[q_φ[z | x] | p[z]]$

$SGVB - analytically computed$

$where Monte Carlo estimates are performed with the reparameterization trick (for variance reduction)$

$_q[Log[p_θ[x | z]]] ≈1/LUnderoverscript[∑, l = 1, arg3] Log[p_θ[x | z^(l)]]$

$where z^(l) = m + λ^(1/2) ⊙ε^(l) and ε^(l) ~ [0, 1]$

$A single sample is often sufficient to form this Monte Carlo estimates in practice$

$q_φ[z | x] = [m[x], λ[x]]$

$z ~ [m[x], λ[x]]$

$ε ~ [0, 1]$

$z = m[x] + λ[x]^(1/2) ⊙ε$

Noisy Gradients

$The log derivative trick yields$

$∇_φ_q[Log[p_θ[x | z]]] = ∇_φ∫q_θ[z | x] Log[p_θ[x | z]] z$

$= ∫Log[p_θ[x | z]] ∇_φq_φ[z | x] q_φ[z | x]/q_φ[z | x] z$

$= ∫q_φ[z | x] Log[p_θ[x | z]] ∇_φq_φ[z | x] z$

$= _q[Log[p_θ[x | z]] ∇_φLog[q_φ[z | x]]]$

$Monte Carlo estimates are calculated as$

$∇_φ_q[Log[p_θ[x | z]]] ≈1/LUnderoverscript[∑, l = 1, arg3] Log[p_θ[x | z^(l)] ∇_φLog[q_φ[z^(l) | x]]]$

$where z^(l) ~ q_φ[z | x]$

$This is referred to as score function gradients, which often exhibit very high variance (the quality of the estimate may depend on φ which may be far from the optimum)$

Reparameterization Trick

$Reparameterize the random variable z ~ q_φ[z | x] using a differentiable transformation f_φ[ε, x] of an auxiliary noise variable ε$

$z = f_φ[ε, x]$

$ε ~ p_ε[ε]$

$Then, Monte Carlo gradient estimates are$

$∇_φ_q[Log[p_θ[x | z]]] = _q[Log[p_θ[x | z]] ∇_φLog[q_φ[z | x]]]$

$= _q[Log[p_θ[x | f_φ[ε, x]]] ∇_φLog[q_φ[f_φ[ε, x] | x]]]$

$≈1/LUnderoverscript[∑, l = 1, arg3] Log[p_θ[x | z^(l)]] ∇_φLog[q_φ[z^(l) | x]]$

$where z^(l) = f_φ[ε^(l), x] and ε ~ p_ε[ε]$

$Note that the expectation is over p_ε that does not depend on variational parameters φ$

$In the case of Gaussian random variables$

$z^(l) = μ_φ[x] + σ_φ[x] ⊙ε^(l)$

$ε^(l) ~ [0, I]$

Score function gradients

$Can be applied to both discrete and continuous random variables$

$Often have high variance$

Reparameterization gradients

$Can be applied to only continuous random variables$

$Often have lower variance than score function gradients$

VAE: Revisited

$Training a VAE involves maximizing the ELBO (Evidence Lower Bound)$

$Underscript[arg max, θ, φ] ℱ[θ, φ ; x] = _q_φ[z | x][Log[p_θ[x | z]]] - D_KL[q_φ[z | x] | p[z]]$

$This is equivalent to minimizing the loss  = _REC + _KL$

$Underscript[arg min, θ, φ] = -_q_φ[z | x][Log[p_θ[x | z]]] + D_KL[q_φ[z | x] | p[z]]$

$_REC + _KL$

$Assume that both q_φ[z | x] and p_θ[x | z] are Gaussian$

$q_φ[z | x] = [z | μ_φ[x], diag[σ_φ^2[x]]]$

$p_θ[x | z] = [x | μ_θ[z], diag[σ_θ^2[z]]]$

Practice of VAEs

$In practice, the covariance of the decoder is set to the identify matrix for all z, i . e .,$

$diag[σ_θ^2[z]] = I$

$z = E_φ[x] = μ_φ[x] + σ_φ[x] ⊙ε for ε ~ [0, I] and _REC = -Log[[x | μ_θ[E_φ[x]], 1]]$

$_REC = | x - μ_θ[E_φ[x]] | _2^2$

$Assuming the prior p[z] = [z | 0, I] yields$

$D_KL[q_φ[z | x] | p[z]] = 1/2 {| μ_φ[x] | _2^2 +d + Underoverscript[∑, i = 1, arg3] (σ_φ^2[x] _i - Log[σ_φ^2[x] _i])}$

$where x∈^d$

Shortcomings of VAEs

$Has to carefully balance the trade - off between _REC and _KL during optimization$

Over-regularization

$A too large weight on the _KL term$

$Smoothing the latent space too much affect sampling quality in a negative way$

Heuristics

$Gradual annealing the importance of _KL during training$

Practical Implementation of VAEs:Summary

$The probabilistic encoder E_φ[x] = z ~ [z | μ_φ[x], diag[σ^2[x]]]$

$That is, the output of encoder is given by$

$E_φ[x] = μ_φ[x] + σ[x] ⊙ε$

$which can be viewed as the mean μ_φ[x] augmented with the Gaussian noise scaled by σ_φ[x]$

Noise Injection

$In this light, a VAE can be seen as a deterministic autoencoder where Gaussian noise is added to the decoder ' s input$

Regularization

Regulaized Autoencoder (RAE)

Deterministic Reqularized Autoencoders

No noise injection

$substitutes noise injection with an explicit reqularization for the decoder$

RAE

$RAE = deterministic autoencoder + explicit regularization for the decoder$

The loss for RAE is given by

$_RAE = _REC + β _z^RAE + λ _RAG$

$where$

$_REG$

$Explicit regularizer for the decoder$

$_z^RAE$ = $1/2$ |z $| _2^2$

$Constraining the size of the latent space to avoid unbounded optimization$

Examples of $_REG$

Tikhonov regularization

$_REG = | θ | _2^2 (weight decay on the decoder parameters θ)$

Gradient penalty

$_REG = | ∇D_θ[E_φ[x]] | _2^2 (enforcing Lipschitz continuity)$

Spectal normalization

$Normalizes each weight matrix θ_l in the decoder by an estimate of its largest singular value$

$θ_l = θ_l/s[θ_l]$

$where s[θ_l] is the current estimate obtained through the power method$

Ex-Post Density Estimation

No KL divergence term in RAE

$Cannot ensure that the latent space Z is distributed according to a simple distribution$

$Lose the simple mechanism provided by p[z] to sample from Z$

$x_new ~ p_θ[x | z_new]$

$z_new ~ p[z]$

Ex-post density estimation

$Fit a density estimator q_δ[z] to {z = E_φ[x] | x∈}$

$Place a dirac distribution on each latent point ⇒ high quality reconstruction but poor generalization$

$Mixture of Gaussians$

$For random sample generation, z ~ q_δ[z] is fed into the decoder in RAE$

$This technique can be used even for VAEs$

ES-CVAE

Echo-State Conditional Variational Autoencoder

$MNIST$

Echo State Networks

$An approach to recurrent neural network training$

$Consists of a large, fixed, recurrent "reservoir" network$

$r^(i) = α r^(i - 1) + (1 - α) f[A r^(i - 1) + B[1 ; Λ_xx^(i)]]$

$where A and B are NOT trained but only properly initialized$

$The network output y^(i) is computed by training suitable output connection weight C$

$y^(i) = C[1 ; x^(i) ; r^(i)]$

Our Model: ES-CVAE

$[Graphics:math/8_Autoencoder/index_191.gif]$

$The joint distribution over a sequence of N$

$instances, p[x_1, x_2, …, x_N] = p[x_1] Underoverscript[∏, n = 2, arg3] p[x_n | x_ (1 : n - 1)], is modeled as$

$p[x_1, x_2, …, x_N] = p[x_1] Underoverscript[∏, n = 2, arg3] p[x_n | r_ (n - 1)]$

$where$

$p[x_n | r_ (n - 1)] = ∫p[x_n | z_n, r_ (n - 1)] p[z_n | r_ (n - 1)] z_n$

$and reservoir states r_ (n - 1) are computed by$

$r_ (n - 1) = α r_ (n - 2) + (1 - α) f[A r_ (n - 2) + B[1 ; Λ_xX_ (n - 1)]]$

$Anomaly score, a[x_n] = -Log[p[x_n | r_ (n - 1)]]$

Variational Lower-Bound F on Log[p[ $x_n$ | $r_ (n - 1)$ ]]

$Log[p[x_n | r_ (n - 1)]] = Log[∫p[x_n, z_n | r_ (n - 1)] z_n]$

$≥∫q[x_n | z_n, r_ (n - 1)] Log[p[x_n, z_n | r_ (n - 1)]/q[x_n | z_n, r_ (n - 1)]] z_n$

$= ∫q[x_n | z_n, r_ (n - 1)] Log[(p[x_n | z_n, r_ (n - 1)] p[z_n | r_ (n - 1)])/q[x_n | z_n, r_ (n - 1)]] z_n$

$= _q_n[Log[p[x_n | z_n, r_ (n - 1)]]] - D_KL[q[z_n | x_n, r_ (n - 1)] | p[z_n | r_ (n - 1)]]$

$where _q_n[.] denotes the expectation w . r . t . q[z_n | x_n, r_ (n - 1)]$

Neural Statistician

$VAE z_n→x_n$

Amortized inference

$Inference using a deep net$

$q_φ[z | x]$

Variational inference = per-sample inference

Neural Statistician: A Bayesian Hierarchincal Model

$[Graphics:math/8_Autoencoder/index_210.gif]$

$The likelihood of a particular data set $

$p[] = ∫p[c] (Underscript[∏, x∈] ∫p_θ[x | z] p_θ[z | c] z) c$

$1. Same as before$

$2. permutation invariant model$

Neural Statistician = VAE for sets

$[Graphics:math/8_Autoencoder/index_215.gif]$

$Multiple stochastic layers$

$Static network (permutation - invariant model) + Standard inference network (as in VAE)$

$The log - likelihood of a particular dataset  is given by$

$Log[p[]] = Log[∫p[c] (Underscript[∏, x∈] ∫p_θ[x | z] p_θ[z | c] z) c]$

$where$

$q_φ[z | c, x] : standard inference network as in VAE$

$q_φ[c | ] : static network (permutation - invariant model)$

5-way 1-shot

$Few - shot classification$

GAN (Generative Adversarial Network)

Adversarial training

$A set of machines learn togeter by pursuing competing goals$

$A fascinating new training method$

$Bypasses the need of loss functions in learning$

$A new way of regularizing learning machines$

Generative Adversarial Network

Generator, G[z;θ]: $^K$ → $^D$

$Caputres the data distribution$

$Counterfeiters : Tries to fake discriminator$

Discirminator, D[x;φ]: $^D$ →{0,1}

$Scoring function : D[x ; φ] = [y = 1 | x] (y = 1 : trining data)$

$Learns features with rich semantics$

$Police : Tries to detect counterfeit images$

$[Graphics:math/9_GAN/index_13.gif]$

$Generator (Decoder)$

$random noise$

$z ~ Gaussian or uniform$

$z_n→G_φ[z] →Overscript[x,^] _n→D_φ[x]$

Training GAN

$Denote by p_d[x] the true data distribution and by p_g[x] our model distribution (generator ' s distribution over x)$

Training D

$Train the discriminator D[x ; φ] to maximize the probability of assigning the correct label to training data as well as fake data generated by G$

$Underscript[max, D][_x ~ p_d[x][Log[D[x ; φ]]] + _x ~ p_g[x][Log[1 - D[x ; φ]]]]$

$= Underscript[max, D][_x ~ p_d[x][Log[D[x ; φ]]] + _z ~ p[z][Log[1 - D[G[z ; θ]]]]]$

Training G

$Train the generator G to minimize the probability of the negative (generated - data) class, Log[1 - D[G[z ; θ]]]$

$Underscript[min, G][_z ~ p[z][Log[1 - D[G[z ; θ]]]]]$

Two-player minimax game (for Nash equilibrium)

$Underscript[min, θ] Underscript[max, φ] [θ, φ]$

$where$

$[θ, φ] = _x ~ p_d[x][Log[D[x ; φ]]] + _z ~ p[z][Log[1 - D[G[z ; θ]]]]$

$Bath G and D are deep neural networks$

$Does not require any sophisticated inference methods (varaitional or sampling)$

$Alternate between k steps of optimizing D (cross - entropy loss minimization) and one step of optimizing G$

$max_D _x ~ q[x][Log[D[x ; φ]]] + _z ~ p[z][Log[1 - D[G[z ; θ]]]]$

$min_G _z ~ p[z][Log[1 - D[G[z ; θ]]]]$

$In practice, train G :$

$max_G _z ~ p[z][Log[D[G[z ; θ]]]]$

$(stronger gradients early in learning)$

Unrolled GANs

$Stabilizes training of GANs and solves mode collapsing problem$

$Increases the diversity and coverage of the data distribution by the generator$

$Unrolled optimization for updating generator parameters$

Generating Images by GANs

DCGAN

$Generating an image of size 64×64 using 100 - dimensional random noise vector .$

$need to make sure how those sizes are obtained !$

Progressive Growing of GANS

$Generating images of size 1024×1024 is a challenging task$

Interesting Applications of GANs

GAN for single Image Super-Resolution

$Goal : Estimate a high - resolution, superresolved image from a low - resolution input image .$

eCommerce GAN

Improved Techniques for Training GANs

Convergent Issue in GAN

Training GANs

$= Finding a Nash equilibrium of a non - convex game with continuous and high - dimensional parameters$

$Note that the modification of parameters in D increase$

$ℒ_D = _x ~ p_d[x][Log[D[x ; φ]]] + _z ~ p[z][Log[1 - D[G[z ; θ]]]]$

$but G is modified to decrease$

$_G = _z ~ p[z][Log[1 - D[G[z ; θ]]]]$

$Thus, gradient methods may fail to converge for many games$

Problems in GAN Training

Non-Convergence

$Model parameters oscillate, destabilize and never converge$

Mode collapsing

$The generator may collapse, producing limited varieties of samples$

Diminished gradient

$The discriminator gets too successful that the generator gradient vanishes and learns nothing$

Feature Matching to Train G

$Rather than directly optimizing the out of the discriminator$

$Underscript[min, G] _z ~ p[z][Log[1 - D[G[z ; θ]]]]$

$train the generator to match the expected value of the features φ[.]$

$on a intermediate layer of the discriminator$

$Underscript[min, G] | _x ~ p_data[x][φ[x]] - _z ~ p[z][φ[G[z ; θ]]] | _2^2$

Denoising Auto-Encoder

$[Graphics:math/9_GAN/index_57.gif]$

$min [| x - Overscript[x,^] | _2^2]$

GAN Trained with Denoising Feature Matching

Training G

$= Denoising autoencoder (in the space of discriminator features) + adversarial discriminator$

$The discriminator D = d◦φ[d[.] : ^D→ {0, 1} is a classifier and φ[.] : ^D→K is a feature extracotr] is trained as in the standard GAN$

$Underscript[max, D] _x ~ P_data[x][Log[D[x ; φ]]] + _z ~ p[z][Log[1 - D[G[z ; θ]]]]$

$The generator G is trained$

$Underscript[min, G] _z ~ p[z][λ_dae | φ[G[z ; θ]] - DAE[φ[G[z ; θ]]] |^2] - λ_gan[Log[D[G[z ; θ]]]]$

$where DAE[.] is treated as constant w . r . t . gradient computations, which is trained by$

$Underscript[min, DAE] _x ~ p_data[x][| φ[x] - DAE[η[φ[x]]] |^2]$

$η[.] is the corruption function$

An Information-Theoretic Extension of GAN

$Toward Disentangled Representation$

Disentangled Representation

$Problems with GANs : No restrictions on how the generator G[z] uses z$

$z can be used in highly entangled way$

$Each dimension of z does not represent the salient attributes of a data instance$

Disentangled = Interpretable and Factorized

Information Maximization

InfoGAN

$Decompose the input noise vector into two parts (structured noise vector)$

$z : treated as source of incompressible noise$

$c : latent code which will target the salient features of the data distribution$

$p[c_1, c_2, …, c_k] = Underoverscript[∏, i = 1, arg3] p[c_i]$

InfoGAN

$Generator is of the form G[z, c] and involves information - regularized minimax game,$

$Underscript[min, G] Underscript[max, D] ν_GAN - λI[c ; G[z, c]]$

$Mutural information$

$I[x ; y] ≥0$

Training InfoGAN

Train the discriminator D[x]

$Underscript[max, D] _z ~ p_d[x][Log[D[x]]] + _ (z ~ p_z[z], c ~ p_c[c])[Log[1 - D[G[z, c]]]]$

Train the generator G[z,c]

$Underscript[min, G] _z ~ p_z[z][Log[1 - D[G[z, c]]]] - λ I[c ; G[z, c]]$

$Mutual information : I[c ; G[z, c]]$

Variational Infomax

$The mutual information I[c ; G[z, c]] is hard to maximize directly as it requires access to the posterior p[c | x]$

$Consider a variational lower - bound on the mutual information term$

$I[c ; G[z, c]] = H[c] - H[c | G[z, c]]$

$= _x ~ G[z, c][_c^′ ~ p[c | x][Log[p[c^′ | x]]]] + H[c]$

$= _x ~ G[z, c][KL[p[c^′ | x] | p[c^′ | x]] ≥0] + _x ~ G[z, c][_c^′ ~ p[c | x][Log[q[c^′ | x]]]] + H[c]$

$≥_x ~ G[z, c][_c^′ ~ p[c | x][Log[q[c^′ | x]]]] + H[c]$

$= _ (C ~ p[c], x ~ G[z, c])[Log[q[c | x]]] + H[c]$

$= ℒ_infomax$

Conditional GAN

$Generator is trained to generate a fake sample x with a condition y$

$(e . g ., class label or data from other modalities) provided as another input, in addition to noise z .$

Generator

$Input noise z and y are combined in joint hidden representation$

Discriminator

$x and y are presented as inputs to the discriminator$

Optimization

$min_G max_D _ (x ~ p_data[x], y ~ p[y])[Log[D[x, y]]] + _ (z ~ p[z], y ~ p[y])[Log[1 - D[G[z, y], y]]]$

Image-to-image translation

Map Edges to Photo via cGAN

Unpaired Image to Image Translation

$Given any two unordered image collections, it learns to automatically translate an image from one to other and vice versa$

Cycle-Consistency

$Translation should be cycle consistent$

$G : →$

$F : →$

$F[G[x]] ≈x and G[F[y]] ≈y$

$[Graphics:math/9_GAN/index_100.gif]$

$[Graphics:math/9_GAN/index_101.gif]$

Adversarial Loss + Cycle Consistency Loss

Adversarial loss for G:X→Y and F:Y→X

$ℒ_gan[G, D_Y, , ] = _y ~ p_data[y][Log[D_Y[y]]] + _x ~ p_data[x][Log[1 - D_Y[G[x]]]]$

$ℒ_gan[F, D_X, , ] = _x ~ p_data[x][Log[D_X[x]]] + _y ~ p_data[y][Log[1 - D_X[F[y]]]]$

Cycle consistency loss

$ℒ_cyc[G, F] = _x ~ p_data[x][| F[G[x]] - x | _1] + _y ~ p_data[y][| G[F[y]] - y | _1]$

Summary

AE	VAE	GAN
□	likelihood-based	likelihood-free
Deterministic Encoder & Decoder	Probabilistic Encoder & Decoder	Deterministic Decoder
□	□	GAN+Encoder=ALI (Adevarsarially learned inference)Bi-GAN
□	Probabilistic Decoder	Generator network
□	Blurry images	Sharp-looking image
□	x	Mode-collapsing problem

GANs with Encoder Networks

Adversarially Learned Inference

$[Graphics:math/9_GAN/index_105.gif]$

Encoder joint distribution

$q[x, z] = q[x] q[z | x]$

Decoder joint distribution

$p[x, z] = p[z] p[x | z]$

Match these two joint distributions

The miniax game

$Underscript[min, G] Underscript[max, D] _q[x][Log[D[x, E[x]]]] + _p[z][Log[1 - D[G[z ; θ], z]]]$

Semi-Supervised Learning with GANs

Small amount of labeled data

$small labeled data + large unlabeled data = semi - supervised learning$

$multi - modal learning$

$meta - learning$

Semi-Supervised Learning with GAN

Classifier for K classes

$p_model[y = k | x] = Exp[ℓ_k]/(Underoverscript[∑, j = 1, arg3] Exp[ℓ_j]), where ℓ_k are logits$

GAN

$Label generated samples with y = K + 1, i . e ., p_model[y = K + 1 | x] yields the probability that x is fake$

Loss

$ℒ = ℒ_supervised + ℒ_unsupervised$

$ℒ_supervised = -_ ((x, y) ~ p_data (x, y))[Log[p_model[y | x, y≤K]]]$

$ℒ_unsupervised = -_x ~ p_data[x][Log[1 - p_model[y = K + 1 | x]]] - _x ~ G[Log[p_model[y = K + 1 | x]]]$

$This ℒ_unsupervised is the case where in the standard GAN game value, we use D[x] = 1 - p_model[y = K + 1 | x], yielding$

$ℒ_unsupervised = _x ~ p_data[x][Log[D[x]]] - _z ~ p[z][Log[1 - D[G[z]]]]$

Hyperparameter Optimization

Bayesian optimization

Optimization of Black-Box Functions

$Interested in finding the global maximizer$

$x^* = Underscript[arg max, x∈] f[x]$

$where f[x] can only be evaluated via queries to a black - box that provides noisy outputs of the form y_t ~ [f[x_t], σ_noise^2]$

$y_t = f[x_t] + ε_t$

$x_t→y_t$

$a closed - form expression of f[x] is not available , or$

$a nonlinear and generally non - convex function f[x] whose derivatives are unavailable, or$

$but expensive noisy evaluations at query points are available$

Regret

Instantaneous regret

$r_t = f[x_ *] - f[x_t]$

Cumulative regret (in the bandit setting)

$R_T = Underoverscript[∑, t = 1, arg3] r_t = Underoverscript[∑, t = 1, arg3] (f[x_ *] - f[x_t])$

Simple regret: (in the optimization setting)

$S_T = Underscript[min, t≤T] r_t = f[x_ *] - Underscript[max, x_t] f[x_t]$

No regret algorithms in the bandit setting: $lim_ (T→∞)$ $R_T/T$ =0

$Note that S_T≤1/TR_T$

Hyperparameter Optimization

Hyperparameters $x_t$

$Validation loss f[x_t]$

$Learning rate$

$#layers, #nodes, Neural Architecture$

$Kernel size in CNN$

$Batch size$

$Grid Search, Random Search$

SigOput

$Observed model performance↔Suggested Hyperparameters$

Objective

$Generalization loss$

Search space

$Hyperparameter settings$

Observations

$Empirical loss on test set$

Automated machine learning

Objective

$Generalization loss$

Search space

$Feature processing methods$

$algorithm selection$

$Hyperparameter settings$

Observations

$Empirical loss on test set$

Clinical drug trials in healthcare

Objective

$Drug effectiveness$

Search space

$ingredients$

$concentrations$

Observations

$improved or not$

Active user modeling

Objective

$Ask right questions$

Search space (x)

$Attributes of a user query$

Observations (f[x])

$Response from the human$

Hyperparameters

Model parameters

$Parameters (that describe the model) learned during training$

$Weights in linear models, neural networks, kernel machines$

Hyperparameters

$Parameters that can be set arbitrarily by users before starting training$

$Trade - off parameters in reqularization, initial values, learning rates, kernel width, number of layers, number of nodes, and so on$

Auto ML ⊃ Hyperparameter Optimization ⊃ Neural Architecture Search

Search over Configuration Space

$How can we find the best configuration of hyperparameters ?$

Grid serach

Random serach

Any more efficient method in terms of the number of eveluations?

$⇒Bayesian optimization$

AutoML

$Automate ML tasks by deploying ML into ML itself$

Feature processing

Model/algorithm selection

Hyperparameter tuning

Many companies are using AutoML

Google

$Cloud AutoML$

Facebook

$AI that builds AI (Asimo : automatically produces improved versions of current versions)$

Microsoft

$in Azure Machine Learning$

Amazon

$AutoGluon$

AutoGluon: Introduced by Amazon in January, 2020

$An open - source library that empowers developers to easily build Automatic Machine Learning (AutoML) models$

Democratizes the task of ML

$Easily can train and deploy high - accuracy models$

$Only requires a few lines of code$

$Cab be customized toward specific use cases$

$Carries out automatic hyperparameter tuning, model selection, architecture search, and data processing$

$model = task . fit (data) →model . predict (new_data)$

$#AutoGluon Classifier$

$predictor = task . fit (train_data = train_data, label = label_column, output_directory = dir)$

$y_pred = predictor . predict (test_data_nolab)$

Bayesian Optimization

Surrogate function

$Use a probabilistic model for the latent function f[.] to guide the search, given _ (1 : t) = {(x_1, y_1), …, (x_t, y_t)} (for instance, GP regression)$

Auquisition function

$Determine where next to sample from the objective function, balancing exploitation and exploration$

$Choose the next x where the posterior mean μ[x] is high (exploitation) and the posterior variance σ^2[x] is high (exploration)$

Surrogate model

$[Graphics:math/10_HyperparameterOptimization/index_62.gif]$

$Bayesian optimization provides an efficient approach in terms of the number of function evaluations required .$

GP regression

Regression

$Parametric regression$

$Nonparametric$

Random function = Gaussian Process

$Treat the latent vector as parameters$

$ = (f[x_1], …, f[x_N]) ∈^N$

$Infer the value of f[x] at any location x∈, given past obserations _ (1 : t), which is Gaussian with marginal mean and variance given by$

$μ_t[x] = k_t[x]  (K_t + σ_noise^2I)^(-1) y_t$

$σ_t^2[x] = k (x, x) - k_t[x]  (K_t + σ_noise^2I)^(-1) k_t[x]$

$GP[μ[x], ∑[x, x^′]] = (mean function, covariance function)$

Choice of Kernels

Squared exponential kernel (Gaussian kernel)

$k[x_i, x_j] = σ_f^2Exp[-1/(2ℓ^2) | x_i - x_j |^2], (for isotropic model)$

$or$

$k[x_i, x_j] = σ_f^2Exp[-1/(2ℓ^2) (x_i - x_j) ∑^(-1) (x_i - x_j)]$

$where ∑ = diag (ℓ_1, …, ℓ_D), (for anisotropic model)$

Matern Kernel

$k[x_i, x_j] = 2^(1 - ν)/Γ[ν] (((2ν)^(1/2) | x_i - x_j |)/I)^νH_ν (((2ν)^(1/2) | x_i - x_j |)/I)$

$where ν is a smoothness parameter, ℓ is a length - scale parameter I, Γ[.] is the gamma function, and H_ν is a modified Bessel function$

Few Things about GP

Pros

$Good uncertainty estimates$

$Generality in the sense that GP with a choice of reasonable kernel can approximate a broad class of functions$

Cons

$Not scalable (standard GP regression requires  (N^3))$

$Limited by kernels$

Besides GP regression, as surrogate models, you can also use

Random forests

$Good : Fast & Parallelizable training$

$Bad : Empirical confidence bounds & Poor extrapolation$

Neural networks

$Learn f by a NN$

$Augment the last layer by Bayesian linear regression$

$Compute the uncertainty estimate by marginalizing out the output weights$

Alternative Surrogate Model: Random Forests

$Exact GP$

$SPGP$

$SSGP$

$Random Forest$

More alternative surrogate models include

Mondrian forest regression

$Stochastic processes for random partitioning$

Neural processes

$Combine the best of neural networks and Gaussian processes$

Algorithm Outline: Bayesian Optimization

$Input : Initial data _ (1 : 1) = {(x_1, y_1)}, limit T∈>1$

$output : A global maximizer x^*$

$for t = 1, 2, …, T - 2 do$

$Find x_ (t + 1) that maximizes the acquisition function over the current GP : x_ (t + 1) = Underscript[arg max, x] a[x | _ (1 : t)]$

$Sample the objective function : y_ (t + 1) = f[x_ (t + 1)] + ε_ (t + 1)$

$Augment the data : _ (1 : t + 1) = {_ (1 : t), (x_ (t + 1), y_ (t + 1))}$

$Update the GP, computing μ_ (t + 1)[x], σ_ (t + 1)^2[x] : μ_ (t + 1)[x] = k_ (t + 1)[x]  (K_ (t + 1) + ρ^2I)^(-1) y_ (t + 1)$

$σ_ (t + 1)^2[x] = k[x, x] - k_ (t + 1)[x]  (K_ (t + 1) + ρ^2I)^(-1) k_ (t + 1)[x]$

$end for$

$return x^* = Underscript[arg max, x∈ {x_1, …, x^T}] μ_T[x]$

Handling Categorical or Integer-Valued Variables

Spearmint

Integer-valued variables

$Rounding after optimizing the acquisition function$

Categorical variables

$One - hot encoding$

A naive approach

$1. Optimize the acquisition function a[.] assuming all variables take real values$

BayesOpt

$Encodes the categorical variable as an integer variable (no one - hot encoding)$

$Uses an ARD kernel with a fixed spatial scale on that dimension that is so small that neighboring integer values have virtually no effect on each other$

A navie approach

$1. Optimize the acquisition function a[.] assuming all varables take real values$

Acquisition Functions

$Given _ (1 : t) = {(x_1, y_1), …, (x_t, y_t)}, we determine the next evaluation location x_ (t + 1) via optimizing the acquisition function a[x | _ (1 : t)]$

$x_ (t + 1) = Underscript[arg max, x∈] a[x | _ (1 : t)]$

$The acquisition function a[x | _ (1 : t)] should be high in areas where the maximum is most likely to lie given the current data . (exploitation)$

Utility Function

$Consider a utility function, u : ^D×|→, which maps$

$an arbitrary query point x,$

$its corresponding function value y = f[x]$

$to a measure of quality of the experiment$

$Given data points observed so far _ (1 : t), the acquisition function is the expected utility of a query point x,$

$a[x | _ (1 : t)] = ∫u[x, y ; ξ] p[y | x, _ (1 : t)] y$

$where$

$u[x, y ; ξ] : utility function (ξ is a setting of hyperparameters)$

$p[y | x, _ (1 : t)] : a belief p over the unknown outcome y revealed when evaluating at x$

Utility and Acquisition Functions

Probability of Improvement (PI)

$u[x, y ; ξ] =  (y>α)$

Expected Improvement (EI)

$u[x, y ; ξ] = (y - α)  (y>α) = ReLU[y - α]$

GP Upper Confidence Bound (GP-UCB)

$u[x, y ; ξ] = μ + β σ$

Expected Improvement

$Expected Improvement w . r . t . the best observed objective value y^* so far is defined as$

$EI = _p[y][max[y - y^*, 0]]$

$= ∫max[y - y^*, 0] [y | Overscript[y, _], σ^2] y$

$EI = ∫_ (-∞)^∞max[y - y^*, 0] [y | Overscript[y, _], σ^2] y$

$= ∫_ (-∞)^∞ (y - y^*) [y | Overscript[y, _], σ^2] y$

$= ∫_ (-∞)^∞y[y | Overscript[y, _], σ^2] y - ∫_ (-∞)^∞y^* [y | Overscript[y, _], σ^2] y$

$= Overscript[y, _] Φ ((Overscript[y, _] - y^*)/σ) + σ φ ((Overscript[y, _] - y^*)/σ) - y^* Φ ((Overscript[y, _] - y^*)/σ)$

$= (Overscript[y, _] - y^*) Φ ((Overscript[y, _] - y^*)/σ) + σ φ ((Overscript[y, _] - y^*)/σ)$

$where Φ[.] is the cumulative distribution function and φ[.] is the probability density for the standard normal distribution$

Exploration-Exploitation Trade-Off in EI

$EI[x] = {{{(μ[x] - μ[x_ *] - ξ) Φ[z] + σ[x] φ[z], if σ[x] >0}, {0, if σ[x] = 0}}$

$z = {{{(μ[x] - μ[x_ *] - ξ)/σ[x] + σ[x] φ[z], if σ[x] >0}, {0, if σ[x] = 0}}$

GP-UCB

$Trade - off between posterior mean (exploitation) and posterior variance (exploration) calculated by GP regression$

$GP - USB chooses$

$x_t = Underscript[arg max, x∈] (μ_ (t - 1)[x] + νβ_t^(1/2) σ_ (t - 1)[x])$

$R_T = Underoverscript[∑, t = 1, arg3] (f[x^*] - f[x_t])$

Compressed Sensing

NAS in practice

$Random Search combined with cheap evaluation strategies is an effective method for neural architecture search$

Application to Soft-Voting in Ensemble

$Overscript[y,^] _t = Underscript[arg max, k] Underscript[∑, j] w_jp_ (k, n)^(j)$

$where p_ (k, n)^(j) denotes the base classifier j ' s prediction that the input x_t is a member of class k$

Neural Process

Generative query network (GQN)

Generalizations of GQN framework

$Conditional neural process (CNP)$

$Neural process (NP)$

$Attentive neural process (ANP)$

$Combine the benefits of both neural networks and GPs$

Wanted

$A model which represents a distribution over functions$

Gaussian Processes

$Nonparametric Bayesian method for learning a distribution over a wide class of nonlinear functions$

$Data - efficient$

$Inference requires  (N^3)$

(non-Bayesian) deep neural networks (DNNs)

$Learn a single function from a training set$

$Learn amount of data$

$More scalable than GPs for inference$

Neural processes

$NN - based probabilistic model to represent a distribution over functions, combining the best of two worlds$

Generative Query Network

$GQNs learn to predict what 3D scenes look like viewed from a new position given some context observations of that scene from other viewpoints$

Conditional Neural Processes

Motivation

CNPs combine benefits of NNs and GPs

$the flexibility of stochastic processes such as GPs$

$structured as neural networks and trained via gradient descent from data directly$

Supervised Learning: Data Description

Observed data

$O = {(x_n, y_n) | n = 1, …, N}$

Target inputs

$T = {x_n | n = N + 1, …, N + M}$

Underlying ground truth function

$f : X→Y$

Task

$Predict the output values f[x] for every x∈T given O$

Supervised Learning

$g[x_n]$

$Parameterized approximating function (e . g ., neural networks)$

CNP

Embedding

$h[x_n, y_n]$

Aggregation

$r = a[r_1, …, r_N]$

Parameterized approximating function (e.g., neural networks)

$g[x_n, r]$

GP vs CNP

Gaussian processes

$Assumption made on P is that all finite sets of functional evaluations of f are jointly Gaussian distributed$

$Predictive distribution P[f[T] | O, T] has a simple analytic form defined by prior assumption on the pairwise correlation structure specified via a kernel function$

$Pros : Data - efficient$

$Cons : Difficult to design appropriate priors and computationally expensive  ((N + M)^2)$

Conditional neural processes

$Directly parameterize conditional stochastic processes without imposing consistency w . r . t . some prior processes$

$Parametrize distributions over f[T], given a distributed representation of O of fixed dimensionality$

CNP: Model

$CNP is a conditional stochastic process Q_θ that defines distributions over f[x] for inputs x∈T$

$Q_θ[f[T] | O, T] ≈P[f[T] | O, T]$

$Q_θ is required to be permutation - invariant$

$Q_θ[f[T] | O, T] = Q_θ[π[f[T]] | O, π[T]]$

$CNPs are scalable, achieving a running time complexity of  (N + M) for makeing M predictions with N observations$

CNP: Architecture

Architecture

$r_n = h_θ[x_n, y_n], ∀ (x_n, y_n) ∈O$

$r = a[r_1, …, r_N]$

$φ_n = g_θ[x_n, r], ∀x_n∈T$

$φ_n are parameters for Q_θ$

$Q_θ[f[x] | O, x_n] = Q[f[x_n] | φ_n]$

$For regression tasks, use φ_n to parameterize the mean and variance of Gaussian distribution : φ_n = (μ_n, σ_n^2)$

$For classification, φ_n parameterizes the logits of the class probabilities p_k over the k classes of a categorical distribution$

The mean aggregation is used

$a[r_1, …, r_N] = 1/NUnderoverscript[∑, n = 1, arg3] r_n$

Neural Processes

Motivation

$NP = Latent variable model version of CNP$

	NN	GP	CNP	NP
Can fit more than one function at test time(distribution over functions)	x	O	O	O
Computationally cheap at test time	O	x	O	O
Can sample entire coherent functions	x	O	x	O

Neural Networks vs Gaussian Processes

NNs

$A parametric model that is tuned via gradient descent$

GPs

$A probabilistic model that defines a distribution over possible functions (probabilistic, data - efficient, but computationally intensive)$

Architecture

Encoder

$r_n = h_θ[x_n, y_n], ∀ (f_n, y_n) ∈O$

Aggregator

$r = a[r_1, …, r_N] parameterizes the mean and variance of latent Gaussian random variable z, i . e ., z ~ [μ[r], Iσ^2[r]]$

Conditional decoder

$φ_n = g_θ[x_n, z], ∀x_n∈T, parameterizes the output distribution (either a Gaussian or a categorical distribution)$

Training

$ELBO is given by$

3 Dimension Space

3 D Equation

Point

$P = {x_p, y_p, z_p, 1}$

$Q = {x_q, y_q, z_q, 1}$

Line Equation

$(x - x_p)/(x_q - x_p) == (y - y_p)/(y_q - y_p) == (z - z_p)/(z_q - z_p)$

Plane(Normal Vector)

$N == {a, b, c, d}$

Line Equation

$(x - x_p)/a == (y - y_p)/b == (z - z_p)/c$

Plane Equation

$N = {a, b, c, d}$

$X - P_0 = {x - x_0, y - y_0, z - z_0, 1 - 1}$

$N . (X - P_0) == 0$

${a, b, c, d} . {x - x_0, y - y_0, z - z_0, 0} == 0$

$a (x - x_0) + b (y - y_0) + c (z - z_0) == 0$

$a x + b y + c z - a x_0 - b y_0 - c z_0 == 0$

$a x + b y + c z + d == 0$

3 Point → Plane (Normal Vector)

$N = {a, b, c, d}$

$P_0 = {x_0, y_0, z_0, 1}$

$P_1 = {x_1, y_1, z_1, 1}$

$P_2 = {x_2, y_2, z_2, 1}$

$N . P_0 = 0$

${a, b, c, d} . {x_0, y_0, z_0, 1} == 0$

$a x_0 + b y_0 + c z_0 + d == 0$

$N . P_1 = 0$

${a, b, c, d} . {x_1, y_1, z_1, 1} == 0$

$a x_1 + b y_1 + c z_1 + d == 0$

$N . P_2 = 0$

${a, b, c, d} . {x_2, y_2, z_2, 1} == 0$

$a x_2 + b y_2 + c z_2 + d == 0$

$N == P_0∧P_1∧P_2$

$N == Det[({{e_1, e_2, e_3, e_4}, {x_0, y_0, z_0, 1}, {x_1, y_1, z_1, 1}, {x_2, y_2, z_2, 1}})]$

$e_1 = {1, 0, 0, 0}$

$e_2 = {0, 1, 0, 0}$

$e_3 = {0, 0, 1, 0}$

$e_4 = {0, 0, 0, 1}$

3 Plane (Normal Vector) → Point

$P = {x, y, z, 1}$

$N_0 = {a_0, b_0, c_0, d_0}$

$N_1 = {a_1, b_1, c_1, d_1}$

$N_2 = {a_2, b_2, c_2, d_2}$

$P == N_0∧N_1∧N_2$

$P == Det[({{e_1, e_2, e_3, e_4}, {a_0, b_0, c_0, d_0}, {a_1, b_1, c_1, d_1}, {a_2, b_2, c_2, d_2}})]$

$e_1 = {1, 0, 0, 0}$

$e_2 = {0, 1, 0, 0}$

$e_3 = {0, 0, 1, 0}$

$e_4 = {0, 0, 0, 1}$

2 Plane (Normal Vector) → Line

$N_0 = {a_0, b_0, c_0}$

$N_1 = {a_1, b_1, c_1}$

$L == N_0×N_1$

$== Det[({{e_1, e_2, e_3}, {a_0, b_0, c_0}, {a_1, b_1, c_1}})]$

$e_1 = {1, 0, 0}$

$e_2 = {0, 1, 0}$

$e_3 = {0, 0, 1}$

$L == Det[({{b_0, c_0}, {b_1, c_1}})] e_1 - Det[({{a_0, c_0}, {a_1, c_1}})] e_2 + Det[({{a_0, b_0}, {a_1, b_1}})] e_3$

$== {b_0c_1 - c_0b_1, a_0c_1 - c_0a_1, a_0b_1 - b_0a_1}$

$== {a, b, c}$

$z = 0 ; ⇒x_p, y_p$

$(x - x_p)/a == (y - y_p)/b == z/c$

2 Line→ Plane (Normal Vector)

$N = {a, b, c, d}$

$L_1 = {l_1, m_1, n_1}, (x - x_1)/l_1 == (y - y_1)/m_1 == (z - z_1)/n_1 == t_1$

${x_1 + l_1 t_1, y_1 + m_1 t_1, z_1 + n_1 t_1, 1}$

$L_2 = {l_2, m_2, n_2}, (x - x_2)/l_2 == (y - y_2)/m_2 == (z - z_2)/n_2 == t_2$

${x_2 + l_2 t_2, y_2 + m_2 t_2, z_2 + n_2 t_2, 1}$

$N . L_1 == 0$

$N . L_2 == 0$

$a x + b y + c z + d == 0$

2 Line Angle

$Cos[θ] = L_1 . L_2/(Norm[L_1] Norm[L_2])$

Equation Solution

$Solve[{a_0x + b_0y + c_0z + d_0 == 0, a_1x + b_1y + c_1z + d_1 == 0, a_2x + b_2y + c_2z + d_2 == 0}, {x, y, z}] ;$

${a_3, b_3, c_3, d_3} = {a_0, b_0, c_0, d_0} × {a_1, b_1, c_1, d_1} × {a_2, b_2, c_2, d_2} ;$

${a_3, b_3, c_3, d_3} == Det[({{e_1, e_2, e_3, e_4}, {a_0, b_0, c_0, d_0}, {a_1, b_1, c_1, d_1}, {a_2, b_2, c_2, d_2}})]$

${a_3/d_3, b_3/d_3, c_3/d_3, 1}$

Homogeneous Coordinates Matrix

Reduction, Extension, Movement

Rotation

z Axis Rotation - $R_z$ [θ]

y Axis Rotation - $R_y$ [θ]

x Axis Rotation - $R_x$ [θ]

Transport [ESC]tr[ESC]

$({{a, b, c}})  = ( {{a}, {b}, {c}} )$

Matrix Property

$({{A1, A2, A3}, {A4, A5, A6}, {A7, A8, A9}}) * ({{B1, B2, B3}, {B4, B5, B6}, {B7, B8, B9}}) = ( {{A1 B1, A2 B2, A3 B3}, {A4 B4, A5 B5, A6 B6}, {A7 B7, A8 B8, A9 B9}} )$

Form

$//MatrixForm$

$Norm[v]//TraditionalForm$

${|v|}//StandardForm$

$//FullForm$

$α = ({{a, b, c}}) ;$

$α[[1, 1]] = a$

Determinant (행렬식)

$| {{a_11, a_12}, {a_21, a_22}} | = Det[({{a_11, a_12}, {a_21, a_22}})] = a_11 a_22 - a_12 a_21$

$| {{a_11, a_12, a_13}, {a_21, a_22, a_23}, {a_31, a_32, a_33}} | = Det[({{a_11, a_12, a_13}, {a_21, a_22, a_23}, {a_31, a_32, a_33}})]$

$= a_11 | {{a_22, a_23}, {a_32, a_33}} | -a_12 | {{a_21, a_23}, {a_31, a_33}} | +a_13 | {{a_21, a_22}, {a_31, a_32}} |$

$= a_11 (a_22 a_33 - a_23 a_32) - a_12 (a_21 a_33 - a_23 a_31) + a_13 (a_21 a_32 - a_22 a_31)$

Vector

${a, b, c} . {x, y, z} = Dot[{a, b, c}, {x, y, z}] = a x + b y + c z ;$

${a, b, c} * {x, y, z} = Times[{a, b, c}, {x, y, z}] = {a x, b y, c z} ;$

${a, b, c} × {x, y, z} = Cross[{a, b, c}, {x, y, z}] = {-c y + b z, c x - a z, -b x + a y} ;$

${|{a, b, c} |} = Norm[{a, b, c}] = ({a}^2 + {b}^2 + {c}^2)^(1/2)$

$Power[{a, b, c}, {x, y, z}] = {a^x, b^y, c^z}$

Vector Angle

$Cos[θ] = u . v/(Norm[u] Norm[v])$

Vector <U,V> ([ESC]*[ESC])

$<U, V> = {u_1, u_2, u_3} × {v_1, v_2, v_3} = {u_1 v_1, u_2 v_2, u_3 v_3}$

Vector Inner Product (Dot[{ $u_1$ , $u_2$ , $u_3$ },{ $v_1$ , $v_2$ , $v_3$ }], |U||V|Cos[θ] )

$U . V = {u_1, u_2, u_3} . {v_1, v_2, v_3} = u_1 v_1 + u_2 v_2 + u_3 v_3$

Vector External Product (Cross[{ $u_1$ , $u_2$ , $u_3$ },{ $v_1$ , $v_2$ , $v_3$ }], [ESC]cross[ESC])

$U×V = {u_1, u_2, u_3} × {v_1, v_2, v_3} = -V×U$

$= | {{e_1, e_2, e_3}, {u_1, u_2, u_3}, {v_1, v_2, v_3}} | = Det[({{e_1, e_2, e_3}, {u_1, u_2, u_3}, {v_1, v_2, v_3}})]$

$= e_1 (u_2 v_3 - u_3 v_2) - e_2 (u_1 v_3 - u_3 v_1) + e_3 (u_1 v_2 - u_2 v_1)$

$= {u_2 v_3 - u_3 v_2, -u_1 v_3 + u_3 v_1, u_1 v_2 - u_2 v_1}$

Vector Property

$U× (V×W) = <U, W>V - <U, V>W$

$<U, V×W> = | {{u_1, u_2, u_3}, {v_1, v_2, v_3}, {w_1, w_2, w_3}} | = {u_1, u_2, u_3} × ({v_1, v_2, v_3} × {w_1, w_2, w_3})$

$= {u_1 (v_2 w_3 - v_3 w_2), u_2 (-v_1 w_3 + v_3 w_1), u_3 (v_1 w_2 - v_2 w_1)}$

Area

$Area = 1/2Det[({{x_1, y_1, 1}, {x_2, y_2, 1}, {x_3, y_3, 1}})] = (x_3 (y_1 - y_2) + x_1 (y_2 - y_3) + x_2 (-y_1 + y_3))/2$

Line & Point (Up|Down)

$Det[({{x_1, y_1, 1}, {x_2, y_2, 1}, {x_3, y_3, 1}})] = x_3 (y_1 - y_2) + x_1 (y_2 - y_3) + x_2 (-y_1 + y_3)$

$Det[({{x_1, y_1, 1}, {x_2, y_2, 1}, {x_3, y_3, 1}})] <0 : Right Turn$

$Det[({{x_1, y_1, 1}, {x_2, y_2, 1}, {x_3, y_3, 1}})] >0 : Left Turn$

Determinant (Det)

$det ({{x, y, z, 1}, {x_0, y_0, z_0, 1}, {x_1, y_1, z_1, 1}, {x_2, y_2, z_2, 1}}) = 0$

Cofactor expansion along the first row

$C_ (1j) = (-1)^(1 + j) M_ (1j)$

$M_ij⇒The (i, j) minor of a 3x3 matrix is the determinant of the submatrix formed by deleting the i - th row and the j - th column .$

$det ({{x, y, z, 1}, {x_0, y_0, z_0, 1}, {x_1, y_1, z_1, 1}, {x_2, y_2, z_2, 1}}) = x C_11 + y C_12 + z C_13 + 1 C_14$

Equation of a Plane
a x + b y + c z + d = 0

$a = C_11 = +M_11 = det ({{y_0, z_0, 1}, {y_1, z_1, 1}, {y_2, z_2, 1}})$

$b = C_12 = -M_12 = -det ({{x_0, z_0, 1}, {x_1, z_1, 1}, {x_2, z_2, 1}})$

$c = C_13 = +M_13 = det ({{x_0, y_0, 1}, {x_1, y_1, 1}, {x_2, y_2, 1}})$

$d = C_14 = -M_14 = -det ({{x_0, y_0, z_0}, {x_1, y_1, z_1}, {x_2, y_2, z_2}})$

3x3 Determinant (Sarrus' Rule)

$det ({{p, q, r}, {s, t, u}, {v, w, x}}) = ptx + quv + rsw - rtv - qsx - puw$

Connection between Cross Product/Scalar Triple Product and Determinant

$n = (a, b, c)$

$P_0 = (x_0, y_0, z_0)$

$P_1 = (x_1, y_1, z_1)$

$P_2 = (x_2, y_2, z_2)$

$n = (P_1 - P_0) × (P_2 - P_0)$

$d = -n·P_0$

$d = -det ({{P_0}, {P_1}, {P_2}}) = -(P_0· ((P_1×P_2)))$

$a x_0 + b y_0 + c z_0 + d = 0$