The definition of machine learning

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

In most cases, the machine learning process is typed into two: Supervised learning and Unsupervised learning.

Supervisor learning

Giving the 'right answers' to the model learning.

Supervised learning problems are classified into two types of problems:

Regression problem and Classification problem.



Predict continuous output is obtained via single prediction function.


Discrete valued output is obtained via classification.

Unsupervised learning

Learning without 'right answer'. This algorithm can be applied for example, clustering problem.

We do not need previous idea about what the label would be.

e.g. Cocktail party problem algorithm

Linear regression problem

Suppose the date set is expressed with input x(i) and output y(i) , and the hypothesis $ h $ is expressed as

$$ h_\theta(x) = \theta_0 + \theta_1 x $$

the problem is converted into a problem to figure out a set of parameter $ (\theta_0, \theta_1) $ so that this set is the best prediction accuracy. The prediction accuracy can be estimated with cost function $ J(\theta_0, \theta_1) $ as

$$ J(\theta_0, \theta_1) = \frac{1}{2m}\sum_{i=1}^{m}(\hat{y_i}-y_i)^2 = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta (x_i)-y_i)^2 $$

The cost function is called 'mean square error'. Now, the problem is as follows:

"Find the set of parameter $ (\theta_0, \theta_1) so that [$ J $ becomes minimum."]

contour plot

If you plot $ J $ as a countour plot for $ (\theta_0, \theta_1) $, you will see a bowl-like shape for $ J $.

Gradient descent (勾配降下法)

Outline: start with some $ (\theta_0, \theta_1) $ and keep changing $ (\theta_0, \theta_1) $ to reduce $ J (\theta_0, \theta_1) $ until it becomes minimum.

Algorithm: Repeat until convergence:

$$ \theta_j:=\theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\theta_0, \theta_1) $$

where $ j=0,1 $ represents the feature index number, and $ \alpha $ is learnign rate.

Note that in gradient descent method, there sometimes happen that from slightly different initial parameter set, they would differ into completely different local optimum.

Learning rate $ \alpha $ should be 'appropriate' size. If $ \alpha $ is too small, the learning cost a lot of steps, while $ \alpha $ is too large, it doesn't converge (or at worst, it diverges).

Gradient Descent For Linear Regression

"Batch" Gradient Descent

"Batch": Each step of gradient descent uses all the training examples.

Linear Algebra

Dimension of the matrix: numbers of rows times number of columns. 行列の次元は行の数×列の数

$$ A_{ij} = i(row), j(column) $$

Vector : Special case of matrix, which has only one column

1-indexed vs 0-indexed: in mathmatical context, we use 1-indexed expression while in machine learning context, we use 0-indexed expressions.

Matrix multiplication: $ (m, n) \mathrm{Matrix} \times (n, k)\mathrm{Matrix} = (m, k)\mathrm{Matrix} $

Matrix-Matrix multiplication is not commutative. Even though the dimension can be different.

$$ A \times B \neq B \times A $$

Matrix-Matrix multiplication is associative.

$$ (A \times B) \times C = A \times (B \times C) $$

Identity Matrix $ I $: Diagonal elements are all $ 1 $ and other elements are all $ 0 $

Matrix inverse $ A^{-1} $: $ A^{-1} $ which satisfies $ AA^{-1} = A^{-1}A = I $

Maricies that don't have an inverse are called 'singular' or 'degenerate'

Matrix transpose $ A^{T} $ : $ A_{ji} = A^{T} $


Multiple Features

Suppose the number of feature $ n $ and the $ i $ th feature, of the total number of examples $ m $ and the $ j $th example, its value is indicated as: $ x^{(j)}_i $

Multivariable form of hypothesis function is as follows

$$ h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n = \theta ^Tx $$

Gradient descent for multiple variables is as follows: cost function is now $ J(\theta ) $ as

$$ J(\theta ) = \frac{1}{2m} \sum ^{m}{i=1} (h\theta (x^{(i)}) - y^{(i)})^2 $$

Feature scaling and mean normalization

To quick convergence for regression, the feature $ x $ shall be scaled and normalized as follows:

$$ x_i := \frac{x_i - \mu_i}{s_i} $$

$ \mu_i $ : average of all values $ x $. $ s_i $ : the range of values (max - min)

Learning rate $ \alpha $

Too small $ \alpha $ will lead slow convergence.

Too large $ \alpha $ will lead divergence.

Polynomial regression can be converted into linear regression problem as follows:

$$ h_\theta (x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 $$

where $ x_1 = x, x_2 = x^2, x_3 = x^3 $. Note parameter $ (x_1, x_2, x_3) $ should be appropriatly scaled and normarized.

Normal equation

Suppose $ J(\theta ) = \theta X $ where $ X $ is example parameters matrix. The optimal parameter $ \theta $ is obtained as follows:

$$ \theta = (X^TX)^{-1}X^Ty $$

In Octave, just pinv(X'*X)*X'*y

Normal equation is slow when $ \theta $ is large (>1000).

What if $ X^TX $ is non-invertible?

Redundant features (linealy dependent)

Too many features (> examples)


Linear logistics

$$ h_\theta (x) = g(\theta ^Tx)] where [$ g(z) = \frac{1}{1+e^{-z}} $$

This function is called Sigmoid functino or Logistic fucntion.

In binary classification problem, the probability of being A + the probabilty of beringB.

$ h_\theta (x) \leq 0.5 $ is usually used for boundary.

Decision Binary

$$ h_\theta (x) = g(\theta ^Tx)] : Predict $ "y=1" $ if [$ (\theta ^Tx) >threshold $$

Nonlinear decision boundaries

Same as the case of Multivariables linear regression, high order term will be fitted via linear desicion binary.

Generalization of cost function

$$ J(\theta ) = \frac{1}{m} \sum_{i=1}^{m} \mathrm{Cost}(h_\theta (x), y) $$

convex: 下に凸(grobal minimum に全域から到達可能), non convex: そうではない

Logistic regression cost function

$$ \mathrm{Cost}(h_\theta (x), y) = \lbrace \begin{matrix} -\mathrm{log}h_\theta (x) & \mathrm{if} y=1 \\ -\mathrm{log}(1-h_\theta (x)) & \mathrm{if} y=0 \end{matrix} $$

$$ J(\theta ) = -\frac{1}{m} \sum_{i=1}^{m} \lbrack y^{(i)}\mathrm{log}(h_\theta (x^{(i)})) + (1-y^{(i)})\mathrm{log}(1-h_\theta (x^{(i)})) \rbrack $$

Multiclass classification

ex. Email tagging: Work, Friends, Family, Hobby

Nedical diagrams: not ill, cold, flu...


Argorithm to devide multiclasses into identical classes.

pick up one class, mark it, and mark other datapoints are treated as theme.

To make a prediction on a new x, pick the class that maximizes $ h_\theta (x) $


Linear logistics ではロジスティック関数の形$ h_\theta (x) = g(\theta ^Tx) $ where $ g(z) = \frac{1}{1+e^{-z}} $ とコスト関数の形

が大事で、この形にすることによりGradient descentで線形回帰と同じ形の発展の方程式

$$ \theta_j:=\theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\theta ) $$



Don't do it. Data set is too few.

Underfit = High bias ... Overfit = High variance

Model selection algorithm would be useful (you'll be learning afterword)


$$ J(\theta ) = \frac{1}{2m} \sum ^{m}{i=1} (h\theta (x^{(i)}) - y^{(i)})^2 + \lambda \sum^{m}{i=1} \thetaj^2 $$

The last term is regularization term to penalize large parameter to avoid overfitting.

$ \lambda $ is regularization parameter.

Regularization with normal equation

$ \theta = (X^TX + \lambda (I - A))^{-1}X^Ty $ where $ A $ is a matrix with only $ (1,1) $ element is 1, which means $ I-A $ is an identical matrix without $ (1, 1) $ element.

Regularized logistic regression

$$ J(\theta ) = -\frac{1}{m} \sum_{i=1}^{m} \lbrack y^{(i)}\mathrm{log}(h_\theta (x^{(i)})) + (1-y^{(i)})\mathrm{log}(1-h_\theta (x^{(i)})) \rbrack + \frac{\lambda}{2m}\sum_{i=1}^{n}\theta_j^2 $$


Neural Network

Simulate the neuron in brain.

'Weight' or 'Parameter' terminology is used for $ \theta $ in input $ \theta ^Tx $

If network has $ s_j $ units in layer $ j $, $ s_{j+1} $units in layer $ j+1 $, then $ \Theta ^{(j)} $ will be dimension $ s_{j+1}\times (s_j+1) $

Forward propagation

Vectorized implementation

where $ g(z) = \frac{1}{1+e^{-z}} $


Neural network cost function


suppose training set $ {(x^{(i)}, y^{(i)}}) $

We want to calculate $ \frac{\partial}{\partial \Theta ^{(l)}_{i,j}}J(\Theta) $ to minimize $ J(\Theta ) $.

Backpropagation algorhythm is like this.

Only if we use sigmoid function for activation function, these delivations are right.

$$ \delta ^{(l)} = ((\Theta ^{(l)})^T \delta ^{(l+1)}) .\ast a^{(l)} .\ast (1-a^{(l)}) $$

Backpropagation in practice

Unrolling parameters -> we use 'vector' minimize function in Octave, so we convert matrices into vector.

Gradient checking -> Backpropagation is fast but complicated method. Numerical gradient is slow but easy method, which can be used for checking if backpropagation implement is collect.

Random initialization -> for breaking symmetry. Theta1 = rand(10, 11) * (2 * INIT_EPSILON) - INIT_EPSILON will obtain $ (-\epsilon, \epsilon ) $ random seed.

putting it together ->


Deciding what to try next

Predict housing prices -> regularized linear regression

what you should not do without thinking

GEt more training examples

Try smaller sets of features

Try getting additional features

Try adding polynomial features

Incerease or decrease $ \lambda $... etc.


Testing and giving some insight to give your machine learning to improve performance.

Evaluating hypothesis

Test set error

divide test set into 70% training data set and 30% test data set.

Cross validation error

Diagnosing Bias vs Variance

Regularization and Bias/Variance


Support Vector Machine (SVM)

SVM is an algorythm to find a hyperplane which maximizes the margin to each class'es nearest sample.

$$ \mathrm{min}\theta \lbrack C\sum{i=1}^{m}y^{(i)}cost_1(\theta^Tx^{(i)}) + (1-y^{(i)})cost_0(\theta^Tx^{(i)})\rbrack +\frac{1}{2}\sum_{i=1}^n\theta_j^2 $$

SVM can be extended to nonlinear problem.


$$ f_1 = \mathrm{similarity}(x, l^{(1)}) = \mathrm{exp} (-\frac{||x-l^{(1)}||^2}{2\sigma^2}) $$

Predict "1" when $ \theta_0 + \theta_1f_1 + \theta_2f_2 + \theta_3f_3 + ... \ge 0 $


Non-supervised algorythm

K-means method

  1. Randomly intialize centroids for k-classes.

  2. Assign each data points for the nearest centroids.

  3. Move each centroid to the average (mean) point of each data point.

  4. Iterate 2. and 3. until $ J $ converges. ( $ J = \sum_{i=1}^{k} ||x^{(i)} - \mu_{C^{(i)}}|| $ . where $ ||a-b|| $ means the distance between $ a $and$ b $)

Principal Component Analysis algorythm (PCA)

Decrease the dimension of raw data.

  1. Normalize and scale feature(optional), $ Sigma = \frac{1}{m}\sum_{i=1}^{m}(x^{i})(x^{i})^T $

  2. U, S, V = svd(Sigma)

  3. Minimize $ 1-\frac{\sum_{i=1}^{k}||x^{i}-x_{approx}^{i}||}{\sum_{i=1}^{n}||x^{i}||} $

Just use to compress or visualize data.


Anomaly detection

Anomaly detection usually uses multivariable Gaussian function.

$ x^{(i)} $ : the $ i $ th feature

$$ p(x) = \Pi_{j=1}^n p(x^{(j)};\mu_j, \sigma_j^2)] where [$ p(x; \mu, \sigma ^2) =\frac{1}{\sqrt{2\pi}\sigma}\exp{(-\frac{(x-\mu)^2}{2\sigma^2})} $$

Anomaly detection with Multivariable Gaussian usually has a lot of normal data compared to anormal data.

So predoction accuracy is not good index. Use F1 score instead.

Multivariate Gaussian

$$ p(x; \mu, \sigma ^2) =\frac{1}{(2\pi)^{\frac{n}{2}}|\sigma|^\frac{1}{2}}\exp{(-\frac{1}{2}(x-\mu)^T\sigma^{-1}(x-\mu))} $$



$ r(i,j) = 1 $ if user $ j $ has scored target $ i $ ($ 0 $ otherwise)

$$ y^{(i,j)} =] score by user $ j $ on target [$ i $$

$$ \theta^{(j)} =] parameter vector for user [$ j $$

$$ x^{(i)} =] feature vector for target [$ i $$

For user $ j $, target $ i $, predicted rating: $ (\theta^{(j)})^T(x^{(i)}) $

$$ m^{(j)} =] number of target scored by user [$ j $$

To learn $ \theta^{(j)} $:

$$ \mathrm{min}{\theta(j)} \frac{1}{2}\sum{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{k=1}^n(\theta_k^{(j)})^2 $$

To learn $ \theta^{(1)}, \theta^{(2)},...,\theta^{(n_u)} $:

$$ \mathrm{min}{\theta(1),...,\theta(nu)} \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2 $$

Gradient descent update:

$ \theta_k^{(j)} := \theta_k^{(j)} - \alpha \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})x_k^{(i)} $ (for$ k = 0 $)

$ \theta_k^{(j)} := \theta_k^{(j)} - \alpha( \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})x_k^{(i)}+\lambda\theta_k^{(j)}) $ (for$ k\neq 0 $ )

Collaborative filtering optimization objective

Minimizing $ x^{(1)}, x^{(2)},...,x^{(n_t)} $and$ \theta^{(1)}, \theta^{(2)},...,\theta^{(n_u)} $simultaneously :

$$ J(x^{(1)}, x^{(2)},...,x^{(n_t)},\theta^{(1)}, \theta^{(2)},...,\theta^{(n_u)}) = \frac{1}{2}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n(x_k^{(i)})+\frac{\lambda}{2}\sum_{i=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)}) $$


Large data set

"It's not who has the best algorithm that wins. It's who has the most data."

batch gradient descent

do gradient descent mehtod with all the features

stochastic gradient descent

process gradient method for each feature.

The tendency is to converge around the local minimum for starting point, which would give well-optimized parameters.

mini-batch gradient descent

intermediate of these two methods, to devide features into small unit and proceed gradient descent within the unit.

Online learning

learn $ p(y=1|x;\theta) $ online, which enables to follow the trend.


Machine learning pipeline

Usually, the whole machine learning system is expressed as a pipeline of some machine learning steps.

e.g. Photo OCR system

Law picture -> Text area detection -> Character segmentation -> Character classification

Sliding windows

Sliding windows technique is used in some machine learning apllication, especially for photo detection problem.

To detect some components in a picture, you would configure a detection window, and then slide it along two axises.

i.e. If you want to analyze 200 * 200 pixel picture with 10 * 10, 15 * 15 and 20 * 20 size windows with 4 pixels per each "step", you would process it with about 750000 times.

Getting lots of data and artificail data

If you have not enough data to train your machine learning system, one way to obtain more positive data is to make artificial modification to original positive datasets. For example, rotating, distorting, stretching, and resizing. Linearly independent modifications are not efficient.

Adding purely random noise to your data usually doesn't help improveing training.

Ceiling analysis

If you have a machine learning pipeline and you want to improve the entire accuracy, but don't know where to be improved, the ceiling analysis is applicable. To each component, if its accuracy is 100%, how the entire accuracy becomes. If the value is large, that part is where you should put resource on.