WEEK1
The definition of machine learning
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
In most cases, the machine learning process is typed into two: Supervised learning and Unsupervised learning.
Supervisor learning
Giving the 'right answers' to the model learning.
Supervised learning problems are classified into two types of problems:
Regression problem and Classification problem.
Regression
回帰
Predict continuous output is obtained via single prediction function.
Classification
Discrete valued output is obtained via classification.
Unsupervised learning
Learning without 'right answer'. This algorithm can be applied for example, clustering problem.
We do not need previous idea about what the label would be.
e.g. Cocktail party problem algorithm
Linear regression problem
Suppose the date set is expressed with input x(i)
and output y(i)
, and the hypothesis $ h $ is expressed as
$$ h_\theta(x) = \theta_0 + \theta_1 x $$
the problem is converted into a problem to figure out a set of parameter $ (\theta_0, \theta_1) $ so that this set is the best prediction accuracy. The prediction accuracy can be estimated with cost function $ J(\theta_0, \theta_1) $ as
$$ J(\theta_0, \theta_1) = \frac{1}{2m}\sum_{i=1}^{m}(\hat{y_i}y_i)^2 = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta (x_i)y_i)^2 $$
The cost function is called 'mean square error'. Now, the problem is as follows:
"Find the set of parameter $ (\theta_0, \theta_1) $ so that $ J $ becomes minimum."
contour plot
If you plot $ J $ as a countour plot for $ (\theta_0, \theta_1) $, you will see a bowllike shape for $ J $.
Gradient descent (勾配降下法)
Outline: start with some $ (\theta_0, \theta_1) $ and keep changing $ (\theta_0, \theta_1) $ to reduce $ J (\theta_0, \theta_1) $ until it becomes minimum.
Algorithm: Repeat until convergence:
$$ \theta_j:=\theta_j  \alpha \frac{\partial}{\partial \theta_j}J(\theta_0, \theta_1) $$
where $ j=0,1 $ represents the feature index number, and $ \alpha $ is learnign rate.
Note that in gradient descent method, there sometimes happen that from slightly different initial parameter set, they would differ into completely different local optimum.
Learning rate $ \alpha $ should be 'appropriate' size. If $ \alpha $ is too small, the learning cost a lot of steps, while $ \alpha $ is too large, it doesn't converge (or at worst, it diverges).
Gradient Descent For Linear Regression
"Batch" Gradient Descent
"Batch": Each step of gradient descent uses all the training examples.
Linear Algebra
Dimension of the matrix: numbers of rows times number of columns. 行列の次元は行の数×列の数
$$ A_{ij} = i(row), j(column) $$
Vector : Special case of matrix, which has only one column
1indexed vs 0indexed: in mathmatical context, we use 1indexed expression while in machine learning context, we use 0indexed expressions.
Matrix multiplication: $ (m, n) \mathrm{Matrix} \times (n, k)\mathrm{Matrix} = (m, k)\mathrm{Matrix} $
MatrixMatrix multiplication is not commutative. Even though the dimension can be different.
$$ A \times B \neq B \times A $$
MatrixMatrix multiplication is associative.
$$ (A \times B) \times C = A \times (B \times C) $$
Identity Matrix $ I $: Diagonal elements are all $ 1 $ and other elements are all $ 0 $
Matrix inverse $ A^{1} $: $ A^{1} $ which satisfies $ AA^{1} = A^{1}A = I $
Maricies that don't have an inverse are called 'singular' or 'degenerate'
Matrix transpose $ A^{T} $ : $ A_{ji} = A^{T} $
WEEK2
Multiple Features
Suppose the number of feature $ n $ and the $ i $ th feature, of the total number of examples $ m $ and the $ j $th example, its value is indicated as: $ x^{(j)}_i $
Multivariable form of hypothesis function is as follows
$$ h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n = \theta ^Tx $$
Gradient descent for multiple variables is as follows: cost function is now $ J(\theta ) $ as
$$ J(\theta ) = \frac{1}{2m} \sum_{i=1}^m (h_\theta (x^{(i)})  y^{(i)})^2 $$
Feature scaling and mean normalization
To quick convergence for regression, the feature $ x $ shall be scaled and normalized as follows:
$$ x_i := \frac{x_i  \mu_i}{s_i} $$
$ \mu_i $ : average of all values $ x $. $ s_i $ : the range of values (max  min)
Learning rate $ \alpha $
Too small $ \alpha $ will lead slow convergence.
Too large $ \alpha $ will lead divergence.
Polynomial regression can be converted into linear regression problem as follows:
$$ h_\theta (x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 $$
where $ x_1 = x, x_2 = x^2, x_3 = x^3 $. Note parameter $ (x_1, x_2, x_3) $ should be appropriatly scaled and normarized.
Normal equation
Suppose $ J(\theta ) = \theta X $ where $ X $ is example parameters matrix. The optimal parameter $ \theta $ is obtained as follows:
$$ \theta = (X^TX)^{1}X^Ty $$
In Octave, just pinv(X'*X)*X'*y
Normal equation is slow when $ \theta $ is large (>1000).
What if $ X^TX $ is noninvertible?
Redundant features (linealy dependent)
Too many features (> examples)
WEEK3
Linear logistics
$ h_\theta (x) = g(\theta ^Tx) $ where $ g(z) = \frac{1}{1+e^{z}} $
This function is called Sigmoid functino or Logistic fucntion.
In binary classification problem, the probability of being A + the probabilty of being B.
$ h_\theta (x) \leq 0.5 $ is usually used for boundary.
Decision Binary
$ h_\theta (x) = g(\theta ^Tx) $ : Predict $ "y=1" $ if $ (\theta^Tx) >threshold $
Nonlinear decision boundaries
Same as the case of Multivariables linear regression, high order term will be fitted via linear desicion binary.
Generalization of cost function
$$ J(\theta ) = \frac{1}{m} \sum_{i=1}^{m} \mathrm{Cost}(h_\theta (x), y) $$
convex: 下に凸(grobal minimum に全域から到達可能), non convex: そうではない
Logistic regression cost function
$$ \mathrm{Cost}(h_\theta (x), y) = \lbrace \begin{matrix} \mathrm{log}h_\theta (x) & \mathrm{if} y=1 \\ \mathrm{log}(1h_\theta (x)) & \mathrm{if} y=0 \end{matrix} $$
$$ J(\theta ) = \frac{1}{m} \sum_{i=1}^{m} \lbrack y^{(i)}\mathrm{log}(h_\theta (x^{(i)})) + (1y^{(i)})\mathrm{log}(1h_\theta (x^{(i)})) \rbrack $$
Multiclass classification
ex. Email tagging: Work, Friends, Family, Hobby
Nedical diagrams: not ill, cold, flu...
Onevsall
Argorithm to devide multiclasses into identical classes.
pick up one class, mark it, and mark other datapoints are treated as theme.
To make a prediction on a new x, pick the class that maximizes $ h_\theta (x) $
Note
Linear logistics ではロジスティック関数の形 $ h_\theta (x) = g(\theta ^Tx) $ where $ g(z) = \frac{1}{1+e^{z}} $ とコスト関数の形
が大事で、この形にすることによりGradient descentで線形回帰と同じ形の発展の方程式
$$ \theta_j:=\theta_j  \alpha \frac{\partial}{\partial \theta_j}J(\theta ) $$
が使える。
Overfitting
Don't do it. Data set is too few.
Underfit = High bias ... Overfit = High variance
Model selection algorithm would be useful (you'll be learning afterword)
Regularization
$$ J(\theta ) = \frac{1}{2m} \sum^{m}{i=1} (h\theta (x^{(i)})  y^{(i)})^2 + \lambda \sum^{m}{i=1} \thetaj^2 $$
The last term is regularization term to penalize large parameter to avoid overfitting.
$ \lambda $ is regularization parameter.
Regularization with normal equation
$ \theta = (X^TX + \lambda (I  A))^{1}X^Ty $ where $ A $ is a matrix with only $ (1,1) $ element is 1, which means $ IA $ is an identical matrix without $ (1, 1) $ element.
Regularized logistic regression
$$ J(\theta ) = \frac{1}{m} \sum_{i=1}^{m} \lbrack y^{(i)}\mathrm{log}(h_\theta (x^{(i)})) + (1y^{(i)})\mathrm{log}(1h_\theta (x^{(i)})) \rbrack + \frac{\lambda}{2m}\sum_{i=1}^{n}\theta_j^2 $$
WEEK4
Neural Network
Simulate the neuron in brain.
'Weight' or 'Parameter' terminology is used for $ \theta $ in input $ \theta ^Tx $
If network has $ s_j $ units in layer $ j $, $ s_{j+1} $units in layer $ j+1 $, then $ \Theta ^{(j)} $ will be dimension $ s_{j+1}\times (s_j+1) $
Forward propagation
Vectorized implementation
where $ g(z) = \frac{1}{1+e^{z}} $
WEEK5
Neural network cost function
Backpropagation
suppose training set $ {(x^{(i)}, y^{(i)}}) $
We want to calculate $ \frac{\partial}{\partial \Theta ^{(l)}_{i,j}}J(\Theta) $ to minimize $ J(\Theta ) $.
Backpropagation algorhythm is like this.
Only if we use sigmoid function for activation function, these delivations are right.
$$ \delta ^{(l)} = ((\Theta ^{(l)})^T \delta ^{(l+1)}) .\ast a^{(l)} .\ast (1a^{(l)}) $$
Backpropagation in practice
Unrolling parameters > we use 'vector' minimize function in Octave, so we convert matrices into vector.
Gradient checking > Backpropagation is fast but complicated method. Numerical gradient is slow but easy method, which can be used for checking if backpropagation implement is collect.
Random initialization > for breaking symmetry. Theta1 = rand(10, 11) * (2 * INIT_EPSILON)  INIT_EPSILON
will obtain $ (\epsilon, \epsilon ) $ random seed.
putting it together >
WEEK6
Deciding what to try next
Predict housing prices > regularized linear regression
what you should not do without thinking
GEt more training examples
Try smaller sets of features
Try getting additional features
Try adding polynomial features
Incerease or decrease $ \lambda $... etc.
Diagnostics
Testing and giving some insight to give your machine learning to improve performance.
Evaluating hypothesis
Test set error
divide test set into 70% training data set and 30% test data set.
Cross validation error
Diagnosing Bias vs Variance
Regularization and Bias/Variance
WEEK7
Support Vector Machine (SVM)
SVM is an algorythm to find a hyperplane which maximizes the margin to each class'es nearest sample.
$$ \mathrm{min} \lbrack C\sum_{i=1}^m y^{(i)}cost_1(\theta^Tx^{(i)}) + (1y^{(i)})cost_0(\theta^Tx^{(i)})\rbrack +\frac{1}{2}\sum^n_{i=1}\theta_j^2$$
SVM can be extended to nonlinear problem.
Kernel
$$ f_1 = \mathrm{similarity}(x, l^{(1)}) = \mathrm{exp} (\frac{xl^{(1)}^2}{2\sigma^2}) $$
Predict "1" when $ \theta_0 + \theta_1f_1 + \theta_2f_2 + \theta_3f_3 + ... \ge 0 $
WEEK8
Nonsupervised algorythm
Kmeans method

Randomly intialize centroids for kclasses.

Assign each data points for the nearest centroids.

Move each centroid to the average (mean) point of each data point.

Iterate 2. and 3. until $ J $ converges. ( $ J = \sum_{i=1}^{k} x^{(i)}  \mu_{C^{(i)}} $ . where $ ab $ means the distance between $ a $and$ b $)
Principal Component Analysis algorythm (PCA)
Decrease the dimension of raw data.

Normalize and scale feature(optional), $ Sigma = \frac{1}{m}\sum_{i=1}^{m}(x^{i})(x^{i})^T $

U, S, V = svd(Sigma)

Minimize $ 1\frac{\sum_{i=1}^{k}x^{i}x_{approx}^{i}}{\sum_{i=1}^{n}x^{i}} $
Just use to compress or visualize data.
WEEK9
Anomaly detection
Anomaly detection usually uses multivariable Gaussian function.
$ x^{(i)} $ : the $ i $ th feature
$ p(x) = \Pi_{j=1}^n p(x^{(j)};\mu_j, \sigma_j^2)$ where $ p(x; \mu, \sigma ^2) =\frac{1}{\sqrt{2\pi}\sigma}\exp{(\frac{(x\mu)^2}{2\sigma^2})} $
Anomaly detection with Multivariable Gaussian usually has a lot of normal data compared to anormal data.
So predoction accuracy is not good index. Use F1 score instead.
Multivariate Gaussian
$$ p(x; \mu, \sigma ^2) =\frac{1}{(2\pi)^{\frac{n}{2}}\sigma^\frac{1}{2}}\exp{(\frac{1}{2}(x\mu)^T\sigma^{1}(x\mu))} $$
Recomendation
Suppose:
$ r(i,j) = 1 $ if user $ j $ has scored target $ i $ ($ 0 $ otherwise)
$ y^{(i,j)} = $ score by user $ j $ on target $ i $
$ \theta^{(j)} = $ parameter vector for user $ j $
$ x^{(i)} = $ feature vector for target $ i $
For user $ j $, target $ i $, predicted rating: $ (\theta^{(j)})^T(x^{(i)}) $
$ m^{(j)} = $ number of target scored by user $ j $
To learn $ \theta^{(j)} $:
$$ \mathrm{min} \frac{1}{2}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}y^{(i,j)})^2+\frac{\lambda}{2}\sum_{k=1}^n(\theta_k^{(j)})^2 $$
To learn $ \theta^{(1)}, \theta^{(2)},...,\theta^{(n_u)} $:
$$ \mathrm{min} \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}y^{(i,j)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2 $$
Gradient descent update:
$ \theta_k^{(j)} := \theta_k^{(j)}  \alpha \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}y^{(i,j)})x_k^{(i)} $ (for$ k = 0 $)
$ \theta_k^{(j)} := \theta_k^{(j)}  \alpha( \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}y^{(i,j)})x_k^{(i)}+\lambda\theta_k^{(j)}) $ (for$ k\neq 0 $ )
Collaborative filtering optimization objective
Minimizing $ x^{(1)}, x^{(2)},...,x^{(n_t)} $and$ \theta^{(1)}, \theta^{(2)},...,\theta^{(n_u)} $simultaneously :
$$ J(x^{(1)}, x^{(2)},...,x^{(n_t)},\theta^{(1)}, \theta^{(2)},...,\theta^{(n_u)}) = \frac{1}{2}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n(x_k^{(i)})+\frac{\lambda}{2}\sum_{i=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)}) $$
WEEK10
Large data set
"It's not who has the best algorithm that wins. It's who has the most data."
batch gradient descent
do gradient descent mehtod with all the features
stochastic gradient descent
process gradient method for each feature.
The tendency is to converge around the local minimum for starting point, which would give welloptimized parameters.
minibatch gradient descent
intermediate of these two methods, to devide features into small unit and proceed gradient descent within the unit.
Online learning
learn $ p(y=1x;\theta) $ online, which enables to follow the trend.
WEEK 11
Machine learning pipeline
Usually, the whole machine learning system is expressed as a pipeline of some machine learning steps.
e.g. Photo OCR system
Law picture > Text area detection > Character segmentation > Character classification
Sliding windows
Sliding windows technique is used in some machine learning apllication, especially for photo detection problem.
To detect some components in a picture, you would configure a detection window, and then slide it along two axises.
i.e. If you want to analyze 200 * 200 pixel picture with 10 * 10, 15 * 15 and 20 * 20 size windows with 4 pixels per each "step", you would process it with about 750000 times.
Getting lots of data and artificail data
If you have not enough data to train your machine learning system, one way to obtain more positive data is to make artificial modification to original positive datasets. For example, rotating, distorting, stretching, and resizing. Linearly independent modifications are not efficient.
Adding purely random noise to your data usually doesn't help improveing training.
Ceiling analysis
If you have a machine learning pipeline and you want to improve the entire accuracy, but don't know where to be improved, the ceiling analysis is applicable. To each component, if its accuracy is 100%, how the entire accuracy becomes. If the value is large, that part is where you should put resource on.