Machine Learning [** WEEK1] [[The definition of machine learning]] A computer program is said to learn from experience E with respect to [- a type] some class of [- problem class] tasks T and [- improves the output measured by a proof] performance measure P, if [- T is improved via scoring P is improved in experience E] its performance at tasks in T, as measured by P, improves with experience E. In most cases, the machine learning process is typed into two: Supervised learning and Unsupervised learning.

[[Supervisor learning]] Giving the 'right answers' to the model learning. Supervised learning problems are classified into two types of problems: Regression problem and Classification problem.

[[Regression]] 回帰 Predict continuous output is obtained via single prediction function.

[[Classification]] Discrete valued output is obtained via classification. []

[[Unsupervised learning]] Learning without 'right answer'. This algorithm can be applied for example, clustering problem. We do not need previous idea about what the label would be. e.g. Cocktail party problem algorithm []

[[Linear regression problem]] Suppose the date set is expressed with input x(i) and output y(i) , and the [[hypothesis]] [$ h] is expressed as [$ h_\theta(x) = \theta 0 + \theta _1 x] the problem is converted into a problem to figure out a set of parameter [$ (\theta _0, \theta _1)] so that this set is the best prediction accuracy. The prediction accuracy can be estimated with [[cost function]] [$ J(\theta _0, \theta _1)] as [$ J(\theta _0, \theta _1) = \frac{1}{2m}\sum _{i=1}^{m}(\hat{yi}-y_i)^2 = \frac{1}{2m}\sum {i=1}^{m}(h\theta (x_i)-y_i)^2] The cost function is called 'mean square error'. Now, the problem is as follows: [* "Find the set of parameter [$ (\theta _0, \theta _1)] so that [$ J] becomes minimum."]

[[contour plot]] If you plot [$ J] as a countour plot for [$ (\theta _0, \theta _1)], you will see a bowl-like shape for [$ J].

[[Gradient descent]] (勾配降下法) Outline: start with some [$ (\theta _0, \theta _1)] and keep changing [$ (\theta _0, \theta _1)] to reduce [$ J (\theta _0, \theta _1)] until it becomes minimum. Algorithm: Repeat until convergence: [$ \theta _j:=\theta _j - \alpha \frac{\partial}{\partial \theta _j}J(\theta _0, \theta _1)] where [$ j=0,1 ] represents the feature index number, and [$ \alpha] is learnign rate. Note that in gradient descent method, there sometimes happen that from slightly different initial parameter set, they would differ into completely different local optimum. Learning rate [$ \alpha] should be 'appropriate' size. If [$ \alpha] is too small, the learning cost a lot of steps, while [$ \alpha] is too large, it doesn't converge (or at worst, it diverges).

[[Gradient Descent For Linear Regression]] [] [["Batch" Gradient Descent]] "Batch": Each step of gradient descent uses all the training examples.

[[Linear Algebra]] Dimension of the matrix: numbers of [[rows]] times number of [[columns]]. 行列の次元は[[行の数×列の数]] [$ A_{ij} = i(row), j(column)] Vector : Special case of matrix, which has only one [[column]] 1-indexed vs 0-indexed: in mathmatical context, we use 1-indexed expression while in machine learning context, we use 0-indexed expressions. Matrix multiplication: [$ (m, n) \mathrm{Matrix} \times (n, k)\mathrm{Matrix} = (m, k)\mathrm{Matrix} ] [] Matrix-Matrix multiplication is [[not]] commutative. Even though the dimension can be different. [$ A \times B \neq B \times A] Matrix-Matrix multiplication is associative. [$ (A \times B) \times C = A \times (B \times C)] Identity Matrix [$ I]: Diagonal elements are all [$ 1] and other elements are all [$ 0] Matrix inverse [$ A^{-1}]: [$ A^{-1}] which satisfies [$ AA^{-1} = A^{-1}A = I] Maricies that don't have an inverse are called 'singular' or 'degenerate' Matrix transpose [$ A^{T}] : [$ A_{ji} = A^{T}]

[** WEEK2] [[Multiple Features]] Suppose the number of feature [$ n] and the [$ i] th feature, of the total number of examples [$ m] and the [$ j]th example, its value is indicated as: [$ x^{(j)}i] Multivariable form of hypothesis function is as follows [$ h\theta (x) = \theta 0 + \theta _1 x1 + \theta 2 x2 + ... + \theta n xn = \theta ^Tx] Gradient descent for multiple variables is as follows: cost function is now [$ J(\theta )] as [$ J(\theta ) = \frac{1}{2m} \sum ^{m}{i=1} (h\theta (x^{(i)}) - y^{(i)})^2] []

[[Feature scaling and mean normalization]] To quick convergence for regression, the feature [$ x] shall be scaled and normalized as follows: [$ x_i := \frac{x_i - \mu i}{si}] [$ \mu i] : average of all values [$ x]. [$ si] : the range of values (max - min)

[[Learning rate]] [$ \alpha] Too small [$ \alpha] will lead slow convergence. Too large [$ \alpha] will lead divergence. Polynomial regression can be converted into linear regression problem as follows: [$ h_\theta (x) = \theta 0 + \theta _1 x + \theta _2 x^2 + \theta _3 x^3 = \theta _0 + \theta _1 x1 + \theta 2 x2 + \theta 3 x3 ] where [$ x_1 = x, x_2 = x^2, x_3 = x^3]. Note parameter [$ (x_1, x_2, x_3)] should be appropriatly scaled and normarized.

[[Normal equation]] Suppose [$ J(\theta ) = \theta X] where [$ X] is example parameters matrix. The optimal parameter [$ \theta] is obtained as follows: [$ \theta = (X^TX)^{-1}X^Ty] In Octave, just pinv(X'*X)*X'*y Normal equation is slow when [$ \theta] is large (>1000). [] What if [$ X^TX] is non-invertible? Redundant features (linealy dependent) Too many features (> examples)

[** WEEK3] [[Linear logistics]] [$ h_\theta (x) = g(\theta ^Tx)] where [$ g(z) = \frac{1}{1+e^{-z}}] This function is called Sigmoid functino or Logistic fucntion. In binary classification problem, the probability of being A + the probabilty of beringB. [$ h_\theta (x) \leq 0.5] is usually used for boundary.

[[Decision Binary]] [$ h_\theta (x) = g(\theta ^Tx)] : Predict [$ "y=1"] if [$ (\theta ^Tx) >threshold]

[[Nonlinear decision boundaries]] Same as the case of Multivariables linear regression, high order term will be fitted via linear desicion binary.

[[Generalization of cost function]] [$ J(\theta ) = \frac{1}{m} \sum {i=1}^{m} \mathrm{Cost}(h\theta (x), y)] convex: 下に凸(grobal minimum に全域から到達可能), non convex: そうではない

[[Logistic regression cost function]] [$ \mathrm{Cost}(h_\theta (x), y) = \lbrace \begin{matrix} -\mathrm{log}h_\theta (x) & \mathrm{if} y=1 \\ -\mathrm{log}(1-h_\theta (x)) & \mathrm{if} y=0 \end{matrix}] [$ J(\theta ) = -\frac{1}{m} \sum {i=1}^{m} \lbrack y^{(i)}\mathrm{log}(h\theta (x^{(i)})) + (1-y^{(i)})\mathrm{log}(1-h_\theta (x^{(i)})) \rbrack]

[] [] [] [[Multiclass classification]] ex. Email tagging: Work, Friends, Family, Hobby Nedical diagrams: not ill, cold, flu...

[[One-vs-all]] Argorithm to devide multiclasses into identical classes. pick up one class, mark it, and mark other datapoints are treated as theme. To make a prediction on a new x, pick the class that maximizes [$ h_\theta (x)]

[[Note]] Linear logistics ではロジスティック関数の形[$ h_\theta (x) = g(\theta ^Tx)] where [$ g(z) = \frac{1}{1+e^{-z}}] とコスト関数の形 [] が大事で、この形にすることによりGradient descentで線形回帰と同じ形の発展の方程式 [$ \theta _j:=\theta _j - \alpha \frac{\partial}{\partial \theta _j}J(\theta )] が使える。

[[Overfitting]] Don't do it. Data set is too few. Underfit = High bias ... Overfit = High variance Model selection algorithm would be useful (you'll be learning afterword)

[[Regularization]] [$ J(\theta ) = \frac{1}{2m} \sum ^{m}{i=1} (h\theta (x^{(i)}) - y^{(i)})^2 + \lambda \sum^{m}_{i=1} \theta _j^2] The last term is regularization term to penalize large parameter to avoid overfitting. [$ \lambda] is regularization parameter.

[[Regularization with normal equation]] [$ \theta = (X^TX + \lambda (I - A))^{-1}X^Ty] where [$ A] is a matrix with only [$ (1,1)] element is 1, which means [$ I-A] is an identical matrix without [$ (1, 1)] element. [] []

[[Regularized logistic regression]] [$ J(\theta ) = -\frac{1}{m} \sum {i=1}^{m} \lbrack y^{(i)}\mathrm{log}(h\theta (x^{(i)})) + (1-y^{(i)})\mathrm{log}(1-h_\theta (x^{(i)})) \rbrack + \frac{\lambda}{2m}\sum _{i=1}^{n}\theta _j^2]

[** WEEK4] [[Neural Network]] Simulate the neuron in brain. 'Weight' or 'Parameter' terminology is used for [$ \theta] in input [$ \theta ^Tx] [] If network has [$ s_j] units in layer [$ j], [$ s_{j+1}]units in layer [$ j+1], then [$ \Theta ^{(j)}] will be dimension [$ s_{j+1}\times (s_j+1)]

[[Forward propagation]] Vectorized implementation [] where [$ g(z) = \frac{1}{1+e^{-z}}]

[** WEEK5] [[Neural network cost function]] [] [[Backpropagation]] suppose training set [$ {(x^{(i)}, y^{(i)}})] We want to calculate [$ \frac{\partial}{\partial \Theta ^{(l)}_{i,j}}J(\Theta)] to minimize [$ J(\Theta )]. Backpropagation algorhythm is like this. [] [] Only if we use sigmoid function for activation function, these delivations are right. [$ \delta ^{(l)} = ((\Theta ^{(l)})^T \delta ^{(l+1)}) .\ast a^{(l)} .\ast (1-a^{(l)})]

[[Backpropagation in practice]] [[Unrolling parameters]] -> we use 'vector' minimize function in Octave, so we convert matrices into vector. [[Gradient checking]] -> Backpropagation is fast but complicated method. Numerical gradient is slow but easy method, which can be used for checking if backpropagation implement is collect. [[Random initialization]] -> for breaking symmetry. Theta1 = rand(10, 11) * (2 * INIT_EPSILON) - INIT_EPSILON will obtain [$ (-\epsilon, \epsilon )] random seed. [[putting it together]] -> []

[** WEEK6] [[Deciding what to try next]] Predict housing prices -> regularized linear regression [[what you should not do without thinking]] GEt more training examples Try smaller sets of features Try getting additional features Try adding polynomial features Incerease or decrease [$ \lambda]... etc.

[[Diagnostics]] Testing and giving some insight to give your machine learning to improve performance.

[*** Evaluating hypothesis] [[Test set error]] divide test set into 70% [[training data set]] and 30% [[test data set]]. []

[[Cross validation error]] [] [[Diagnosing Bias vs Variance]] [] [[Regularization and Bias/Variance]] [] [] []

[** WEEK7] [[Support Vector Machine (SVM)]] SVM is an algorythm to find a hyperplane which maximizes the margin to each class'es nearest sample. [$ \mathrm{min}\theta \lbrack C\sum{i=1}^{m}y^{(i)}cost_1(\theta^Tx^{(i)}) + (1-y^{(i)})cost_0(\theta^Tx^{(i)})\rbrack +\frac{1}{2}\sum_{i=1}^n\theta_j^2] []

SVM can be extended to nonlinear problem. [[Kernel]] [$ f_1 = \mathrm{similarity}(x, l^{(1)}) = \mathrm{exp} (-\frac{||x-l^{(1)}||^2}{2\sigma^2})] Predict "1" when [$ \theta_0 + \theta_1f_1 + \theta_2f_2 + \theta_3f_3 + ... \ge 0] [] []

[** WEEK8] [**** Non-supervised algorythm] [[K-means method]]

  1. Randomly intialize centroids for k-classes.
  2. Assign each data points for the nearest centroids.
  3. Move each centroid to the average (mean) point of each data point.
  4. Iterate 2. and 3. until [$ J] converges. ( [$ J = \sum_{i=1}^{k} ||x^{(i)} - \mu _{C^{(i)}}||] . where [$ ||a-b||] means the distance between [$ a]and[$ b])

[[Principal Component Analysis algorythm (PCA)]] Decrease the dimension of raw data.

  1. Normalize and scale feature(optional), [$ Sigma = \frac{1}{m}\sum_{i=1}^{m}(x^{i})(x^{i})^T]
  2. U, S, V = svd(Sigma)
  3. Minimize [$ 1-\frac{\sum_{i=1}^{k}||x^{i}-x_{approx}^{i}||}{\sum_{i=1}^{n}||x^{i}||}] Just use to compress or visualize data.

[** WEEK9] [[Anomaly detection]] Anomaly detection usually uses multivariable Gaussian function. [$ x^{(i)}] : the [$ i] th feature [$ p(x) = \Pi {j=1}^n p(x^{(j)};\muj, \sigma_j^2)] where [$ p(x; \mu, \sigma ^2) =\frac{1}{\sqrt{2\pi}\sigma}\exp{(-\frac{(x-\mu)^2}{2\sigma^2})}] Anomaly detection with Multivariable Gaussian usually has a lot of normal data compared to anormal data. So predoction accuracy is not good index. Use F1 score instead.

[[Multivariate Gaussian]] [$ p(x; \mu, \sigma ^2) =\frac{1}{(2\pi)^{\frac{n}{2}}|\sigma|^\frac{1}{2}}\exp{(-\frac{1}{2}(x-\mu)^T\sigma^{-1}(x-\mu))}]

[[Recomendation]] Suppose: [$ r(i,j) = 1] if user [$ j] has scored target [$ i] ([$ 0] otherwise) [$ y^{(i,j)} =] score by user [$ j] on target [$ i] [$ \theta^{(j)} =] parameter vector for user [$ j] [$ x^{(i)} =] feature vector for target [$ i] For user [$ j], target [$ i], predicted rating: [$ (\theta^{(j)})^T(x^{(i)})] [$ m^{(j)} =] number of target scored by user [$ j]

To learn [$ \theta^{(j)}]: [$ \mathrm{min}{\theta(j)} \frac{1}{2}\sum{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{k=1}^n(\theta_k^{(j)})^2] To learn [$ \theta^{(1)}, \theta^{(2)},...,\theta^{(n_u)} ]: [$ \mathrm{min}{\theta(1),...,\theta(nu)} \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2]

Gradient descent update: [$ \theta_k^{(j)} := \theta_k^{(j)} - \alpha \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})x_k^{(i)}] (for[$ k = 0]) [$ \theta_k^{(j)} := \theta_k^{(j)} - \alpha( \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})x_k^{(i)}+\lambda\theta_k^{(j)})] (for[$ k\neq 0] )

Collaborative filtering optimization objective Minimizing [$ x^{(1)}, x^{(2)},...,x^{(n_t)} ]and[$ \theta^{(1)}, \theta^{(2)},...,\theta^{(n_u)} ]simultaneously : [$ J(x^{(1)}, x^{(2)},...,x^{(n_t)},\theta^{(1)}, \theta^{(2)},...,\theta^{(n_u)}) = \frac{1}{2}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n(x_k^{(i)})+\frac{\lambda}{2}\sum_{i=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})]

[** WEEK10] [[Large data set]] "It's not who has the best algorithm that wins. It's who has the most data."

[[batch gradient descent]] do gradient descent mehtod with all the features

[[stochastic gradient descent]] process gradient method for each feature. The tendency is to converge around the local minimum for starting point, which would give well-optimized parameters.

[[mini-batch gradient descent]] intermediate of these two methods, to devide features into small unit and proceed gradient descent within the unit.

[[Online learning]] learn [$ p(y=1|x;\theta)] online, which enables to follow the trend.

[** WEEK 11] [[Machine learning pipeline]] Usually, the whole machine learning system is expressed as a pipeline of some machine learning steps. e.g. [[Photo OCR system]] [[Law picture]] -> [[Text area detection]] -> [[Character segmentation]] -> [[Character classification]]

[[Sliding windows]] Sliding windows technique is used in some machine learning apllication, especially for photo detection problem. To detect some components in a picture, you would configure a detection window, and then slide it along two axises. i.e. If you want to analyze 200 * 200 pixel picture with 10 * 10, 15 * 15 and 20 * 20 size windows with 4 pixels per each "step", you would process it with about 750000 times.

[[Getting lots of data and artificail data]] If you have not enough data to train your machine learning system, one way to obtain more positive data is to make artificial modification to original positive datasets. For example, rotating, distorting, stretching, and resizing. Linearly independent modifications are not efficient. Adding purely random noise to your data usually doesn't help improveing training.

[[Ceiling analysis]] If you have a machine learning pipeline and you want to improve the entire accuracy, but don't know where to be improved, the ceiling analysis is applicable. To each component, if its accuracy is 100%, how the entire accuracy becomes. If the value is large, that part is where you should put resource on.