slama.dev

Introduction to Machine Learning

Preface

This website contains my lecture notes from a lecture by Ullrich Köthe from the academic year 2022/2023 (University of Heidelberg). If you find something incorrect/unclear, or would like to contribute, feel free to submit a pull request (or let me know via email).

The notes end at the decision tree/forest lecture (inclusive)!

Also, special thanks to Lucia Zhang, some of the notes are shamelessly stolen from her.

Resources

Introduction

Setting:

Idea:

Traditional approach:

Machine learning:

Basic ML workflow:

  1. collect data in two datasets: training dataset (TS) to train and test dataset to validate
    • extremely important, if you don’t do this you die instantly
    • necessary for recognizing overfitting (FF is polynomials and we fit points…)
  2. select function family FF
    • prior knowledge about what the data looks like
    • trial and error (there are tools to help you with this)
  3. find the best θ^\hat{\theta} by fitting FF to the training set
  4. validate the quality of fθ^(X)f_{\hat{\theta}}(X) on the test set
  5. deploy model in practice

Probabilistic prediction:

Problems:

Notation

iji \setminus j 11 (name) 22 (height) 33 (gender)
11 Alice 1.71.7m f
22 Bob 1.81.8m m
33 Max 1.91.9m m

We usually (when Programming) want a float matrix: drop names, discretize labels and use one-hot encoding (one-hot because there is only ones in the particular features).

iji \setminus j 11 (height) 22 (f) 33 (m) 44 (o)
11 1.71.7 11 00 00
22 1.81.8 00 11 00
33 1.91.9 00 11 00

Types of learning approaches

  1. supervised learningtrue responses are known in TS={(Xi,Yi)}i=1N\mathrm{TS} = \left\{(X_i, Y_i)\right\}_{i = 1}^N
  2. weakly supervised learning
    1. we have some information (know YiY_i for some instances)
    2. we have worse information (there is a tumor but we don’t know where)
  3. unsupervised learningonly features are known in TS={Xi}i=1N\mathrm{TS} = \left\{X_i\right\}_{i = 1}^N
    • learning algorithm must find the structure in the data on its own (data mining or, for humans, “research”)
    • only a few solutions that are guaranteed to work (this is unsurprisingly difficult to do)
      1. representation learning – compute new features that are better to predict
      2. clustering – group similar instances into “clusters” with a single representative
  4. self-supervised learning – define an auxiliary task where YiY_i are easy to determine

Classification

Deterministic classifier

Quality measured by collecting the “confusion matrix”

Y^Y\hat{Y} \setminus Y^* 1-1 +1+1
1-1 true negative false negative
+1+1 false positive true positive
  1. false positive/negative fraction – “how many out of all events are false” #FP or #FNN[0,1]\frac{\# \mathrm{FP}\ \text{or}\ \# \mathrm{FN}}{\mathrm{N}} \in \left[0, 1\right]
  2. false positive/negative rate – “how many out of positive/negative events are false” #FP#FP+#TNp(Y^=1Y=1)#FN#FN+#TPp(Y^=1Y=1)\frac{\# \mathrm{FP}}{\# \mathrm{FP} + \# \mathrm{TN}} \approx p(\hat{Y} = 1 \mid Y^* = -1) \qquad \frac{\# \mathrm{FN}}{\# \mathrm{FN} + \# \mathrm{TP}} \approx p(\hat{Y} = -1 \mid Y^* = 1)

Probabilistic classifier

Quality measured by “calibration”

Important: calculate confusion matrix or calibration from test set, not train set (bias)!

Threshold classifier

Bayes rule of conditional probability

Is hugely important for a number of reasons:

Some history behind ML:

How good can it be?

Definition (Bayes classifier): uses Bayes rule (LHS or RHS) with true probabilities pp^*

Theorem: no learned classifier using p^\hat{p} can be better than the Bayes classifier.

How bad can it be?

Linear classification

  1. use a linear formula to reduce all features to scaled score
  2. apply threshold classifier to score (C=2C = 2 for now)

Y^i=sign(jβjXij+b)\hat{Y}_i = \mathrm{sign}\left(\sum_{j} \beta_j X_{ij} + b\right) for feature weights β\beta and intercept bb.

Perceptron

Gradient descent looks as follows: β(t)=β(t1)τL(t1)βloss derivative\beta^{(t)} = \beta^{(t - 1)} - \tau \underbrace{\frac{\partial \mathcal{L}^{(t - 1)}}{\partial \beta}}_{\text{loss derivative}} for learning rate τ1\tau \ll 1 In our case, the derivative (for a single instance) is L(t1)β=ReLU(YiXiβ)β={0YiXiβ<0  (correct)YiXiTotherwise\frac{\partial \mathcal{L}^{(t - 1)}}{\partial \beta} = \frac{\partial \mathrm{ReLU} (-Y_i X_i \beta)}{\partial \beta} = \begin{cases} 0 & -Y_i^* X_i \beta < 0\ \ \text{(correct)} \\ -Y_i^* X_i^T & \text{otherwise} \end{cases}

Rosenblatt’s algorithm

Algorithm (Rosenblatt’s algorithm):

  1. initialzation of β(0)\beta^{(0)} – random/uniform numbers
  2. for t=1,,Tt = 1, \ldots, T (or until convergence)
    • update β\beta using the gradient descent formula above, with minor changes:
      • additionally, use τ/N\tau / N instead of τ\tau (so it doesn’t change when learning set changes)
      • sum over only incorrect guesses from the previous iteration β(t)=β(t1)+τNi:Y^i(t1)YiYiXiT\boxed{\beta^{(t)} = \beta^{(t - 1)} + \frac{\tau}{N} \sum_{i: \hat{Y}_i^{(t - 1)} \neq Y_i^*} Y_i^* X_i^T}

Linear classifier.

Support Vector Machine (SVM)

Improved algorithm (popular around 1995): (Linear) Support Vector Machine (SVM)

Reminder: distance of a point XiX_i and plane β,b\beta, b is d(Xi,(β,b))=Xiβ+bβd(X_i, (\beta, b)) = \left| \frac{ X_i \beta + b}{|| \beta ||} \right|

Radius of the safety region (margin) is the smallest distance distance of a point to the decision plane (also called the “Hausdorff distance between H and TS”): mH=mini=1Nd(Xi,(βH,bH))m_H = \min_{i = 1}^N d(X_i, (\beta_H, b_H))

Let i^\hat{i} be the closest point. Then Yi^(Xi^βH+bH)=mHβHY_{\hat{i}}^* (X_{\hat{i}} \beta_H + b_H) = m_H || \beta_H ||

Now we chose a representative such that the equation above is 11. The decision plane is then correct when i:Yi(XiβH+bH)1\forall i: Y_i^* (X_i \beta_H + b_H) \ge 1 (specifically 11 for i^\hat{i}). To make it as general as possible, we want one that maximizes the distance H=arg maxHmH=arg maxH(miniYi(XiβH+bH)βH)=arg maxH(1βHminiYi(XiβH+bH)))=argmaxH1βH \begin{aligned} H &= \argmax_H m_H \\ &= \argmax_H \left(\min_i \frac{Y_i^* (X_i \beta_H + b_H)}{ ||\beta_H ||}\right) \\ &= \argmax_H \left(\frac{1}{|| \beta_H ||} \min_i Y_i^* (X_i \beta_H + b_H))\right) \\ &= \arg \max_H \frac{1}{|| \beta_H || } \end{aligned}

This yields an optimization problem: we want to arg maxH1βHs.t.i:Yi(XiβH+bH)1\argmax_H \frac{1}{|| \beta_H || } \quad \text{s.t.} \quad \forall i: Y_i^* (X_i \beta_H + b_H) \ge 1

This is inconvenient – change to a more convenient equivalent optimization problem arg minH12βHTβHs.t.i:Yi(XiβH+bH)1\argmin_H \frac{1}{2} \beta_H^T \beta_H \quad \text{s.t.} \quad \forall i: Y_i^* (X_i \beta_H + b_H) \ge 1

Now we define slack variables ξi0\xi_i \ge 0 that measure how much it was violated arg minH12βHTβHs.t.i:Yi(XiβH+bH)1ξi\argmin_H \frac{1}{2} \beta_H^T \beta_H \quad \text{s.t.} \quad \forall i: Y_i^* (X_i \beta_H + b_H) \ge 1 - \xi_i and we also want to minimize those, which we’ll do by using Lagrange multipliers: arg minH,ξ12βHTβHmaximize margins+λNi=1NReLU(1Yi(XiβH+bH))minimize penalties for misclassified points\boxed{\argmin_{H, \xi} \underbrace{\frac{1}{2} \beta_H^T \beta_H}_{\text{maximize margins}} + \underbrace{\frac{\lambda}{N} \sum_{i = 1}^{N} \mathrm{ReLU} (1 - Y_i^* (X_i \beta_H + b_H))}_{\text{minimize penalties for misclassified points}}}

We again optimize by derivative + gradient descent:

Lβ=β+λNi=YiXiβ+b<1NYiXiT\frac{\partial \mathcal{L}}{\partial \beta} = \beta + \frac{\lambda }{N} \sum_{i = Y_i^* X_i \beta + b < 1}^{N} -Y_i^* X_i^T Lb=λNi=YiXiβ+b<1NYi\frac{\partial \mathcal{L}}{\partial b} = \frac{\lambda }{N} \sum_{i = Y_i^* X_i \beta + b < 1}^{N} -Y_i^*

The iteration step for β\beta looks as follows: β(t)=β(t1)τ(β(t1)+λNi=YiXiβ+b<1NYiXiT)b(t)=b(t1)τ(λNi=YiXiβ+b<1NYi) \boxed{ \begin{aligned} \beta^{(t)} &= \beta^{(t - 1)} - \tau \left(\beta^{(t - 1)} + \frac{\lambda }{N} \sum_{i = Y_i^* X_i \beta + b < 1}^{N} -Y_i^* X_i^T\right) \\ b^{(t)} &= b^{(t - 1)} - \tau \left(\frac{\lambda}{N} \sum_{i = Y_i^* X_i \beta + b < 1}^{N} -Y_i^*\right) \end{aligned} }

Linear Discriminant Analysis (LDA)

To get any ellipse, we start with a unit circle zz and stretch (λ\lambda), rotate (QQ) and shift (μ\mu) it: Qλz+μ=(cosφsinφsinφcosφ)(x100x2)z+μQ \lambda z + \mu = \begin{pmatrix} \cos \varphi & -\sin \varphi \\ \sin \varphi & \cos \varphi \end{pmatrix} \begin{pmatrix} x_1 & 0 \\ 0 & x_2 \end{pmatrix} z + \mu

Derivation of an ellipse from a unit circle.

Solving for zz (and using the fact that QQ is orthogonal), we get z=λ1Q1(Xμ)=λ1QT(Xμ)z = \lambda^{-1} Q^{-1} (X - \mu) = \lambda^{-1} Q^T (X - \mu)

Gaussian distribution can be defined asN(xμ,σ)=12πσ2exp(12((xμ)σ1)2)\mathcal{N}(x \mid \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2} \left((x - \mu)\sigma^{-1}\right)^2\right)

For higher dimensions (with our circle zz), we get the generalized Gaussian N(zμ,Σ)=1det(2πΣ)exp(12(xμ)Q1λ1λ1QTΣ1(xμ)T)\mathcal{N}(z \mid \mu, \Sigma) = \frac{1}{\sqrt{\det\left(2 \pi \Sigma\right)}} \exp\left(-\frac{1}{2} (x - \mu) \underbrace{Q^{-1}\lambda^{-1}\lambda^{-1}Q^{-T}}_{\Sigma^{-1}} (x - \mu)^T\right)

Let {Xi}i=1N1\left\{X-i\right\}_{i = 1}^{N_1} be features of class 11. Then N(Xμ1,Σ1)=1det(2πΣ1)exp(12(x1μ1)Σ11(x1μ1)T)N(X \mid \mu_1, \Sigma_1) = \frac{1}{\sqrt{\det(2 \pi \Sigma_1)}} \exp\left(-\frac{1}{2} (x_1 - \mu_1) \Sigma_1^{-1} (x_1 - \mu_1)^T\right)

To derive the learning method, we’ll use two things:

  1. maximum likelihood principle: choose μ1\mu_1 and Σ1\Sigma_1 such that TS will be a typical outcome of the resulting model (i.e. best model maximizes the likelihood of TS)
  2. i.i.d. assumption: training instances drawn independently from the same distribution
    • independently – joint distribution is the product
    • identically distributed – all instances come from the same distribution

For the probability, we get p(TS)=p(X1,,Xn)=i=1Npi(Xi)independently=i=1Np(Xi)identically distributed=i=1NN(Xiμ,Σ) \begin{aligned} p(\mathrm{TS}) &= p(X_1, \ldots, X_n) \\ &= \prod_{i = 1}^{N} p_i(X_i) \qquad \text{independently} \\ &= \prod_{i = 1}^{N} p(X_i) \qquad \text{identically distributed} \\ &= \prod_{i = 1}^{N} \mathcal{N(X_i \mid \mu, \Sigma)} \end{aligned}

Using the maximum likelihood principle, the problem becomes μ^,Σ^=arg maxμ,Σp(TS)\hat{\mu}, \hat{\Sigma} = \argmax_{\mu, \Sigma} p(\mathrm{TS})

It’s mathematically simpler to minimize negative logarithm of p(TS)p(\mathrm{TS}) (applying monotonic function to an optimization problem doesn’t change arg min\argmin and max=min\max = -\min). We get

μ^,Σ^=arg minμ,Σlog(p(TS))=i=1N[log(1det(2πΣ))12(Xiμ)Σ1(Xiμ)T]=i=1N[log(det(2πΣ))+(Xiμ)Σ1(Xiμ)T]=L(TS) \begin{aligned} \hat{\mu}, \hat{\Sigma} &= \argmin_{\mu, \Sigma} -\log(p(\mathrm{TS})) \\ &= - \sum_{i = 1}^{N} \left[\log \left(\frac{1}{\sqrt{\det(2 \pi \Sigma)}}\right) - \frac{1}{2} (X_i - \mu) \Sigma^{-1} (X_i - \mu)^T\right] \\ &= \sum_{i = 1}^{N} \left[\log \left(\det(2 \pi \Sigma)\right) + (X_i - \mu) \Sigma^{-1} (X_i - \mu)^T\right] = \mathcal{L}(\mathrm{TS}) \end{aligned}

This is our loss. We now do derivative and set it to 00, since that will find the optimum. First, we’ll derivate by μ\mu, which gets rid of the first part of loss completely and we get

Lμ=i=1NΣ1(Xiμ)TvAvTv=2AvT=0i=1N(Xiμ)T=0i=1NXi=Nμ1Ni=1NXi=μ \begin{aligned} \frac{\partial \mathcal{L}}{ \partial \mu} = \sum_{i = 1}^{N} \overbrace{\Sigma^{-1} (X_i -\mu)^T}^{\frac{\partial vAv^T}{\partial v} = 2Av^T} &= 0 \\ \sum_{i = 1}^{N} (X_i -\mu)^T &= 0 \\ \sum_{i = 1}^{N} X_i &= N \mu \\ \frac{1}{N} \sum_{i = 1}^{N} X_i &= \mu \\ \end{aligned}

In other words, the mean μ\mu is the average (shocking, I know).

For Σ\Sigma, this will be a little more complicated. We’ll do the partial derivation by Σ1=K\Sigma^{-1} = K instead, which gives the following: LK=i=1N[(KT)1logdetKK=(KT)1+(Xiμ)T(Xiμ)vAvTA=vTv]=0i=1N[K1+(Xiμ)T(Xiμ)]=0K symmetricNK1+i=1N(Xiμ)T(Xiμ)=01Ni=1N(Xiμ)T(Xiμ)=Σ \begin{aligned} \frac{\partial \mathcal{L}}{ \partial K} = \sum_{i = 1}^{N} [ \overbrace{-(K^T)^{-1}}^{\frac{\partial \log \det K}{\partial K} = (K^T)^{-1}} + \overbrace{(X_i - \mu)^T (X_i - \mu)}^{\frac{\partial vAv^T}{\partial A} = v^Tv}] &= 0 \\ \sum_{i = 1}^{N} \left[ -K^{-1} + (X_i - \mu)^T (X_i - \mu)\right] &= 0 \qquad K\ \text{symmetric} \\ -N K^{-1} + \sum_{i = 1}^{N} (X_i - \mu)^T (X_i - \mu) &= 0 \\ \frac{1}{N} \sum_{i = 1}^{N} (X_i - \mu)^T (X_i - \mu) &= \Sigma \\ \end{aligned}

Again, in other words, the variance is the average over the quared vectors offset by the mean, which too makes sense.

Now we have 22 clases but with same covariance (by assumption of LDA) and we can:

  1. determine two means (μ1,μ1\mu_1, \mu_{-1}) as μ1=1N1i:Yi=1Xiμ1=1N1i:Yi=1\boxed{\mu_1 = \frac{1}{N_1} \sum_{i: Y_i^* = 1} X_i \qquad \mu_{-1} = \frac{1}{N_{-1}} \sum_{i: Y_i^* = -1}}
  2. to calculate covariance (which is the same for both classes): Σ=1N(i:Yi=1(Xiμ1)T(Xiμ1)+i:Yi=1(Xiμ1)T(Xiμ1))\boxed{\Sigma = \frac{1}{N} \left(\sum_{i: Y_i^* = 1} (X_i - \mu_1)^T (X_i - \mu_1) + \sum_{i: Y_i^* = -1} (X_i - \mu_{-1})^T (X_i - \mu_{-1})\right)}
  3. use Bayes RHS and our calculations to calculate the LHS: Y^i=sign(Xiβ+b)withβ=2Σ1(μ1μ1)Tb=μ1Σ1μ1Tμ1Σ1μ1T\boxed{\begin{aligned} \hat{Y}_i = \mathrm{sign}(X_i \beta + b) \quad \text{with} \quad &\beta = 2 \Sigma^{-1} (\mu_1 - \mu_{-1})^T \\ & b = \mu_{-1} \Sigma^{-1} \mu_{-1}^T - \mu_1 \Sigma^{-1} \mu_1^T \end{aligned}}

The derivation of (3) looks as follows: assuming p(Y=1)=p(Y=1)=12p(Y = 1) = p(Y = -1) = \frac{1}{2} for simpler computations and p(Y=1X)p(Y = -1 \mid X) being analogous:

p(Y=1X)=p(Y=1X)p(Y=1)p(Y=1X)p(Y=1)+p(Y=1X)p(Y=1)=p(Y=1X)p(Y=1X)+p(Y=1X)=11+p(XY=1)p(XY=1) \begin{aligned} p(Y=1|X)&=\frac{p(Y=1|X){p(Y=1)}}{p(Y=1|X){p(Y=1)}+p(Y=-1|X){p(Y=-1)}}\\ &=\frac{p(Y=1|X)}{p(Y=1|X)+p(Y=-1|X)}\\ &=\frac{1}{1+\frac{p(X|Y=-1)}{p(X|Y=1)}} \end{aligned}

Now we substitute the Gaussian function:

p(XiY=1)p(XiY=1)=1det(2πΣ)exp(12(Xiμ1)Σ1(Xiμ1)T)1det(2πΣ)exp(12(Xiμ1)Σ1(Xiμ1)T)=exp(12(Xiμ1)Σ1(Xiμ1)T+12(Xiμ1)Σ1(Xiμ1)T)=exp(12[XΣ1XT2XΣ1μ1T+μ1Σ1μ1TXΣ1XT+2XΣ1μ1Tμ1Σ1μ1T])=exp(XΣ1(μ1Tμ1T)β12(μ1Σ1μ1Tμ1Σ1μ1T)b)=exp((Xβ+b)) \begin{aligned} \frac{p(X_i|{Y=-1})}{p(X_i|{Y=1})}&=\frac{\cancel{\frac{1}{\sqrt{\det(2\pi\Sigma)}}}\exp(-\frac{1}{2}(X_i-\mu_{-1})\Sigma^{-1}(X_i-\mu_{-1})^T)}{\cancel{\frac{1}{\sqrt{\det(2\pi\Sigma)}}}\exp(-\frac{1}{2}(X_i-\mu_{1})\Sigma^{-1}(X_i-\mu_{1})^T)}\\ &=\exp(-\frac{1}{2}(X_i-\mu_{-1})\Sigma^{-1}(X_i-\mu_{-1})^T+\frac{1}{2}(X_i-\mu_{1})\Sigma^{-1}(X_i-\mu_{1})^T) \\ &=\exp(-\frac{1}{2}\left[\cancel{X\Sigma^{-1}X^T}-2X\Sigma^{-1}\mu_{-1}^T+\mu_{-1}\Sigma^{-1}\mu_{-1}^T-\cancel{X\Sigma^{-1}X^T}+2X\Sigma^{-1}\mu_1^T-\mu_1\Sigma^{-1}\mu_{1}^T\right])\\ &=\exp(-X\underbrace{\Sigma^{-1}(\mu_1^T-\mu_{-1}^T)}_\beta-\frac{1}{2}\underbrace{(\mu_{-1}\Sigma^{-1}\mu_{-1}^T-\mu_1\Sigma^{-1}\mu_1^T)}_b)=\exp(-(X\beta+b)) \end{aligned}

Plugging the result into the previous equation, we obtain the sigmoid function σ\sigma:

p(Y=1X)=11+exp((Xβ+b))σ(t)=11+exp(t)p(Y = 1 \mid X) = \frac{1}{1 + \mathrm{exp}(-(X \beta + b))} \qquad \sigma(t) = \frac{1}{1 + \exp(-t)}

Two of its properties will be helpful for us: σ(t)=1σ(t)σ(t)t=σ(t)σ(t)\sigma(-t) = 1 - \sigma(t) \qquad \frac{\partial \sigma(t)}{\partial t} = \sigma(t) \sigma(-t)

Doing an analogous derivation for p(Y=1X)p(Y = -1 \mid X), we get y^=argmaxkp(Y=kX){1if σ(Xβ+b)>121if σ(Xβ+b)<12{1if Xβ+b>01if Xβ+b<0sign(Xβ+b) \begin{aligned} \hat y=\arg\max_k p(Y=k|X) \Longleftrightarrow& \begin{cases}1 &\text{if } \sigma(X\beta+b)>\frac{1}{2}\\ -1 &\text{if } \sigma(X\beta+b)<\frac{1}{2}\end{cases}\\ \Longleftrightarrow& \begin{cases}1 &\text{if } X\beta+b>0\\ -1 &\text{if } X\beta+b<0\end{cases}\\ \Longleftrightarrow& \text{sign}(X\beta+b) \end{aligned}

Alternatives to training beta and b:

  1. fit mean and covariance of clusters
  2. least-squares regression: L(Yi,Y^i)=(Yi(Xβ+b))2\mathcal{L}(Y_i^*, \hat{Y}_i) = (Y_i^* - (X \beta + b))^2 (same solution as 11)
  3. Fisher’s idea: define 1D scores Zi=XiβZ_i = X_i \beta and chose β\beta such that a threshold on ZiZ_i has minimum error
    • define projection of the means μ1^=μ1β,μ1^=μ1β\hat{\mu_{1}} = \mu_1 \beta, \hat{\mu_{-1}} = \mu_{-1}\beta
    • intuition: μ^1\hat{\mu}_1 and μ1^\hat{\mu_{-1}} should be as far away as possible β=arg maxβ(μ^1μ^1)2\beta = \argmax_\beta (\hat{\mu}_1 - \hat{\mu}_{-1})^2
    • doesn’t quite work, because τβ    τ2(μ^1μ^1)\tau \beta \implies \tau^2 (\hat{\mu}_{1} \hat{\mu}_{-1})
    • solution: scale by the variance σ^\hat{\sigma}: σ^1=Var(Z1Yi=1),σ^1Var(ZiYi=1)\hat{\sigma}_1 = \mathrm{Var}\left(Z_1 \mid Y_i^* = 1\right), \hat{\sigma}_{-1} \mathrm{Var}\left(Z_i \mid Y_i^* = -1\right), then we get β^=arg maxβ(μ^1μ1^)2σ^12+σ^12\hat{\beta} = \argmax_{\beta} \frac{\left(\hat{\mu}_1 - \hat{\mu_{-1}}\right)^2}{\hat{\sigma}_1^2 + \hat{\sigma}_{-1}^2}
    • again gives the same solution as 11 and 22
  4. Logistic regression (LR): same posterior as LDA, but learn LHS of Bayes rule
    • gives different solution to LDA
Logistic regression (LR)

We again have i.i.d. assumptions – all labels are drawn independently form the same posterior: p((Yi)i=1N(Xi)i=1N)=i=1Np(YiXi)p\left(\left(Y_i^*\right)_{i = 1}^N \mid \left(X_i\right)_{i = 1}^N\right) = \prod_{i = 1}^{N} p(Y_i^* \mid X_i)

For LR, Yi{0,1}Y^*_i \in \left\{0, 1\right\}, which allows us to rewrite (with the σ\sigma results from LDA) like so:  β^,b^=argminβ,bi=1N[Yilogσ(Xβ+b)+(1Yi)log(σ((Xβ+b))]  \boxed{\ \hat{\beta}, \hat{b}=\arg\min_{\beta, b}-\sum_{i=1}^N\Big[Y_i^* \log \sigma\left(X\beta+b\right)+\left(1-Y_i^*\right) \log \left(\sigma\left(-\left(X\beta+b\right)\right)\Big]\right.\ }

Here we have log(σ(t))=log(1+exp(t))-\log(\sigma(-t)) = \log(1 + \exp(t)), which is called the softplus function:

ReLu + Softplus graphs.

Simplifying for b=0b = 0 and using the following properties:

σ(Xβ)β=σ(Xβ)X=σ(Xβ)σ(Xβ)Xlogσ(Xβ)β=1σ(Xβ)σ(Xβ)σ(Xβ)X=σ(Xβ)Xlogσ(Xβ)β=1σ(Xβ)σ(Xβ)σ(Xβ)(X)=σ(Xβ)X \begin{aligned} \frac{\partial \sigma\left(X\beta\right)}{\partial \beta}&=\sigma^{\prime}\left(X\beta\right) X=\sigma(X \beta) \sigma\left(-X\beta\right) \cdot X \\ \frac{\partial \log \sigma\left(X\beta\right)}{\partial \beta}&=\frac{1}{\sigma\left(X\beta\right)} \sigma\left(X\beta\right) \sigma\left(-X\beta\right) \cdot X=\sigma\left(X\beta\right) \cdot X \\ \frac{\partial \log \sigma\left(-X\beta\right)}{\partial \beta}&=\frac{1}{\sigma\left(-X\beta\right)} \sigma\left(X\beta\right) \sigma\left(-X\beta\right) \cdot(-X)=-\sigma\left(X\beta\right) \cdot X \end{aligned}

We get the derivatives β\beta: Loss(TS)β=i=1N[Yiσ(Xiβ)1σ(Xiβ)Xi+(1Yi)σ(Xiβ)(Xi)]=i=1N[YiXiYiσ(Xiβ)Xiσ(Xiβ)Xi+Yiσ(Xiβ)Xi] \begin{aligned} \frac{\partial \mathcal Loss(T S)}{\partial \beta}&=-\sum_{i=1}^N\left[Y_i^* \underbrace{\sigma\left(-X_i \beta\right)}_{1-\sigma\left(X_i \beta\right)} \cdot X_i+\left(1-Y_i^*\right) \sigma\left(X_{i \beta}\right) \cdot\left(-X_i\right)\right] \\ & =-\sum_{i=1}^N\left[Y_i^* X_i-\cancel{Y_i^* \sigma\left(X_i \beta\right) X_i}-\sigma\left(X_i \beta\right) X_i+\cancel{Y_i^* \sigma\left(X_i \beta\right) X_i}\right] \end{aligned}

obtaining

 Loss(TS)β=i=1N(σ(Xiβ)Yierror )Xi=!0  \boxed{\ \frac{\partial \mathcal Loss(T S)}{\partial \beta}=\sum_{i=1}^N(\underbrace{\sigma\left(X_i \beta\right)-Y_i^*}_{\text {error }}) \cdot X_i \stackrel{!}{=} 0\ }

The four cases that can occur are as follows:

Real value \ Classifier result σ(Xiβ)1\sigma(X_i\beta)\approx 1 σ(Xiβ)0\sigma(X_i\beta)\approx 0
Yi=1Y^*_i=1 no correction error 1\approx-1: pulls σ(Xiβ)1\sigma(X_i\beta)\rightarrow1
Yi=0Y^*_i=0 error 1\approx 1: pulls σ(Xiβ)0\sigma(X_i\beta)\rightarrow0 no correction
Summary

Perceptron: ReLU(Yi(Xiβ+b))λ=0\mathrm{ReLU}(-Y_i^* (X_i \beta + b)) \qquad \lambda = 0

SVM: ReLU(1Yi(Xiβ+b))λ>0\mathrm{ReLU}(1 - Y_i^* (X_i \beta + b)) \qquad \lambda > 0

LDA: (Yi(Xiβ+b))2λ=0 (fit μ,Σ), λ>0 (fit objective)(Y_i^* - (X_i \beta + b))^2 \qquad \lambda = 0\ (\text{fit } \mu, \Sigma),\ \lambda > 0\ (\text{fit objective})

LR: softplus(Yi(Xiβ+b))λ=0 (or >)\mathrm{softplus}(-Y_i^* (X_i \beta + b)) \qquad \lambda = 0\ (\text{or >})

Linear classifier.

Multi-class classification
  1. one-against-rest classification
    • for k=1,,Ck = 1, \ldots, C, define βk,bk    \beta_k, b_k \implies score sk=Xiβk+bks_k = X_i \beta_k + b_k
    • train by treating Yi=kY_i^* = k as “class + 1” and YikY_i^* \neq k as “class - 1” (rest)
    • make them comparable by normalization: β^k=βkβk\hat{\beta}_k = \frac{\beta_k}{\|\beta_k\|} (same for bb)
    • classify according to biggest score, or “don’t know” if all scores are bad: Y^i={"unknown"sk<ε, karg maxkskotherwise\hat{Y}_i = \begin{cases} \text{"unknown"} & s_k < \varepsilon,\ \forall k \\ \argmax_k s_k & \text{otherwise}\end{cases}
  2. all-pairs: train a linear model for all k,kk, k' (n2\approx n^2)
    • sk,k=Xiβk,k+bk,k, kks_{k, k'} = X_i \beta_{k, k'} + b_{k, k'},\ \forall k \neq k'
    • if sk,k>0s_{k, k'} > 0 then one vote for class kk, if sk,k<0s_{k, k'} < 0 then one vote for kk'
    • Y^i\hat{Y}_i is the label with most votes (or “unknown” for ties)
  3. define posterior as “softmax-function” scores as in (1): p(Y^i=kXi)=exp(sk)k=1Cexp(sk)=softmax(si,sk)\boxed{p(\hat Y_i = k \mid X_i) = \frac{\exp(s_k)}{\sum_{k' = 1}^{C} \exp(s_k')} = \mathrm{softmax}(s_i, s_k)}
    • standard for neural network classification
    • easily provable: softmax(s1,s2)=σ(s1s2)\mathrm{softmax}(s_1, s_2) = \sigma(s_1 - s_2)
    • new: train all βk,bk\beta_k, b_k jointly (at the same time)

Non-linear classification

  1. measure more features and increase dimension
    • higher dimensional spaces tend to be more linearly separable
    • for N=D+1N = D + 1, it’s always separable (but maybe not useful)
    • for D>ND > N, use sparse learning methods
  2. use non-linear classifier
    • QDA, which is a non-linear version of LDA
    • Kernel-SVM (non-linear SVM, less popular but cool math)
    • Decision trees/forests (recursively subdivide XX, we’ll discuss them later)
  3. non-linearly transform features X~i=φ(Xi)\tilde{X}_i = \varphi(X_i)
    • XOR function: x1,x2{1,1}x_1, x_2 \in \left\{-1, 1\right\}, yi={+1x1x21otherwisey_i^* = \begin{cases} +1 & x_1 \neq x_2 \\ -1 & \text{otherwise} \end{cases}
    • four points classification: x~=x1x2,y^={+1x~<01otherwise\tilde{x} = x_1 \cdot x_2, \hat{y} = \begin{cases} +1 & \tilde x < 0 \\ -1 & \text{otherwise} \end{cases}
    • BMI (non-linear formula, linear classification)
    • problem: hand-crafting φ\varphi is difficult
    • solution: learn φ\varphi (multi-layer neural networks)

Neural networks

A neural network is a collection of neurons in parallel layers

Network example.

Previously, NN were believed to not be a good idea, but

Activation functions:

Important theorems:

  1. neural networks with L2L \ge 2 (=1= 1 hidden layer) are universal approximators (can approximate any function to any accuracy), if there are enough neurons in the hidden layer
    • purely existential proof, but nice to know
    • up till 2005, this was used, since they were enough
      • bad idea – deeper networks are easier to train (many good local optima)
  2. finding the optimal network is NP-hard     \implies we need approximation algorithms

Training by backpropagation (chain rule)

  1. LosszL\frac{\partial \mathcal{Loss}}{\partial z_L} (application-dependent loss)
    • regression: Loss=12(zLYi)2LosszL=zLYi\begin{aligned} \mathcal{Loss} &= \frac{1}{2} \left(z_L - Y_i^*\right)^2 \\ \frac{\partial \mathcal{Loss}}{\partial z_L} &= z_L - Y_i^* \end{aligned}
    • classification: Loss=k=1C1I[k=Yi]logp(Y=kX)zLkLosszLk={1zLkk=Yi0otherwise\begin{aligned} \mathcal{Loss} &= \sum_{k = 1}^{C} \mathbb{1} \mathbb{I} \left[k = Y_i^*\right] -\log \underbrace{p(Y = k \mid X)}_{z_{Lk}} \\ \frac{\partial \mathcal{Loss}}{\partial z_{Lk}} &= \begin{cases} - \frac{1}{z_{Lk}} & k = Y_i^* \\ 0 & \text{otherwise} \end{cases} \end{aligned}
  2. back-propagate through output activation φL(z~L)\varphi_L (\tilde z_L)
    • here we define δ~L=Lossz~L\tilde{\delta}_L = \frac{\partial \mathcal{Loss}}{\partial \tilde z_L} since we’ll use it a lot
    • for regression, we had identity activation function so zL=φL(z~L)=z~L    δ~L=z~LYiz_L = \varphi_L(\tilde z_L) = \tilde z_L \implies \tilde{\delta}_L = \tilde z_L - Y_i^*
    • for classification, this is a bit more complicated but ends up being δ~L={zLk1k=YizLkotherwise\boxed{\tilde{\delta}_L = \begin{cases} z_{Lk} - 1 & k = Y_i^* \\ z_{Lk} & \text{otherwise} \end{cases}}
  3. for every l=1,,Ll = 1, \ldots, L, let δ~l=Lossz~l\tilde \delta_l = \frac{\partial \mathcal{Loss}}{\partial \tilde z_l}; recursion starts with δ~L\tilde \delta_L (previous step)
    • use chain rule (and simplify/compute using previous layers): δ~l1=Lossz~l1=Lossz~lzlzl1zl1z~l1=δ~lBlTdiag(φ(z~l1))\begin{aligned} \boxed{\tilde \delta_{l - 1} = \frac{\partial \mathcal{Loss}}{\partial \tilde z_{l - 1}}} &= \frac{\partial \mathcal{Loss}}{\partial \tilde z_l} \cdot \frac{\partial z_l}{\partial z_{l - 1}} \cdot \frac{\partial z_{l - 1}}{\partial \tilde z_{l - 1}} \\ &= \boxed{\tilde \delta_l \cdot B_l^T \cdot \mathrm{diag}(\varphi'(\tilde z_{l - 1}))} \end{aligned}
  4. another chain rule, finally with what we want to calculate: LossBl=Lossz~lz~lBl=z~l1δ~l\begin{aligned} \boxed{\frac{\partial \mathcal{Loss}}{\partial B_l}} &= \frac{\partial \mathcal{Loss}}{\partial \tilde z_l} \cdot \frac{\partial \tilde z_l}{\partial B_l} \\ &= \boxed{\tilde z_{l - 1} \cdot \tilde \delta_l} & \end{aligned}

Training algorithm:

  1. init BlB_l randomly (explained later)
  2. for t=1,,Tt = 1, \ldots, T
    1. forward pass – for ii in batch
      • z0=[1,Xi]z_0 = [1, X_i]
      • for l=1,,Ll = 1, \ldots, L, compute + store zl,z~lz_l, \tilde{z}_l along the way
    2. backward pass: ΔBl=0\Delta B_l = 0, for ii in batch
      • compute δ~L\tilde \delta_L
      • for l=L,,1l = L, \ldots, 1, compute
        • ΔBl+=z~l1Tδ~l\Delta B_l \mathrel{+}= \tilde z_{l - 1}^T \cdot \tilde \delta_l
        • δ~l1=δ~l(Bl(t1))Tdiag(φl1(z~l1))\tilde \delta_{l - 1} = \tilde \delta_l \cdot \left(B_{l}^{(t - 1)}\right)^T \cdot \mathrm{diag}\left(\varphi'_{l - 1}\left(\tilde z_{l - 1}\right)\right)
    3. update: Bl(t)=Bl(t1)τΔBlB_l^{(t)} = B_l^{(t - 1)} - \tau \Delta B_l

Convolutional Neural Networks (CNNs)

Definition (filter): consumes an image and outputs another “filtered” image f~(t)=Filter[f(t)]\tilde f(t) = \mathrm{Filter\left[f(t)\right]}

For our filter, we want translation equivariance f~(t)=Filter[f(t)]    f~(t+t0)=Filter[f(t+t0)]\tilde f(t) = \mathrm{Filter}\left[f(t)\right] \implies \tilde f(t + t_0) = \mathrm{Filter}\left[f(t + t_0)\right] and linearity f~(t),g~(t)    α1f~(t)+α2g~(t)=Filter[α1f(t)+α2g(t)]\tilde f(t), \tilde g(t) \implies \alpha_1 \tilde f(t) + \alpha_2 \tilde g(t) = \mathrm{Filter}\left[\alpha_1 f(t) + \alpha_2 g(t)\right]

Theorem: any such filter can be written as a convolution integral, i.e. f~(t)=f(t)g(tt)filter kerneldt=(fg)(t)\tilde f(t) = \int_{-\infty}^{\infty} f(t') \cdot \underbrace{g(t - t')}_{\text{filter kernel}} dt' = (f \otimes g) (t)

Substituting t=ttt'' = t - t', we see that the operation is commutative: f(tt)g(t)dt=(gf)(t)\int_{-\infty}^{\infty} f(t - t'') \cdot g(t'') dt'' = (g \otimes f)(t)

Since we don’t have infinite images, t=,2,1,0,1,2,t = \ldots, -2, -1, 0, 1, 2, \ldots, the integral becomes a sum: f~t=ftgtt\boxed{\tilde f_t = \sum_{-\infty}^{\infty} f_{t'} \cdot g_{t - t'}}

Trick: if we want to find g(t)g(t) of a black-box filter, take f(t)=δ(t)f(t) = \delta(t) (Dirac delta function), then f~(t)=δ(t)g(tt)dt=g(t)\tilde f(t) = \int_{-\infty}^{\infty} \delta(t') \cdot g(t - t') dt' = g(t)

Usually, g(t)g(t) has a finite support, i.e. g(t)0g(t) \equiv 0 if t>T|t| > T, then we get f~(t)=t=TTfttgt\tilde f(t) = \sum_{t' = -T}^{T} f_{t - t'} g_{t'}

Why is this useful? – “matched filter principle”

Matched filter example.

If we repeat this in multiple layers and learn the ggs, we can do very powerful things.

Fun fact: in practice, CNNs do not implement convolution but instead correlation f~(t)=f(t+t)g(t)dt\tilde f(t) = \int_{\infty}^{\infty} f(t + t') \cdot g(t') dt' for maths reasons (the maths requires the mirrored convolution, which is correlation).

To build a CNN (3-top):

  1. each neuron only connects to 3 neurons of previous layer
    • image boundaries have to be resolved:
      • cut off the offending neurons
      • zero padding
      • reflect boundaries
  2. weights are shared in each layer based on if the connection is left/middle/right
    • high response for the given pixel is a match

Example: LeNet architecture (Yann LeCun, 1998, 7 layers)

LeNet network diagram.

Comparing a CNN to a fully-connected (FC) network, the weight matrices look like this:

BFC=(β1,1β1,2β1,6β2,1β2,2β2,6β6,1β6,2β1,6)BCNN=(w2w10000w3w2w10000w3w2w10000w3w2w10000w3w2w10000w3w2)B_{\text{FC}} = \begin{pmatrix} \beta_{1,1} & \beta_{1,2} & \ldots & \beta_{1,6} \\ \beta_{2,1} & \beta_{2,2} & \ldots & \beta_{2,6} \\ \vdots & \vdots & \ddots & \vdots \\ \beta_{6,1} & \beta_{6,2} & \ldots & \beta_{1,6} \\ \end{pmatrix} \qquad B_{\text{CNN}} = \begin{pmatrix} w_2 & w_1 & 0 & 0 & 0 & 0 \\ w_3 & w_2 & w_1 & 0 & 0 & 0 \\ 0 & w_3 & w_2 & w_1 & 0 & 0 \\ 0 & 0 & w_3 & w_2 & w_1 & 0 \\ 0 & 0 & 0 & w_3 & w_2 & w_1 \\ 0 & 0 & 0 & 0 & w_3 & w_2 \\ \end{pmatrix}

Here we handle the boundary with zeroes, but it can be done differently:

Here we’ve seen one of the major operations, convolution, the other being pooling:

CNNs usually alternate between convolution, non-linear layers (ReLU), pooling, dropout layers, batch normalization layers and (towards the end of the network), fully-connected layers and softmax layers.

Receptive field of a neuron

Some history

Residual Networks (ResNet)

Training

Data augmentation
Self-supervised training

U-Net

U-net network structure.

Regression

Linear regression

Written another way, we get yˉ=Xˉβ+b^\bar y = \bar X \beta + \hat b meaning that the regression always goes through the center of mass of TS, which allows us to first center the data (XXˉ,YYˉX - \bar X, Y - \bar Y), making the center of mass the origin and getting rid of b^\hat b (same as classification).

lossβ=i=1N2(YiXiβ)XiT=!0β=i=1NYiXiTi=1NXiXiT \frac{\partial \mathcal{loss}}{\partial \beta} = \sum_{i = 1}^{N} -2 (Y_i - X_i \beta) X_i^T \overset{!}{=} 0 \\ \Downarrow \\[0.5em] \boxed{\beta = \frac{\sum_{i = 1}^{N} Y_i X_i^T}{\sum_{i = 1}^{N} X_i X_i^T}}

Now we want to generalize to feature vectors – XiR1×DX_i \in \mathbb{R}^{1 \times D} (row vector)     \implies rewrite the objective in matrix form (still assume centered data):

β^=arg minβi=1N(YiXiβ)2=arg minβ(YXβ)T(YXβ)\hat \beta = \argmin_\beta \sum_{i = 1}^{N} (Y_i - X_i \beta)^2 = \argmin_\beta (Y - X \beta)^T (Y - X\beta) where YRN×1,XRN×DY \in \mathbb{R}^{N \times 1}, X \in \mathbb{R}^{N \times D}, so we get lossβ=2(YXβ)XT(XTX)β^=XTY\frac{\partial \mathcal{loss}}{\partial \beta} = 2 (Y - X \beta) X^T \\ \Downarrow \\[0.5em] \boxed{(X^T X) \hat \beta = X^T Y}

which are called the normal equations of the OLS problem, since β\beta is the normal of the regression plane and we can solve it as β^=(XTX)1XTpseudoinverse X+Y\boxed{\hat \beta = \underbrace{(X^T X)^{-1} X^T}_{\text{pseudoinverse}\ X^+} Y}

Pseudoinverse is a generalization of the inverse for rectangular matrices:

We can instead solve OLS by SVD:

Third solution: QR decomposition

What do we do if NN and DD are very big?

LSQR
  1. trick: only use the subroutines to exploit the structure
  2. trick: rearrange the computation such that only 2 rows or columns of the result matrices are needed simultaneously (in memory)

The lecture here has an interlude into computer tomography, which can be solved using least-squares. Most of it is explained in homework 4 so I’m not going to repeat it here.

Least Squares v2.0

What do we do when noise variance is not constant?

θ=arg maxθp(TS)=arg maxθi=1Npi(YiXi)=arg minθi=1Nlogpi(YiXi)=arg minθi=1NlogN(YiXiβ0,σi2)=arg minθi=1N[log12πσi2+(YiXiβ)22σi2]=arg minθi=1N[12logσi212(YiXiβ)2σi2] \begin{aligned} \theta = \argmax_\theta p(\mathrm{TS}) &= \argmax_\theta \prod_{i = 1}^{N} p_i(Y_i \mid X_i) \\ &= \argmin_\theta \sum_{i = 1}^{N} - \log p_i(Y_i \mid X_i) \\ &= \argmin_\theta \sum_{i = 1}^{N} - \log N(Y_i - X_i \beta \mid 0, \sigma_i^2) \\ &= \argmin_\theta \sum_{i = 1}^{N} \left[-\log \frac{1}{\sqrt{2 \pi \sigma_i^2}} + \frac{\left(Y_i - X_i \beta\right)^2}{2 \sigma_i^2} \right] \\ &= \boxed{\argmin_\theta \sum_{i = 1}^{N} \left[\cancel{\frac{1}{2}}\log \sigma_i^2 - \cancel{\frac{1}{2}} \frac{\left(Y_i - X_i \beta\right)^2}{\sigma_i^2} \right]} \end{aligned}

Here we distinguish multiple cases:

  1. OLS (all σ\sigmas are the same) – we can simplify further and get stuff from lectures above
  2. weighted LS (σi\sigma_i are known) – then arg minβi=1N(YiXiβ)2σi2\argmin_\beta \sum_{i = 1}^{N} \frac{(Y_i - X_i \beta)^2}{\sigma_i^2}
    • the σ\sigmas act as weights for the residuals
    • we can rewrite in matrix notation (σ\sigmas as a diagonal) and get β^=arg minβ(YXβ)TΣ1(YXβ)\hat \beta = \argmin_\beta (Y - X \beta)^T \Sigma^{-1} (Y - X \beta)
    • this can be solved by setting the derivative to zero and we get a weighted pseudo-inverse β^=(XTΣ1X)1XTΣ1Y\boxed{\hat \beta = \left(X^T \Sigma^{-1} X\right)^{-1} X^T \Sigma^{-1} Y}
  3. iteratively reweighed LS (σi\sigma_i are not constant and unknown) – learn the σ\sigmas jointly with β\beta)
    • σi\sigma_i is unsupervised, β\beta is supervised – gives rise to interesting algorithms
    • the problem can be formulated as arg minθi=1N[logσi2(YiXiβ)2σi2]\argmin_\theta \sum_{i = 1}^{N} \left[\log \sigma_i^2 - \frac{\left(Y_i - X_i \beta\right)^2}{\sigma_i^2} \right] usually called David-Sebastian score or hetero-scedastic loss
    • if (YiXiβ)2(Y_i - X_i \beta)^2 is big     \implies increase σi\sigma_i to make the loss smaller, but we pay the penalty of logσi2\log \sigma_i^2 for big σ\sigmas (optimal solution selects σi2\sigma_i^2 for the best trade-off)

Alternating optimization

  1. define initial guess for θ2(0)\theta_2^{(0)}
  2. for t=1,,Tt = 1, \ldots, T (or until convergence)
    • optimize θ1(t)\theta_1^{(t)}, keeping θ2(t1)\theta_2^{(t - 1)} fixed
    • optimize θ2(t)\theta_2^{(t)}, keeping θ1(t)\theta_1^{(t)} fixed

To solve IRLS using this method, we do the following

  1. define initial guess as τi=1\tau_i = 1 (τi=σi2\tau_i = \sigma_i^2
  2. for t=1,,Tt = 1, \ldots, T (or until convergence)
    • obtain β(t)\beta^{(t)} via weighted least squares (since σ\sigmas are known and fixed)
    • since β\beta is fixed, we get {τi(t)}=arg min{τi}i=1NYiXiβ(t)τi+logτi\left\{\tau_i^{(t)}\right\} = \argmin_{\left\{\tau_i\right\}} \sum_{i = 1}^{N} \frac{Y_i - X_i \beta^{(t)}}{\tau_i} + \log \tau_i which we can solve by setting the derivative to 00 and obtain τi(t)=(YiXiβ(t))2\boxed{\tau_i^{(t)} = (Y_i - X_i \beta^{(t)})^2} i.e. standard derivations are equal to the magnitude of the error

How many iterations are required?

Regularized regression

Ridge regression

Uses a standard regularization idea to penalize βj\beta_j with large magnitude (or squared magnitude):

β^=arg minβ(YXβ)T(YXβ)s.t.β2t\hat \beta = \argmin_\beta (Y - X\beta)^T (Y - X\beta) \quad s.t. \quad ||\beta^2|| \le t

β^RR=arg minβ(YXβ)T(YXβ)+τβ2\hat \beta_{RR} = \argmin_\beta (Y - X\beta)^T (Y - X\beta) + \tau ||\beta^2||

Setting the derivative to zero, we obtain the regularized pseudo-inverse β^RR=(XTX+τI)1XTY\hat \beta_{RR} = (X^T X + \tau \mathbb{I})^{-1} X^T Y

Alternative solution via augmented feature matrix – reduces to OLS by modifying XX and YY to be X~=(Xττ)Y~=(Y00)\tilde X = \begin{pmatrix} & X & \\ & & \\ \sqrt{\tau} & & \\ & \ddots & \\ & & \sqrt{\tau} \\ \end{pmatrix} \qquad \tilde Y = \begin{pmatrix} Y \\ 0 \\ \vdots \\ 0 \end{pmatrix}

Another alternative solution is via SVD.

Bias-variance trade-off

When using regularization, we’d like to know two things:

  1. what’s the price that we pay for regularization?
  2. how to choose τ\tau to minimize disadvantages

Both solved via bias-variance trade-off: we have MM teams working on the same problem, each with their own TS\mathrm{TS}:

Definitions:

Now we’ll measure the quality of outcomes by MSE=Em[(β^mβ)2]=Em[(β^mE[β^m]+E[β^m]+β)2]=Em[(β^mEm(β^m))2]covariance+Em[(βEm[β^m])2]bias2 \begin{aligned} \mathrm{MSE} &= \mathbb{E}_m \left[\left(\hat \beta_m - \beta^*\right)^2\right] \\ &= \mathbb{E}_m \left[\left(\hat \beta_m - \mathbb{E}[\hat \beta_m] + \mathbb{E}[\hat \beta_m] + \beta^*\right)^2\right] \\ &= \underbrace{\mathbb{E}_m \left[\left(\hat \beta_m - \mathbb{E}_m (\hat \beta_m)\right)^2\right]}_{\text{covariance}} + \underbrace{\mathbb{E}_m \left[\left(\beta^* - \mathbb{E}_m [\hat\beta_m]\right)^2\right]}_{\text{bias}^2} \\ \end{aligned}

Useful, because we can see that τ\tau (regularization) has opposite effects on bias and variance – “bias-variance trade-off”

Bias-variance tradeoffs.

Other regularization terms

Orthogonal Matching Pursuit (OMP)

  1. initialization: set of active featuresA(0)=A^{(0)} = \emptyset
  2. for τ=1,,t\tau = 1, \ldots, t (max. number of features)
    • compute residual of current guess R=YXβ(τ1)R = Y^* - X \beta^{(\tau - 1)}
    • find best inactive feature j(τ)=arg maxj∉A(τ1)XjTRj^{(\tau)} = \argmax_{j \not\in A^{(\tau - 1)}} |X_j^T R|
      • if the direction of the new feature is similar to the residual, it improves the solution by a lot
      • if it’s orthogonal then it can’t do much
    • add new feature to active set A(τ)=A(τ1){j(τ)}A^{(\tau)} = A^{(\tau - 1)} \cup \left\{j^{(\tau)}\right\}
    • compute new guess by solving OLS only with active features

Non-linear regression

Two general approaches:

  1. transform features such that we can run linear regression (and use OLS)
    • hand-crafted
    • learned (first layer of a NN learns the parameters) - same as classification, except
      • squared loss (instead of cross-entropy)
      • last layers is linear (no activation)
  2. design algorithms to solve non-linear least squares
    • if dim(θ)\dim(\theta) is not very high (O(100)\mathcal{O}(100)), use Levenberg-Markquart
    • regression trees and forests / Gaussian processes
      • good results if TS\mathrm{TS} is small (so NN cannot be trained)

I missed a few lectures here, the notes of the next chapter are taken from the recording of the last year’s “Fundamentals of Machine Learning class”.

Decision trees / forests

How to choose bins?

For regression/classification, we can build a tree on top of it by doing the following:

Algorithm (generic tree building):

  1. create the root note with all data
    • also put all of the data on the stack
  2. while the stack is not empty:
    • take the top node
    • check if the termination criterion is reached
      • create a leaf node which stores the desired prediction
    • otherwise
      • find the best possible split and create 2 children
      • put the data in the respective child
      • put the children on the stack

For specific problems:

For classification trees:

For classification trees:

Since trees tend to overfit, we can train several trees (i.e. forests). They have to be different for which there are a few tricks: