Introduction to Machine Learning


This website contains my lecture notes from a lecture by Ullrich Köthe from the academic year 2022/2023 (University of Heidelberg). If you find something incorrect/unclear, or would like to contribute, feel free to submit a pull request (or let me know via email).

The notes end at the decision tree/forest lecture (inclusive)!

Also, special thanks to Lucia Zhang, some of the notes are shamelessly stolen from her.





Traditional approach:

Machine learning:

Basic ML workflow:

  1. collect data in two datasets: training dataset (TS) to train and test dataset to validate
    • extremely important, if you don’t do this you die instantly
    • necessary for recognizing overfitting (FF is polynomials and we fit points…)
  2. select function family FF
    • prior knowledge about what the data looks like
    • trial and error (there are tools to help you with this)
  3. find the best θ^\hat{\theta} by fitting FF to the training set
  4. validate the quality of fθ^(X)f_{\hat{\theta}}(X) on the test set
  5. deploy model in practice

Probabilistic prediction:



iji \setminus j 11 (name) 22 (height) 33 (gender)
11 Alice 1.71.7m f
22 Bob 1.81.8m m
33 Max 1.91.9m m

We usually (when Programming) want a float matrix: drop names, discretize labels and use one-hot encoding (one-hot because there is only ones in the particular features).

iji \setminus j 11 (height) 22 (f) 33 (m) 44 (o)
11 1.71.7 11 00 00
22 1.81.8 00 11 00
33 1.91.9 00 11 00

Types of learning approaches

  1. supervised learningtrue responses are known in TS={(Xi,Yi)}i=1N\mathrm{TS} = \left\{(X_i, Y_i)\right\}_{i = 1}^N
  2. weakly supervised learning
    1. we have some information (know YiY_i for some instances)
    2. we have worse information (there is a tumor but we don’t know where)
  3. unsupervised learningonly features are known in TS={Xi}i=1N\mathrm{TS} = \left\{X_i\right\}_{i = 1}^N
    • learning algorithm must find the structure in the data on its own (data mining or, for humans, “research”)
    • only a few solutions that are guaranteed to work (this is unsurprisingly difficult to do)
      1. representation learning – compute new features that are better to predict
      2. clustering – group similar instances into “clusters” with a single representative
  4. self-supervised learning – define an auxiliary task where YiY_i are easy to determine


Deterministic classifier

Quality measured by collecting the “confusion matrix”

Y^Y\hat{Y} \setminus Y^* 1-1 +1+1
1-1 true negative false negative
+1+1 false positive true positive
  1. false positive/negative fraction – “how many out of all events are false” #FP or #FNN[0,1]\frac{\# \mathrm{FP}\ \text{or}\ \# \mathrm{FN}}{\mathrm{N}} \in \left[0, 1\right]
  2. false positive/negative rate – “how many out of positive/negative events are false” #FP#FP+#TNp(Y^=1Y=1)#FN#FN+#TPp(Y^=1Y=1)\frac{\# \mathrm{FP}}{\# \mathrm{FP} + \# \mathrm{TN}} \approx p(\hat{Y} = 1 \mid Y^* = -1) \qquad \frac{\# \mathrm{FN}}{\# \mathrm{FN} + \# \mathrm{TP}} \approx p(\hat{Y} = -1 \mid Y^* = 1)

Probabilistic classifier

Quality measured by “calibration”

Important: calculate confusion matrix or calibration from test set, not train set (bias)!

Threshold classifier

Bayes rule of conditional probability

Is hugely important for a number of reasons:

Some history behind ML:

How good can it be?

Definition (Bayes classifier): uses Bayes rule (LHS or RHS) with true probabilities pp^*

Theorem: no learned classifier using p^\hat{p} can be better than the Bayes classifier.

How bad can it be?

Linear classification

  1. use a linear formula to reduce all features to scaled score
  2. apply threshold classifier to score (C=2C = 2 for now)

Y^i=sign(jβjXij+b)\hat{Y}_i = \mathrm{sign}\left(\sum_{j} \beta_j X_{ij} + b\right) for feature weights β\beta and intercept bb.


Gradient descent looks as follows: β(t)=β(t1)τL(t1)βloss derivative\beta^{(t)} = \beta^{(t - 1)} - \tau \underbrace{\frac{\partial \mathcal{L}^{(t - 1)}}{\partial \beta}}_{\text{loss derivative}} for learning rate τ1\tau \ll 1 In our case, the derivative (for a single instance) is L(t1)β=ReLU(YiXiβ)β={0YiXiβ<0  (correct)YiXiTotherwise\frac{\partial \mathcal{L}^{(t - 1)}}{\partial \beta} = \frac{\partial \mathrm{ReLU} (-Y_i X_i \beta)}{\partial \beta} = \begin{cases} 0 & -Y_i^* X_i \beta < 0\ \ \text{(correct)} \\ -Y_i^* X_i^T & \text{otherwise} \end{cases}

Rosenblatt’s algorithm

Algorithm (Rosenblatt’s algorithm):

  1. initialzation of β(0)\beta^{(0)} – random/uniform numbers
  2. for t=1,,Tt = 1, \ldots, T (or until convergence)
    • update β\beta using the gradient descent formula above, with minor changes:
      • additionally, use τ/N\tau / N instead of τ\tau (so it doesn’t change when learning set changes)
      • sum over only incorrect guesses from the previous iteration β(t)=β(t1)+τNi:Y^i(t1)YiYiXiT\boxed{\beta^{(t)} = \beta^{(t - 1)} + \frac{\tau}{N} \sum_{i: \hat{Y}_i^{(t - 1)} \neq Y_i^*} Y_i^* X_i^T}

Linear classifier.

Support Vector Machine (SVM)

Improved algorithm (popular around 1995): (Linear) Support Vector Machine (SVM)

Reminder: distance of a point XiX_i and plane β,b\beta, b is d(Xi,(β,b))=Xiβ+bβd(X_i, (\beta, b)) = \left| \frac{ X_i \beta + b}{|| \beta ||} \right|

Radius of the safety region (margin) is the smallest distance distance of a point to the decision plane (also called the “Hausdorff distance between H and TS”): mH=mini=1Nd(Xi,(βH,bH))m_H = \min_{i = 1}^N d(X_i, (\beta_H, b_H))

Let i^\hat{i} be the closest point. Then Yi^(Xi^βH+bH)=mHβHY_{\hat{i}}^* (X_{\hat{i}} \beta_H + b_H) = m_H || \beta_H ||

Now we chose a representative such that the equation above is 11. The decision plane is then correct when i:Yi(XiβH+bH)1\forall i: Y_i^* (X_i \beta_H + b_H) \ge 1 (specifically 11 for i^\hat{i}). To make it as general as possible, we want one that maximizes the distance H=arg maxHmH=arg maxH(miniYi(XiβH+bH)βH)=arg maxH(1βHminiYi(XiβH+bH)))=argmaxH1βH \begin{aligned} H &= \argmax_H m_H \\ &= \argmax_H \left(\min_i \frac{Y_i^* (X_i \beta_H + b_H)}{ ||\beta_H ||}\right) \\ &= \argmax_H \left(\frac{1}{|| \beta_H ||} \min_i Y_i^* (X_i \beta_H + b_H))\right) \\ &= \arg \max_H \frac{1}{|| \beta_H || } \end{aligned}

This yields an optimization problem: we want to arg maxH1βHs.t.i:Yi(XiβH+bH)1\argmax_H \frac{1}{|| \beta_H || } \quad \text{s.t.} \quad \forall i: Y_i^* (X_i \beta_H + b_H) \ge 1

This is inconvenient – change to a more convenient equivalent optimization problem arg minH12βHTβHs.t.i:Yi(XiβH+bH)1\argmin_H \frac{1}{2} \beta_H^T \beta_H \quad \text{s.t.} \quad \forall i: Y_i^* (X_i \beta_H + b_H) \ge 1

Now we define slack variables ξi0\xi_i \ge 0 that measure how much it was violated arg minH12βHTβHs.t.i:Yi(XiβH+bH)1ξi\argmin_H \frac{1}{2} \beta_H^T \beta_H \quad \text{s.t.} \quad \forall i: Y_i^* (X_i \beta_H + b_H) \ge 1 - \xi_i and we also want to minimize those, which we’ll do by using Lagrange multipliers: arg minH,ξ12βHTβHmaximize margins+λNi=1NReLU(1Yi(XiβH+bH))minimize penalties for misclassified points\boxed{\argmin_{H, \xi} \underbrace{\frac{1}{2} \beta_H^T \beta_H}_{\text{maximize margins}} + \underbrace{\frac{\lambda}{N} \sum_{i = 1}^{N} \mathrm{ReLU} (1 - Y_i^* (X_i \beta_H + b_H))}_{\text{minimize penalties for misclassified points}}}

We again optimize by derivative + gradient descent:

Lβ=β+λNi=YiXiβ+b<1NYiXiT\frac{\partial \mathcal{L}}{\partial \beta} = \beta + \frac{\lambda }{N} \sum_{i = Y_i^* X_i \beta + b < 1}^{N} -Y_i^* X_i^T Lb=λNi=YiXiβ+b<1NYi\frac{\partial \mathcal{L}}{\partial b} = \frac{\lambda }{N} \sum_{i = Y_i^* X_i \beta + b < 1}^{N} -Y_i^*

The iteration step for β\beta looks as follows: β(t)=β(t1)τ(β(t1)+λNi=YiXiβ+b<1NYiXiT)b(t)=b(t1)τ(λNi=YiXiβ+b<1NYi) \boxed{ \begin{aligned} \beta^{(t)} &= \beta^{(t - 1)} - \tau \left(\beta^{(t - 1)} + \frac{\lambda }{N} \sum_{i = Y_i^* X_i \beta + b < 1}^{N} -Y_i^* X_i^T\right) \\ b^{(t)} &= b^{(t - 1)} - \tau \left(\frac{\lambda}{N} \sum_{i = Y_i^* X_i \beta + b < 1}^{N} -Y_i^*\right) \end{aligned} }

Linear Discriminant Analysis (LDA)

To get any ellipse, we start with a unit circle zz and stretch (λ\lambda), rotate (QQ) and shift (μ\mu) it: Qλz+μ=(cosφsinφsinφcosφ)(x100x2)z+μQ \lambda z + \mu = \begin{pmatrix} \cos \varphi & -\sin \varphi \\ \sin \varphi & \cos \varphi \end{pmatrix} \begin{pmatrix} x_1 & 0 \\ 0 & x_2 \end{pmatrix} z + \mu

Derivation of an ellipse from a unit circle.

Solving for zz (and using the fact that QQ is orthogonal), we get z=λ1Q1(Xμ)=λ1QT(Xμ)z = \lambda^{-1} Q^{-1} (X - \mu) = \lambda^{-1} Q^T (X - \mu)

Gaussian distribution can be defined asN(xμ,σ)=12πσ2exp(12((xμ)σ1)2)\mathcal{N}(x \mid \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2} \left((x - \mu)\sigma^{-1}\right)^2\right)

For higher dimensions (with our circle zz), we get the generalized Gaussian N(zμ,Σ)=1det(2πΣ)exp(12(xμ)Q1λ1λ1QTΣ1(xμ)T)\mathcal{N}(z \mid \mu, \Sigma) = \frac{1}{\sqrt{\det\left(2 \pi \Sigma\right)}} \exp\left(-\frac{1}{2} (x - \mu) \underbrace{Q^{-1}\lambda^{-1}\lambda^{-1}Q^{-T}}_{\Sigma^{-1}} (x - \mu)^T\right)

Let {Xi}i=1N1\left\{X-i\right\}_{i = 1}^{N_1} be features of class 11. Then N(Xμ1,Σ1)=1det(2πΣ1)exp(12(x1μ1)Σ11(x1μ1)T)N(X \mid \mu_1, \Sigma_1) = \frac{1}{\sqrt{\det(2 \pi \Sigma_1)}} \exp\left(-\frac{1}{2} (x_1 - \mu_1) \Sigma_1^{-1} (x_1 - \mu_1)^T\right)

To derive the learning method, we’ll use two things:

  1. maximum likelihood principle: choose μ1\mu_1 and Σ1\Sigma_1 such that TS will be a typical outcome of the resulting model (i.e. best model maximizes the likelihood of TS)
  2. i.i.d. assumption: training instances drawn independently from the same distribution
    • independently – joint distribution is the product
    • identically distributed – all instances come from the same distribution

For the probability, we get p(TS)=p(X1,,Xn)=i=1Npi(Xi)independently=i=1Np(Xi)identically distributed=i=1NN(Xiμ,Σ) \begin{aligned} p(\mathrm{TS}) &= p(X_1, \ldots, X_n) \\ &= \prod_{i = 1}^{N} p_i(X_i) \qquad \text{independently} \\ &= \prod_{i = 1}^{N} p(X_i) \qquad \text{identically distributed} \\ &= \prod_{i = 1}^{N} \mathcal{N(X_i \mid \mu, \Sigma)} \end{aligned}

Using the maximum likelihood principle, the problem becomes μ^,Σ^=arg maxμ,Σp(TS)\hat{\mu}, \hat{\Sigma} = \argmax_{\mu, \Sigma} p(\mathrm{TS})

It’s mathematically simpler to minimize negative logarithm of p(TS)p(\mathrm{TS}) (applying monotonic function to an optimization problem doesn’t change arg min\argmin and max=min\max = -\min). We get

μ^,Σ^=arg minμ,Σlog(p(TS))=i=1N[log(1det(2πΣ))12(Xiμ)Σ1(Xiμ)T]=i=1N[log(det(2πΣ))+(Xiμ)Σ1(Xiμ)T]=L(TS) \begin{aligned} \hat{\mu}, \hat{\Sigma} &= \argmin_{\mu, \Sigma} -\log(p(\mathrm{TS})) \\ &= - \sum_{i = 1}^{N} \left[\log \left(\frac{1}{\sqrt{\det(2 \pi \Sigma)}}\right) - \frac{1}{2} (X_i - \mu) \Sigma^{-1} (X_i - \mu)^T\right] \\ &= \sum_{i = 1}^{N} \left[\log \left(\det(2 \pi \Sigma)\right) + (X_i - \mu) \Sigma^{-1} (X_i - \mu)^T\right] = \mathcal{L}(\mathrm{TS}) \end{aligned}

This is our loss. We now do derivative and set it to 00, since that will find the optimum. First, we’ll derivate by μ\mu, which gets rid of the first part of loss completely and we get

Lμ=i=1NΣ1(Xiμ)TvAvTv=2AvT=0i=1N(Xiμ)T=0i=1NXi=Nμ1Ni=1NXi=μ \begin{aligned} \frac{\partial \mathcal{L}}{ \partial \mu} = \sum_{i = 1}^{N} \overbrace{\Sigma^{-1} (X_i -\mu)^T}^{\frac{\partial vAv^T}{\partial v} = 2Av^T} &= 0 \\ \sum_{i = 1}^{N} (X_i -\mu)^T &= 0 \\ \sum_{i = 1}^{N} X_i &= N \mu \\ \frac{1}{N} \sum_{i = 1}^{N} X_i &= \mu \\ \end{aligned}

In other words, the mean μ\mu is the average (shocking, I know).

For Σ\Sigma, this will be a little more complicated. We’ll do the partial derivation by Σ1=K\Sigma^{-1} = K instead, which gives the following: LK=i=1N[(KT)1logdetKK=(KT)1+(Xiμ)T(Xiμ)vAvTA=vTv]=0i=1N[K1+(Xiμ)T(Xiμ)]=0K symmetricNK1+i=1N(Xiμ)T(Xiμ)=01Ni=1N(Xiμ)T(Xiμ)=Σ \begin{aligned} \frac{\partial \mathcal{L}}{ \partial K} = \sum_{i = 1}^{N} [ \overbrace{-(K^T)^{-1}}^{\frac{\partial \log \det K}{\partial K} = (K^T)^{-1}} + \overbrace{(X_i - \mu)^T (X_i - \mu)}^{\frac{\partial vAv^T}{\partial A} = v^Tv}] &= 0 \\ \sum_{i = 1}^{N} \left[ -K^{-1} + (X_i - \mu)^T (X_i - \mu)\right] &= 0 \qquad K\ \text{symmetric} \\ -N K^{-1} + \sum_{i = 1}^{N} (X_i - \mu)^T (X_i - \mu) &= 0 \\ \frac{1}{N} \sum_{i = 1}^{N} (X_i - \mu)^T (X_i - \mu) &= \Sigma \\ \end{aligned}

Again, in other words, the variance is the average over the quared vectors offset by the mean, which too makes sense.

Now we have 22 clases but with same covariance (by assumption of LDA) and we can:

  1. determine two means (μ1,μ1\mu_1, \mu_{-1}) as μ1=1N1i:Yi=1Xiμ1=1N1i:Yi=1\boxed{\mu_1 = \frac{1}{N_1} \sum_{i: Y_i^* = 1} X_i \qquad \mu_{-1} = \frac{1}{N_{-1}} \sum_{i: Y_i^* = -1}}
  2. to calculate covariance (which is the same for both classes): Σ=1N(i:Yi=1(Xiμ1)T(Xiμ1)+i:Yi=1(Xiμ1)T(Xiμ1))\boxed{\Sigma = \frac{1}{N} \left(\sum_{i: Y_i^* = 1} (X_i - \mu_1)^T (X_i - \mu_1) + \sum_{i: Y_i^* = -1} (X_i - \mu_{-1})^T (X_i - \mu_{-1})\right)}
  3. use Bayes RHS and our calculations to calculate the LHS: Y^i=sign(Xiβ+b)withβ=2Σ1(μ1μ1)Tb=μ1Σ1μ1Tμ1Σ1μ1T\boxed{\begin{aligned} \hat{Y}_i = \mathrm{sign}(X_i \beta + b) \quad \text{with} \quad &\beta = 2 \Sigma^{-1} (\mu_1 - \mu_{-1})^T \\ & b = \mu_{-1} \Sigma^{-1} \mu_{-1}^T - \mu_1 \Sigma^{-1} \mu_1^T \end{aligned}}

The derivation of (3) looks as follows: assuming p(Y=1)=p(Y=1)=12p(Y = 1) = p(Y = -1) = \frac{1}{2} for simpler computations and p(Y=1X)p(Y = -1 \mid X) being analogous:

p(Y=1X)=p(Y=1X)p(Y=1)p(Y=1X)p(Y=1)+p(Y=1X)p(Y=1)=p(Y=1X)p(Y=1X)+p(Y=1X)=11+p(XY=1)p(XY=1) \begin{aligned} p(Y=1|X)&=\frac{p(Y=1|X){p(Y=1)}}{p(Y=1|X){p(Y=1)}+p(Y=-1|X){p(Y=-1)}}\\ &=\frac{p(Y=1|X)}{p(Y=1|X)+p(Y=-1|X)}\\ &=\frac{1}{1+\frac{p(X|Y=-1)}{p(X|Y=1)}} \end{aligned}

Now we substitute the Gaussian function:

p(XiY=1)p(XiY=1)=1det(2πΣ)exp(12(Xiμ1)Σ1(Xiμ1)T)1det(2πΣ)exp(12(Xiμ1)Σ1(Xiμ1)T)=exp(12(Xiμ1)Σ1(Xiμ1)T+12(Xiμ1)Σ1(Xiμ1)T)=exp(12[XΣ1XT2XΣ1μ1T+μ1Σ1μ1TXΣ1XT+2XΣ1μ1Tμ1Σ1μ1T])=exp(XΣ1(μ1Tμ1T)β12(μ1Σ1μ1Tμ1Σ1μ1T)b)=exp((Xβ+b)) \begin{aligned} \frac{p(X_i|{Y=-1})}{p(X_i|{Y=1})}&=\frac{\cancel{\frac{1}{\sqrt{\det(2\pi\Sigma)}}}\exp(-\frac{1}{2}(X_i-\mu_{-1})\Sigma^{-1}(X_i-\mu_{-1})^T)}{\cancel{\frac{1}{\sqrt{\det(2\pi\Sigma)}}}\exp(-\frac{1}{2}(X_i-\mu_{1})\Sigma^{-1}(X_i-\mu_{1})^T)}\\ &=\exp(-\frac{1}{2}(X_i-\mu_{-1})\Sigma^{-1}(X_i-\mu_{-1})^T+\frac{1}{2}(X_i-\mu_{1})\Sigma^{-1}(X_i-\mu_{1})^T) \\ &=\exp(-\frac{1}{2}\left[\cancel{X\Sigma^{-1}X^T}-2X\Sigma^{-1}\mu_{-1}^T+\mu_{-1}\Sigma^{-1}\mu_{-1}^T-\cancel{X\Sigma^{-1}X^T}+2X\Sigma^{-1}\mu_1^T-\mu_1\Sigma^{-1}\mu_{1}^T\right])\\ &=\exp(-X\underbrace{\Sigma^{-1}(\mu_1^T-\mu_{-1}^T)}_\beta-\frac{1}{2}\underbrace{(\mu_{-1}\Sigma^{-1}\mu_{-1}^T-\mu_1\Sigma^{-1}\mu_1^T)}_b)=\exp(-(X\beta+b)) \end{aligned}

Plugging the result into the previous equation, we obtain the sigmoid function σ\sigma:

p(Y=1X)=11+exp((Xβ+b))σ(t)=11+exp(t)p(Y = 1 \mid X) = \frac{1}{1 + \mathrm{exp}(-(X \beta + b))} \qquad \sigma(t) = \frac{1}{1 + \exp(-t)}

Two of its properties will be helpful for us: σ(t)=1σ(t)σ(t)t=σ(t)σ(t)\sigma(-t) = 1 - \sigma(t) \qquad \frac{\partial \sigma(t)}{\partial t} = \sigma(t) \sigma(-t)

Doing an analogous derivation for p(Y=1X)p(Y = -1 \mid X), we get y^=argmaxkp(Y=kX){1if σ(Xβ+b)>121if σ(Xβ+b)<12{1if Xβ+b>01if Xβ+b<0sign(Xβ+b) \begin{aligned} \hat y=\arg\max_k p(Y=k|X) \Longleftrightarrow& \begin{cases}1 &\text{if } \sigma(X\beta+b)>\frac{1}{2}\\ -1 &\text{if } \sigma(X\beta+b)<\frac{1}{2}\end{cases}\\ \Longleftrightarrow& \begin{cases}1 &\text{if } X\beta+b>0\\ -1 &\text{if } X\beta+b<0\end{cases}\\ \Longleftrightarrow& \text{sign}(X\beta+b) \end{aligned}

Alternatives to training beta and b:

  1. fit mean and covariance of clusters
  2. least-squares regression: L(Yi,Y^i)=(Yi(Xβ+b))2\mathcal{L}(Y_i^*, \hat{Y}_i) = (Y_i^* - (X \beta + b))^2 (same solution as 11)
  3. Fisher’s idea: define 1D scores Zi=XiβZ_i = X_i \beta and chose β\beta such that a threshold on ZiZ_i has minimum error
    • define projection of the means μ1^=μ1β,μ1^=μ1β\hat{\mu_{1}} = \mu_1 \beta, \hat{\mu_{-1}} = \mu_{-1}\beta
    • intuition: μ^1\hat{\mu}_1 and μ1^\hat{\mu_{-1}} should be as far away as possible β=arg maxβ(μ^1μ^1)2\beta = \argmax_\beta (\hat{\mu}_1 - \hat{\mu}_{-1})^2
    • doesn’t quite work, because τβ    τ2(μ^1μ^1)\tau \beta \implies \tau^2 (\hat{\mu}_{1} \hat{\mu}_{-1})
    • solution: scale by the variance σ^\hat{\sigma}: σ^1=Var(Z1Yi=1),σ^1Var(ZiYi=1)\hat{\sigma}_1 = \mathrm{Var}\left(Z_1 \mid Y_i^* = 1\right), \hat{\sigma}_{-1} \mathrm{Var}\left(Z_i \mid Y_i^* = -1\right), then we get β^=arg maxβ(μ^1μ1^)2σ^12+σ^12\hat{\beta} = \argmax_{\beta} \frac{\left(\hat{\mu}_1 - \hat{\mu_{-1}}\right)^2}{\hat{\sigma}_1^2 + \hat{\sigma}_{-1}^2}
    • again gives the same solution as 11 and 22
  4. Logistic regression (LR): same posterior as LDA, but learn LHS of Bayes rule
    • gives different solution to LDA
Logistic regression (LR)

We again have i.i.d. assumptions – all labels are drawn independently form the same posterior: p((Yi)i=1N(Xi)i=1N)=i=1Np(YiXi)p\left(\left(Y_i^*\right)_{i = 1}^N \mid \left(X_i\right)_{i = 1}^N\right) = \prod_{i = 1}^{N} p(Y_i^* \mid X_i)

For LR, Yi{0,1}Y^*_i \in \left\{0, 1\right\}, which allows us to rewrite (with the σ\sigma results from LDA) like so:  β^,b^=argminβ,bi=1N[Yilogσ(Xβ+b)+(1Yi)log(σ((Xβ+b))]  \boxed{\ \hat{\beta}, \hat{b}=\arg\min_{\beta, b}-\sum_{i=1}^N\Big[Y_i^* \log \sigma\left(X\beta+b\right)+\left(1-Y_i^*\right) \log \left(\sigma\left(-\left(X\beta+b\right)\right)\Big]\right.\ }

Here we have log(σ(t))=log(1+exp(t))-\log(\sigma(-t)) = \log(1 + \exp(t)), which is called the softplus function:

ReLu + Softplus graphs.

Simplifying for b=0b = 0 and using the following properties:

σ(Xβ)β=σ(Xβ)X=σ(Xβ)σ(Xβ)Xlogσ(Xβ)β=1σ(Xβ)σ(Xβ)σ(Xβ)X=σ(Xβ)Xlogσ(Xβ)β=1σ(Xβ)σ(Xβ)σ(Xβ)(X)=σ(Xβ)X \begin{aligned} \frac{\partial \sigma\left(X\beta\right)}{\partial \beta}&=\sigma^{\prime}\left(X\beta\right) X=\sigma(X \beta) \sigma\left(-X\beta\right) \cdot X \\ \frac{\partial \log \sigma\left(X\beta\right)}{\partial \beta}&=\frac{1}{\sigma\left(X\beta\right)} \sigma\left(X\beta\right) \sigma\left(-X\beta\right) \cdot X=\sigma\left(X\beta\right) \cdot X \\ \frac{\partial \log \sigma\left(-X\beta\right)}{\partial \beta}&=\frac{1}{\sigma\left(-X\beta\right)} \sigma\left(X\beta\right) \sigma\left(-X\beta\right) \cdot(-X)=-\sigma\left(X\beta\right) \cdot X \end{aligned}

We get the derivatives β\beta: Loss(TS)β=i=1N[Yiσ(Xiβ)1σ(Xiβ)Xi+(1Yi)σ(Xiβ)(Xi)]=i=1N[YiXiYiσ(Xiβ)Xiσ(Xiβ)Xi+Yiσ(Xiβ)Xi] \begin{aligned} \frac{\partial \mathcal Loss(T S)}{\partial \beta}&=-\sum_{i=1}^N\left[Y_i^* \underbrace{\sigma\left(-X_i \beta\right)}_{1-\sigma\left(X_i \beta\right)} \cdot X_i+\left(1-Y_i^*\right) \sigma\left(X_{i \beta}\right) \cdot\left(-X_i\right)\right] \\ & =-\sum_{i=1}^N\left[Y_i^* X_i-\cancel{Y_i^* \sigma\left(X_i \beta\right) X_i}-\sigma\left(X_i \beta\right) X_i+\cancel{Y_i^* \sigma\left(X_i \beta\right) X_i}\right] \end{aligned}


 Loss(TS)β=i=1N(σ(Xiβ)Yierror )Xi=!0  \boxed{\ \frac{\partial \mathcal Loss(T S)}{\partial \beta}=\sum_{i=1}^N(\underbrace{\sigma\left(X_i \beta\right)-Y_i^*}_{\text {error }}) \cdot X_i \stackrel{!}{=} 0\ }

The four cases that can occur are as follows:

Real value \ Classifier result σ(Xiβ)1\sigma(X_i\beta)\approx 1 σ(Xiβ)0\sigma(X_i\beta)\approx 0
Yi=1Y^*_i=1 no correction error 1\approx-1: pulls σ(Xiβ)1\sigma(X_i\beta)\rightarrow1
Yi=0Y^*_i=0 error 1\approx 1: pulls σ(Xiβ)0\sigma(X_i\beta)\rightarrow0 no correction

Perceptron: ReLU(Yi(Xiβ+b))λ=0\mathrm{ReLU}(-Y_i^* (X_i \beta + b)) \qquad \lambda = 0

SVM: ReLU(1Yi(Xiβ+b))λ>0\mathrm{ReLU}(1 - Y_i^* (X_i \beta + b)) \qquad \lambda > 0

LDA: (Yi(Xiβ+b))2λ=0 (fit μ,Σ), λ>0 (fit objective)(Y_i^* - (X_i \beta + b))^2 \qquad \lambda = 0\ (\text{fit } \mu, \Sigma),\ \lambda > 0\ (\text{fit objective})

LR: softplus(Yi(Xiβ+b))λ=0 (or >)\mathrm{softplus}(-Y_i^* (X_i \beta + b)) \qquad \lambda = 0\ (\text{or >})

Linear classifier.

Multi-class classification
  1. one-against-rest classification
    • for k=1,,Ck = 1, \ldots, C, define βk,bk    \beta_k, b_k \implies score sk=Xiβk+bks_k = X_i \beta_k + b_k
    • train by treating Yi=kY_i^* = k as “class + 1” and YikY_i^* \neq k as “class - 1” (rest)
    • make them comparable by normalization: β^k=βkβk\hat{\beta}_k = \frac{\beta_k}{\|\beta_k\|} (same for bb)
    • classify according to biggest score, or “don’t know” if all scores are bad: Y^i={"unknown"sk<ε, karg maxkskotherwise\hat{Y}_i = \begin{cases} \text{"unknown"} & s_k < \varepsilon,\ \forall k \\ \argmax_k s_k & \text{otherwise}\end{cases}
  2. all-pairs: train a linear model for all k,kk, k' (n2\approx n^2)
    • sk,k=Xiβk,k+bk,k, kks_{k, k'} = X_i \beta_{k, k'} + b_{k, k'},\ \forall k \neq k'
    • if sk,k>0s_{k, k'} > 0 then one vote for class kk, if sk,k<0s_{k, k'} < 0 then one vote for kk'
    • Y^i\hat{Y}_i is the label with most votes (or “unknown” for ties)
  3. define posterior as “softmax-function” scores as in (1): p(Y^i=kXi)=exp(sk)k=1Cexp(sk)=softmax(si,sk)\boxed{p(\hat Y_i = k \mid X_i) = \frac{\exp(s_k)}{\sum_{k' = 1}^{C} \exp(s_k')} = \mathrm{softmax}(s_i, s_k)}
    • standard for neural network classification
    • easily provable: softmax(s1,s2)=σ(s1s2)\mathrm{softmax}(s_1, s_2) = \sigma(s_1 - s_2)
    • new: train all βk,bk\beta_k, b_k jointly (at the same time)

Non-linear classification

  1. measure more features and increase dimension
    • higher dimensional spaces tend to be more linearly separable
    • for N=D+1N = D + 1, it’s always separable (but maybe not useful)
    • for D>ND > N, use sparse learning methods
  2. use non-linear classifier
    • QDA, which is a non-linear version of LDA
    • Kernel-SVM (non-linear SVM, less popular but cool math)
    • Decision trees/forests (recursively subdivide XX, we’ll discuss them later)
  3. non-linearly transform features X~i=φ(Xi)\tilde{X}_i = \varphi(X_i)
    • XOR function: x1,x2{1,1}x_1, x_2 \in \left\{-1, 1\right\}, yi={+1x1x21otherwisey_i^* = \begin{cases} +1 & x_1 \neq x_2 \\ -1 & \text{otherwise} \end{cases}
    • four points classification: x~=x1x2,y^={+1x~<01otherwise\tilde{x} = x_1 \cdot x_2, \hat{y} = \begin{cases} +1 & \tilde x < 0 \\ -1 & \text{otherwise} \end{cases}
    • BMI (non-linear formula, linear classification)
    • problem: hand-crafting φ\varphi is difficult
    • solution: learn φ\varphi (multi-layer neural networks)

Neural networks

A neural network is a collection of neurons in parallel layers

Network example.

Previously, NN were believed to not be a good idea, but

Activation functions:

Important theorems:

  1. neural networks with L2L \ge 2 (=1= 1 hidden layer) are universal approximators (can approximate any function to any accuracy), if there are enough neurons in the hidden layer
    • purely existential proof, but nice to know
    • up till 2005, this was used, since they were enough
      • bad idea – deeper networks are easier to train (many good local optima)
  2. finding the optimal network is NP-hard     \implies we need approximation algorithms

Training by backpropagation (chain rule)