This website contains my lecture notes from a lecture by Ullrich Köthe from the academic year 2022/2023 (University of Heidelberg). If you find something incorrect/unclear, or would like to contribute, feel free to submit a pull request (or let me know via email).
The notes end at the decision tree/forest lecture (inclusive)!
Also, special thanks to Lucia Zhang, some of the notes are shamelessly stolen from her.
Resources
TensorFlow playground – a really cool visualization of resigning and training a neural network for classification/regression and a number of standard parameters (activation functions, training rates, etc…)
Introduction
Setting:
quantities Y that we would like to know but can’t easily measure
quantities X that we can measure and are related to Y
Idea:
learn a function Y^=f(X) such that Y^i≈Yi∗ (we’re close to the real value)
Traditional approach:
ask an expert to figure out f(X), but for many approaches ML works better
Machine learning:
choose a universal function family F with parameters θ
ex. F={ax2+b+c},θ={a,b,c}
find parameters θ^ that fit the data, so Y^=fθ^(X)andY^i≈Yi∗
Basic ML workflow:
collect data in two datasets: training dataset (TS) to train and test dataset to validate
extremely important, if you don’t do this you die instantly
necessary for recognizing overfitting (F is polynomials and we fit points…)
select function familyF
prior knowledge about what the data looks like
trial and error (there are tools to help you with this)
find the best θ^ by fitting F to the training set
validate the quality of fθ^(X) on the test set
deploy model in practice
Probabilistic prediction:
reality is often more complicated
no unique true response Y∗, instead set of possible values (possibly infinite)
in that case we learn a conditional probability (instead of a function): Y^∼p(Y∣X)
to learn this, we define a “universal probability family” with parameters θ and chose θ^ such that Y∼pθ^(Y∣X) matches the data
strict generalization: deterministic case is recovered by defining p(Y∣X)=δ(Y−f(X))
δ distribution has only one value
often we derive deterministic results from probabilistic predictions (eg. mode/median/random sample of the distribution)
generally, mode is the best, has some really nice properties
Problems:
nature is not fully predictable
quantum physics (randomness)
deterministic chaos (pendulum)
combinatorial explosion (aminoacids)
data:
finite
noisy
incomplete
ambiguous
model:
function family is imperfect (too small)
may not have converged
nature has changed in the meantime
goal:
disagreement about what is relevant (desirable)
handling uncertainty is central challenge of AI and ML
Notation
features vector Xi for each instance i∈1,…,N
feature matrixX (rows are instances, columns are features j∈1,…,D)
X∈RN×D
i∖j
1 (name)
2 (height)
3 (gender)
1
Alice
1.7m
f
2
Bob
1.8m
m
3
Max
1.9m
m
We usually (when Programming) want a float matrix: drop names, discretize labels and use one-hot encoding (one-hot because there is only ones in the particular features).
i∖j
1 (height)
2 (f)
3 (m)
4 (o)
1
1.7
1
0
0
2
1.8
0
1
0
3
1.9
0
1
0
responseYi row vector for instance i, response elements m∈1,…,M
most of the time, M=1 (scalar response)
exception is one-hot encoding of discrete labels
tasks according to type of response
Yi∈RM…Y≈fθ(X) is regression
Yi=k;k∈{1,…,C} for C number of categories …classification
labels are unordered (“apples”, “oranges”) – categorial classification
training setTS={(Xi,Yi)}i=1N – supervised learning
Yi is the known response for instance i
Types of learning approaches
supervised learning – true responses are known in TS={(Xi,Yi)}i=1N
weakly supervised learning
we have some information (know Yi for some instances)
we have worse information (there is a tumor but we don’t know where)
unsupervised learning – only features are known in TS={Xi}i=1N
learning algorithm must find the structure in the data on its own (data mining or, for humans, “research”)
only a few solutions that are guaranteed to work (this is unsurprisingly difficult to do)
representation learning – compute new features that are better to predict
clustering – group similar instances into “clusters” with a single representative
self-supervised learning – define an auxiliary task where Yi are easy to determine
Classification
Yi=k,k∈{1,…,C}
common special case C=2, then k∈{0,1} or k∈{−1,1}
always supervised!
Deterministic classifier
one hard prediction for each iY^i=f(Xi)withY^i=Yi∗
Quality measured by collecting the “confusion matrix”
Y^∖Y∗
−1
+1
−1
true negative
false negative
+1
false positive
true positive
false positive/negative fraction – “how many out of all events are false” N#FPor#FN∈[0,1]
false positive/negative rate – “how many out of positive/negative events are false” #FP+#TN#FP≈p(Y^=1∣Y∗=−1)#FN+#TP#FN≈p(Y^=−1∣Y∗=1)
Probabilistic classifier
returns a probabilistic vector ∈[0,1]C (soft response) for every label ≡ posterior distribution p(Y=k∣X)
more general than hard classification – can recover decision function by returning the most probable label by using argmax: p(Y=k∣X)⟹f(X)=argkmaxp(Y=k∣X)
Quality measured by “calibration”
if p(Y=k∣X)=v, then label k should be correct v% of the times
if actual accuracy is higher, then the classifier is “under confident”
otherwise it’s “over confident” (often happens)
Important: calculate confusion matrix or calibration from test set, not train set (bias)!
Threshold classifier
single feature X∈R
usually not enough to predict the response reasonably
Y^=sign(X−T)={1−1X>TX<T for some threshold T
three classical solutions for multiple features
design a formula to combine features into single score (use BMI for obese/not obese)
hard and expensive for most problems, not to mention inaccurate
linear classification: compute score as a linear combination and learn coefficients
Si=∑j=1DβjXij⟹Y^i=sign(Si−T)
less powerful than the first one (restricted to linear formulas)
we can learn so it usually performs better
nearest neighbour: classify test instance Xtest according to most similar training instance
i^=argminidist(Xi,Xtest)
then Y^test=Yi^∗
expensive – to store the entire training set and scan it for every Xtest
doesn’t scale well for more features (generalized binary search)
hard to define the distance function to reflect true similarity
“metric learning” is a hard task of its own…
Bayes rule of conditional probability
joint distribution of features and labels p(X,Y)
can be decomposed by the chain rule to
p(X)p(Y∣X) – first measure features, then determine response
p(Y)p(X∣Y) – first determine response, then get compatible features
for classification, we want to use Bayes ruleposteriorp(Y=k∣X)=marginalp(X)p(X∣Y=k)likelihoodp(Y=k)prior
posterior – we want to update our judgement based on measuring the features
prior – we know this (1% for a disease in the general population)
likelihood – the likelihood of the disease given features (fever, cough)
marginal – can be recovered by summing over possibilities: p(X)=k=1∑Cp(X∣Y=k)
Is hugely important for a number of reasons:
fundamental equation for probabilistic machine learning
allows clean uncertainty handling (among others)
puts model accuracy into perspective (what is good or bad)
can be used to evaluate but doesn’t tell us how to improve (and how)
better idea – perceptron loss: weigh penalty by error magnitude (and use ReLU) L(Y^i,Yi∗)=ReLU(−YiXiβ)={0∣Xiβ∣=−YiXiβY^i=Yi∗Y^i=Yi∗
note that these are only for one instance, for all the average error is N1i=1∑NReLU(−Yi∗Xiβ)=N1i:Y^i=Yi∗∑−Yi∗Xiβ
Gradient descent looks as follows: β(t)=β(t−1)−τloss derivative∂β∂L(t−1) for learning rateτ≪1 In our case, the derivative (for a single instance) is ∂β∂L(t−1)=∂β∂ReLU(−YiXiβ)={0−Yi∗XiT−Yi∗Xiβ<0(correct)otherwise
Rosenblatt’s algorithm
Algorithm (Rosenblatt’s algorithm):
initialzation of β(0) – random/uniform numbers
for t=1,…,T (or until convergence)
update β using the gradient descent formula above, with minor changes:
additionally, use τ/N instead of τ (so it doesn’t change when learning set changes)
sum over only incorrect guesses from the previous iteration
β(t)=β(t−1)+Nτi:Y^i(t−1)=Yi∗∑Yi∗XiT
only converges when the data is “linearly separable” (i.e. ∃β for loss 0)
(+) first practical learning algorithm, established the gradient descent principle
(-) only converges when the training set area linearly separable
(-) tends to overfit, bad at generalization
Support Vector Machine (SVM)
Improved algorithm (popular around 1995): (Linear) Support Vector Machine (SVM)
maintain a safety region around data where solution should not be
closest points – support vectors
learns Y^=argmaxkp(Y=k∣X), which is LHS of Bayes – is discriminative
Reminder: distance of a point Xi and plane β,b is d(Xi,(β,b))=∣∣β∣∣Xiβ+b
notice that scaling β doesn’t change the distance – pairs (τβ,τb) define the same plane – we can therefore define an equivalence class H={β′,b′∣β′=τβH,b′=τbH} for βH,bH representatives (can be chosen arbitrarily)
Radius of the safety region (margin) is the smallest distance distance of a point to the decision plane (also called the “Hausdorff distance between H and TS”): mH=i=1minNd(Xi,(βH,bH))
Let i^ be the closest point. Then Yi^∗(Xi^βH+bH)=mH∣∣βH∣∣
we don’t use minus because we want distance, not penalty (like perceptron)
we again use the trick with multiplying by the label and brought the norm to the other side
Now we chose a representative such that the equation above is 1.
The decision plane is then correct when ∀i:Yi∗(XiβH+bH)≥1 (specifically 1 for i^). To make it as general as possible, we want one that maximizes the distance H=HargmaxmH=Hargmax(imin∣∣βH∣∣Yi∗(XiβH+bH))=Hargmax(∣∣βH∣∣1iminYi∗(XiβH+bH)))=argHmax∣∣βH∣∣1
This yields an optimization problem: we want to Hargmax∣∣βH∣∣1s.t.∀i:Yi∗(XiβH+bH)≥1
This is inconvenient – change to a more convenient equivalent optimization problem Hargmin21βHTβHs.t.∀i:Yi∗(XiβH+bH)≥1
Now we define slack variables ξi≥0 that measure how much it was violated Hargmin21βHTβHs.t.∀i:Yi∗(XiβH+bH)≥1−ξi and we also want to minimize those, which we’ll do by using Lagrange multipliers:H,ξargminmaximize margins21βHTβH+minimize penalties for misclassified pointsNλi=1∑NReLU(1−Yi∗(XiβH+bH))
adjusting λ makes compromise between being right vs. robust (hyperparameter tuning):
The iteration step for β looks as follows: β(t)b(t)=β(t−1)−τβ(t−1)+Nλi=Yi∗Xiβ+b<1∑N−Yi∗XiT=b(t−1)−τNλi=Yi∗Xiβ+b<1∑N−Yi∗
note that we can’t get stuck in a minimum, since the objective function is convex
Linear Discriminant Analysis (LDA)
idea: assume that features for each class form a cluster
assume they’re elliptic and model each one as a Gaussian distribution
is RHS of Bayes (generative) – calculate the posterior from Bayes formula
Linear – clusters have the same shape
there is also a Quadratic, which permits different shapes but isn’t linear
To get any ellipse, we start with a unit circle z and stretch (λ), rotate (Q) and shift (μ) it: Qλz+μ=(cosφsinφ−sinφcosφ)(x100x2)z+μ
Solving for z (and using the fact that Q is orthogonal), we get z=λ−1Q−1(X−μ)=λ−1QT(X−μ)
Gaussian distribution can be defined asN(x∣μ,σ)=2πσ21exp(−21((x−μ)σ−1)2)
for μ…mean, σ…variance
standard normal distribution is for μ=0 and σ=1
For higher dimensions (with our circle z), we get the generalized Gaussian N(z∣μ,Σ)=det(2πΣ)1exp−21(x−μ)Σ−1Q−1λ−1λ−1Q−T(x−μ)T
Σ−1=K is the precision matrix
since Σ−1=Q−1λ−1λ−1Q−T, we see a decomposition to eigenvalues (correspond to square of radii of the ellipse) and eigenvectors (the corresponding radii vectors)
Let {X−i}i=1N1 be features of class 1. Then N(X∣μ1,Σ1)=det(2πΣ1)1exp(−21(x1−μ1)Σ1−1(x1−μ1)T)
learning: find μ1 and Σ1
To derive the learning method, we’ll use two things:
maximum likelihood principle: choose μ1 and Σ1 such that TS will be a typical outcome of the resulting model (i.e. best model maximizes the likelihood of TS)
i.i.d. assumption: training instances drawn independently from the same distribution
independently – joint distribution is the product
identically distributed – all instances come from the same distribution
For the probability, we get p(TS)=p(X1,…,Xn)=i=1∏Npi(Xi)independently=i=1∏Np(Xi)identically distributed=i=1∏NN(Xi∣μ,Σ)
Using the maximum likelihood principle, the problem becomes μ^,Σ^=μ,Σargmaxp(TS)
It’s mathematically simpler to minimize negative logarithm of p(TS) (applying monotonic function to an optimization problem doesn’t change argmin and max=−min). We get
This is our loss. We now do derivative and set it to 0, since that will find the optimum. First, we’ll derivate by μ, which gets rid of the first part of loss completely and we get
In other words, the mean μ is the average (shocking, I know).
For Σ, this will be a little more complicated. We’ll do the partial derivation by Σ−1=K instead, which gives the following:
∂K∂L=i=1∑N[−(KT)−1∂K∂logdetK=(KT)−1+(Xi−μ)T(Xi−μ)∂A∂vAvT=vTv]i=1∑N[−K−1+(Xi−μ)T(Xi−μ)]−NK−1+i=1∑N(Xi−μ)T(Xi−μ)N1i=1∑N(Xi−μ)T(Xi−μ)=0=0Ksymmetric=0=Σ
Again, in other words, the variance is the average over the quared vectors offset by the mean, which too makes sense.
Now we have 2 clases but with same covariance (by assumption of LDA) and we can:
determine two means (μ1,μ−1) as μ1=N11i:Yi∗=1∑Xiμ−1=N−11i:Yi∗=−1∑
to calculate covariance (which is the same for both classes): Σ=N1i:Yi∗=1∑(Xi−μ1)T(Xi−μ1)+i:Yi∗=−1∑(Xi−μ−1)T(Xi−μ−1)
use Bayes RHS and our calculations to calculate the LHS: Y^i=sign(Xiβ+b)withβ=2Σ−1(μ1−μ−1)Tb=μ−1Σ−1μ−1T−μ1Σ−1μ1T
The derivation of (3) looks as follows: assuming p(Y=1)=p(Y=−1)=21 for simpler computations and p(Y=−1∣X) being analogous:
Plugging the result into the previous equation, we obtain the sigmoid function σ:
p(Y=1∣X)=1+exp(−(Xβ+b))1σ(t)=1+exp(−t)1
Two of its properties will be helpful for us: σ(−t)=1−σ(t)∂t∂σ(t)=σ(t)σ(−t)
Doing an analogous derivation for p(Y=−1∣X), we get y^=argkmaxp(Y=k∣X)⟺⟺⟺{1−1if σ(Xβ+b)>21if σ(Xβ+b)<21{1−1if Xβ+b>0if Xβ+b<0sign(Xβ+b)
Alternatives to training beta and b:
fit mean and covariance of clusters
least-squares regression: L(Yi∗,Y^i)=(Yi∗−(Xβ+b))2 (same solution as 1)
Fisher’s idea: define 1D scores Zi=Xiβ and chose β such that a threshold on Zi has minimum error
define projection of the means μ1^=μ1β,μ−1^=μ−1β
intuition:μ^1 and μ−1^ should be as far away as possible β=βargmax(μ^1−μ^−1)2
doesn’t quite work, because τβ⟹τ2(μ^1μ^−1)
solution: scale by the variance σ^: σ^1=Var(Z1∣Yi∗=1),σ^−1Var(Zi∣Yi∗=−1), then we get β^=βargmaxσ^12+σ^−12(μ^1−μ−1^)2
again gives the same solution as 1 and 2
Logistic regression (LR): same posterior as LDA, but learn LHS of Bayes rule
gives different solution to LDA
Logistic regression (LR)
We again have i.i.d. assumptions – all labels are drawn independently form the same posterior: p((Yi∗)i=1N∣(Xi)i=1N)=i=1∏Np(Yi∗∣Xi)
as a reminder, this was swapped for LDA (we had p(X∣Y))
use maximum likelihood: choose β^,b^ such that posterior of TS is maximized: ⟺β^,b^=β,bargmaxi=1∏Np(Yi∗∣Xi)β^,b^=argβ,bmin−i=1∑Nlogp(Yi∗∣Xi)min+log trick
For LR, Yi∗∈{0,1}, which allows us to rewrite (with the σ results from LDA) like so: β^,b^=argβ,bmin−i=1∑N[Yi∗logσ(Xβ+b)+(1−Yi∗)log(σ(−(Xβ+b))]
no analytic solution but is convex (local extreme = global extreme, solve via GD)
Here we have −log(σ(−t))=log(1+exp(t)), which is called the softplus function:
Simplifying for b=0 and using the following properties:
We get the derivatives β:
∂β∂Loss(TS)=−i=1∑NYi∗1−σ(Xiβ)σ(−Xiβ)⋅Xi+(1−Yi∗)σ(Xiβ)⋅(−Xi)=−i=1∑N[Yi∗Xi−Yi∗σ(Xiβ)Xi−σ(Xiβ)Xi+Yi∗σ(Xiβ)Xi]
obtaining
∂β∂Loss(TS)=i=1∑N(error σ(Xiβ)−Yi∗)⋅Xi=!0
The four cases that can occur are as follows:
Real value \ Classifier result
σ(Xiβ)≈1
σ(Xiβ)≈0
Yi∗=1
no correction
error ≈−1: pulls σ(Xiβ)→1
Yi∗=0
error ≈1: pulls σ(Xiβ)→0
no correction
Summary
with Y∈{−1,1}, all methods have same decision rule Y^i=sign(Xiβ+b) but the methods differ by how they define & find optimal β,b
common objective function β^,b^=β,bargmin=2λregularizationβTβ+N1i=1∑Ndata termLoss(Yi∗,Xiβ+b)
four points classification:x~=x1⋅x2,y^={+1−1x~<0otherwise
BMI (non-linear formula, linear classification)
problem: hand-crafting φ is difficult
solution: learn φ (multi-layer neural networks)
Neural networks
definition: inputs zin eg. (zin=X or zin= output of other neurons)
computation (activation):zout=φ(zinβ+b) (linear function plugged into non-linear one)
pre-activation:z~=zinβ+b
popular activation functions: φ
identity: used as output of regression networks
classic choices:σ or tanh, almost the same when scaled / offset
modern choices:
ReLU – not differentiable at 0 but nice in practice
leaky ReLU={xαxx>0otherwise
exponential linear unit ELU(x)={xα(ex−1)x>0otherwise
swish function x⋅σ(x)
usually, b is not used but turned into a “bias neuron” for each layer
A neural network is a collection of neurons in parallel layers
fully-connected: all neurons of previous layer connect to those of the next
for 4 layers, we have z0=[1,x] (bias), z1,z2 hidden and z3=y
sample computation would then be y=φ3([1;z2]B3)=φ3([1;φ2([1;z1]B2)]B3)=φ3([1;φ2([1;φ1([1;z0⋅B1])]B2)]B3)
(1) is the input, (2-5) are hidden layers, (6) is the output
Previously, NN were believed to not be a good idea, but
~2005: GPUs were found out to compute them well
big data became available (to train the big network)
~2012: starts outperforming everything else
Activation functions:
φ1,…,φL−1 (not for input): chosen by the designer
nowadays we usually use one of the modern choices described above
φL: determined by application
regression (Y∈R): identity
classification (p(Y=k∣X)): softmax
Important theorems:
neural networks with L≥2 (=1 hidden layer) are universal approximators (can approximate any function to any accuracy), if there are enough neurons in the hidden layer
purely existential proof, but nice to know
up till 2005, this was used, since they were enough
bad idea – deeper networks are easier to train (many good local optima)
finding the optimal network is NP-hard ⟹ we need approximation algorithms