IDC410 Machine Learning

2022-01-11.md

Machine Learning

Data collected from a real world process in form of input output pairs.
ML algorithm gives an approximation to $f$ which gives output for input. $f$ is the target function
$\hat f$ is a family of functions from the Hypothesis Space and the function is chosen by finding the argmin of some cost function.

Structure of Data

Most data can be represented as a table of input columns and an output column.
Data can be of
- Continuous Variable
  - We want a PDF
    - # parameters depends on PDF
- Categorical Variable
  - We want the probability distribution for the variable
    - k-1 parameters
  - There is no obvious order on them
- Ordinal Variable
  - Categorical with order

But there are an infinite number of input parameters, as the input may be Continuous.

Conditional Probability

We want to estimate the Conditional Probability $P(y|\vec{x})$
This is called a discriminative model.
In the binary categorical output, it is a bernoulli distribution.

Bayes Theorem

\[P(Y\|X) = \frac{P(X\|Y)P(Y)}{P(X)}\]

Conditional Independance Assumption

Given an output Y, all input parametersare independant of each other, i.e

$P(X|y_i) = \prod_j P(x_j|y_i)$

By modelling $P(x_j|y_i)$ as a Gaussian function, we have a total of $2n$ parameters for every $y_i$. In the case of the binary output, there are a total of $2n + 2n + 1$ parameters only, which reduces the complexity of the problem. $P(x_j|y_1) = 2n$ $P(x_j|y_2) = 2n$ $P(y_1) = \theta,\ P(y_2)=1-\theta$

This is called the Naive Bayes algorithm.

Since we are looking at the joint probability distribution, and if we know it, we can generate a sample. Hence this model is also known as a generative sample.

Defining without Conditional Independance

We can also model the entire problem as an N dimensional Gaussian without the [#Conditional Independance Assumption].

For this, we need an N dimensional $\vec{\mu}$ and a covarience matrix which has $\mathcal{O}(n^2)$

So here, we have a total of $\mathcal{O}(n^2)$. Hence, [#Conditional Independance Assumption] makes the parameter space linear in $n$.

Basically, constraining the Hypothesis space reduces the complexity of the ML problem.

Logistic Approach

We can further simplify the model further, by considering

$P(y_i|X) = \frac{1}{1+\exp(\sum\beta_i x_i)}$

This model has only $n+1$ parameters.

Equation of decision boundary

P(y_1|X) = P(y_2|X) $\implies 0 = \sum \beta_i x_i$

That is, our discrimination boundary is a plane. If the data cannot be cut by the plane, then we cannot use this assumption.

Constraint defined on our hypothesis set is known as a Language Bias, and it is our choice which Language Bias to pick.