Linear Regression, Loss, and Bias

Lecture

Simplest setting

Input: set of points $(x_{i}, y_{i})$ in $R \times R$

What function $f : R \to R$ explains the relationship of $x_{i}$ 's with $y_{i}$ 's?
What linear function $f (x) = a x + b$ describes it best?

Multivariate version

Input: set of points $(X_{i}, y_{i})$ in $R^{d} \times R$

Find linear function $f (x_{1}, x_{2}, \dots, x_{d}) = a_{1} x_{1} + \dots + a_{d} x_{d} + b$ that describes $y_{i}$ 's in terms of $X_{i}$ 's?

Why linear regression?

Have to assume something!
Models hidden linear relationship + noise

Typical objective: minimize square error

Points rarely can be described exactly using a linear relationship
How to decide between several non-ideal options?
Typically want to find $f$ that minimizes total square error:

$i \sum (f (x_{i}) - y_{i})^{2}$

What about Categorical Data?

Convert to numerical first
One way is to simply convert to unique integers
A better way is to create N new columns (one for each category) and make them boolean (is_x, is_y, is_z etc) -- one hot encoding

For example with 3 categories:

$[201] \to 001100010$

1-D Example

We'll start with an example from the Rust Machine Learning Book.

// Use a crate built from source on github
:dep linfa-book = { git = "https://github.com/rust-ml/book" }
:dep ndarray = { version = "^0.15.6" }

// create_curve implemented at https://github.com/rust-ml/book/blob/main/src/lib.rs#L52
use linfa_book::create_curve;
use ndarray::Array2;
use ndarray::s;

fn generate_data(output: bool) -> Array2<f64> {

    /*
     * Generate a dataset of x and y values
     * 
     * - Randomly generate 50 points between 0 and 7
     * - Calculate y = m * x^power + b +noise
     *   where noise is uniformly random between -0.5 and 0.5
     * 
     * m = 1.0  (slope)
     * power = 1.0  (straight line)
     * b = 0.0 (y-intercept)
     * num_points = 50
     * x_range = [0.0, 7.0]
     * 
     * This produces a 50x2 Array2<f64> with the first column being x and the
     * second being y
     */
    let array: Array2<f64> = linfa_book::create_curve(1.0, 1.0, 0.0, 50, [0.0, 7.0]);

    // Converting from an array to a Linfa Dataset can be the trickiest part of this process
    // The first part has to be an array of arrays even if they have a single entry for the 1-D case
    // The underlying type is kind of ugly but thankfully the compiler figures it out for us
    // let data: ArrayBase<OwnedRepr<f64>, Ix2> = ....  is the actual type
    // The second part has to be an array of values.
    // let targets: Array1<f64> is the actual type


    let data = array.slice(s![.., 0..1]).to_owned();
    let targets = array.column(1).to_owned();
    if output {
        println!("The original array is:");
        println!("{:.2?}", array);

        println!("The data is:");
        println!("{:.2?}", data);

        println!("The targets are:");
        println!("{:.2?}", targets);

    }
    return array;
}

generate_data(true);

The original array is:
[[6.59, 6.85],
 [2.49, 2.39],
 [0.77, 0.32],
 [1.85, 2.25],
 [2.13, 2.46],
 [0.93, 0.60],
 [6.17, 6.51],
 [1.91, 2.02],
 [2.82, 3.20],
 [0.55, 0.23],
 [4.31, 4.06],
 [4.04, 3.54],
 [6.15, 6.51],
 [0.10, -0.33],
 [0.68, 0.77],
 [1.03, 1.18],
 [0.51, 0.68],
 [3.75, 3.38],
 [1.59, 1.57],
 [5.77, 6.26],
 [1.18, 1.15],
 [2.99, 2.99],
 [4.33, 4.69],
 [1.36, 1.52],
 [5.20, 5.67],
 [6.18, 6.41],
 [6.10, 6.46],
 [6.82, 6.68],
 [4.27, 3.88],
 [1.67, 1.71],
 [5.42, 5.62],
 [2.47, 2.02],
 [4.48, 4.95],
 [1.82, 2.27],
 [5.53, 5.27],
 [0.52, 0.53],
 [5.63, 5.35],
 [0.13, 0.43],
 [4.47, 4.83],
 [0.58, 1.00],
 [3.13, 3.28],
 [6.84, 6.69],
 [6.46, 6.15],
 [6.04, 6.27],
 [5.63, 6.12],
 [1.60, 1.86],
 [0.18, 0.08],
 [6.24, 6.70],
 [0.20, 0.06],
 [4.44, 4.10]], shape=[50, 2], strides=[2, 1], layout=Cc (0x5), const ndim=2
The data is:
[[6.59],
 [2.49],
 [0.77],
 [1.85],
 [2.13],
 [0.93],
 [6.17],
 [1.91],
 [2.82],
 [0.55],
 [4.31],
 [4.04],
 [6.15],
 [0.10],
 [0.68],
 [1.03],
 [0.51],
 [3.75],
 [1.59],
 [5.77],
 [1.18],
 [2.99],
 [4.33],
 [1.36],
 [5.20],
 [6.18],
 [6.10],
 [6.82],
 [4.27],
 [1.67],
 [5.42],
 [2.47],
 [4.48],
 [1.82],
 [5.53],
 [0.52],
 [5.63],
 [0.13],
 [4.47],
 [0.58],
 [3.13],
 [6.84],
 [6.46],
 [6.04],
 [5.63],
 [1.60],
 [0.18],
 [6.24],
 [0.20],
 [4.44]], shape=[50, 1], strides=[1, 1], layout=CFcf (0xf), const ndim=2
The targets are:
[6.85, 2.39, 0.32, 2.25, 2.46, 0.60, 6.51, 2.02, 3.20, 0.23, 4.06, 3.54, 6.51, -0.33, 0.77, 1.18, 0.68, 3.38, 1.57, 6.26, 1.15, 2.99, 4.69, 1.52, 5.67, 6.41, 6.46, 6.68, 3.88, 1.71, 5.62, 2.02, 4.95, 2.27, 5.27, 0.53, 5.35, 0.43, 4.83, 1.00, 3.28, 6.69, 6.15, 6.27, 6.12, 1.86, 0.08, 6.70, 0.06, 4.10], shape=[50], strides=[1], layout=CFcf (0xf), const ndim=1

Let's plot these points to see what they look like

// 2021
//:dep plotters={version = "^0.3.0", default_features = false, features = ["evcxr", "all_series"]}

// 2024
:dep plotters={version = "^0.3.0", default-features = false, features = ["evcxr", "all_series"]}

:dep ndarray = { version = "^0.15.6" }

extern crate plotters;
use plotters::prelude::*;

{
let array: Array2<f64> = generate_data(false);

let x_values = array.column(0);
let y_values = array.column(1);

evcxr_figure((640, 480), |root| {
    let mut chart = ChartBuilder::on(&root)
    // the caption for the chart
        .caption("2-D Plot", ("Arial", 20).into_font())
        .x_label_area_size(40)
        .y_label_area_size(40)
   // the X and Y coordinates spaces for the chart
        .build_cartesian_2d(0f64..8f64, 0f64..8f64)?;
    chart.configure_mesh()
        .x_desc("X Values")
        .y_desc("Y Values")
        .draw()?;

    chart.draw_series(
        x_values.iter().zip(y_values.iter()).map(|(&x, &y)| {
            Circle::new((x, y), 3, RED.filled())
        }),
    )?;
    Ok(())
})
}

Let's fit a linear regression to it

// There are multiple packages in this git repository so we have to declare
// all the ones we care about
:dep linfa = { git = "https://github.com/rust-ml/linfa" }
:dep linfa-linear = { git = "https://github.com/rust-ml/linfa" }
:dep ndarray = { version = "^0.15.6" }

// Now we need to declare which function we are going to use
use linfa::Dataset;
use linfa_linear::LinearRegression;
use linfa_linear::FittedLinearRegression;
use linfa::prelude::*;

fn fit_model(array: &Array2<f64>) -> FittedLinearRegression<f64> {
    // Let's regenarate and split our data
    let data = array.slice(s![.., 0..1]).to_owned();
    let targets = array.column(1).to_owned();

    // And finally let's fit a linear regression to it
    println!("Data: {:?} \n Targets: {:?}", data, targets);
    let dataset = Dataset::new(data, targets);
    let lin_reg: LinearRegression = LinearRegression::new();
    let model: FittedLinearRegression<f64> = lin_reg.fit(&dataset).unwrap();
    
    let ypred = model.predict(&dataset);
    let loss = (dataset.targets() - ypred)
        .mapv(|x| x.abs())
        .mean();

    println!("{:?}", loss);
    return model;
}

let array: Array2<f64> = generate_data(false);
let model = fit_model(&array);
println!("{:?}", model);

Data: [[6.381616680690672],
 [4.396361662331629],
 [4.3324365575773305],
 [0.8289740051702756],
 [6.319103537997485],
 [0.3335321841860137],
 [5.095836624614474],
 [3.6492182454480435],
 [1.3364352957815282],
 [0.09639507860613405],
 [5.303183512825285],
 [4.2951694397460285],
 [5.4057438081138915],
 [0.10960286365816452],
 [4.603044369250634],
 [4.024864366257263],
 [4.281935499419392],
 [4.06157301198745],
 [5.755333890324657],
 [1.3359866701598488],
 [2.3904258537567324],
 [4.029842664366427],
 [2.2848232403062685],
 [0.12639520987084052],
 [2.9318196797275835],
 [3.4846905959349144],
 [2.1132653187529598],
 [2.409170009823532],
 [1.294104964714346],
 [3.068189601192284],
 [1.5698399649061598],
 [4.585070111032277],
 [0.9080582927284744],
 [2.3500918290692145],
 [5.412834219057823],
 [0.5732492542928671],
 [3.1955234173749907],
 [0.6011335403952247],
 [0.06756207483950094],
 [0.4973826225929374],
 [3.5708462015304976],
 [5.016449358044835],
 [6.275456045849482],
 [0.24134187530345663],
 [0.19382724586379352],
 [3.900026524938423],
 [1.977100227563272],
 [4.934898600624889],
 [3.970926362145382],
 [1.5789373737594155]], shape=[50, 1], strides=[1, 1], layout=CFcf (0xf), const ndim=2 
 Targets: [6.74548949826222, 4.178753589022698, 3.9296433463393803, 0.9101170752817256, 6.422143609882557, -0.115321608937899, 5.126622670184417, 4.001921017212379, 1.2316076794171344, 0.3251034025687174, 5.291327977564027, 4.740353143641885, 5.005473769290836, -0.3802551719385241, 4.723355959069144, 3.740167334494597, 3.869317273792087, 3.583182714685261, 5.960145733182379, 1.4154622808579058, 2.395483563478454, 3.780181818550531, 2.195179330427658, -0.0780379205920585, 3.3559033104968132, 3.595446926448168, 1.9151631312465982, 2.5817866309609263, 1.1115468223706244, 3.2978733650316716, 1.7303235376317656, 4.577920815363182, 0.7328128405941892, 1.8555370714021548, 5.259719826144993, 0.46486256898256517, 3.661623687684834, 0.24442021204114717, 0.26510910851412994, 0.8471086226984335, 3.373961998065203, 4.696005829425596, 5.854185825360939, 0.3863210554131675, 0.29817841644956, 3.7047013726196827, 2.027075695584828, 5.171080561424032, 3.949876872423843, 2.0560366719493564], shape=[50], strides=[1], layout=CFcf (0xf), const ndim=1
Some(0.23628401752752)
FittedLinearRegression { intercept: -0.016055504752757018, params: [0.9953569138666288], shape=[1], strides=[1], layout=CFcf (0xf), const ndim=1 }

Finally let's put everything together and plot the data points and model line

extern crate plotters;
use plotters::prelude::*;

{    
let array: Array2<f64> = generate_data(false);
let model = fit_model(&array);
println!("{:?}", model);

let x_values = array.column(0);
let y_values = array.column(1);

evcxr_figure((640, 480), |root| {
    let mut chart = ChartBuilder::on(&root)
    // the caption for the chart
        .caption("Linear Regression", ("Arial", 20).into_font())
        .x_label_area_size(40)
        .y_label_area_size(40)
    // the X and Y coordinates spaces for the chart
        .build_cartesian_2d(0f64..8f64, 0f64..8f64)?;
    chart.configure_mesh()
        .x_desc("X Values")
        .y_desc("Y Values")
        .draw()?;

    chart.draw_series(
        x_values.iter().zip(y_values.iter()).map(|(&x, &y)| {
            Circle::new((x, y), 3, RED.filled())
        }),
    )?;
    let mut line_points = Vec::with_capacity(2);
    for i in (0..8i32).step_by(1) {
        line_points.push((i as f64, (i as f64 * model.params()[0]) + model.intercept()));
    }
    // We can configure the rounded precision of our result here
    let precision = 2;
    let label = format!(
        "y = {:.2$}x + {:.2}",
        model.params()[0],
        model.intercept(),
        precision
    );
    chart.draw_series(LineSeries::new(line_points, &BLACK))
        .unwrap()
        .label(&label);

    Ok(())
})
    
}

Data: [[2.04750415763924],
 [6.367873757109327],
 [6.398058085002853],
 [2.768792455803591],
 [0.1718068124456904],
 [5.410322501484772],
 [1.16759081573306],
 [2.7536979999018953],
 [4.950802437871839],
 [0.6025334876065953],
 [3.2764831065673183],
 [1.721556072918826],
 [2.944603890742717],
 [1.1679420030984646],
 [1.4583480833464897],
 [5.603000967206449],
 [0.7792958771030705],
 [0.9903895595193106],
 [5.849216682555504],
 [5.039598826115972],
 [6.79519856325383],
 [5.9811481574554595],
 [5.848135244005015],
 [3.2303889196020474],
 [5.180333900515606],
 [6.441584502170909],
 [2.1595411034915752],
 [0.7565248539761933],
 [6.575497606063534],
 [4.5995576454693055],
 [4.3608367164107875],
 [5.548520783489524],
 [6.525459030458023],
 [0.6937753120972134],
 [1.9522048348466015],
 [2.2341967770269227],
 [3.637038438411855],
 [2.2190157648753184],
 [2.681992233899634],
 [5.566572415187922],
 [0.9357604528754049],
 [0.8490168837359822],
 [6.448593818753438],
 [2.6758983352983177],
 [1.049091617939221],
 [1.5851990026054852],
 [6.098462409705635],
 [2.585742190451155],
 [0.17671676593123875],
 [0.5246679878163918]], shape=[50, 1], strides=[1, 1], layout=CFcf (0xf), const ndim=2 
 Targets: [2.408804027014255, 6.469323578131448, 6.193104604093066, 2.418915373474102, 0.029226368788591195, 4.9972110392473255, 1.1150716816343442, 2.926060993136029, 4.490694341852246, 0.9502150297756105, 3.4740768014974783, 1.251263397177235, 3.4148633041576995, 1.4681943013079881, 0.9890758429931192, 5.391821032945574, 1.211034580394566, 1.3854290271879017, 5.431409730804429, 4.699669709042999, 6.998217208006695, 6.109111375789134, 5.75094664694154, 3.658429812887407, 5.289801849698538, 6.105335877939171, 1.8358314904060669, 0.36311224569624856, 6.342541141255715, 4.691208846854487, 4.250255794021059, 5.675447984854635, 6.793126113768862, 0.5931705314387281, 2.111835341368409, 2.144381384764876, 3.8510929693234646, 1.7549171320182217, 2.563236265429576, 5.214375657851614, 0.8020824408564826, 0.7138795774936122, 6.183001994443131, 2.633962608410515, 1.4663428201689852, 1.6261732010691712, 6.5133103394465035, 2.1115834556736774, 0.5083760324119229, 0.7979776364877389], shape=[50], strides=[1], layout=CFcf (0xf), const ndim=1
Some(0.25864023660929314)
FittedLinearRegression { intercept: 0.05790152043058407, params: [0.9754302443444729], shape=[1], strides=[1], layout=CFcf (0xf), const ndim=1 }

params $= a$
intercept $= b$ in

$f (x) = a x + b$

Coefficient of determination (or $R^{2}$ )

How good is my function $f$ ?
Input: points ${(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})}$
Idea: Compare variance of $y_{i}$ 's to the deviation of $y_{i}$ 's from $f (x_{i})$ 's
Formally:

$1 - \frac{\sum _{i} ( y _{i} - f ( x _{i} ) ) ^{2}}{\sum _{i} ( y _{i} - y ˉ ) ^{2}} = 1 - \frac{\frac{1}{n} \sum _{i} ( y _{i} - f ( x _{i} ) ) ^{2}}{Var ( y _{i} )}$

where $\overset{y}{ˉ} = \frac{1}{n} \sum_{i} y_{i}$

Range: $(- \infty, 1]$ (should be in $[0, 1]$ for linear regression)

Let's compute for our data

:dep linfa = { git = "https://github.com/rust-ml/linfa" }
:dep linfa-linear = { git = "https://github.com/rust-ml/linfa" }
:dep ndarray = { version = "^0.15.6" }
:dep ndarray-stats = { version = "^0.5.1" }

let array: Array2<f64> = generate_data(false);
let data = array.slice(s![.., 0..1]).to_owned();
let targets = array.column(1).to_owned();

// And finally let's fit a linear regression to it
let dataset = Dataset::new(data, targets).with_feature_names(vec!["x"]);
let lin_reg: LinearRegression = LinearRegression::new();
let model: FittedLinearRegression<f64> = lin_reg.fit(&dataset).unwrap();    
let ypred = model.predict(&dataset);
//
let variance:f64 = dataset.targets().var(0.0);
let mse:f64 = (dataset.targets() - ypred)
    .mapv(|x| x.powi(2))
        .mean().unwrap();
let r2 = 1.0 - mse/variance;
println!("variance = {:.3}, mse = {:.3}, R2 = {:.3}", variance, mse, r2);

variance = 2.971, mse = 0.080, R2 = 0.973

Multivariable linear regression

What if you have multiple input variables and one output variable?

It's actually quite simple. The same code we used above but make sure your X side contains multiple values for each of the variables in your function. The code will compute as many coefficients as the variables it sees.

For 3-D input:

$f (x, y, z) = a_{1} x + a_{2} y + a_{3} z + b$

:dep linfa = { git = "https://github.com/rust-ml/linfa" }
:dep linfa-linear = { git = "https://github.com/rust-ml/linfa" }
:dep ndarray = { version = "^0.15.6" }

use linfa::Dataset;
use linfa::traits::Fit;
use ndarray::{Array1, Array2, array};
use linfa_linear::LinearRegression;
use linfa_linear::FittedLinearRegression;
use linfa::prelude::*;


fn main() {
    // Example data: 4 samples with 3 features each
    let x: Array2<f64> = array![[1.0, 2.0, 3.0],
                                //[2.0, 3.0, 1.0],
                                [2.0, 3.0, 4.0],
                                [3.0, 4.0, 5.0],
                                [4.0, 5.0, 6.0],
                                //[6.0, 11.0, 12.0]
                                ];
    // Target values
    let y: Array1<f64> = array![6.0, 
                                //6.0, 
                                9.0, 
                                12.0, 
                                15.0, 
                                //29.0,
        ];

    // Create dataset
    let dataset = Dataset::new(x.clone(), y.clone());

    // Fit linear regression model
    let lin_reg = LinearRegression::new();
    let model = lin_reg.fit(&dataset).unwrap();

    // Print coefficients
    println!("Coefficients: {:.3?}", model.params());
    println!("Intercept: {:.3?}", model.intercept());

    // Predict using the fitted model
    let predictions = model.predict(&x);
    println!("Predictions: {:.3?}", predictions);
}

main();

Coefficients: [22.667, -8.333, -11.333], shape=[3], strides=[1], layout=CFcf (0xf), const ndim=1
Intercept: 34.000
Predictions: [6.000, 9.000, 12.000, 15.000], shape=[4], strides=[1], layout=CFcf (0xf), const ndim=1

General Least Squares Fit

We can generalize the ordinary least squares regression problem, by trying, for example, to find the $β_{i}$ in an equation like

$\overset{y}{^}_{i} = β_{0} + β_{1} f_{1} (x_{i}) + β_{2} f_{1} (x_{i}) + ... + β_{n} f_{n} (x_{i})$

to minize the differences between some targets $y_{i}$ and the $\overset{y}{^}_{i}$ .

Note that these are still linear in the parameters.

We can rewrite this in matrix form:

$y_{1} y_{2} ⋮ y_{m} = 111 f_{1} (x_{1}) f_{1} (x_{2}) f_{1} (x_{m}) ... ... ⋮ ... f_{n} (x_{1}) f_{n} (x_{2}) f_{n} (x_{m}) β_{0} β_{1} ⋮ β_{n} + ε_{1} ε_{2} ⋮ ε_{m}$

where the matrix of $f$ 's is sometimes called the Design Matrix.

We are going to use a different library familar to us from the ndarray lecture.
ndarray-linalg can compute the parameters to fit an arbitrary function as long as you have some idea of what the function might be.
- e.g. express it as a design matrix
Then solve a system of linear equations with that matrix as the left hand side and our observed values as the right hand side.
The result is the missing parameters of our assumed function

:dep ndarray = { version = "^0.15.6" }

// See ./README.md for ndarray-linalg prereqs

// This is the version for MAC 
:dep ndarray-linalg = { version = "^0.16", features = ["openblas-system"] }

// Alternative for Mac if you installed netlib
//:dep ndarray-linalg = { version = "^0.14" , features = ["netlib"]}

// This works for linux
// :dep ndarray-linalg = { version = "^0.14" , features = ["openblas"]}


use ndarray::{Array1, Array2, ArrayView1};
use ndarray_linalg::Solve;
use ndarray::array;
use ndarray::Axis;
use ndarray_linalg::LeastSquaresSvd;

// Define an arbitrary function (e.g., a quadratic function)
fn arbitrary_function(x: f64, params: &ArrayView1<f64>) -> f64 {
    params[0] + params[1] * x + params[2] * x * x
}

// Compute the design matrix for the arbitrary function
fn design_matrix(x: &Array1<f64>) -> Array2<f64> {
    let mut dm = Array2::<f64>::zeros((x.len(), 3));
    for (i, &xi) in x.iter().enumerate() {
        dm[(i, 0)] = 1.0;
        dm[(i, 1)] = xi;
        dm[(i, 2)] = xi * xi;
    }
    dm
}

// Perform least squares fit
fn least_squares_fit(x: &Array1<f64>, y: &Array1<f64>) -> Array1<f64> {
    let dm = design_matrix(x);
    let y_col = y.to_owned().insert_axis(Axis(1)); // Convert y to a column vector
    let params = dm.least_squares(&y_col).unwrap().solution; // Use least squares solver
    params.column(0).to_owned() // Convert params back to 1D array
}


fn main() {
    // Example data
    let x: Array1<f64> = array![1.0, 2.0, 3.0, 4.0];
    let y: Array1<f64> = array![1.0, 4.0, 9.0, 16.0];  // y = x^2 for this example

    // Perform least squares fit
    let params = least_squares_fit(&x, &y);
    println!("Fitted parameters: {:.3}", params);

    // Predict using the fitted parameters
    let predictions: Array1<f64> = x.mapv(|xi| arbitrary_function(xi, &params.view()));
    println!("Predictions: {:.3?}", predictions);
}

main();

The type of the variable dataset was redefined, so was lost.
The type of the variable array was redefined, so was lost.
The type of the variable model was redefined, so was lost.
The type of the variable lin_reg was redefined, so was lost.


Fitted parameters: [0.000, -0.000, 1.000]
Predictions: [1.000, 4.000, 9.000, 16.000], shape=[4], strides=[1], layout=CFcf (0xf), const ndim=1

Example General Function

Imagine a case where you have a function like :

$f 1 = a_{0} + a_{1} x + a_{2} x^{2} + a_{3} l o g (x)$

Then setup your functions like this:

// Define an arbitrary function (e.g., a quadratic function)
fn arbitrary_function(x: f64, params: &ArrayView1<f64>) -> f64 {
    params[0] + params[1] * x + params[2] * x * x + params[3] * x.ln()

}

// Compute the design matrix for the arbitrary function
fn design_matrix(x: &Array1<f64>) -> Array2<f64> {
    let mut dm = Array2::<f64>::zeros((x.len(), 4));
    for (i, &xi) in x.iter().enumerate() {
        dm[(i, 0)] = 1.0;
        dm[(i, 1)] = xi;
        dm[(i, 2)] = xi * xi;
        dm[(i, 3)] = xi.ln();
    }
    dm
}

It gets messier if x appears in exponents and outside the scope of this lecture but it is possible to do non-linear least squares fit for completely arbitrary functions!!!

Train/Test Splits

To test the generality of your models, it is recommended to split your data into

training dataset (e.g. 80% of the data)
testing dataset (e.g. 20% of the data)

Then train with the training dataset and evaluate using the test dataset.

Loss Functions, Bias & Cross-Validation

Reminders

Typical predictive data analysis pipeline:

Very important: split your data into a training and test part
Train your model on the training part
Use the testing part to evaluate accuracy

Measuring errors for regression

Usually, the predictor is not perfect.
How do I evaluate different options and choose the best one?

Mean Squared Error (or $L_{2}$ loss):

$\frac{1}{n} i = 1 \sum n (f (x_{i}) - y_{i})^{2}$

Mean Absolute Error (or $L_{1}$ loss):

$\frac{1}{n} i = 1 \sum n ∣ f (x_{i}) - y_{i} ∣$

Plots of MSE and MAE

// 2021
//:dep plotters={version = "^0.3.0", default_features = false, features = ["evcxr", "all_series","full_palette"]}

// 2024
:dep plotters={version = "^0.3.0", default-features = false, features = ["evcxr", "all_series","full_palette"]}

:dep ndarray = { version = "^0.16.0" }
:dep ndarray-rand = { version = "0.15.0" }
:dep linfa = { version = "^0.7.0" }
:dep linfa-linear = { version = "^0.7.0" }
:dep linfa-datasets = { version = "^0.7.0" }
:dep linfa-linear = { version = "^0.7.0" }
:dep ndarray-stats = { version = "^0.5.1" }

[E0277] Error: the trait bound `ArrayBase<OwnedRepr<f64>, Dim<[usize; 2]>>: Matrix<_>` is not satisfied



[E0277] Error: the trait bound `ArrayBase<OwnedRepr<f64>, Dim<[usize; 2]>>: BaseMatrix<_>` is not satisfied



[E0308] Error: mismatched types



[E0308] Error: mismatched types



[E0277] Error: the trait bound `FittedLinearRegression<_>: linfa::prelude::Predict<&ArrayBase<OwnedRepr<f64>, Dim<[usize; 2]>>, _>` is not satisfied



[E0277] Error: the trait bound `ArrayBase<OwnedRepr<f64>, Dim<[usize; 2]>>: Records` is not satisfied



[E0599] Error: no method named `least_squares` found for struct `ArrayBase` in the current scope



[E0308] Error: arguments to this function are incorrect



[E0308] Error: mismatched types

// Helper code for plotting
extern crate plotters;
use plotters::prelude::*;
use ndarray::Array1;
use plotters::evcxr::SVGWrapper;
use plotters::style::colors::full_palette;

fn plotter_scatter(sizes: (u32, u32), x_range: (f64, f64), y_range: (f64, f64), scatters: &[(&Array1<f64>, &Array1<f64>, &RGBColor, &str, &str)], lines: &[(&Array1<f64>, &Array1<f64>, &str, &RGBColor)]) -> SVGWrapper {
    evcxr_figure((sizes.0, sizes.1), |root| {
    let mut chart = ChartBuilder::on(&root)
    // the caption for the chart
        .caption("2-D Plot", ("Arial", 20).into_font())
        .x_label_area_size(40)
        .y_label_area_size(40)
   // the X and Y coordinates spaces for the chart
        .build_cartesian_2d(x_range.0..x_range.1, y_range.0..y_range.1)?;
    chart.configure_mesh()
        .x_desc("X Values")
        .y_desc("Y Values")
        .draw()?;
        
        for scatter in scatters {
            if scatter.3 == "x" {
              chart.draw_series(
                scatter.0.iter().zip(scatter.1.iter()).map(|(&a, &b)| {
                  Cross::new((a, b), 3, scatter.2.filled())   
                }),
              )?
              .label(scatter.4)
              .legend(|(x, y)| Cross::new((x, y), 3, scatter.2.filled()));
            } else {
              chart.draw_series(
                scatter.0.iter().zip(scatter.1.iter()).map(|(&a, &b)| {
                  Circle::new((a, b), 3, scatter.2.filled())   
                }),
              )?
              .label(scatter.4)
              .legend(|(x, y)| Circle::new((x, y), 3, scatter.2.filled()));
            }
        }
        
        if lines.len() > 0 {
            for line in lines {
              chart
                .draw_series(LineSeries::new(line.0.iter().cloned().zip(line.1.iter().cloned()), line.3))?
                .label(line.2)
                .legend(|(x, y)| PathElement::new(vec![(x, y), (x + 20, y )], line.3));
            }
            // Configure and draw the legend
        
        }
        chart
              .configure_series_labels()
              .background_style(&WHITE.mix(0.8))
              .border_style(&BLACK)
              .draw()?;
    Ok(())
  })
}

use ndarray::Array;
    
let xs = Array::linspace(-1.3, 1.3, 300);
let abs_xs = xs.mapv(f64::abs);
let xs_squared = xs.mapv(|a| a.powi(2));

plotter_scatter(
  (500, 400), (-1.5,1.5), (0.,1.75), //plot size and ranges
  &[], //any scatters
  &[(&xs,&abs_xs, "MAE", &full_palette::RED), (&xs,&xs_squared, "MSE", &full_palette::BLUE)] //any lines
)

Observe that MAE will be more sensitive to small differences, while

MSE penalized big differences more.

Definition of an outlier

A point or small set of points that are "different"

Important difference between error measures: different attention to outliers

Linear Regression: higher powers of absolute error ( $L_{p}$ loss)

In the limit...

This converges to minimizing the maximum difference between ( $f (x_{i}) = c_{0} x_{i} + c_{1}$ and $y_{i}$ )
This is called: $L_{\infty}$ loss
Another way to express it: minimize $z$ such that

$∣ c_{0} x_{i} + c_{1} - y_{i} ∣ \leq z for all i$

In the limit...

Another way to express it: minimize $z$ such that

$∣ c_{0} x_{i} + c_{1} - y_{i} ∣ \leq z for all i$

Linear programming formulation: minimize $z$ such that

$c_{0} x_{i} + c_{1} - y_{i} \leq z for all i$

and

$- (c_{0} x_{i} + c_{1} - y_{i}) \leq z for all i$

1D insight into the outlier sensitivity

Input: set of points in $R$
What point minimizes MSE, MAE, ... as a representative of these points?
MSE aka $L_{2}$ : mean
MAE aka $L_{1}$ : median
$L_{\infty}$ : the mean of the maximum and minimum

Something in between MSE and MAE?

$L_{p}$ loss for $p \in (1, 2)$ ?
Huber loss:
- quadratic for small distances
- linear for large distances
$L_{δ} (f (x), y) = {\frac{1}{2} (y - f (x))^{2} δ ∣ y - f (x) ∣ - \frac{1}{2} δ^{2} if ∣ y - f (x) ∣ \leq δ otherwise$

Underfitting

Model not expressive enough to capture the problem
Or a solution found does not match the problem

Possible solutions:

Try harder to find a better solution
Add more parameters
Try a different model that captures the solution

Overfitting

Predictions adjusted too well to training data
Error on test data $≫$ error on training data

Possible solutions:

Don't optimize the model on the training data too much
Remove features that are too noisy
Add more training data
Reduce model complexity

Bias and variance

$Total learning error = Bias + Variance + noise$

Bias:

The simplifying assumptions made by the model to make the target function easier to approximate.

Mathematically it is the difference of the average value of predictions from the true function:

$E [\hat{f} (x) - f (x)]$

Variance:

The amount that the output of the model changes given different training data.

Mathematically it is the mean of the delta of the squared deviation of the prediction function from its expected value:

$E [(\hat{f} (x) - E [\hat{f} (x)])^{2}]$

Bias: error due to model unable to match the complexity of the problem

Variance: how much the prediction will vary in response to data points

Overfitting: high variance, low bias

Underfitting: high bias, low variance

Important in practice:

detecting the source of problems: variance vs. bias, overfitting vs. underfitting
navigating the trade-off and finding the sweet spot

Some examples

Terminology

Parameters

Variables fixed in a specific instantiation of a model
Examples:
- coefficients in linear regression
- decision tree structure and thresholds
- weights and thresholds in a neural network

Hyperparameters

Also parameters, but higher level
Examples:
- number of leafs in a decision tree
- number of layers and their structure in a neural network
- degree of a polynomial

Hyperparameter tuning

Adjusting hyperparameters before training the final model

Model selection

Deciding on the type of model to be used (linear regression? decision trees? ...)

Decision Tree discussion

Tree structure and thresholds for splits are parameters (learned by the algorithm)
Many of the others are hyperparameters
split_quality: Sets the metric used to decide the feature on which to split a node
min_impurity_decrease: How much reduction in gini or other metric should we see before we allow a split
min_weight_split: Sets the minimum weight of samples required to split a node.
min_weight_leaf: Sets the minimum weight of samples that a split has to place in each leaf
max_depth: Affects the structure of the tree and how elements can be assigned to nodes

All documented at great length in (https://docs.rs/linfa-trees/latest/linfa_trees/struct.DecisionTreeParams.html)

Challenges of training and cross-validation

Big goal: train a model that can be used for predicting

Intermediate goal: select the right model and hyperparameters

How about trying various options and seeing how they perform on the test set?

Information leak danger!

If we do it adaptively, information from the test set could affect the model selection

Cross–validation attempts to solve this problem

Tune your parameters by using portions of the training set and preserve the test set for only a final evaluation

Holdout method

Partition the training data again: training set and validation set
Use the validation part to estimate accuracy whenever needed

Pros:

Very efficient
Fine with lots of data when losing a fraction is not a problem

Cons:

Yet another part of data not used for training
Problematic when the data set is small
Testing part could contain important information

$k$ –fold cross–validation

Partition the training set into $k$ folds at random
Repeat $k$ times:
- train on $k - 1$ folds
- estimate the accuracy on the $k$ -th fold
Return the mean

Pros:

Every data point used for training most of the time
Less variance in the estimate

Cons:

$k$ times slower

LOOCV: Leave–one–out cross–validation

Extreme case of the previous approach: separate fold for each data point
For each data point $q$ :
- train on data without $q$
- estimate the accuracy on $q$
Return the mean of accuracies

Cons:

Even more expensive

Many other options

Generalization: leave– $p$ –out cross–validation enumerates over $(p n)$ subsets
Sampling instead of trying all options
A variation that ensures that all classes evenly distributed in folds
...

Training and Cross Validation with Iris Dataset

Classic dataset

3 Iris species
50 examples in each class
4 features (sepal length/width, petal length/width) for each sample

:dep linfa = {version = "0.7.0"}
:dep linfa-datasets = { version = "0.7.0", features = ["iris"] }
:dep linfa-trees = { version = "0.7.0" }
:dep ndarray = { version = "0.16.1" }
:dep ndarray-rand = {version = "0.15.0" }
:dep rand = { version = "0.8.5", features = ["small_rng"] }
:dep smartcore = { version = "0.3.2", features = ["datasets", "ndarray-bindings"] }

use linfa::prelude::*;
use linfa_trees::DecisionTree;
use ndarray_rand::rand::SeedableRng;
use rand::rngs::SmallRng;
use linfa_trees::DecisionTreeParams;

fn crossvalidate() -> Result<(), Box<dyn std::error::Error>> {

    // Load the Iris dataset
    let mut iris = linfa_datasets::iris();

    let mut rng = SmallRng::seed_from_u64(42);

    // Split the data into training and testing sets
    let (train, test) = iris.clone()
        .shuffle(&mut rng)
        .split_with_ratio(0.8);

    // Extract the features (X) and target (y) for training and testing sets
    let X_train = train.records();
    let y_train = train.targets();
    let X_test = test.records();
    let y_test = test.targets();

    // Print the shape of the training and testing sets
    println!("X_train shape: ({}, {})", X_train.nrows(), X_train.ncols());
    println!("y_train shape: ({})", y_train.len());
    println!("X_test shape: ({}, {})", X_test.nrows(), X_test.ncols());
    println!("y_test shape: ({})", y_test.len());

    // Train the model on the training data
    let model = DecisionTree::params()
        .max_depth(Some(3))
        .fit(&train)?;

    // Evaluate the model's accuracy on the training set
    let train_accuracy = model.predict(&train)
        .confusion_matrix(&train)?
        .accuracy();
    println!("Training accuracy: {:.2}%", train_accuracy * 100.0);

    // Evaluate the model's accuracy on the test set
    let test_accuracy = model.predict(&test)
        .confusion_matrix(&test)?
        .accuracy();
    println!("Test accuracy: {:.2}%", test_accuracy * 100.0);
    
    // Define two models with depths 3 and 2
    let dt_params1 = DecisionTree::params().max_depth(Some(3));
    let dt_params2 = DecisionTree::params().max_depth(Some(2));

    // Create a vector of models
    let models = vec![dt_params1, dt_params2];

    // Train and cross-validation using the models
    let scores = iris.cross_validate_single(
        5, 
        &models, 
        |prediction, truth|
            Ok(prediction.confusion_matrix(truth.to_owned())?.accuracy()))?;
    println!("Cross-validation scores: {:?}", scores);

    // Perform cross-validation using fold
    let scores: Vec<_> = iris.fold(5).into_iter().map(|(train, valid)| {
       let model = DecisionTree::params()
          .max_depth(Some(3))
          .fit(&train).unwrap();
       let accuracy = model.predict(&valid).confusion_matrix(&valid).unwrap().accuracy();
       accuracy
    }).collect();
    
    println!("Cross-validation scores general: {:?} {}", scores, scores.iter().sum::<f32>()/scores.len() as f32);

    Ok(())
}

crossvalidate();

[E0433] Error: failed to resolve: could not find `naive` in `linalg`



[E0432] Error: unresolved import `smartcore::linalg::BaseMatrix`



[E0277] Error: the trait bound `ArrayBase<OwnedRepr<f64>, Dim<[usize; 2]>>: Array2<_>` is not satisfied



[E0277] Error: the trait bound `ArrayBase<OwnedRepr<f64>, Dim<[usize; 1]>>: Array1<_>` is not satisfied



[E0061] Error: this function takes 5 arguments but 4 arguments were supplied



[E0308] Error: arguments to this function are incorrect



[E0308] Error: arguments to this function are incorrect



[E0599] Error: the method `fit` exists for struct `LinearRegression`, but its trait bounds were not satisfied



[E0599] Error: no method named `least_squares` found for struct `ArrayBase` in the current scope



[E0308] Error: arguments to this function are incorrect



[E0599] Error: the method `fit` exists for struct `LinearRegression`, but its trait bounds were not satisfied



[E0308] Error: mismatched types

Let's take a closer look at

#![allow(unused)]
fn main() {
    // Evaluate the model's accuracy on the training set
    let train_accuracy = model.predict(&train)
        .confusion_matrix(&train)?
        .accuracy();
    println!("Training accuracy: {:.2}%", train_accuracy * 100.0);
}

We're creating the confusion matrix, for example

                Predicted
                Class1    Class2    Class3
Actual  Class1    TP1       FP12      FP13
        Class2    FP21      TP2       FP23
        Class3    FP31      FP32      TP3

Where:

TP1, TP2, TP3 are True Positives for each class (correct predictions)
FPxy are False Positives (predicting class y when it's actually class x)

And then calculating accuracy as:

Accuracy = (TP1 + TP2 + TP3) / Total Predictions

For the second training experiment:

#![allow(unused)]
fn main() {
    // Define two models with depths 3 and 2
    let dt_params1 = DecisionTree::params().max_depth(Some(3));
    let dt_params2 = DecisionTree::params().max_depth(Some(2));

    // Create a vector of models
    let models = vec![dt_params1, dt_params2];

    // Train and cross-validation using the models
    let scores = iris.cross_validate_single(5, &models, |prediction, truth|
        Ok(prediction.confusion_matrix(truth.to_owned())?.accuracy()))?;
    println!("Cross-validation scores: {:?}", scores);
}

dt_params1 and dt_params2 are just parameter configurations, not yet trained models.

We'll split the training set into 5 folds:

Initial Split: [Fold1, Fold2, Fold3, Fold4, Fold5]

Then train on 4 of the folds and validate on the 5th:

For Model 1 (max_depth=3):
Iteration 1: Train on [Fold2, Fold3, Fold4, Fold5], Validate on Fold1
Iteration 2: Train on [Fold1, Fold3, Fold4, Fold5], Validate on Fold2
Iteration 3: Train on [Fold1, Fold2, Fold4, Fold5], Validate on Fold3
Iteration 4: Train on [Fold1, Fold2, Fold3, Fold5], Validate on Fold4
Iteration 5: Train on [Fold1, Fold2, Fold3, Fold4], Validate on Fold5

For Model 2 (max_depth=2): Same process repeated with the same folds

And the last training experiment:

#![allow(unused)]
fn main() {
// Perform cross-validation using fold
let scores: Vec<_> = iris.fold(5).into_iter().map(|(train, valid)| {
   let model = DecisionTree::params()
      .max_depth(Some(3))
      .fit(&train).unwrap();
   let accuracy = model.predict(&valid).confusion_matrix(&valid).unwrap().accuracy();
   accuracy
}).collect();
}

This manually implements 5-fold cross-validation.

Step-by-step description courtesy of Cursor Agent.

Creating the Folds:

#![allow(unused)]
fn main() {
iris.fold(5)
}

This splits the Iris dataset into 5 folds
Returns an iterator over tuples of (train, valid) where:
- train contains 4/5 of the data
- valid contains 1/5 of the data
Each iteration will use a different fold as the validation set

The Main Loop:

#![allow(unused)]
fn main() {
.into_iter().map(|(train, valid)| {
}

Converts the folds into an iterator
For each iteration, we get a training set and validation set

Model Training:

#![allow(unused)]
fn main() {
let model = DecisionTree::params()
    .max_depth(Some(3))
    .fit(&train).unwrap();
}

Creates a decision tree model with max depth of 3
Trains the model on the current training fold
Uses unwrap() to handle potential errors (in production code, you'd want better error handling)

Model Evaluation:

#![allow(unused)]
fn main() {
let accuracy = model.predict(&valid)
    .confusion_matrix(&valid)
    .unwrap()
    .accuracy();
}

Makes predictions on the validation set
Creates a confusion matrix comparing predictions to true labels
Calculates the accuracy from the confusion matrix

Collecting Results:

#![allow(unused)]
fn main() {
}).collect();
}

Collects all 5 accuracy scores into a vector
Each score represents the model's performance on a different validation fold

The key differences between this and the previous cross_validate_single method are:

This is a more manual implementation
It only evaluates one model configuration (max_depth=3)
It gives you direct control over the training and evaluation process
The results are collected into a simple vector of accuracy scores

This approach is useful when you want more control over the cross-validation process or when you need to implement custom evaluation metrics.

DS210 Course Notes