Linear Regression, Loss, and Bias
About This Module
Prework
Prework Reading
Pre-lecture Reflections
Lecture
Learning Objectives
Simplest setting
Input: set of points in
-
What function explains the relationship of 's with 's?
-
What linear function describes it best?
Multivariate version
Input: set of points in
Find linear function that describes 's in terms of 's?
Why linear regression?
- Have to assume something!
- Models hidden linear relationship + noise
Typical objective: minimize square error
-
Points rarely can be described exactly using a linear relationship
-
How to decide between several non-ideal options?
-
Typically want to find that minimizes total square error:
What about Categorical Data?
- Convert to numerical first
- One way is to simply convert to unique integers
- A better way is to create N new columns (one for each category) and make them boolean (is_x, is_y, is_z etc) -- one hot encoding
For example with 3 categories:
1-D Example
We'll start with an example from the Rust Machine Learning Book.
// Use a crate built from source on github
:dep linfa-book = { git = "https://github.com/rust-ml/book" }
:dep ndarray = { version = "^0.15.6" }
// create_curve implemented at https://github.com/rust-ml/book/blob/main/src/lib.rs#L52
use linfa_book::create_curve;
use ndarray::Array2;
use ndarray::s;
fn generate_data(output: bool) -> Array2<f64> {
/*
* Generate a dataset of x and y values
*
* - Randomly generate 50 points between 0 and 7
* - Calculate y = m * x^power + b +noise
* where noise is uniformly random between -0.5 and 0.5
*
* m = 1.0 (slope)
* power = 1.0 (straight line)
* b = 0.0 (y-intercept)
* num_points = 50
* x_range = [0.0, 7.0]
*
* This produces a 50x2 Array2<f64> with the first column being x and the
* second being y
*/
let array: Array2<f64> = linfa_book::create_curve(1.0, 1.0, 0.0, 50, [0.0, 7.0]);
// Converting from an array to a Linfa Dataset can be the trickiest part of this process
// The first part has to be an array of arrays even if they have a single entry for the 1-D case
// The underlying type is kind of ugly but thankfully the compiler figures it out for us
// let data: ArrayBase<OwnedRepr<f64>, Ix2> = .... is the actual type
// The second part has to be an array of values.
// let targets: Array1<f64> is the actual type
let data = array.slice(s![.., 0..1]).to_owned();
let targets = array.column(1).to_owned();
if output {
println!("The original array is:");
println!("{:.2?}", array);
println!("The data is:");
println!("{:.2?}", data);
println!("The targets are:");
println!("{:.2?}", targets);
}
return array;
}
generate_data(true);
The original array is:
[[6.59, 6.85],
[2.49, 2.39],
[0.77, 0.32],
[1.85, 2.25],
[2.13, 2.46],
[0.93, 0.60],
[6.17, 6.51],
[1.91, 2.02],
[2.82, 3.20],
[0.55, 0.23],
[4.31, 4.06],
[4.04, 3.54],
[6.15, 6.51],
[0.10, -0.33],
[0.68, 0.77],
[1.03, 1.18],
[0.51, 0.68],
[3.75, 3.38],
[1.59, 1.57],
[5.77, 6.26],
[1.18, 1.15],
[2.99, 2.99],
[4.33, 4.69],
[1.36, 1.52],
[5.20, 5.67],
[6.18, 6.41],
[6.10, 6.46],
[6.82, 6.68],
[4.27, 3.88],
[1.67, 1.71],
[5.42, 5.62],
[2.47, 2.02],
[4.48, 4.95],
[1.82, 2.27],
[5.53, 5.27],
[0.52, 0.53],
[5.63, 5.35],
[0.13, 0.43],
[4.47, 4.83],
[0.58, 1.00],
[3.13, 3.28],
[6.84, 6.69],
[6.46, 6.15],
[6.04, 6.27],
[5.63, 6.12],
[1.60, 1.86],
[0.18, 0.08],
[6.24, 6.70],
[0.20, 0.06],
[4.44, 4.10]], shape=[50, 2], strides=[2, 1], layout=Cc (0x5), const ndim=2
The data is:
[[6.59],
[2.49],
[0.77],
[1.85],
[2.13],
[0.93],
[6.17],
[1.91],
[2.82],
[0.55],
[4.31],
[4.04],
[6.15],
[0.10],
[0.68],
[1.03],
[0.51],
[3.75],
[1.59],
[5.77],
[1.18],
[2.99],
[4.33],
[1.36],
[5.20],
[6.18],
[6.10],
[6.82],
[4.27],
[1.67],
[5.42],
[2.47],
[4.48],
[1.82],
[5.53],
[0.52],
[5.63],
[0.13],
[4.47],
[0.58],
[3.13],
[6.84],
[6.46],
[6.04],
[5.63],
[1.60],
[0.18],
[6.24],
[0.20],
[4.44]], shape=[50, 1], strides=[1, 1], layout=CFcf (0xf), const ndim=2
The targets are:
[6.85, 2.39, 0.32, 2.25, 2.46, 0.60, 6.51, 2.02, 3.20, 0.23, 4.06, 3.54, 6.51, -0.33, 0.77, 1.18, 0.68, 3.38, 1.57, 6.26, 1.15, 2.99, 4.69, 1.52, 5.67, 6.41, 6.46, 6.68, 3.88, 1.71, 5.62, 2.02, 4.95, 2.27, 5.27, 0.53, 5.35, 0.43, 4.83, 1.00, 3.28, 6.69, 6.15, 6.27, 6.12, 1.86, 0.08, 6.70, 0.06, 4.10], shape=[50], strides=[1], layout=CFcf (0xf), const ndim=1
Let's plot these points to see what they look like
// 2021
//:dep plotters={version = "^0.3.0", default_features = false, features = ["evcxr", "all_series"]}
// 2024
:dep plotters={version = "^0.3.0", default-features = false, features = ["evcxr", "all_series"]}
:dep ndarray = { version = "^0.15.6" }
extern crate plotters;
use plotters::prelude::*;
{
let array: Array2<f64> = generate_data(false);
let x_values = array.column(0);
let y_values = array.column(1);
evcxr_figure((640, 480), |root| {
let mut chart = ChartBuilder::on(&root)
// the caption for the chart
.caption("2-D Plot", ("Arial", 20).into_font())
.x_label_area_size(40)
.y_label_area_size(40)
// the X and Y coordinates spaces for the chart
.build_cartesian_2d(0f64..8f64, 0f64..8f64)?;
chart.configure_mesh()
.x_desc("X Values")
.y_desc("Y Values")
.draw()?;
chart.draw_series(
x_values.iter().zip(y_values.iter()).map(|(&x, &y)| {
Circle::new((x, y), 3, RED.filled())
}),
)?;
Ok(())
})
}
Let's fit a linear regression to it
// There are multiple packages in this git repository so we have to declare
// all the ones we care about
:dep linfa = { git = "https://github.com/rust-ml/linfa" }
:dep linfa-linear = { git = "https://github.com/rust-ml/linfa" }
:dep ndarray = { version = "^0.15.6" }
// Now we need to declare which function we are going to use
use linfa::Dataset;
use linfa_linear::LinearRegression;
use linfa_linear::FittedLinearRegression;
use linfa::prelude::*;
fn fit_model(array: &Array2<f64>) -> FittedLinearRegression<f64> {
// Let's regenarate and split our data
let data = array.slice(s![.., 0..1]).to_owned();
let targets = array.column(1).to_owned();
// And finally let's fit a linear regression to it
println!("Data: {:?} \n Targets: {:?}", data, targets);
let dataset = Dataset::new(data, targets);
let lin_reg: LinearRegression = LinearRegression::new();
let model: FittedLinearRegression<f64> = lin_reg.fit(&dataset).unwrap();
let ypred = model.predict(&dataset);
let loss = (dataset.targets() - ypred)
.mapv(|x| x.abs())
.mean();
println!("{:?}", loss);
return model;
}
let array: Array2<f64> = generate_data(false);
let model = fit_model(&array);
println!("{:?}", model);
Data: [[6.381616680690672],
[4.396361662331629],
[4.3324365575773305],
[0.8289740051702756],
[6.319103537997485],
[0.3335321841860137],
[5.095836624614474],
[3.6492182454480435],
[1.3364352957815282],
[0.09639507860613405],
[5.303183512825285],
[4.2951694397460285],
[5.4057438081138915],
[0.10960286365816452],
[4.603044369250634],
[4.024864366257263],
[4.281935499419392],
[4.06157301198745],
[5.755333890324657],
[1.3359866701598488],
[2.3904258537567324],
[4.029842664366427],
[2.2848232403062685],
[0.12639520987084052],
[2.9318196797275835],
[3.4846905959349144],
[2.1132653187529598],
[2.409170009823532],
[1.294104964714346],
[3.068189601192284],
[1.5698399649061598],
[4.585070111032277],
[0.9080582927284744],
[2.3500918290692145],
[5.412834219057823],
[0.5732492542928671],
[3.1955234173749907],
[0.6011335403952247],
[0.06756207483950094],
[0.4973826225929374],
[3.5708462015304976],
[5.016449358044835],
[6.275456045849482],
[0.24134187530345663],
[0.19382724586379352],
[3.900026524938423],
[1.977100227563272],
[4.934898600624889],
[3.970926362145382],
[1.5789373737594155]], shape=[50, 1], strides=[1, 1], layout=CFcf (0xf), const ndim=2
Targets: [6.74548949826222, 4.178753589022698, 3.9296433463393803, 0.9101170752817256, 6.422143609882557, -0.115321608937899, 5.126622670184417, 4.001921017212379, 1.2316076794171344, 0.3251034025687174, 5.291327977564027, 4.740353143641885, 5.005473769290836, -0.3802551719385241, 4.723355959069144, 3.740167334494597, 3.869317273792087, 3.583182714685261, 5.960145733182379, 1.4154622808579058, 2.395483563478454, 3.780181818550531, 2.195179330427658, -0.0780379205920585, 3.3559033104968132, 3.595446926448168, 1.9151631312465982, 2.5817866309609263, 1.1115468223706244, 3.2978733650316716, 1.7303235376317656, 4.577920815363182, 0.7328128405941892, 1.8555370714021548, 5.259719826144993, 0.46486256898256517, 3.661623687684834, 0.24442021204114717, 0.26510910851412994, 0.8471086226984335, 3.373961998065203, 4.696005829425596, 5.854185825360939, 0.3863210554131675, 0.29817841644956, 3.7047013726196827, 2.027075695584828, 5.171080561424032, 3.949876872423843, 2.0560366719493564], shape=[50], strides=[1], layout=CFcf (0xf), const ndim=1
Some(0.23628401752752)
FittedLinearRegression { intercept: -0.016055504752757018, params: [0.9953569138666288], shape=[1], strides=[1], layout=CFcf (0xf), const ndim=1 }
Finally let's put everything together and plot the data points and model line
extern crate plotters;
use plotters::prelude::*;
{
let array: Array2<f64> = generate_data(false);
let model = fit_model(&array);
println!("{:?}", model);
let x_values = array.column(0);
let y_values = array.column(1);
evcxr_figure((640, 480), |root| {
let mut chart = ChartBuilder::on(&root)
// the caption for the chart
.caption("Linear Regression", ("Arial", 20).into_font())
.x_label_area_size(40)
.y_label_area_size(40)
// the X and Y coordinates spaces for the chart
.build_cartesian_2d(0f64..8f64, 0f64..8f64)?;
chart.configure_mesh()
.x_desc("X Values")
.y_desc("Y Values")
.draw()?;
chart.draw_series(
x_values.iter().zip(y_values.iter()).map(|(&x, &y)| {
Circle::new((x, y), 3, RED.filled())
}),
)?;
let mut line_points = Vec::with_capacity(2);
for i in (0..8i32).step_by(1) {
line_points.push((i as f64, (i as f64 * model.params()[0]) + model.intercept()));
}
// We can configure the rounded precision of our result here
let precision = 2;
let label = format!(
"y = {:.2$}x + {:.2}",
model.params()[0],
model.intercept(),
precision
);
chart.draw_series(LineSeries::new(line_points, &BLACK))
.unwrap()
.label(&label);
Ok(())
})
}
Data: [[2.04750415763924],
[6.367873757109327],
[6.398058085002853],
[2.768792455803591],
[0.1718068124456904],
[5.410322501484772],
[1.16759081573306],
[2.7536979999018953],
[4.950802437871839],
[0.6025334876065953],
[3.2764831065673183],
[1.721556072918826],
[2.944603890742717],
[1.1679420030984646],
[1.4583480833464897],
[5.603000967206449],
[0.7792958771030705],
[0.9903895595193106],
[5.849216682555504],
[5.039598826115972],
[6.79519856325383],
[5.9811481574554595],
[5.848135244005015],
[3.2303889196020474],
[5.180333900515606],
[6.441584502170909],
[2.1595411034915752],
[0.7565248539761933],
[6.575497606063534],
[4.5995576454693055],
[4.3608367164107875],
[5.548520783489524],
[6.525459030458023],
[0.6937753120972134],
[1.9522048348466015],
[2.2341967770269227],
[3.637038438411855],
[2.2190157648753184],
[2.681992233899634],
[5.566572415187922],
[0.9357604528754049],
[0.8490168837359822],
[6.448593818753438],
[2.6758983352983177],
[1.049091617939221],
[1.5851990026054852],
[6.098462409705635],
[2.585742190451155],
[0.17671676593123875],
[0.5246679878163918]], shape=[50, 1], strides=[1, 1], layout=CFcf (0xf), const ndim=2
Targets: [2.408804027014255, 6.469323578131448, 6.193104604093066, 2.418915373474102, 0.029226368788591195, 4.9972110392473255, 1.1150716816343442, 2.926060993136029, 4.490694341852246, 0.9502150297756105, 3.4740768014974783, 1.251263397177235, 3.4148633041576995, 1.4681943013079881, 0.9890758429931192, 5.391821032945574, 1.211034580394566, 1.3854290271879017, 5.431409730804429, 4.699669709042999, 6.998217208006695, 6.109111375789134, 5.75094664694154, 3.658429812887407, 5.289801849698538, 6.105335877939171, 1.8358314904060669, 0.36311224569624856, 6.342541141255715, 4.691208846854487, 4.250255794021059, 5.675447984854635, 6.793126113768862, 0.5931705314387281, 2.111835341368409, 2.144381384764876, 3.8510929693234646, 1.7549171320182217, 2.563236265429576, 5.214375657851614, 0.8020824408564826, 0.7138795774936122, 6.183001994443131, 2.633962608410515, 1.4663428201689852, 1.6261732010691712, 6.5133103394465035, 2.1115834556736774, 0.5083760324119229, 0.7979776364877389], shape=[50], strides=[1], layout=CFcf (0xf), const ndim=1
Some(0.25864023660929314)
FittedLinearRegression { intercept: 0.05790152043058407, params: [0.9754302443444729], shape=[1], strides=[1], layout=CFcf (0xf), const ndim=1 }
paramsinterceptin
Coefficient of determination (or )
-
How good is my function ?
-
Input: points
-
Idea: Compare variance of 's to the deviation of 's from 's
-
Formally:
where
- Range: (should be in for linear regression)
Let's compute for our data
:dep linfa = { git = "https://github.com/rust-ml/linfa" }
:dep linfa-linear = { git = "https://github.com/rust-ml/linfa" }
:dep ndarray = { version = "^0.15.6" }
:dep ndarray-stats = { version = "^0.5.1" }
let array: Array2<f64> = generate_data(false);
let data = array.slice(s![.., 0..1]).to_owned();
let targets = array.column(1).to_owned();
// And finally let's fit a linear regression to it
let dataset = Dataset::new(data, targets).with_feature_names(vec!["x"]);
let lin_reg: LinearRegression = LinearRegression::new();
let model: FittedLinearRegression<f64> = lin_reg.fit(&dataset).unwrap();
let ypred = model.predict(&dataset);
//
let variance:f64 = dataset.targets().var(0.0);
let mse:f64 = (dataset.targets() - ypred)
.mapv(|x| x.powi(2))
.mean().unwrap();
let r2 = 1.0 - mse/variance;
println!("variance = {:.3}, mse = {:.3}, R2 = {:.3}", variance, mse, r2);
variance = 2.971, mse = 0.080, R2 = 0.973
Multivariable linear regression
What if you have multiple input variables and one output variable?
It's actually quite simple. The same code we used above but make sure your X side contains multiple values for each of the variables in your function. The code will compute as many coefficients as the variables it sees.
For 3-D input:
:dep linfa = { git = "https://github.com/rust-ml/linfa" }
:dep linfa-linear = { git = "https://github.com/rust-ml/linfa" }
:dep ndarray = { version = "^0.15.6" }
use linfa::Dataset;
use linfa::traits::Fit;
use ndarray::{Array1, Array2, array};
use linfa_linear::LinearRegression;
use linfa_linear::FittedLinearRegression;
use linfa::prelude::*;
fn main() {
// Example data: 4 samples with 3 features each
let x: Array2<f64> = array![[1.0, 2.0, 3.0],
//[2.0, 3.0, 1.0],
[2.0, 3.0, 4.0],
[3.0, 4.0, 5.0],
[4.0, 5.0, 6.0],
//[6.0, 11.0, 12.0]
];
// Target values
let y: Array1<f64> = array![6.0,
//6.0,
9.0,
12.0,
15.0,
//29.0,
];
// Create dataset
let dataset = Dataset::new(x.clone(), y.clone());
// Fit linear regression model
let lin_reg = LinearRegression::new();
let model = lin_reg.fit(&dataset).unwrap();
// Print coefficients
println!("Coefficients: {:.3?}", model.params());
println!("Intercept: {:.3?}", model.intercept());
// Predict using the fitted model
let predictions = model.predict(&x);
println!("Predictions: {:.3?}", predictions);
}
main();
Coefficients: [22.667, -8.333, -11.333], shape=[3], strides=[1], layout=CFcf (0xf), const ndim=1
Intercept: 34.000
Predictions: [6.000, 9.000, 12.000, 15.000], shape=[4], strides=[1], layout=CFcf (0xf), const ndim=1
General Least Squares Fit
We can generalize the ordinary least squares regression problem, by trying, for example, to find the in an equation like
to minize the differences between some targets and the .
Note that these are still linear in the parameters.
We can rewrite this in matrix form:
where the matrix of 's is sometimes called the Design Matrix.
-
We are going to use a different library familar to us from the ndarray lecture.
-
ndarray-linalg can compute the parameters to fit an arbitrary function as long as you have some idea of what the function might be.
- e.g. express it as a design matrix
-
Then solve a system of linear equations with that matrix as the left hand side and our observed values as the right hand side.
-
The result is the missing parameters of our assumed function
:dep ndarray = { version = "^0.15.6" }
// See ./README.md for ndarray-linalg prereqs
// This is the version for MAC
:dep ndarray-linalg = { version = "^0.16", features = ["openblas-system"] }
// Alternative for Mac if you installed netlib
//:dep ndarray-linalg = { version = "^0.14" , features = ["netlib"]}
// This works for linux
// :dep ndarray-linalg = { version = "^0.14" , features = ["openblas"]}
use ndarray::{Array1, Array2, ArrayView1};
use ndarray_linalg::Solve;
use ndarray::array;
use ndarray::Axis;
use ndarray_linalg::LeastSquaresSvd;
// Define an arbitrary function (e.g., a quadratic function)
fn arbitrary_function(x: f64, params: &ArrayView1<f64>) -> f64 {
params[0] + params[1] * x + params[2] * x * x
}
// Compute the design matrix for the arbitrary function
fn design_matrix(x: &Array1<f64>) -> Array2<f64> {
let mut dm = Array2::<f64>::zeros((x.len(), 3));
for (i, &xi) in x.iter().enumerate() {
dm[(i, 0)] = 1.0;
dm[(i, 1)] = xi;
dm[(i, 2)] = xi * xi;
}
dm
}
// Perform least squares fit
fn least_squares_fit(x: &Array1<f64>, y: &Array1<f64>) -> Array1<f64> {
let dm = design_matrix(x);
let y_col = y.to_owned().insert_axis(Axis(1)); // Convert y to a column vector
let params = dm.least_squares(&y_col).unwrap().solution; // Use least squares solver
params.column(0).to_owned() // Convert params back to 1D array
}
fn main() {
// Example data
let x: Array1<f64> = array![1.0, 2.0, 3.0, 4.0];
let y: Array1<f64> = array![1.0, 4.0, 9.0, 16.0]; // y = x^2 for this example
// Perform least squares fit
let params = least_squares_fit(&x, &y);
println!("Fitted parameters: {:.3}", params);
// Predict using the fitted parameters
let predictions: Array1<f64> = x.mapv(|xi| arbitrary_function(xi, ¶ms.view()));
println!("Predictions: {:.3?}", predictions);
}
main();
The type of the variable dataset was redefined, so was lost.
The type of the variable array was redefined, so was lost.
The type of the variable model was redefined, so was lost.
The type of the variable lin_reg was redefined, so was lost.
Fitted parameters: [0.000, -0.000, 1.000]
Predictions: [1.000, 4.000, 9.000, 16.000], shape=[4], strides=[1], layout=CFcf (0xf), const ndim=1
Example General Function
Imagine a case where you have a function like :
Then setup your functions like this:
// Define an arbitrary function (e.g., a quadratic function)
fn arbitrary_function(x: f64, params: &ArrayView1<f64>) -> f64 {
params[0] + params[1] * x + params[2] * x * x + params[3] * x.ln()
}
// Compute the design matrix for the arbitrary function
fn design_matrix(x: &Array1<f64>) -> Array2<f64> {
let mut dm = Array2::<f64>::zeros((x.len(), 4));
for (i, &xi) in x.iter().enumerate() {
dm[(i, 0)] = 1.0;
dm[(i, 1)] = xi;
dm[(i, 2)] = xi * xi;
dm[(i, 3)] = xi.ln();
}
dm
}
It gets messier if x appears in exponents and outside the scope of this lecture but it is possible to do non-linear least squares fit for completely arbitrary functions!!!
Train/Test Splits
To test the generality of your models, it is recommended to split your data into
- training dataset (e.g. 80% of the data)
- testing dataset (e.g. 20% of the data)
Then train with the training dataset and evaluate using the test dataset.
It is a bit more cumbersome in Rust than in scikit-learn in python but in the end not that hard
The smartcore crate implements this function for you and you can use it as follows
:dep linfa = { git = "https://github.com/rust-ml/linfa" }
:dep linfa-linear = { git = "https://github.com/rust-ml/linfa" }
:dep ndarray = {version = "0.15"}
:dep smartcore = {version = "0.2", features=["ndarray-bindings"]}
use linfa::Dataset;
use linfa::traits::Fit;
use linfa_linear::LinearRegression;
use ndarray::{Array1, Array2, array};
use linfa::DatasetBase;
use smartcore::model_selection::train_test_split; // This is the function we need
use smartcore::linalg::naive::dense_matrix::DenseMatrix;
use smartcore::linalg::BaseMatrix;
fn main() {
// Example data: 4 samples with 3 features each
let x: Array2<f64> = array![[1.0, 2.0, 3.0],
[2.0, 3.0, 4.0],
[3.0, 4.0, 5.0],
[4.0, 5.0, 6.0]];
// Target values
let y: Array1<f64> = array![6.0, 9.0, 12.0, 15.0];
// Split the data into training and testing sets
let (x_train, x_test, y_train, y_test) = train_test_split(&x, &y, 0.5, true);
let train_dataset = Dataset::new(x_train.clone(), y_train.clone());
let test_dataset = Dataset::new(x_test.clone(), y_test.clone());
// Fit linear regression model to the training data
let lin_reg = LinearRegression::new();
let model = lin_reg.fit(&train_dataset).unwrap();
// Print coefficients
println!("Coefficients: {:.3?}", model.params());
println!("Intercept: {:.3?}", model.intercept());
// Predict from the test set
let predictions = model.predict(&x_test);
println!("X_test: {:.3?}, Predictions: {:.3?}", x_test, predictions);
}
main();
Coefficients: [1.000, 1.000, 1.000], shape=[3], strides=[1], layout=CFcf (0xf), const ndim=1
Intercept: -0.000
X_test: [[4.000, 5.000, 6.000],
[3.000, 4.000, 5.000]], shape=[2, 3], strides=[3, 1], layout=Cc (0x5), const ndim=2, Predictions: [15.000, 12.000], shape=[2], strides=[1], layout=CFcf (0xf), const ndim=1
Loss Functions, Bias & Cross-Validation
Reminders
Typical predictive data analysis pipeline:
- Very important: split your data into a training and test part
- Train your model on the training part
- Use the testing part to evaluate accuracy
Measuring errors for regression
- Usually, the predictor is not perfect.
- How do I evaluate different options and choose the best one?
Mean Squared Error (or loss):
Mean Absolute Error (or loss):
Plots of MSE and MAE
// 2021
//:dep plotters={version = "^0.3.0", default_features = false, features = ["evcxr", "all_series","full_palette"]}
// 2024
:dep plotters={version = "^0.3.0", default-features = false, features = ["evcxr", "all_series","full_palette"]}
:dep ndarray = { version = "^0.16.0" }
:dep ndarray-rand = { version = "0.15.0" }
:dep linfa = { version = "^0.7.0" }
:dep linfa-linear = { version = "^0.7.0" }
:dep linfa-datasets = { version = "^0.7.0" }
:dep linfa-linear = { version = "^0.7.0" }
:dep ndarray-stats = { version = "^0.5.1" }
[E0277] Error: the trait bound `ArrayBase<OwnedRepr<f64>, Dim<[usize; 2]>>: Matrix<_>` is not satisfied
[E0277] Error: the trait bound `ArrayBase<OwnedRepr<f64>, Dim<[usize; 2]>>: BaseMatrix<_>` is not satisfied
[E0308] Error: mismatched types
[E0308] Error: mismatched types
[E0277] Error: the trait bound `FittedLinearRegression<_>: linfa::prelude::Predict<&ArrayBase<OwnedRepr<f64>, Dim<[usize; 2]>>, _>` is not satisfied
[E0277] Error: the trait bound `ArrayBase<OwnedRepr<f64>, Dim<[usize; 2]>>: Records` is not satisfied
[E0599] Error: no method named `least_squares` found for struct `ArrayBase` in the current scope
[E0308] Error: arguments to this function are incorrect
[E0308] Error: mismatched types
// Helper code for plotting
extern crate plotters;
use plotters::prelude::*;
use ndarray::Array1;
use plotters::evcxr::SVGWrapper;
use plotters::style::colors::full_palette;
fn plotter_scatter(sizes: (u32, u32), x_range: (f64, f64), y_range: (f64, f64), scatters: &[(&Array1<f64>, &Array1<f64>, &RGBColor, &str, &str)], lines: &[(&Array1<f64>, &Array1<f64>, &str, &RGBColor)]) -> SVGWrapper {
evcxr_figure((sizes.0, sizes.1), |root| {
let mut chart = ChartBuilder::on(&root)
// the caption for the chart
.caption("2-D Plot", ("Arial", 20).into_font())
.x_label_area_size(40)
.y_label_area_size(40)
// the X and Y coordinates spaces for the chart
.build_cartesian_2d(x_range.0..x_range.1, y_range.0..y_range.1)?;
chart.configure_mesh()
.x_desc("X Values")
.y_desc("Y Values")
.draw()?;
for scatter in scatters {
if scatter.3 == "x" {
chart.draw_series(
scatter.0.iter().zip(scatter.1.iter()).map(|(&a, &b)| {
Cross::new((a, b), 3, scatter.2.filled())
}),
)?
.label(scatter.4)
.legend(|(x, y)| Cross::new((x, y), 3, scatter.2.filled()));
} else {
chart.draw_series(
scatter.0.iter().zip(scatter.1.iter()).map(|(&a, &b)| {
Circle::new((a, b), 3, scatter.2.filled())
}),
)?
.label(scatter.4)
.legend(|(x, y)| Circle::new((x, y), 3, scatter.2.filled()));
}
}
if lines.len() > 0 {
for line in lines {
chart
.draw_series(LineSeries::new(line.0.iter().cloned().zip(line.1.iter().cloned()), line.3))?
.label(line.2)
.legend(|(x, y)| PathElement::new(vec![(x, y), (x + 20, y )], line.3));
}
// Configure and draw the legend
}
chart
.configure_series_labels()
.background_style(&WHITE.mix(0.8))
.border_style(&BLACK)
.draw()?;
Ok(())
})
}
use ndarray::Array;
let xs = Array::linspace(-1.3, 1.3, 300);
let abs_xs = xs.mapv(f64::abs);
let xs_squared = xs.mapv(|a| a.powi(2));
plotter_scatter(
(500, 400), (-1.5,1.5), (0.,1.75), //plot size and ranges
&[], //any scatters
&[(&xs,&abs_xs, "MAE", &full_palette::RED), (&xs,&xs_squared, "MSE", &full_palette::BLUE)] //any lines
)
Observe that MAE will be more sensitive to small differences, while
MSE penalized big differences more.
Definition of an outlier
Important difference between error measures: different attention to outliers
Linear Regression: higher powers of absolute error ( loss)
In the limit...
-
This converges to minimizing the maximum difference between ( and )
-
This is called: loss
-
Another way to express it: minimize such that
In the limit...
- Another way to express it: minimize such that
- Linear programming formulation: minimize such that
and
1D insight into the outlier sensitivity
-
Input: set of points in
-
What point minimizes MSE, MAE, ... as a representative of these points?
-
MSE aka : mean
-
MAE aka : median
-
: the mean of the maximum and minimum
Something in between MSE and MAE?
-
loss for ?
-
Huber loss:
- quadratic for small distances
- linear for large distances
Underfitting
- Model not expressive enough to capture the problem
- Or a solution found does not match the problem
Possible solutions:
- Try harder to find a better solution
- Add more parameters
- Try a different model that captures the solution
Overfitting
- Predictions adjusted too well to training data
- Error on test data error on training data
Possible solutions:
- Don't optimize the model on the training data too much
- Remove features that are too noisy
- Add more training data
- Reduce model complexity
Bias and variance
Bias:
The simplifying assumptions made by the model to make the target function easier to approximate.
Mathematically it is the difference of the average value of predictions from the true function:
Variance:
The amount that the output of the model changes given different training data.
Mathematically it is the mean of the delta of the squared deviation of the prediction function from its expected value:
Bias: error due to model unable to match the complexity of the problem
Variance: how much the prediction will vary in response to data points
Overfitting: high variance, low bias
Underfitting: high bias, low variance
Important in practice:
-
detecting the source of problems: variance vs. bias, overfitting vs. underfitting
-
navigating the trade-off and finding the sweet spot
Some examples
| Algorithm | Bias| Variance| |:-:|:-:|:-:| |Linear Regression|High|Low| |Decision Tree|Low|High| |Random Forest|Low|High (less than tree)|
Terminology
Parameters
- Variables fixed in a specific instantiation of a model
- Examples:
- coefficients in linear regression
- decision tree structure and thresholds
- weights and thresholds in a neural network
Hyperparameters
-
Also parameters, but higher level
-
Examples:
- number of leafs in a decision tree
- number of layers and their structure in a neural network
- degree of a polynomial
Hyperparameter tuning
- Adjusting hyperparameters before training the final model
Model selection
- Deciding on the type of model to be used (linear regression? decision trees? ...)
Decision Tree discussion
- Tree structure and thresholds for splits are parameters (learned by the algorithm)
- Many of the others are hyperparameters
- split_quality: Sets the metric used to decide the feature on which to split a node
- min_impurity_decrease: How much reduction in gini or other metric should we see before we allow a split
- min_weight_split: Sets the minimum weight of samples required to split a node.
- min_weight_leaf: Sets the minimum weight of samples that a split has to place in each leaf
- max_depth: Affects the structure of the tree and how elements can be assigned to nodes
All documented at great length in (https://docs.rs/linfa-trees/latest/linfa_trees/struct.DecisionTreeParams.html)
Challenges of training and cross-validation
Big goal: train a model that can be used for predicting
Intermediate goal: select the right model and hyperparameters
Information leak danger!
- If we do it adaptively, information from the test set could affect the model selection
Tune your parameters by using portions of the training set and preserve the test set for only a final evaluation
Holdout method
-
Partition the training data again: training set and validation set
-
Use the validation part to estimate accuracy whenever needed
Pros:
- Very efficient
- Fine with lots of data when losing a fraction is not a problem
Cons:
- Yet another part of data not used for training
- Problematic when the data set is small
- Testing part could contain important information
–fold cross–validation
- Partition the training set into folds at random
- Repeat times:
- train on folds
- estimate the accuracy on the -th fold
- Return the mean
Pros:
- Every data point used for training most of the time
- Less variance in the estimate
Cons:
- times slower
LOOCV: Leave–one–out cross–validation
-
Extreme case of the previous approach: separate fold for each data point
-
For each data point :
- train on data without
- estimate the accuracy on
-
Return the mean of accuracies
Cons:
- Even more expensive
Many other options
-
Generalization: leave––out cross–validation enumerates over subsets
-
Sampling instead of trying all options
-
A variation that ensures that all classes evenly distributed in folds
-
...
Training and Cross Validation with Iris Dataset
Classic dataset
- 3 Iris species
- 50 examples in each class
- 4 features (sepal length/width, petal length/width) for each sample
:dep linfa = {version = "0.7.0"}
:dep linfa-datasets = { version = "0.7.0", features = ["iris"] }
:dep linfa-trees = { version = "0.7.0" }
:dep ndarray = { version = "0.16.1" }
:dep ndarray-rand = {version = "0.15.0" }
:dep rand = { version = "0.8.5", features = ["small_rng"] }
:dep smartcore = { version = "0.3.2", features = ["datasets", "ndarray-bindings"] }
use linfa::prelude::*;
use linfa_trees::DecisionTree;
use ndarray_rand::rand::SeedableRng;
use rand::rngs::SmallRng;
use linfa_trees::DecisionTreeParams;
fn crossvalidate() -> Result<(), Box<dyn std::error::Error>> {
// Load the Iris dataset
let mut iris = linfa_datasets::iris();
let mut rng = SmallRng::seed_from_u64(42);
// Split the data into training and testing sets
let (train, test) = iris.clone()
.shuffle(&mut rng)
.split_with_ratio(0.8);
// Extract the features (X) and target (y) for training and testing sets
let X_train = train.records();
let y_train = train.targets();
let X_test = test.records();
let y_test = test.targets();
// Print the shape of the training and testing sets
println!("X_train shape: ({}, {})", X_train.nrows(), X_train.ncols());
println!("y_train shape: ({})", y_train.len());
println!("X_test shape: ({}, {})", X_test.nrows(), X_test.ncols());
println!("y_test shape: ({})", y_test.len());
// Train the model on the training data
let model = DecisionTree::params()
.max_depth(Some(3))
.fit(&train)?;
// Evaluate the model's accuracy on the training set
let train_accuracy = model.predict(&train)
.confusion_matrix(&train)?
.accuracy();
println!("Training accuracy: {:.2}%", train_accuracy * 100.0);
// Evaluate the model's accuracy on the test set
let test_accuracy = model.predict(&test)
.confusion_matrix(&test)?
.accuracy();
println!("Test accuracy: {:.2}%", test_accuracy * 100.0);
// Define two models with depths 3 and 2
let dt_params1 = DecisionTree::params().max_depth(Some(3));
let dt_params2 = DecisionTree::params().max_depth(Some(2));
// Create a vector of models
let models = vec![dt_params1, dt_params2];
// Train and cross-validation using the models
let scores = iris.cross_validate_single(
5,
&models,
|prediction, truth|
Ok(prediction.confusion_matrix(truth.to_owned())?.accuracy()))?;
println!("Cross-validation scores: {:?}", scores);
// Perform cross-validation using fold
let scores: Vec<_> = iris.fold(5).into_iter().map(|(train, valid)| {
let model = DecisionTree::params()
.max_depth(Some(3))
.fit(&train).unwrap();
let accuracy = model.predict(&valid).confusion_matrix(&valid).unwrap().accuracy();
accuracy
}).collect();
println!("Cross-validation scores general: {:?} {}", scores, scores.iter().sum::<f32>()/scores.len() as f32);
Ok(())
}
crossvalidate();
[E0433] Error: failed to resolve: could not find `naive` in `linalg`
[E0432] Error: unresolved import `smartcore::linalg::BaseMatrix`
[E0277] Error: the trait bound `ArrayBase<OwnedRepr<f64>, Dim<[usize; 2]>>: Array2<_>` is not satisfied
[E0277] Error: the trait bound `ArrayBase<OwnedRepr<f64>, Dim<[usize; 1]>>: Array1<_>` is not satisfied
[E0061] Error: this function takes 5 arguments but 4 arguments were supplied
[E0308] Error: arguments to this function are incorrect
[E0308] Error: arguments to this function are incorrect
[E0599] Error: the method `fit` exists for struct `LinearRegression`, but its trait bounds were not satisfied
[E0599] Error: no method named `least_squares` found for struct `ArrayBase` in the current scope
[E0308] Error: arguments to this function are incorrect
[E0599] Error: the method `fit` exists for struct `LinearRegression`, but its trait bounds were not satisfied
[E0308] Error: mismatched types
Let's take a closer look at
#![allow(unused)] fn main() { // Evaluate the model's accuracy on the training set let train_accuracy = model.predict(&train) .confusion_matrix(&train)? .accuracy(); println!("Training accuracy: {:.2}%", train_accuracy * 100.0); }
We're creating the confusion matrix, for example
Predicted
Class1 Class2 Class3
Actual Class1 TP1 FP12 FP13
Class2 FP21 TP2 FP23
Class3 FP31 FP32 TP3
Where:
- TP1, TP2, TP3 are True Positives for each class (correct predictions)
- FPxy are False Positives (predicting class y when it's actually class x)
And then calculating accuracy as:
Accuracy = (TP1 + TP2 + TP3) / Total Predictions
For the second training experiment:
#![allow(unused)] fn main() { // Define two models with depths 3 and 2 let dt_params1 = DecisionTree::params().max_depth(Some(3)); let dt_params2 = DecisionTree::params().max_depth(Some(2)); // Create a vector of models let models = vec![dt_params1, dt_params2]; // Train and cross-validation using the models let scores = iris.cross_validate_single(5, &models, |prediction, truth| Ok(prediction.confusion_matrix(truth.to_owned())?.accuracy()))?; println!("Cross-validation scores: {:?}", scores); }
dt_params1 and dt_params2 are just parameter configurations, not yet trained models.
We'll split the training set into 5 folds:
Initial Split: [Fold1, Fold2, Fold3, Fold4, Fold5]
Then train on 4 of the folds and validate on the 5th:
For Model 1 (max_depth=3):
Iteration 1: Train on [Fold2, Fold3, Fold4, Fold5], Validate on Fold1
Iteration 2: Train on [Fold1, Fold3, Fold4, Fold5], Validate on Fold2
Iteration 3: Train on [Fold1, Fold2, Fold4, Fold5], Validate on Fold3
Iteration 4: Train on [Fold1, Fold2, Fold3, Fold5], Validate on Fold4
Iteration 5: Train on [Fold1, Fold2, Fold3, Fold4], Validate on Fold5
For Model 2 (max_depth=2): Same process repeated with the same folds
And the last training experiment:
#![allow(unused)] fn main() { // Perform cross-validation using fold let scores: Vec<_> = iris.fold(5).into_iter().map(|(train, valid)| { let model = DecisionTree::params() .max_depth(Some(3)) .fit(&train).unwrap(); let accuracy = model.predict(&valid).confusion_matrix(&valid).unwrap().accuracy(); accuracy }).collect(); }
This manually implements 5-fold cross-validation.
Step-by-step description courtesy of Cursor Agent.
- Creating the Folds:
#![allow(unused)] fn main() { iris.fold(5) }
- This splits the Iris dataset into 5 folds
- Returns an iterator over tuples of
(train, valid)where:traincontains 4/5 of the datavalidcontains 1/5 of the data
- Each iteration will use a different fold as the validation set
- The Main Loop:
#![allow(unused)] fn main() { .into_iter().map(|(train, valid)| { }
- Converts the folds into an iterator
- For each iteration, we get a training set and validation set
- Model Training:
#![allow(unused)] fn main() { let model = DecisionTree::params() .max_depth(Some(3)) .fit(&train).unwrap(); }
- Creates a decision tree model with max depth of 3
- Trains the model on the current training fold
- Uses
unwrap()to handle potential errors (in production code, you'd want better error handling)
- Model Evaluation:
#![allow(unused)] fn main() { let accuracy = model.predict(&valid) .confusion_matrix(&valid) .unwrap() .accuracy(); }
- Makes predictions on the validation set
- Creates a confusion matrix comparing predictions to true labels
- Calculates the accuracy from the confusion matrix
- Collecting Results:
#![allow(unused)] fn main() { }).collect(); }
- Collects all 5 accuracy scores into a vector
- Each score represents the model's performance on a different validation fold
The key differences between this and the previous cross_validate_single method are:
- This is a more manual implementation
- It only evaluates one model configuration (max_depth=3)
- It gives you direct control over the training and evaluation process
- The results are collected into a simple vector of accuracy scores
This approach is useful when you want more control over the cross-validation process or when you need to implement custom evaluation metrics.