NDArray and Neural networks

Lecture

Learning Objectives

Simple neural networks can be easily represented by matrix arithmentic.
Let's see how we can build a relatively simple neural network to recognize handwritten digits.

Biological Neuron

An _artificial neuron_is loosely modeled on a biological neuron.

From Stanford's cs231n

The dendrites carry impulses from other neurons of different distances.
Once the collective firing rate of the impulses exceed a certain threshold, the neuron fires its own pulse through the axon to other neurons.

Artificial Neuron

The artificial neuron is loosely based on the biological one.

From Stanford's cs231n.

The artifical neuron:

collects one or more inputs,
each multiplied by a unique weight,
sums the weighted inputs,
adds a bias,
then applies a nonlinear activation function.

Multi-Layer Perceptron (MLP) or Fully Connected Network (FCN)

From Stanford's cs231n.

Multiple artificial neurons can act on the same inputs,. This defines a layer. We can have more than one layer until we produce one or more outputs.
The example above shows a network with 3 inputs, two layers of neurons, each with 4 neurons, followed by one layer that produces a single value output.

This architecture can be used as a binary classifier.

FCN Detailed View

Here's a more detailed view of a fully connected network making biases explicit and with weight matrix and bias vector dimensions

We can collect all the inputs into a column vector.

$x = x_{1} x_{2} x_{3}$

Then we can gather the weights for the first layer:

$h_{1} = a [Ω_{0} x + β_{0}] = a ω_{11} ω_{21} ω_{31} ω_{41} ω_{12} ω_{22} ω_{32} ω_{42} ω_{13} ω_{23} ω_{33} ω_{43} x_{1} x_{2} x_{3} + β_{1} β_{2} β_{3} β_{4}$

where $a [z]$ is a nonlinear activation function, such as ReLU shown below.

The equation for which is.

$a [z] = ReLU [z] = {0 z z < 0 z \geq 0$

or in pseudocode simply:

x = max(0, x)

The nonlinear activation functions are crucial, because without them, the neural network can be simplified down to a single linear system of equations, which is severely limited in its representational power.

The subsequent layers can be defined similarly interms of the activations of the previous activation functions.

$h_{2} h_{3} h_{4} y = a [Ω_{1} h_{1} + β] = a [Ω_{2} h_{2} + β] = Ω_{3} h_{3} + β = softmax [h_{4}]$

where the last function is a softmax, which limits the output to between 0 and 1, and all the outputs sum up to 1.

$softmax (y_{i}) = \frac{e ^{y_{i}}}{\sum _{j = 1}^{K} e ^{y_{j}}}$

Here's a visualization.

From TDS Post.

Much of neural network training and evaluation comes down to vector operations in which ndarray excels.

Training a Neural Network

We won't go into all the details of training a network, but we'll show an example.
The steps involve producing the outpur from some input -- Forward Pass
Then updating the weights and biases of the network, taking the partial derivatives of the entire model with respect to each parameter and then updating each parameter proportional to its partial derivative -- Backward Pass

Simple weight update example

We'll build our understanding of the gradient by considering derivatives of single variable functions.

Let's start with a quadratic function

$f (x) = 3 x^{2} - 4 x + 5.$

The derivative with respect to $x$ of our example is

$f^{'} (x) = 6 x - 4.$

This is the asymptotic slope at any particular value of $x$ .

Question: Which way does $x$ need to move at $x = - 2$ to decrease $f (x)$ ? Positive or negative direction?

Question: What about at $x = 2$ ? Positive or negative direction?

An easy way to update this is:

$x \leftarrow x - α \frac{\partial f}{\partial x}$

where $α$ is some small number (e.g. 0.01) which we call the learning rate.

Minimizing the Network Loss

To train our network we want to minize some function that computes the difference between our target values $t_{i}$ and what the network produces $y_{i}$ for each input $x_{i}$ .

An appropriate loss function for classification models is the categorical cross-entropy loss.

The following image from the Medium Post illustrates the full pipeline:

except in the figure $s$ is the model output and $y$ are the targets.

We won't go into details but a nice output is that

$\frac{\partial ℓ _{i}}{\partial h _{4}} = y - t$

Forward propagation

Weights would be represented by an Array2<f32> with dimensions: (# neurons $\times$ # inputs)
- For example if you have 100 input neurons and 50 2nd layer neurons then you need a $50 \times 100$ matrix of weights
Bias vectors would be represented by an Array2<f32> with dimensions (# neurons $\times$ 1)

Backwards propagation

As mentioned, in practice the calculation of partial derivatives must be calculated starting with the end of the pipeline and then working backwards.
The chain rule for differentiation in calculus gives a modular, scalable way of doing this calculation, again starting from the end and working backwards.

Just a reminder, we have

$f_{0} h_{1} f_{1} h_{2} f_{2} h_{3} h_{4} y = Ω_{0} x + β_{0} = a [f_{0}] = Ω_{1} h_{1} + β = a [f_{1}] = Ω_{2} h_{2} + β = a [f_{2}] = Ω_{3} h_{3} + β = softmax [h_{4}]$

We already did

$\frac{\partial ℓ _{i}}{\partial h _{4}} \frac{\partial ℓ _{i}}{\partial Ω _{3}} = y - t = \frac{\partial ℓ _{i}}{\partial h _{4}} \frac{\partial h _{4}}{\partial Ω _{3}} = \frac{\partial ℓ _{i}}{\partial h _{4}} h_{3}^{T}$

And the process continues.

Example Implementaiton

Let's look at an example implementation.

We'll start with some utility functions.

#![allow(unused)]
fn main() {
:dep ndarray = { version = "^0.15" }
:dep rand = { version = "0.8" }

use ndarray::prelude::*;
use rand::Rng;

// Helper function to populate arrays with random values
// We could use ndarray-rand crate to do this, but it doesn't seem to work in a Jupyter notebook
fn populate_array(arr: &mut Array2<f32>, m: usize, n: usize) {
    let mut rng = rand::thread_rng();
    for i in 0..m {
        for j in 0..n {
            arr[(i, j)] = rng.gen_range(-1.0..1.0);
        }
    }
}

// ReLU activation function
fn relu(x: &Array2<f32>) -> Array2<f32> {
    x.mapv(|x| if x > 0.0 { x } else { 0.0 })
}

// Derivative of ReLU
fn relu_derivative(x: &Array2<f32>) -> Array2<f32> {
    x.mapv(|x| if x > 0.0 { 1.0 } else { 0.0 })
}

// Softmax function
fn softmax(x: &Array2<f32>) -> Array2<f32> {
    let exp_x = x.mapv(|x| x.exp());
    let sum_exp_x = exp_x.sum();
    exp_x / sum_exp_x
}

}

Then we'll define a NeuralNetwork struct and associated methods.

#![allow(unused)]

fn main() {
struct NeuralNetwork {
    input_size: usize,
    output_size: usize,
    weights1: Array2<f32>,
    biases1: Array2<f32>,
    weights2: Array2<f32>,
    biases2: Array2<f32>,
    learning_rate: f32,
}

impl NeuralNetwork {
    /// Create a shallow neural network with one hidden layer
    /// 
    /// # Arguments
    /// 
    /// * `input_size` - The number of input neurons
    /// * `hidden_size` - The number of neurons in the hidden layer
    /// * `output_size` - The number of output neurons
    /// * `learning_rate` - The learning rate for the neural network
    fn new(input_size: usize, hidden_size: usize, output_size: usize, learning_rate: f32) -> Self {
        let mut weights1 = Array2::zeros((hidden_size, input_size));
        let mut weights2 = Array2::zeros((output_size, hidden_size));

        // Initialize weights with random values
        populate_array(&mut weights1, hidden_size, input_size);
        populate_array(&mut weights2, output_size, hidden_size);

        let biases1 = Array2::zeros((hidden_size, 1));
        let biases2 = Array2::zeros((output_size, 1));

        NeuralNetwork {
            input_size,
            output_size,
            weights1,
            biases1,
            weights2,
            biases2,
            learning_rate,
        }
    }

    fn forward(&self, input: &Array2<f32>) -> (Array2<f32>, Array2<f32>, Array2<f32>) {
        // First layer
        let pre_activation1 = self.weights1.dot(input) + &self.biases1;
        let hidden = relu(&pre_activation1);

        // Output layer
        let pre_activation2 = self.weights2.dot(&hidden) + &self.biases2;
        let output = softmax(&pre_activation2);

        (hidden, pre_activation2, output)
    }

    fn backward(
        &mut self,
        input: &Array2<f32>,
        hidden: &Array2<f32>,
        pre_activation2: &Array2<f32>,
        output: &Array2<f32>,
        target: &Array2<f32>,
    ) {
        let batch_size = input.shape()[1] as f32;

        // Calculate gradients for output layer
        let output_error = output - target;
        
        // Gradients for weights2 and biases2
        let d_weights2 = output_error.dot(&hidden.t()) / batch_size;
        let d_biases2 = &output_error.sum_axis(Axis(1)).insert_axis(Axis(1)) / batch_size;

        // Backpropagate error to hidden layer
        let hidden_error = self.weights2.t().dot(&output_error);
        let hidden_gradient = &hidden_error * &relu_derivative(hidden);

        // Gradients for weights1 and biases1
        let d_weights1 = hidden_gradient.dot(&input.t()) / batch_size;
        let d_biases1 = &hidden_gradient.sum_axis(Axis(1)).insert_axis(Axis(1)) / batch_size;

        // Update weights and biases using gradient descent
        self.weights2 = &self.weights2 - &(d_weights2 * self.learning_rate);
        self.biases2 = &self.biases2 - &(d_biases2 * self.learning_rate);
        self.weights1 = &self.weights1 - &(d_weights1 * self.learning_rate);
        self.biases1 = &self.biases1 - &(d_biases1 * self.learning_rate);
    }

    fn train(&mut self, input: &Array2<f32>, target: &Array2<f32>) -> f32 {
        // Forward pass
        let (hidden, pre_activation2, output) = self.forward(input);

        // Calculate loss (cross-entropy)
        let epsilon = 1e-15;
        let loss = -target * &output.mapv(|x| (x + epsilon).ln());
        let loss = loss.sum() / (input.shape()[1] as f32);

        // Backward pass
        self.backward(input, &hidden, &pre_activation2, &output, target);

        loss
    }
}

}


fn main() {
    // Create a neural network
    let mut nn = NeuralNetwork::new(6, 6, 4, 0.01);

    // Create sample input
    let mut input = Array2::zeros((nn.input_size, 1));
    populate_array(&mut input, nn.input_size, 1);

    // Create sample target
    let mut target = Array2::zeros((nn.output_size, 1));
    populate_array(&mut target, nn.output_size, 1);
    
    // Calculate initial cross entropy loss before training
    let (_, _, initial_output) = nn.forward(&input);
    let epsilon = 1e-15;
    let initial_loss = -&target * &initial_output.mapv(|x| (x + epsilon).ln());
    let initial_loss = initial_loss.sum() / (input.shape()[1] as f32);
    println!("Initial loss before training: {}", initial_loss);

    
    // Train for one iteration
    let loss = nn.train(&input, &target);
    //println!("Loss: {}", loss);

    // Calculate loss after training
    let (_, _, final_output) = nn.forward(&input);
    let epsilon = 1e-15;
    let final_loss = -&target * &final_output.mapv(|x| (x + epsilon).ln());
    let final_loss = final_loss.sum() / (input.shape()[1] as f32);
    println!("loss after training: {}", final_loss);

}

We can now run an iteration of a forward pass and backward pass.

You can repeatedly run this cell to see the loss decreasing.

#![allow(unused)]
fn main() {
main()
}

Initial loss before training: 1.5886599
loss after training: 1.5812341





()

Has anyone implemented neural networks in Rust?

See for example candle. Perhaps a pun on "Torch" from "PyTorch".

But the point isn't to use an existing package but learn how to build a simple one from basic linear algebra principles.

All of this and more...

Quick start: https://github.com/rust-ndarray/ndarray/blob/master/README-quick-start.md
Numpy equivalence: https://docs.rs/ndarray/latest/ndarray/doc/ndarray_for_numpy_users/index.html
Linear Algebra: https://github.com/rust-ndarray/ndarray-linalg/blob/master/README.md

DS210 Course Notes

NDArray and Neural networks

About This Module

Prework

Prework Reading

Pre-lecture Reflections