Topics

  1. Numbering Systems
  2. The Von Neumann Architecture
  3. Memory Hierarchy and Memory Concepts
  4. Trends, Sizes and Costs

Numbering Systems

  • Decimal (0-9) e.g. 1724
  • Binary (0-1) e.g. 0b011000 (24 decimal)
  • Octal (0-7) e.g. 0o131 (89 decimal)
  • Hexadecimal (0-9, A-F) e.g 0x13F (319 decimal)

Converting between numbering systems

For any base b to decimal. Assume number C with digits

Between octal and binary

Every octal digit corresponds to exactly 3 binary digits and the reverse. For example 0o34 = 0b011_100. Traverse numbers right to left and prepend with 0s if necessary.

Between hexadecimal and binary

Every hexadecimal digit corresponds to exactly 4 binary digits and the reverse. For example 0x3A = 0b0011_1010. Traverse numbers right to left and prepend with 0s if necessary.

Between decimal and binary (or any base b)

More complicated. Divide repeatedly by 2 (or the base b) and keep the remainder as the next most significant binary digit. Stop when the division returns 0.

i = 0 
while D > 0:
  C[i] = D % 2 # modulo operator -- or substitute 2 for any base b
  D = D // 2 # floor division -- or substitute 2 for any base b
  i += 1

What about between decimal and octal/hexadecimal

You can use the same logic as for binary or convert to binary and then use the binary to octal/hexadecimal simple conversions

The Von Neuman Architecture

Named after the First Draft of a Report on the EDVAC written by mathematician John von Neuman in 1945.

Most processor architectures are still based on this same model.

von Neumann Architecture

Key Components

  1. Central Processing Unit (CPU):
    • The CPU is the core processing unit responsible for executing instructions and performing computations. It consists of:
    • Control Unit (CU):
      • Directs the operations of the CPU by interpreting instructions and coordinating data flow between the components.
      • Controls the flow of data between the input, memory, and output devices.
    • Arithmetic/Logic Unit (ALU):
      • Performs arithmetic operations (e.g., addition, subtraction) and logical operations (e.g., AND, OR, NOT).
      • Acts as the computational engine of the CPU.
  2. Memory Unit:
    • Stores data and instructions needed for processing.
    • The memory serves as temporary storage for instructions being executed and intermediate data.
    • It communicates with both the CPU and input/output devices.
  3. Input Device:
    • Provides data or instructions to the CPU.
    • Examples include keyboards, mice, and sensors.
    • Data flows from the input device into the CPU for processing.
  4. Output Device:
    • Displays or transmits the results of computations performed by the CPU.
    • Examples include monitors, printers, and actuators.

Also known as the stored program architecture

Both data and program stored in memory and it's just convention which parts of memory contain instructions and which ones contain variables.

Two very special registers in the processor: Program Counter (PC) and Stack Pointer (SP)

PC: Points to the next instruction. Auto-increments by one when instruction is executed with the exception of branch and jmp instructions that explicitly modify it. Branch instructions used in loops and conditional statements. Jmp instructions used in function calls.

SP: Points to beginning of state (parameters, local variables, return address, old stackpointer etc) for current function call.

Intruction Decoding

Use the Program Counter to fetch the next instruction. After fetching you have to decode it, and subsequently to execute it.

Decoding instructions requires that you split the instruction number to the opcode (telling you what to do) and the operands (telling what data to operate one)

Opcode Format

Example from MIPS (Microprocessor without Interlocked Pipeline Stages) Intruction Set Architecture (ISA). MIPS is RISC (Reduced Instruction Set Computer).

The time cost of operations

Assume for example a processor clocked at 2 GHz, e.g. .

  • Executing an instruction ~ 0.5 ns (1 clock cycle)
  • Getting a value (4 bytes) from L1 cache ~1 ns
  • Branch mispredict ~3 ns
  • Getting a value from L2 cache ~4 ns
  • Send 1Kbyte of data over 1Gbps network (just send not arrive) ~ 16 ns
  • Get a value from main memory ~100 ns
  • Read 1MB from main memory sequentially ~1000 ns
  • Compress 1Kbyte (in L1 cache) with zippy ~2000 ns
  • Read 1MB from SSD ~49,000 µs
  • Send a ping pong packet inside a datacenter ~500,000 ns
  • Read 1Mbyte from HDD ~825,000 ns
  • Do an HDD seek ~2,000,000 ns
  • Send a packet from US to Europe and back ~150,000,000 ns

https://samwho.dev/numbers/

The memory hierarchy and memory concepts

We've talked about different kinds of memory. It's helpful to think of it in terms of a hierarchy.

  • As indicated above, registers are closest to the processor and fastest.
  • As you move farther away, the size gets larger but access gets slower

Storage Hierarchy

The following figure from Hennesy and Patterson is also very informative.

Memory Hierarchy From Hennesy and Patterson, Computer Architecture: A Quantitative Approach_.

When the CPU tries to read from a memory location it

  • First checks if that memory location is copied to L1 cache
    • if it is, then the value is returned
    • if it is not...
  • Then checks if the memory location is copied to L2 cache
    • if it is, then the value is copied to L1 cache and returned
    • if it is not...
  • Then checks if the memory location is copied to L3 cache
    • if it is, then the value is copied to L2, then L1 and returned
    • if it is not...
  • Go to main memory
    • fetch a cache line size of data, typically 64 bytes (why?)

More on Caches

  • Each cache line size of memory can be mapped to one of cache slots in each cache
  • we say such a cache is -way
  • if all slots are occupied, then we evict the Least Recently Used (LRU) slots

Cache Mapping

Direct mapped versus 2-way cache mapping. Wikipedia: CPU cache

We can see the cache configuration on a Linux system with the getconf command.

Here's the output from the MOC.

$ getconf -a | grep CACHE
LEVEL1_ICACHE_SIZE                 32768  (32KB)
LEVEL1_ICACHE_ASSOC                8
LEVEL1_ICACHE_LINESIZE             64

LEVEL1_DCACHE_SIZE                 32768  (32KB)
LEVEL1_DCACHE_ASSOC                8
LEVEL1_DCACHE_LINESIZE             64

LEVEL2_CACHE_SIZE                  1048576 (1MB)
LEVEL2_CACHE_ASSOC                 16
LEVEL2_CACHE_LINESIZE              64

LEVEL3_CACHE_SIZE                  23068672 (22MB)
LEVEL3_CACHE_ASSOC                 11
LEVEL3_CACHE_LINESIZE              64

LEVEL4_CACHE_SIZE                  0
LEVEL4_CACHE_ASSOC                 0
LEVEL4_CACHE_LINESIZE              0

How many way associative are they?

Why is 32kb not 32,000? When is K 1,000?

An 8-way associative cache with 32 KB of size and 64-byte blocks divides the cache into 64 sets, each with 8 cache lines. Memory addresses are mapped to specific sets.

Benefits of 8-Way Associativity:

  1. Reduces Conflict Misses:
    • Associativity allows multiple blocks to map to the same set, reducing the likelihood of eviction due to conflicts.
  2. Balances Complexity and Performance:
    • Higher associativity generally improves hit rates but increases lookup complexity. An 8-way cache strikes a good balance for most applications.

Cache Use Examples

Example from this blog post.

Contiguous read loop

Contiguous Reading

// cache1.cpp

#include <time.h>
#include <stdio.h>
#include <stdlib.h>

/*
 * Contiguous access loop
 * 
 * Example from https://mecha-mind.medium.com/demystifying-cpu-caches-with-examples-810534628d71
 *
 * compile with `clang cache.cpp -o cache`
 * run with `./cache`
 */

int main(int argc, char* argv[]) {
    const int length = 512 * 1024 * 1024;   // 512M
    const int cache_line_size = 16;  // size in terms of ints (4 bytes) so 16 * 4 = 64 bytes
    const int m = length/cache_line_size;  // 512M / 32 = 32M

    printf("Looping %d M times\n", m/(1024*1024));

    int *arr = (int*)malloc(length * sizeof(int)); // 512M length array

    clock_t start = clock();
    for (int i = 0; i < m; i++)   // loop 32M times with contiguous access
        arr[i]++;
    clock_t stop = clock();
    
    double duration = ((double)(stop - start)) / CLOCKS_PER_SEC * 1000;
    
    printf("Duration: %f ms\n", duration);

    free(arr);
    return 0;
}

When running on Apple M2 Pro.

% clang cache1.cpp -o cache1
% ./cache1
Looping 32 M times
Duration: 54.166000 ms




Now let's modify the loop to jump by intervals of cache_line_size

Noncontiguous Read Loop

Noncontiguous Read

// cache2.cpp

    for (int i = 0; i < m*cache_line_size; i+=cache_line_size) // non-contiguous access
        arr[i]++;
    clock_t stop = clock();
% ./cache2
Looping 32 M times
Duration: 266.715000 ms

About 5X slower. What happened?

Noncontiguous with 2x cache line jump

We loop half the amount of times!!

    for (int i = 0; i < m*cache_line_size; i+=2*cache_line_size) {
        arr[i]++;
        arr[i+cache_line_size]++;
    }

When running on Apple M2 Pro.

% ./cache3
Looping 16 M times
Duration: 255.551000 ms

Caches on multi-processor systems

For multi-processor systems (which are now standard), memory hierarchy looks something like this:

Caches on multicore systems

In other words, each core has it's own L1 and L2 cache, but the L3 cache and of course main memory is shared.

Virtual Memory, Page Tables and TLBs

  • The addressable memory address range is much larger than available physical memory
  • Every program thinks it can access every possible memory address.
    • And there has to exist some security to prevent one program from modifying the memory occupied by another.
  • The mechanism for that is virtual memory, paging and address translation
Virtual and physical addresses

Wikipedia: Page table

image.png From University of Illinois CS 241 lecture notes.

Page sizes are typically 4KB, 2MB or 1GB depending on the operating system.

If you access a memory address that is not paged into memory, there is a page fault while a page is possible evicted and a the memory is loaded from storage into memory.

We'll finish by looking at some representative costs, sizes and computing "laws."

Costs

  • Server GPU: \500-$1000
  • DRAM: \0.05-$.01/Gbyte
  • Disk: \0.02-$0.14/Gbyte

Sizes

For a typical server

2 X 2Ghz Intel/ADM processors
32-128Gbytes of memory
10-100 Tbytes of storage
10Gbps Network card
1-2 KWatts of power

For a typical datacenter

100K - 1M sercers
1+ MWatt of power 1-10 Pbbs of internal bandwidth, 1-10 Tbps of Internet facing bandwidth 1-10 Exabytes of storage

Computers grow fast so we have written some rules of thumb about them

  • Kryder's Law -- Storage density doubles every 12 months
  • Nielsen's Law -- Consumer Bandwidth doubles every 20 months
  • Moore's Law -- CPU capacity doubles every 18 months
  • Metcalfe's Law -- The value of a Network increases with the square of its members
  • Bell's Law -- Every 10 years the computing paradigm changes

In Class Poll