Lecture 19 - HashMap and HashSet

Logistics

  • Exam 1 corrections are done, oral exams are Monday and Tuesday
  • HW4 due in a week - Joey will intro in discussions on Tuesday
  • Reminder to cite sources and be skeptical of AI answers

Learning Objectives

By the end of today, you should be able to:

  • Use HashMap<K, V> to quickly look up data by key
  • Use HashSet<T> to find unique values and check membership
  • Understand what these have in common with Vec and String (besides capital letters)

What are collections?

Collections are types that can hold multiple values of a specified type.

So far

Heap-allocated collections:

  • Vec<T> - Growable array of items of type T (Lecture 15)
  • String - Growable text data (Lecture 18)

Stack-allocated collections:

  • Arrays [T; N] - Fixed number of items, known at compile time

Today we get two more collections

  • HashMap<K, V> - Look up values by key (like a dictionary)
  • HashSet<T> - Store unique values (no duplicates allowed)

Both are heap-allocated and follow the same ownership patterns as Vec!

The Problem: Looking Up Data Quickly

Imagine you're analyzing customer data with a million records.

You need to find customer "Alice Smith"'s phone number quickly.

Option 1: Search through a list

#![allow(unused)]
fn main() {
let customer_data = vec![("John Doe", 12345), 
                        ("Jane Smith", 56789), 
                        ("Alice Smith", 11235),
                         /* ... 999,997 more */];

// This is slow - might check every name!
for customer in customer_data.iter() {
    if customer.0 == "Alice Smith" {
        println!("{}", customer.1);
        break;
    }
}
}

Option 2: Sort and box them?

Instead of searching, what if we kept the information in organized drawers so we could jump directly to Alice's information?

Option 3: ???

Option 3: The miraculous HashMap

(Yep, it's basically a python dict)

#![allow(unused)]
fn main() {
use std::collections::HashMap;

// Create a "phone book" for customer_data
let mut customer_data = HashMap::new();
customer_data.insert("Alice Smith".to_string(), 12345);
customer_data.insert("Bob Jones".to_string(), 56789);
customer_data.insert("Carol White".to_string(), 11235);

// Look up Alice's data instantly
match customer_data.get("Alice Smith") {
    Some(x) => println!("Alice's data is {}", x),
    None => println!("Alice not found"),
}
}

NOTE .get() returns an Option type!

Memory Layout: HashMap on Stack and Heap

What does a HashMap look like in memory?

#![allow(unused)]
fn main() {
let mut customer_data = HashMap::new();
customer_data.insert("Alice Smith".to_string(), 12345);
customer_data.insert("Bob Jones".to_string(), 56789);
}
         STACK                                    HEAP

┌──────────────────────┐             ┌──────────────────────────────┐
│ customer_data:       │             │  Bucket Array (simplified)   │
│  HashMap<String,i32> │             │                              │
│   ptr ───────────────┼────────────►│  [0]: (hash) ──────┐         │
│   len: 2             │             │  [1]: empty        │         │
│   capacity: 8        │             │  [2]: (hash) ─┐    │         │
└──────────────────────┘             │  [3]: empty   │    │         │
                                     │  ...          │    │         │
                                     └───────────────┼────┼─────────┘
                                                     │    │
                                     ┌───────────────┘    │
                                     ▼                    ▼
                              ┌──────────────┐    ┌──────────────┐
                              │"Bob Jones"   │    │"Alice Smith" │
                              │  (String)    │    │  (String)    │
                              ├──────────────┤    ├──────────────┤
                              │  56789       │    │  12345       │
                              │  (i32)       │    │  (i32)       │
                              └──────────────┘    └──────────────┘

Key points:

  • HashMap lives on the stack (pointer + metadata)
  • The bucket array lives on the heap
  • Both keys (Strings) and values are stored on the heap
  • Hash function determines which bucket stores each key-value pair

Adding and Updating Data

#![allow(unused)]
fn main() {
use std::collections::HashMap;
// Store product prices
let mut prices = HashMap::new();
prices.insert("laptop".to_string(), 999.99);
prices.insert("mouse".to_string(), 25.50);
prices.insert("keyboard".to_string(), 75.00);

// Update an existing price
prices.insert("laptop".to_string(), 899.99);  // overwrites

// Only add if not already there
if !prices.contains_key("tablet") {
    prices.insert("tablet".to_string(), 199.99);
}

// More concise way to add but avoid overwriting:
prices.entry("tablet".to_string()).or_insert(199.99);
}

How does this really work?

It's not Quite A-Z rooms with A-Z cabinets with A-Z drawers...

If you did that for our class (of 46 students):

  • 9 of you have A names (~20%)
  • No one has F, I, N, O, P, Q, R, U, X, Z
  • Not a great use of space!

Okay, then we'll just make... two A rooms

This is kind of how libraries do it:

We have the Dewey Decimal System:

  • Looking for "The Rust Programming Language"?
    • Turn it into code "005.133"
    • Find the appropriate shelf : "005.8-005.212"
    • Find the book on that shelf
  • Looking for "The Lord of the Rings"?
    • Turn it into code "823.912"
    • Find the right shelf: "823-824"
    • Find the book on that shelf

And we can have some shelves cover fewer numbers and some shelves cover more...

But we don't know at first what the distribution will be!

The solution - hashes

A hash function takes any input and converts it to a number.

Key properties:

  • Deterministic: Same input always produces same output
  • Fast: Takes milliseconds even for large inputs
  • Uniform: Spreads values evenly across a range
  • Avalanche effect: Small changes in input → big changes in output
    • hash("Alice") -> 42
    • hash("alice") -> 8374 (just lowercase 'A' changed everything!)
  • Hard to invert and Collisions are rare -> useful in security (eg passwords!)

A Toy Hash Function Example

Here's a simplified hash function to show the concept (real ones are much more sophisticated!):

#![allow(unused)]
fn main() {
fn toy_hash(s: &str) -> i32 {
    let mut hash: i32 = 0;
    for ch in s.chars() {
        hash = hash.wrapping_mul(31).wrapping_add(ch as i32);
    }
    hash 
}

// Examples:
println!("{}", toy_hash("Alice")); 
println!("{}", toy_hash("Bob"));    
println!("{}", toy_hash("alice"));  // (lowercase 'a' changes everything!)
}

Real hash functions (like the ones Rust uses) are much more complex and optimized, but they follow the same principle: turn any input into a number that can be used as an array index!

What does Rust use? Depends on what you're hashing, but if you must know... the default is SipHash 1-3 (have fun going down that rabbit hole!) - it is slower but more secure

From Hash Value to Bucket Index

Important distinction: The hash value is NOT the same as the bucket index! (ie the code isn't the shelf)

Let's say we have a HashMap with 8 buckets:

1. Calculate hash value (can be any i32):
   hash("Alice") = 1,234,567,890

2. Convert to bucket index using modulo:
   bucket_index = 1,234,567,890 % 8 = 2

3. Store in bucket [2]

Why use modulo?

  • Hash values can be HUGE (billions)
  • We only have a limited number of buckets (e.g., 8, 16, 100)
  • Modulo (%) wraps the hash value to fit our bucket array

So here's what HashMap does

  • Turns a key into a hash
  • Turns a hash into a bucket array index
  • Stores the (key, value) pair at the bucket array index

Iterating on a HashMap

Continuing our example:

#![allow(unused)]
fn main() {
use std::collections::HashMap;

let mut prices = HashMap::new();
prices.insert("laptop".to_string(), 999.99);
prices.insert("mouse".to_string(), 25.50);
prices.insert("keyboard".to_string(), 75.00);

// Look at all products and prices
for (product, price) in prices.iter() { // product and price are both &
    println!("{}: ${:.2}", product, price);
}

// Give everything a 10% discount
for (product, price) in prices.iter_mut() { // product is &, price is &mut
    *price = *price * 0.9;
}

// And printing them again
for (product, price) in prices.iter() { // product and price are both &
    println!("{}: ${:.2}", product, price);
}
}

Ownership Interlude: What happens here?

#![allow(unused)]
fn main() {
use std::collections::HashMap;
let product = String::from("smartphone");
let mut prices = HashMap::new();
prices.insert(product, 599.99);
println!("Product: {}", product);  // What happens?
}

Common Pattern: Counting Things

Let's count how many times each word appears in text:

#![allow(unused)]
fn main() {
use std::collections::HashMap;

let text = "the cat sat on the mat";
let mut word_counts = HashMap::new();

for word in text.split_whitespace() {
    let new_count = match word_counts.get(word) {
        Some(x) => x+1,
        None => 1
    };
    word_counts.insert(word.to_string(), new_count);
}

for (word, count) in &word_counts {
    println!("'{}' appears {} times", word, count);
}
}

Alternatively:

#![allow(unused)]
fn main() {
use std::collections::HashMap;

let text = "the cat sat on the mat";
let mut word_counts = HashMap::new();

for word in text.split_whitespace() {
    let count = word_counts.entry(word.to_string()).or_insert(0); // this gives a mutable reference!
    *count += 1;
}

for (word, count) in &word_counts {
    println!("'{}' appears {} times", word, count);
}
}

Take-away: .entry().or_insert() gives you a mutable reference to the value in the key-value pair!

HashSet - the baby sibling of HashMap

The Problem: Duplicate Data

You have customer data but some customers appear multiple times:

#![allow(unused)]
fn main() {
let customers = vec![
    "Alice", "Bob", "Alice", "Carol", "Bob", "Devon", "Alice"
];
}

How many unique customers do we have? (How would you solve this without hashing?)

HashSet: Automatic Uniqueness

(Yep, you've seen this too in a python set)

#![allow(unused)]
fn main() {
use std::collections::HashSet;

let customers = vec!["Alice", "Bob", "Alice", "Carol", "Bob", "David", "Alice"];

// Put all customers in a HashSet - duplicates automatically removed
let unique_customers: HashSet<&str> = customers.iter().cloned().collect();

println!("Original list: {} customers", customers.len());  // 7
println!("Unique customers: {}", unique_customers.len());   // 4

// See who the unique customers are
for customer in &unique_customers {
    println!("Customer: {}", customer);
}
}

Understanding .iter().cloned().collect()

Let's break down what's happening in that HashSet creation:

#![allow(unused)]
fn main() {
let customers = vec!["Alice", "Bob", "Alice"];
let unique: HashSet<&str> = customers.iter().cloned().collect();
}

Step by step:

  1. .iter() - Creates an iterator over references to the elements

    • Type: Iterator<Item = &&str> (references to string slices)
  2. .cloned() - Makes copies of each reference

    • Takes each &&str and "clones" it to get &str
    • For Copy types like &str, i32, this is cheap (just copies the pointer/value)
    • Type: Iterator<Item = &str>
  3. .collect() - Gathers all items into a HashSet

    • Looks at the type annotation (: HashSet<&str>)
    • Creates a HashSet and inserts each &str, automatically removing duplicates
    • Type: HashSet<&str>

Creating HashSets from Different Vec Types

From Vec - Copy types are simple:

use std::collections::HashSet;

fn main(){
    let numbers = vec![1, 2, 3, 2, 4, 1, 5];

    // Option 1: Use .iter().cloned().collect()
    let unique_nums: HashSet<i32> = numbers.iter().cloned().collect();
    println!("{:?}", numbers);  // still valid

    // Option 2: Use .into_iter().collect() (consumes the Vec)
    let unique_nums: HashSet<i32> = numbers.into_iter().collect();
    // println!("{:?}", numbers);  // won't compile

    println!("{:?}", unique_nums);  // {1, 2, 3, 4, 5} (order may vary)
}

Strings work the same way - clone or move ownership

use std::collections::HashSet;

fn main(){
    let names = vec![
        String::from("Alice"),
        String::from("Bob"),
        String::from("Alice")
    ];

    // Option 1: Clone all Strings (original Vec still valid)
    let unique_names: HashSet<String> = names.iter().cloned().collect();
    println!("Original: {:?}", names);  // Still works!
    println!("Unique: {:?}", unique_names);

    // Option 2: Move Strings into HashSet (consumes the Vec)
    let unique_names: HashSet<String> = names.into_iter().collect();
    // println!("{:?}", names);  // ERROR! names was moved
}

Checking if something is in the hashset

#![allow(unused)]
fn main() {
use std::collections::HashSet;

let mut valid_products = HashSet::new();
valid_products.insert("laptop".to_string());
valid_products.insert("mouse".to_string());
valid_products.insert("keyboard".to_string());

// Check if a product is valid
let product_to_check = "tablet";
if valid_products.contains(product_to_check) {
    println!("{} is a valid product", product_to_check);
} else {
    println!("{} is not in our catalog", product_to_check);
}
}

A realistic example

You have 100,000 customer IDs and need to check if 10,000 orders are from valid customers. Which is faster?

#![allow(unused)]
fn main() {
// Option A: Keep customer IDs in a Vec
let customers_vec = vec![/* 100,000 customer IDs */];
for order_id in order_ids {
    if customers_vec.contains(&order_id) {
        // Process valid order
    }
}

// Option B: Keep customer IDs in a HashSet
let customers_set: HashSet<_> = customer_ids.into_iter().collect();
for order_id in order_ids {
    if customers_set.contains(&order_id) {
        // Process valid order
    }
}
}

They look the same - the difference is in how they work

  • Vec has to potentially check ALL 100,000 each time to find a match! Up to 100k TIMES 10k operations
  • HashSet just hashes each order and checks against the list - only order 10k

Activity 19 - Explain the Anagram Finder

On gradescope you'll find a complete program for finding anagrams. The code is functional (for once!) - your job is to understand it.

You can discuss in groups but each gradescope submission has a cap of 2.

  1. Take some time to explain in the in-line commments what each line of code is doing.
  2. In the triple /// doc-string comments before each function, explain what the function does overall and what its role is in the program.
  3. Consider renaming functions and variables (and if you do, replacing it elsewhere!) to make it clearer what's going on

You can paste this into your IDE/VSCode or Rust Playground - whichever's easier.

Regardless of how far you get, paste your edited code into gradescope by the end of class.