Lecture 19 - HashMap and HashSet
Logistics
- Exam 1 corrections are done, oral exams are Monday and Tuesday
- HW4 due in a week - Joey will intro in discussions on Tuesday
- Reminder to cite sources and be skeptical of AI answers
Learning Objectives
By the end of today, you should be able to:
- Use
HashMap<K, V>to quickly look up data by key - Use
HashSet<T>to find unique values and check membership - Understand what these have in common with Vec and String (besides capital letters)
What are collections?
Collections are types that can hold multiple values of a specified type.
So far
Heap-allocated collections:
Vec<T>- Growable array of items of typeT(Lecture 15)String- Growable text data (Lecture 18)
Stack-allocated collections:
- Arrays
[T; N]- Fixed number of items, known at compile time
Today we get two more collections
HashMap<K, V>- Look up values by key (like a dictionary)HashSet<T>- Store unique values (no duplicates allowed)
Both are heap-allocated and follow the same ownership patterns as Vec!
The Problem: Looking Up Data Quickly
Imagine you're analyzing customer data with a million records.
You need to find customer "Alice Smith"'s phone number quickly.
Option 1: Search through a list
#![allow(unused)] fn main() { let customer_data = vec![("John Doe", 12345), ("Jane Smith", 56789), ("Alice Smith", 11235), /* ... 999,997 more */]; // This is slow - might check every name! for customer in customer_data.iter() { if customer.0 == "Alice Smith" { println!("{}", customer.1); break; } } }
Option 2: Sort and box them?
Instead of searching, what if we kept the information in organized drawers so we could jump directly to Alice's information?
Option 3: ???
Option 3: The miraculous HashMap
(Yep, it's basically a python dict)
#![allow(unused)] fn main() { use std::collections::HashMap; // Create a "phone book" for customer_data let mut customer_data = HashMap::new(); customer_data.insert("Alice Smith".to_string(), 12345); customer_data.insert("Bob Jones".to_string(), 56789); customer_data.insert("Carol White".to_string(), 11235); // Look up Alice's data instantly match customer_data.get("Alice Smith") { Some(x) => println!("Alice's data is {}", x), None => println!("Alice not found"), } }
NOTE .get() returns an Option type!
Memory Layout: HashMap on Stack and Heap
What does a HashMap look like in memory?
#![allow(unused)] fn main() { let mut customer_data = HashMap::new(); customer_data.insert("Alice Smith".to_string(), 12345); customer_data.insert("Bob Jones".to_string(), 56789); }
STACK HEAP
┌──────────────────────┐ ┌──────────────────────────────┐
│ customer_data: │ │ Bucket Array (simplified) │
│ HashMap<String,i32> │ │ │
│ ptr ───────────────┼────────────►│ [0]: (hash) ──────┐ │
│ len: 2 │ │ [1]: empty │ │
│ capacity: 8 │ │ [2]: (hash) ─┐ │ │
└──────────────────────┘ │ [3]: empty │ │ │
│ ... │ │ │
└───────────────┼────┼─────────┘
│ │
┌───────────────┘ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│"Bob Jones" │ │"Alice Smith" │
│ (String) │ │ (String) │
├──────────────┤ ├──────────────┤
│ 56789 │ │ 12345 │
│ (i32) │ │ (i32) │
└──────────────┘ └──────────────┘
Key points:
- HashMap lives on the stack (pointer + metadata)
- The bucket array lives on the heap
- Both keys (Strings) and values are stored on the heap
- Hash function determines which bucket stores each key-value pair
Adding and Updating Data
#![allow(unused)] fn main() { use std::collections::HashMap; // Store product prices let mut prices = HashMap::new(); prices.insert("laptop".to_string(), 999.99); prices.insert("mouse".to_string(), 25.50); prices.insert("keyboard".to_string(), 75.00); // Update an existing price prices.insert("laptop".to_string(), 899.99); // overwrites // Only add if not already there if !prices.contains_key("tablet") { prices.insert("tablet".to_string(), 199.99); } // More concise way to add but avoid overwriting: prices.entry("tablet".to_string()).or_insert(199.99); }
How does this really work?
It's not Quite A-Z rooms with A-Z cabinets with A-Z drawers...
If you did that for our class (of 46 students):
- 9 of you have A names (~20%)
- No one has F, I, N, O, P, Q, R, U, X, Z
- Not a great use of space!
Okay, then we'll just make... two A rooms
This is kind of how libraries do it:
We have the Dewey Decimal System:
- Looking for "The Rust Programming Language"?
- Turn it into code "005.133"
- Find the appropriate shelf : "005.8-005.212"
- Find the book on that shelf
- Looking for "The Lord of the Rings"?
- Turn it into code "823.912"
- Find the right shelf: "823-824"
- Find the book on that shelf
And we can have some shelves cover fewer numbers and some shelves cover more...
But we don't know at first what the distribution will be!
The solution - hashes
A hash function takes any input and converts it to a number.
Key properties:
- Deterministic: Same input always produces same output
- Fast: Takes milliseconds even for large inputs
- Uniform: Spreads values evenly across a range
- Avalanche effect: Small changes in input → big changes in output
hash("Alice")-> 42hash("alice")-> 8374 (just lowercase 'A' changed everything!)
- Hard to invert and Collisions are rare -> useful in security (eg passwords!)
A Toy Hash Function Example
Here's a simplified hash function to show the concept (real ones are much more sophisticated!):
#![allow(unused)] fn main() { fn toy_hash(s: &str) -> i32 { let mut hash: i32 = 0; for ch in s.chars() { hash = hash.wrapping_mul(31).wrapping_add(ch as i32); } hash } // Examples: println!("{}", toy_hash("Alice")); println!("{}", toy_hash("Bob")); println!("{}", toy_hash("alice")); // (lowercase 'a' changes everything!) }
Real hash functions (like the ones Rust uses) are much more complex and optimized, but they follow the same principle: turn any input into a number that can be used as an array index!
What does Rust use? Depends on what you're hashing, but if you must know... the default is SipHash 1-3 (have fun going down that rabbit hole!) - it is slower but more secure
From Hash Value to Bucket Index
Important distinction: The hash value is NOT the same as the bucket index! (ie the code isn't the shelf)
Let's say we have a HashMap with 8 buckets:
1. Calculate hash value (can be any i32):
hash("Alice") = 1,234,567,890
2. Convert to bucket index using modulo:
bucket_index = 1,234,567,890 % 8 = 2
3. Store in bucket [2]
Why use modulo?
- Hash values can be HUGE (billions)
- We only have a limited number of buckets (e.g., 8, 16, 100)
- Modulo (
%) wraps the hash value to fit our bucket array
So here's what HashMap does
- Turns a key into a hash
- Turns a hash into a bucket array index
- Stores the (key, value) pair at the bucket array index
Iterating on a HashMap
Continuing our example:
#![allow(unused)] fn main() { use std::collections::HashMap; let mut prices = HashMap::new(); prices.insert("laptop".to_string(), 999.99); prices.insert("mouse".to_string(), 25.50); prices.insert("keyboard".to_string(), 75.00); // Look at all products and prices for (product, price) in prices.iter() { // product and price are both & println!("{}: ${:.2}", product, price); } // Give everything a 10% discount for (product, price) in prices.iter_mut() { // product is &, price is &mut *price = *price * 0.9; } // And printing them again for (product, price) in prices.iter() { // product and price are both & println!("{}: ${:.2}", product, price); } }
Ownership Interlude: What happens here?
#![allow(unused)] fn main() { use std::collections::HashMap; let product = String::from("smartphone"); let mut prices = HashMap::new(); prices.insert(product, 599.99); println!("Product: {}", product); // What happens? }
Common Pattern: Counting Things
Let's count how many times each word appears in text:
#![allow(unused)] fn main() { use std::collections::HashMap; let text = "the cat sat on the mat"; let mut word_counts = HashMap::new(); for word in text.split_whitespace() { let new_count = match word_counts.get(word) { Some(x) => x+1, None => 1 }; word_counts.insert(word.to_string(), new_count); } for (word, count) in &word_counts { println!("'{}' appears {} times", word, count); } }
Alternatively:
#![allow(unused)] fn main() { use std::collections::HashMap; let text = "the cat sat on the mat"; let mut word_counts = HashMap::new(); for word in text.split_whitespace() { let count = word_counts.entry(word.to_string()).or_insert(0); // this gives a mutable reference! *count += 1; } for (word, count) in &word_counts { println!("'{}' appears {} times", word, count); } }
Take-away: .entry().or_insert() gives you a mutable reference to the value in the key-value pair!
HashSet - the baby sibling of HashMap
The Problem: Duplicate Data
You have customer data but some customers appear multiple times:
#![allow(unused)] fn main() { let customers = vec![ "Alice", "Bob", "Alice", "Carol", "Bob", "Devon", "Alice" ]; }
How many unique customers do we have? (How would you solve this without hashing?)
HashSet: Automatic Uniqueness
(Yep, you've seen this too in a python set)
#![allow(unused)] fn main() { use std::collections::HashSet; let customers = vec!["Alice", "Bob", "Alice", "Carol", "Bob", "David", "Alice"]; // Put all customers in a HashSet - duplicates automatically removed let unique_customers: HashSet<&str> = customers.iter().cloned().collect(); println!("Original list: {} customers", customers.len()); // 7 println!("Unique customers: {}", unique_customers.len()); // 4 // See who the unique customers are for customer in &unique_customers { println!("Customer: {}", customer); } }
Understanding .iter().cloned().collect()
Let's break down what's happening in that HashSet creation:
#![allow(unused)] fn main() { let customers = vec!["Alice", "Bob", "Alice"]; let unique: HashSet<&str> = customers.iter().cloned().collect(); }
Step by step:
-
.iter()- Creates an iterator over references to the elements- Type:
Iterator<Item = &&str>(references to string slices)
- Type:
-
.cloned()- Makes copies of each reference- Takes each
&&strand "clones" it to get&str - For Copy types like
&str,i32, this is cheap (just copies the pointer/value) - Type:
Iterator<Item = &str>
- Takes each
-
.collect()- Gathers all items into a HashSet- Looks at the type annotation (
: HashSet<&str>) - Creates a HashSet and inserts each
&str, automatically removing duplicates - Type:
HashSet<&str>
- Looks at the type annotation (
Creating HashSets from Different Vec Types
From Vec
use std::collections::HashSet; fn main(){ let numbers = vec![1, 2, 3, 2, 4, 1, 5]; // Option 1: Use .iter().cloned().collect() let unique_nums: HashSet<i32> = numbers.iter().cloned().collect(); println!("{:?}", numbers); // still valid // Option 2: Use .into_iter().collect() (consumes the Vec) let unique_nums: HashSet<i32> = numbers.into_iter().collect(); // println!("{:?}", numbers); // won't compile println!("{:?}", unique_nums); // {1, 2, 3, 4, 5} (order may vary) }
Strings work the same way - clone or move ownership
use std::collections::HashSet; fn main(){ let names = vec![ String::from("Alice"), String::from("Bob"), String::from("Alice") ]; // Option 1: Clone all Strings (original Vec still valid) let unique_names: HashSet<String> = names.iter().cloned().collect(); println!("Original: {:?}", names); // Still works! println!("Unique: {:?}", unique_names); // Option 2: Move Strings into HashSet (consumes the Vec) let unique_names: HashSet<String> = names.into_iter().collect(); // println!("{:?}", names); // ERROR! names was moved }
Checking if something is in the hashset
#![allow(unused)] fn main() { use std::collections::HashSet; let mut valid_products = HashSet::new(); valid_products.insert("laptop".to_string()); valid_products.insert("mouse".to_string()); valid_products.insert("keyboard".to_string()); // Check if a product is valid let product_to_check = "tablet"; if valid_products.contains(product_to_check) { println!("{} is a valid product", product_to_check); } else { println!("{} is not in our catalog", product_to_check); } }
A realistic example
You have 100,000 customer IDs and need to check if 10,000 orders are from valid customers. Which is faster?
#![allow(unused)] fn main() { // Option A: Keep customer IDs in a Vec let customers_vec = vec![/* 100,000 customer IDs */]; for order_id in order_ids { if customers_vec.contains(&order_id) { // Process valid order } } // Option B: Keep customer IDs in a HashSet let customers_set: HashSet<_> = customer_ids.into_iter().collect(); for order_id in order_ids { if customers_set.contains(&order_id) { // Process valid order } } }
They look the same - the difference is in how they work
- Vec has to potentially check ALL 100,000 each time to find a match! Up to 100k TIMES 10k operations
- HashSet just hashes each order and checks against the list - only order 10k
Activity 19 - Explain the Anagram Finder
On gradescope you'll find a complete program for finding anagrams. The code is functional (for once!) - your job is to understand it.
You can discuss in groups but each gradescope submission has a cap of 2.
- Take some time to explain in the in-line commments what each line of code is doing.
- In the triple /// doc-string comments before each function, explain what the function does overall and what its role is in the program.
- Consider renaming functions and variables (and if you do, replacing it elsewhere!) to make it clearer what's going on
You can paste this into your IDE/VSCode or Rust Playground - whichever's easier.
Regardless of how far you get, paste your edited code into gradescope by the end of class.