Algorithms for Data Science: Quantiles, Graphs, and Algorithm Design
About This Module
This module covers essential algorithms for data science applications. You'll learn quantile calculations for statistical analysis, graph representation and traversal algorithms (BFS/DFS), and algorithm design patterns including split-apply-combine, greedy algorithms, and divide-and-conquer approaches.
Prework
Prework Reading
Please read the following:
- Wikipedia: Percentile - Focus on calculation methods
- Wikipedia: Breadth-first search
- Wikipedia: Depth-first search
Pre-lecture Reflections
- What's the difference between percentile and quantile?
- Why might BFS find the shortest path in an unweighted graph?
- When would you use DFS vs BFS for graph exploration?
- What is the "greedy" approach to solving problems?
Learning Objectives
By the end of this module, you will be able to:
- Calculate quantiles and percentiles using linear interpolation
- Understand interquartile range (IQR) and its uses
- Implement ranking algorithms (standard and dense rank)
- Represent graphs using adjacency lists
- Implement BFS and DFS traversals
- Apply algorithm design patterns to data problems
Part 1: Quantiles and Statistical Algorithms
What are Quantiles?
Quantiles divide sorted data into equal parts:
- Quartiles (4 parts): Q1 (25%), Q2/median (50%), Q3 (75%)
- Percentiles (100 parts): P50 = median, P95 = 95th percentile
- Deciles (10 parts): D1 (10%), D5 (50%), etc.
Sorted data: [1, 2, 3, 4, 5, 6, 7, 8, 9]
Q1 Q2 Q3
↓ ↓ ↓
[1, 2, 3, 4, 5, 6, 7, 8, 9]
25% 50% 75%
Calculating Quantiles: Linear Interpolation
For quantile q (0.0 to 1.0) on sorted data of length n:
position = q * (n - 1)
lower_idx = floor(position)
upper_idx = ceil(position)
fraction = position - lower_idx
if lower_idx == upper_idx:
result = data[lower_idx]
else:
result = data[lower_idx] * (1 - fraction) +
data[upper_idx] * fraction
Quantile Implementation
fn quantile(data: &[f64], q: f64) -> Option<f64> { if data.is_empty() || !(0.0..=1.0).contains(&q) { return None; } // Sort the data let mut sorted = data.to_vec(); sorted.sort_by(|a, b| a.partial_cmp(b).unwrap()); // Calculate position let pos = q * (sorted.len() - 1) as f64; let lower = pos.floor() as usize; let upper = pos.ceil() as usize; //println!("For q: {}, index position is: {}, lower is: {}, upper is: {}", q, pos, lower, upper); //println!("lower f[{}]: {}, upper f[{}]: {}", lower, sorted[lower], upper, sorted[upper]); if lower == upper { Some(sorted[lower]) } else { // Linear interpolation let fraction = pos - lower as f64; Some(sorted[lower] * (1.0 - fraction) + sorted[upper] * fraction) } } fn main() { let data = vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]; //let data = vec![-1.0, 2.0, 4.4, 6.7, 11.2, 22.8, 83.1, 124.7]; println!("Q1 (25%): {:?}", quantile(&data, 0.25)); // 2.5 println!("Q2 (50%): {:?}", quantile(&data, 0.50)); // 5.0 println!("Q3 (75%): {:?}", quantile(&data, 0.75)); // 7.5 println!("P90: {:?}", quantile(&data, 0.90)); // 8.2 }
HW7 Connection: This is the
quantile()function in Part 3 specifically forf64values!
Interquartile Range (IQR)
IQR = Q3 - Q1 measures the spread of the middle 50% of data.
Uses:
- Less sensitive to outliers than using the range (max - min)
- Outlier detection:
- , or
#![allow(unused)] fn main() { fn iqr(data: &[f64]) -> Option<f64> { let q1 = quantile(data, 0.25)?; let q3 = quantile(data, 0.75)?; Some(q3 - q1) } fn quantile(data: &[f64], q: f64) -> Option<f64> { if data.is_empty() || !(0.0..=1.0).contains(&q) { return None; } let mut sorted = data.to_vec(); sorted.sort_by(|a, b| a.partial_cmp(b).unwrap()); let pos = q * (sorted.len() - 1) as f64; let lower = pos.floor() as usize; let upper = pos.ceil() as usize; if lower == upper { Some(sorted[lower]) } else { let fraction = pos - lower as f64; Some(sorted[lower] * (1.0 - fraction) + sorted[upper] * fraction) } } let data = vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]; let iqr_val = iqr(&data).unwrap(); let q1 = quantile(&data, 0.25).unwrap(); let q3 = quantile(&data, 0.75).unwrap(); println!("Q1: {}, Q3: {}, IQR: {}", q1, q3, iqr_val); // Outlier bounds let lower_bound = q1 - 1.5 * iqr_val; let upper_bound = q3 + 1.5 * iqr_val; println!("Outlier bounds: [{:.1}, {:.1}]", lower_bound, upper_bound); }
Reminder: The `?` operator is used to propagate errors up the call stack.
- It is equivalent to `return Err(e)` if the expression is an `Err(e)`.
- It is equivalent to `return Ok(x)` if the expression is an `Ok(x)`.
- It is equivalent to `return x` if the expression is a value.
IQR with outliers
Let's use a slightly more interesting dataset:
#![allow(unused)] fn main() { fn iqr(data: &[f64]) -> Option<f64> { let q1 = quantile(data, 0.25)?; let q3 = quantile(data, 0.75)?; Some(q3 - q1) } fn quantile(data: &[f64], q: f64) -> Option<f64> { if data.is_empty() || !(0.0..=1.0).contains(&q) { return None; } let mut sorted = data.to_vec(); sorted.sort_by(|a, b| a.partial_cmp(b).unwrap()); let pos = q * (sorted.len() - 1) as f64; let lower = pos.floor() as usize; let upper = pos.ceil() as usize; if lower == upper { Some(sorted[lower]) } else { let fraction = pos - lower as f64; Some(sorted[lower] * (1.0 - fraction) + sorted[upper] * fraction) } } let data = vec![-1.0, 2.0, 4.4, 6.7, 11.2, 22.8, 83.1, 124.7]; let iqr_val = iqr(&data).unwrap(); let q1 = quantile(&data, 0.25).unwrap(); let q3 = quantile(&data, 0.75).unwrap(); println!("Q1: {}, Q3: {}, IQR: {}", q1, q3, iqr_val); // Outlier bounds let lower_bound = q1 - 1.5 * iqr_val; let upper_bound = q3 + 1.5 * iqr_val; println!("Outlier bounds: [{:.1}, {:.1}]", lower_bound, upper_bound); }
Ranking Algorithms
Standard Rank: Position in sorted order (ties get same rank, gaps follow)
Dense Rank: Position in sorted order (ties get same rank, no gaps)
Values: [100, 95, 95, 90, 85]
Standard: [ 1, 2, 2, 4, 5] ← gap after ties
Dense: [ 1, 2, 2, 3, 4] ← no gaps
Standard and Dense Ranking in Sports
Out of curiosity, I asked Anthropic Opus 4.5 to find examples of standard and dense ranking in sports.
Standard Competition Ranking (1, 2, 2, 4) — Skips positions after ties
Most individual sports and races use this method:
-
Golf ⛳ — The classic example. You'll see "T2" (tied for 2nd) on leaderboards, and the next player is listed as 4th if two players tied for 2nd. This emphasizes that a player finished ahead of X competitors.
-
Tennis (ATP/WTA rankings) 🎾 — Points-based rankings, but when ties occur in tournament results, standard ranking applies.
-
Olympic events 🏅 — Track & field, swimming, skiing, etc. If two athletes tie for silver, no bronze is awarded (they give two silvers). The next finisher is 4th.
-
Marathon / Running races 🏃 — If two runners tie for 2nd, the next finisher is 4th place.
-
Horse racing 🐎 — Finish positions follow standard ranking.
-
Cycling (race stages) 🚴 — Stage finishes use standard ranking.
Dense Ranking (1, 2, 2, 3) — Consecutive positions, no gaps
Less common in sports, but used in some contexts:
-
Soccer/Football league tables ⚽ — While ties on points are typically broken by goal difference (so ties are rare), some leagues display positions using dense-style numbering during the season.
-
Some fitness leaderboards — Particularly in CrossFit or gym competitions where continuous ranking is preferred.
-
Some esports standings — Varies by organization.
Key Insight
The distinction often comes down to what the rank is meant to communicate:
| Standard Rank | Dense Rank |
|---|---|
| "How many competitors finished ahead of you?" | "What tier/bracket are you in?" |
| Emphasizes individual achievement | Emphasizes grouping/classification |
Golf's use of standard ranking makes intuitive sense: if you tied for 2nd, there's still only one person who beat you, but two people share a position ahead of the 4th-place finisher—so that finisher had 3 people beat them.
Implementing Dense Rank
#![allow(unused)] fn main() { fn dense_rank(data: &[f64]) -> Vec<usize> { if data.is_empty() { return vec![]; } // Create (index, value) pairs and sort by value let mut indexed: Vec<(usize, f64)> = data.iter() .enumerate() // produces iter of (index, &value) pairs .map(|(i, &v)| (i, v)) // extract index and dereference value .collect(); // sort by the values (second element of the tuples) indexed.sort_by(|a, b| a.1.partial_cmp(&b.1).unwrap()); // Assign dense ranks let mut ranks = vec![0; data.len()]; let mut current_rank = 0; let mut prev_value: Option<f64> = None; for &(original_idx, value) in &indexed { if let Some(prev) = prev_value { // compare with some small epsilon to avoid floating point // precision issues (e.g. 1.0 and 1.0000000000000001) if (value - prev).abs() > 1e-10 { current_rank += 1; // Only increment for new values } } ranks[original_idx] = current_rank; prev_value = Some(value); } ranks } let scores = vec![85.0, 95.0, 90.0, 95.0, 80.0]; let ranks = dense_rank(&scores); for (score, rank) in scores.iter().zip(ranks.iter()) { println!("Score: {}, Rank: {}", score, rank); } }
HW7 Connection: This is the
dense_rank()function in Part 3!
Part 2: Graph Representation
What is a Graph?
A graph G = (V, E) consists of:
- Vertices (V): nodes/points
- Edges (E): connections between vertices
0 --- 1
| |
| |
3 --- 2
Vertices: {0, 1, 2, 3}
Edges: {(0,1), (1,2), (2,3), (3,0)}
Adjacency List Representation
Store graph as a list of neighbors for each vertex:
#![allow(unused)] fn main() { use std::collections::HashMap; // Using Vec<Vec<usize>> fn create_graph_vec(n: usize, edges: &[(usize, usize)]) -> Vec<Vec<usize>> { let mut adj = vec![vec![]; n]; for &(u, v) in edges { adj[u].push(v); adj[v].push(u); // For undirected graph } adj } // Using HashMap for sparse or labeled graphs fn create_graph_map<'a>(edges: &[(&'a str, &'a str)]) -> HashMap<&'a str, Vec<&'a str>> { let mut adj: HashMap<&'a str, Vec<&'a str>> = HashMap::new(); for &(u, v) in edges { adj.entry(u).or_default().push(v); adj.entry(v).or_default().push(u); } adj } // Example: Square graph let edges = vec![(0, 1), (1, 2), (2, 3), (3, 0)]; let graph = create_graph_vec(4, &edges); for (vertex, neighbors) in graph.iter().enumerate() { println!("Vertex {}: neighbors = {:?}", vertex, neighbors); } let edges_map = vec![("0", "1"), ("1", "2"), ("2", "3"), ("3", "0")]; let graph_map = create_graph_map(&edges_map); for (vertex, neighbors) in graph_map.iter() { println!("Vertex {}: neighbors = {:?}", vertex, neighbors); } }
When to Use Each Representation
| Representation | Best For | Lookup | Memory |
|---|---|---|---|
Vec<Vec<usize>> | Dense graphs, integer vertices | O(1) | O(V + E) |
HashMap<K, Vec<K>> | Sparse graphs, labeled vertices | O(1) avg | O(V + E) |
Part 3: Graph Traversal with BFS and DFS
Breadth-First Search (BFS)
BFS explores nodes level by level using a queue (VecDeque):
Graph: BFS from vertex 0:
0 Level 0: [0]
/ \ Level 1: [1, 3] (neighbors of 0)
1 3 Level 2: [2] (unvisited neighbors)
\ /
2 Visit order: 0 → 1 → 3 → 2
BFS Implementation
This BFS implementation uses a HashSet to track visited nodes and a VecDeque as a FIFO queue. Starting from a given vertex, it repeatedly dequeues the front node, marks it visited, and enqueues all unvisited neighbors. The algorithm returns nodes in the order they were first discovered, which corresponds to visiting vertices level by level outward from the start.
#![allow(unused)] fn main() { use std::collections::{VecDeque, HashSet}; fn bfs(graph: &[Vec<usize>], start: usize) -> Vec<usize> { let mut visited = HashSet::new(); let mut queue = VecDeque::new(); let mut order = Vec::new(); queue.push_back(start); visited.insert(start); while let Some(current) = queue.pop_front() { order.push(current); for &neighbor in &graph[current] { if !visited.contains(&neighbor) { visited.insert(neighbor); queue.push_back(neighbor); } } } order } // Square graph with diagonal // 0 --- 3 // | | // | | // 1 --- 2 let graph = vec![ vec![1, 3], // 0 vec![0, 2], // 1 vec![1, 3], // 2 vec![0, 2], // 3 ]; let order = bfs(&graph, 0); println!("BFS order from 0: {:?}", order); }
Note: VecDeque is essential for O(1) queue operations!
BFS for Shortest Path (Unweighted)
Why does BFS find shortest paths? Because BFS explores nodes level by level, the first time we reach any node is guaranteed to be via the shortest path. When we visit a node at distance d from the start, we've already visited all nodes at distances 0, 1, ..., d-1. This means we can't later find a shorter path to that node.
Key insight: In an unweighted graph, "shortest path" means fewest edges. BFS naturally discovers nodes in order of increasing distance from the start.
#![allow(unused)] fn main() { use std::collections::{VecDeque, HashMap}; fn bfs_distances(graph: &[Vec<usize>], start: usize) -> HashMap<usize, usize> { let mut distances = HashMap::new(); let mut queue = VecDeque::new(); queue.push_back(start); distances.insert(start, 0); while let Some(current) = queue.pop_front() { let current_dist = distances[¤t]; for &neighbor in &graph[current] { if !distances.contains_key(&neighbor) { distances.insert(neighbor, current_dist + 1); queue.push_back(neighbor); } } } distances } let graph = vec![ vec![1, 3], // 0 vec![0, 2], // 1 vec![1, 3], // 2 vec![0, 2], // 3 ]; let distances = bfs_distances(&graph, 0); for (node, dist) in &distances { println!("Distance from 0 to {}: {}", node, dist); } }
Depth-First Search (DFS)
DFS explores as deep as possible first using a stack (Vec or recursion):
Graph: DFS from vertex 0:
0 Step 1: Visit 0, push neighbors [1,3]
/ \ Step 2: Pop 1, visit it, push neighbor [2]
1 3 Step 3: Pop 2, visit it, push neighbor [3]
\ / Step 4: Pop 3, visit it (no new neighbors)
2
Visit order: 0 → 1 → 2 → 3
(Goes deep before exploring siblings)
DFS Implementation (Iterative)
This iterative DFS uses a Vec as a LIFO stack and a HashSet to track visited nodes. Starting from a given vertex, it pops the top node, marks it visited if not already seen, and pushes all unvisited neighbors onto the stack. Neighbors are added in reverse order to maintain consistent left-to-right traversal. The algorithm explores as deep as possible along each branch before backtracking.
#![allow(unused)] fn main() { use std::collections::HashSet; fn dfs(graph: &[Vec<usize>], start: usize) -> Vec<usize> { let mut visited = HashSet::new(); let mut stack = vec![start]; // Use Vec as stack let mut order = Vec::new(); while let Some(current) = stack.pop() { if visited.contains(¤t) { continue; } visited.insert(current); order.push(current); // Add neighbors to stack (reverse for consistent ordering) for &neighbor in graph[current].iter().rev() { if !visited.contains(&neighbor) { stack.push(neighbor); } } } order } let graph = vec![ vec![1, 3], // 0 vec![0, 2], // 1 vec![1, 3], // 2 vec![0, 2], // 3 ]; let order = dfs(&graph, 0); println!("DFS order from 0: {:?}", order); }
BFS vs DFS Summary
| Feature | BFS | DFS |
|---|---|---|
| Data Structure | Queue (VecDeque) | Stack (Vec) |
| Order | Level by level | Deep first |
| Shortest path | ✅ (unweighted) | ❌ |
| Memory | O(width) | O(depth) |
| Use case | Shortest path, levels | Cycle detection, components |
Part 4: Algorithm Design Patterns
We'll cover the following patterns:
- Split-Apply-Combine
- Greedy Algorithms
- Divide and Conquer
Let's start with the first pattern: Split-Apply-Combine.
Pattern 1: Split-Apply-Combine
Already covered in HW7 Part 2 with GroupedSeries:
1. SPLIT: Group data by category
2. APPLY: Calculate aggregate per group
3. COMBINE: Collect results
data = [(A, 10), (B, 20), (A, 30), (B, 40)]
↓ SPLIT
groups = {A: [10, 30], B: [20, 40]}
↓ APPLY (mean)
means = {A: 20.0, B: 30.0}
↓ COMBINE
result = HashMap with means
Pattern 2: Greedy Algorithms
Greedy: Make the locally optimal choice at each step.
Example: Coin change (when it works)
#![allow(unused)] fn main() { fn greedy_coin_change(amount: u32, coins: &[u32]) -> Vec<u32> { let mut result = Vec::new(); let mut remaining = amount; // Sort coins in descending order let mut sorted_coins = coins.to_vec(); sorted_coins.sort_by(|a, b| b.cmp(a)); for &coin in &sorted_coins { while remaining >= coin { result.push(coin); remaining -= coin; } } result } let coins = vec![25, 10, 5, 1]; // US coins let change = greedy_coin_change(67, &coins); println!("67 cents: {:?}", change); // [25, 25, 10, 5, 1, 1] }
Greedy Coin Change is Not Always Optimal
Warning: Greedy doesn't always give optimal solutions!
The greedy approach to the coin change problem is not always optimal when the coin denominations are not in a canonical system. A canonical system is a system of coin denominations where each denomination is at least twice the value of the next smaller denomination.
For example, consider the coin denominations [25, 15, 1] and we want to make change of 30 cents.
#![allow(unused)] fn main() { fn greedy_coin_change(amount: u32, coins: &[u32]) -> Vec<u32> { let mut result = Vec::new(); let mut remaining = amount; // Sort coins in descending order let mut sorted_coins = coins.to_vec(); sorted_coins.sort_by(|a, b| b.cmp(a)); for &coin in &sorted_coins { while remaining >= coin { result.push(coin); remaining -= coin; } } result } let coins = vec![25, 15, 1]; let change = greedy_coin_change(30, &coins); println!("30 cents: {:?}", change); // [25, 1, 1, 1, 1, 1] }
Pattern 3: Divide and Conquer
Divide and Conquer:
- Divide problem into smaller subproblems
- Conquer subproblems recursively
- Combine solutions
Classic example: Binary Search
#![allow(unused)] fn main() { fn binary_search(sorted: &[i32], target: i32) -> Option<usize> { let mut left = 0; let mut right = sorted.len(); while left < right { let mid = left + (right - left) / 2; match sorted[mid].cmp(&target) { std::cmp::Ordering::Equal => return Some(mid), std::cmp::Ordering::Less => left = mid + 1, std::cmp::Ordering::Greater => right = mid, } } None } let data = vec![1, 3, 5, 7, 9, 11, 13, 15]; println!("Index of 7: {:?}", binary_search(&data, 7)); // Some(3) println!("Index of 8: {:?}", binary_search(&data, 8)); // None }
If we just searched item by item we would need O(n) time. Binary search gives us O(log n) time assuming the data is sorted which we get if we use a sorted data structure like BTreeMap. Otherwise we would need to sort the data first which is O(n log n) time.
Algorithm Design Summary
| Pattern | Key Idea | When to Use |
|---|---|---|
| Split-Apply-Combine | Group, aggregate, collect | Data aggregation by category |
| Greedy | Best local choice | Optimization with greedy property |
| Divide & Conquer | Split, solve, merge | Problems with optimal substructure |
Summary: HW7 Algorithm Connections
| HW7 Component | Concepts Used |
|---|---|
| FrequencyTable | Counting, Entry API |
| GroupedSeries | Split-apply-combine, closures |
| Histogram | BTreeMap, binning |
| quantile/iqr | Sorting, interpolation |
| RollingBuffer | VecDeque, circular buffer |
| rank/dense_rank | Sorting, index tracking |
Key Takeaways
- Quantiles require sorted data and linear interpolation
- IQR is robust to outliers (Q3 - Q1)
- BFS uses VecDeque, finds shortest paths
- DFS uses Vec as stack, explores deeply
- Algorithm patterns help structure solutions
In-Class Exercise: Outlier Detection
Task: Implement a function that finds all outliers in a dataset using the IQR method covered earlier in this lecture.
Recall: A value is an outlier if it falls outside the bounds:
- Lower bound: Q1 - 1.5 × IQR
- Upper bound: Q3 + 1.5 × IQR
fn find_outliers(data: &[f64]) -> Vec<f64> {
// TODO: Return a Vec containing all outlier values
// Hint: You can use the quantile() function from earlier
// Step 1: Calculate Q1 and Q3
// Step 2: Calculate IQR
// Step 3: Calculate bounds
// Step 4: Filter and collect outliers
todo!()
}
// Example:
// data = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 100.0]
// Q1 = 2.25, Q3 = 6.25, IQR = 4.0
// Lower bound = 2.25 - 6.0 = -3.75
// Upper bound = 6.25 + 6.0 = 12.25
// Output: [100.0] (only 100.0 is outside the bounds)
Hints:
- First implement or copy the
quantile()function from the slides - Use
.iter().filter().cloned().collect()to find values outside bounds - Remember to handle the empty data case
Solution
#![allow(unused)] fn main() { fn quantile(data: &[f64], q: f64) -> Option<f64> { if data.is_empty() || !(0.0..=1.0).contains(&q) { return None; } let mut sorted = data.to_vec(); sorted.sort_by(|a, b| a.partial_cmp(b).unwrap()); let pos = q * (sorted.len() - 1) as f64; let lower = pos.floor() as usize; let upper = pos.ceil() as usize; if lower == upper { Some(sorted[lower]) } else { let fraction = pos - lower as f64; Some(sorted[lower] * (1.0 - fraction) + sorted[upper] * fraction) } } fn find_outliers(data: &[f64]) -> Vec<f64> { if data.is_empty() { return vec![]; } // Step 1: Calculate Q1 and Q3 let q1 = quantile(data, 0.25).unwrap(); let q3 = quantile(data, 0.75).unwrap(); // Step 2: Calculate IQR let iqr = q3 - q1; // Step 3: Calculate bounds let lower_bound = q1 - 1.5 * iqr; let upper_bound = q3 + 1.5 * iqr; // Step 4: Filter and collect outliers data.iter() .filter(|&&x| x < lower_bound || x > upper_bound) .cloned() .collect() } // Test it let data = vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 100.0]; let outliers = find_outliers(&data); let q1 = quantile(&data, 0.25).unwrap(); let q3 = quantile(&data, 0.75).unwrap(); let iqr = q3 - q1; println!("Q1: {}, Q3: {}, IQR: {}", q1, q3, iqr); println!("Lower bound: {}", q1 - 1.5 * iqr); println!("Upper bound: {}", q3 + 1.5 * iqr); println!("Outliers: {:?}", outliers); // Test with data that has outliers on both ends let data2 = vec![-50.0, 10.0, 12.0, 14.0, 15.0, 16.0, 18.0, 200.0]; println!("\nData with outliers on both ends:"); println!("Outliers: {:?}", find_outliers(&data2)); }
Key insights:
- Reuses the
quantile()function from earlier in the lecture - The filter closure uses
&&xbecause we're iterating over&f64references .cloned()converts&f64tof64before collecting- This pattern (compute statistics, then filter) is common in data science
HW7 Brings It All Together
HW7 combines everything:
- Generics and traits (Numeric)
- Collections (HashMap, HashSet, BTreeMap, VecDeque)
- Closures (aggregation functions)
- Iterators (data processing)
- Algorithm design (statistics, grouping)
Good luck on HW7!