CSV Files and Basic Data Engineering

About This Module

Prework

Prework Reading

Pre-lecture Reflections

Lecture

Learning Objectives

  • Data Engineering in Rust
    1. Reading CSV Files
    2. Deserializing CSV Files
    3. Cleaning CSV Files
    4. Converting CSV Data to NDArray representation
  1. Reading CSV Files
  2. Deserializing CSV Files
  3. Cleaning CSV Files
  4. Converting CSV Data to NDArray representation
  • By default CSV will generate StringRecords which are structs containing an array of strings

  • Missing fields will be represented as empty strings

:dep csv = { version = "^1.3" }

let mut rdr = csv::Reader::from_path("uspop.csv").unwrap();
let mut count = 0;
// Loop over each record.
for result in rdr.records() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    let record = result.expect("a CSV record");
    // Print a debug version of the record.
    if count < 5 {
        println!("{:?}", record);
    }
    count += 1;
}

StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])
StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])
StringRecord(["Oakman", "AL", "", "33.7133333", "-87.3886111"])
StringRecord(["Richards Crossroads", "AL", "", "31.7369444", "-85.2644444"])
StringRecord(["Sandfort", "AL", "", "32.3380556", "-85.2233333"])





()

What if there malformed records with mismatched fields?

:dep csv = { version = "^1.3" }

let mut rdr = csv::Reader::from_path("usbad.csv").unwrap();
let mut count = 0;
// Loop over each record.
for result in rdr.records() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    let record = result.expect("a CSV record");
    // Print a debug version of the record.
    if count < 5 {
        println!("{:?}", record);
    }
    count += 1;
}

StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])
StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])



thread '<unnamed>' panicked at src/lib.rs:164:25:
a CSV record: Error(UnequalLengths { pos: Some(Position { byte: 125, line: 4, record: 3 }), expected_len: 5, len: 8 })
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: std::panic::catch_unwind
   4: _run_user_code_16
   5: evcxr::runtime::Runtime::run_loop
   6: evcxr::runtime::runtime_hook
   7: evcxr_jupyter::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Let's make this safe for malformed records. Match statements to the rescue

:dep csv = { version = "^1.3" }

let mut rdr = csv::Reader::from_path("usbad.csv").unwrap();
let mut count = 0;
// Loop over each record.
for result in rdr.records() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    match result {
        Ok(record) => { 
          if count < 5 {
              println!("{:?}", record);
          }
          count += 1; 
        },
        Err(err) => {
            println!("error reading CSV record {}", err);
        }  
    }
}
StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])
StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])
error reading CSV record CSV error: record 3 (line: 4, byte: 125): found record with 8 fields, but the previous record has 5 fields
StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])





()

If your csv file has headers and you want to access them then you can use the headers function

By default, the first row is treated as a special header row.

:dep csv = { version = "^1.3" }
{
let mut rdr = csv::Reader::from_path("usbad.csv").unwrap();
let mut count = 0;
// Loop over each record.
let headers = rdr.headers()?;
println!("Headers: {:?}", headers);

for result in rdr.records() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    match result {
        Ok(record) => { 
          if count < 5 {
              println!("{:?}", record);
          }
          count += 1; 
        },
        Err(err) => {
            println!("error reading CSV record {}", err);
        }  
    }
}
}
Headers: StringRecord(["City", "State", "Population", "Latitude", "Longitude"])
StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])
StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])
error reading CSV record CSV error: record 3 (line: 4, byte: 125): found record with 8 fields, but the previous record has 5 fields
StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])





()

You can customize your reader in many ways:

#![allow(unused)]
fn main() {
let mut rdr = csv::ReaderBuilder::new()
        .has_headers(false)
        .delimiter(b';')
        .double_quote(false)
        .escape(Some(b'\\'))
        .flexible(true)
        .comment(Some(b'#'))
        .from_path("Some path");
}

What is the difference between a ReaderBuilder and a Reader? One is customizable and one is not.

2. Deserializing CSV Files

StringRecords are not particularly useful in computation. They typically have to be converted to floats or integers before we can work with them.

You can deserialize your CSV data either into a:

  • Record with types you define, or

  • a hashmap of key value pairs

Custom Record

:dep csv = { version = "^1.3" }
use std::collections::HashMap;

type StrRecord = (String, String, Option<u64>, f64, f64);

let mut rdr = csv::Reader::from_path("uspop.csv").unwrap();
let mut count = 0;
// Loop over each record.
for result in rdr.deserialize() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    let record:StrRecord = result.expect("a CSV record");
    // Print a debug version of the record.
    if count < 5 {
        println!("{:?}", record);
    }
    count += 1;
}

("Davidsons Landing", "AK", None, 65.2419444, -165.2716667)
("Kenai", "AK", Some(7610), 60.5544444, -151.2583333)
("Oakman", "AL", None, 33.7133333, -87.3886111)
("Richards Crossroads", "AL", None, 31.7369444, -85.2644444)
("Sandfort", "AL", None, 32.3380556, -85.2233333)





()

Note that we use Option<T> on one of the types that we know has some empty values.

HashMap

Note the order of the outputs in print.

:dep csv = { version = "^1.3" }
use std::collections::HashMap;

type Record = HashMap<String, String>;

let mut rdr = csv::Reader::from_path("uspop.csv").unwrap();
let mut count = 0;
// Loop over each record.
for result in rdr.deserialize() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    let record:Record = result.expect("a CSV record");
    // Print a debug version of the record.
    if count < 5 {
        println!("{:?}", record);
    }
    count += 1;
}
{"State": "AK", "Latitude": "65.2419444", "Population": "", "City": "Davidsons Landing", "Longitude": "-165.2716667"}
{"Population": "7610", "City": "Kenai", "State": "AK", "Longitude": "-151.2583333", "Latitude": "60.5544444"}
{"Latitude": "33.7133333", "City": "Oakman", "Population": "", "Longitude": "-87.3886111", "State": "AL"}
{"Population": "", "Longitude": "-85.2644444", "City": "Richards Crossroads", "State": "AL", "Latitude": "31.7369444"}
{"City": "Sandfort", "Latitude": "32.3380556", "Longitude": "-85.2233333", "State": "AL", "Population": ""}





()

This will work well but makes it hard to read and know what type is associated with which CSV field

You can do better by using serde and structs

:dep csv = { version = "^1.3" }
:dep serde = { version = "^1", features = ["derive"] }

// This lets us write `#[derive(Deserialize)]`.
use serde::Deserialize;

// We don't need to derive `Debug` (which doesn't require Serde), but it's a
// good habit to do it for all your types.
//
// Notice that the field names in this struct are NOT in the same order as
// the fields in the CSV data!
#[derive(Debug, Deserialize)]  // derive the Deserialize trait
#[serde(rename_all = "PascalCase")]
struct SerRecord {
    latitude: f64,
    longitude: f64,
    population: Option<u64>,  // account for the fact that some records have no population
    city: String,
    state: String,
}

let mut rdr = csv::Reader::from_path("uspop.csv").unwrap();
let mut count = 0;

// Loop over each record.
for result in rdr.deserialize() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    let record:SerRecord = result.expect("a CSV record");
    // Print a debug version of the record.
    if count < 5 {
        println!("{:?}", record);
    }
    count += 1;
}

The type of the variable rdr was redefined, so was lost.


SerRecord { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" }
SerRecord { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" }
SerRecord { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" }
SerRecord { latitude: 31.7369444, longitude: -85.2644444, population: None, city: "Richards Crossroads", state: "AL" }
SerRecord { latitude: 32.3380556, longitude: -85.2233333, population: None, city: "Sandfort", state: "AL" }





()

What about deserializing with invalid data?

:dep csv = { version = "^1.3" }
:dep serde = { version = "^1", features = ["derive"] }

// This lets us write `#[derive(Deserialize)]`.
use serde::Deserialize;

// We don't need to derive `Debug` (which doesn't require Serde), but it's a
// good habit to do it for all your types.
//
// Notice that the field names in this struct are NOT in the same order as
// the fields in the CSV data!
#[derive(Debug, Deserialize)]
#[serde(rename_all = "PascalCase")]
struct FSerRecord {
    latitude: f64,
    longitude: f64,
    population: Option<u64>,
    city: String,
    state: String,
}

let mut rdr = csv::Reader::from_path("usbad.csv").unwrap();
let mut count = 0;
// Loop over each record.
for result in rdr.deserialize() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    let record:FSerRecord = result.expect("a CSV record");
    // Print a debug version of the record.
    if count < 5 {
        println!("{:?}", record);
    }
    count += 1;
}

FSerRecord { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" }
FSerRecord { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" }



thread '<unnamed>' panicked at src/lib.rs:335:36:
a CSV record: Error(UnequalLengths { pos: Some(Position { byte: 125, line: 4, record: 3 }), expected_len: 5, len: 8 })
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: <unknown>
   4: <unknown>
   5: evcxr::runtime::Runtime::run_loop
   6: evcxr::runtime::runtime_hook
   7: evcxr_jupyter::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Deserialization failed so we need to deal with bad records just like before. Match statement to the rescue

:dep csv = { version = "^1.3" }
:dep serde = { version = "^1", features = ["derive"] }

// This lets us write `#[derive(Deserialize)]`.
use serde::Deserialize;

// We don't need to derive `Debug` (which doesn't require Serde), but it's a
// good habit to do it for all your types.
//
// Notice that the field names in this struct are NOT in the same order as
// the fields in the CSV data!
#[derive(Debug, Deserialize)]
#[serde(rename_all = "PascalCase")]
struct GSerRecord {
    latitude: f64,
    longitude: f64,
    population: Option<u64>,
    city: String,
    state: String,
}

let mut rdr = csv::Reader::from_path("usbad.csv").unwrap();
let mut count = 0;

// Loop over each record.
// We need to specify the type we are deserializing to because compiler
// cannot infer the type from the match statement
for result in rdr.deserialize::<GSerRecord>() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    match result {
        Ok(record) => {
            // Print a debug version of the record.
            if count < 5 {
                println!("{:?}", record);
            }
            count += 1;
        },
        Err(err) => {
            println!("{}", err);
        }
    }
}

GSerRecord { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" }
GSerRecord { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" }
CSV error: record 3 (line: 4, byte: 125): found record with 8 fields, but the previous record has 5 fields
GSerRecord { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" }





()

Some more complex work. Let's filter cities over a population threshold

:dep csv = { version = "^1.3" }
:dep serde = { version = "^1", features = ["derive"] }

// This lets us write `#[derive(Deserialize)]`.
use serde::Deserialize;

// We don't need to derive `Debug` (which doesn't require Serde), but it's a
// good habit to do it for all your types.
//
// Notice that the field names in this struct are NOT in the same order as
// the fields in the CSV data!
#[derive(Debug, Deserialize)]
#[serde(rename_all = "PascalCase")]
struct FilterRecord {
    latitude: f64,
    longitude: f64,
    population: Option<u64>,
    city: String,
    state: String,
}

let mut rdr = csv::Reader::from_path("uspop.csv").unwrap();
let minimum_pop: u64 = 50_000;
// Loop over each record.
for result in rdr.deserialize::<FilterRecord>() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    match result {
        Ok(record) => {
            // `map_or` is a combinator on `Option`. It take two parameters:
            // a value to use when the `Option` is `None` (i.e., the record has
            // no population count) and a closure that returns another value of
            // the same type when the `Option` is `Some`. In this case, we test it
            // against our minimum population count that we got from the command
            // line.
            if record.population.map_or(false, |pop| pop >= minimum_pop) {
                println!("{:?}", record);
            }
        },
        Err(err) => {
            println!("{}", err);
        }
    }
}

FilterRecord { latitude: 34.0738889, longitude: -117.3127778, population: Some(52335), city: "Colton", state: "CA" }
FilterRecord { latitude: 34.0922222, longitude: -117.4341667, population: Some(169160), city: "Fontana", state: "CA" }
FilterRecord { latitude: 33.7091667, longitude: -117.9527778, population: Some(56133), city: "Fountain Valley", state: "CA" }
FilterRecord { latitude: 37.4283333, longitude: -121.9055556, population: Some(62636), city: "Milpitas", state: "CA" }
FilterRecord { latitude: 33.4269444, longitude: -117.6111111, population: Some(62272), city: "San Clemente", state: "CA" }
FilterRecord { latitude: 41.1669444, longitude: -73.2052778, population: Some(139090), city: "Bridgeport", state: "CT" }
FilterRecord { latitude: 34.0230556, longitude: -84.3616667, population: Some(77218), city: "Roswell", state: "GA" }
FilterRecord { latitude: 39.7683333, longitude: -86.1580556, population: Some(773283), city: "Indianapolis", state: "IN" }
FilterRecord { latitude: 45.12, longitude: -93.2875, population: Some(62528), city: "Coon Rapids", state: "MN" }
FilterRecord { latitude: 40.6686111, longitude: -74.1147222, population: Some(59878), city: "Bayonne", state: "NJ" }
FilterRecord { latitude: 45.4983333, longitude: -122.4302778, population: Some(98851), city: "Gresham", state: "OR" }
FilterRecord { latitude: 34.9247222, longitude: -81.0252778, population: Some(59766), city: "Rock Hill", state: "SC" }
FilterRecord { latitude: 26.3013889, longitude: -98.1630556, population: Some(60509), city: "Edinburg", state: "TX" }
FilterRecord { latitude: 32.8369444, longitude: -97.0816667, population: Some(53221), city: "Euless", state: "TX" }
FilterRecord { latitude: 26.1944444, longitude: -98.1833333, population: Some(60687), city: "Pharr", state: "TX" }





()

Cleaning CSV Files

Once you have a Record you can push it to a vector and then iterate over the vector to fix it. Deserialization doesn't quite work all that well when the fields themselves are malformed

:dep csv = { version = "^1.3" }
:dep serde = { version = "^1", features = ["derive"] }

// This lets us write `#[derive(Deserialize)]`.
use serde::Deserialize;

// We don't need to derive `Debug` (which doesn't require Serde), but it's a
// good habit to do it for all your types.
//
// Notice that the field names in this struct are NOT in the same order as
// the fields in the CSV data!
#[derive(Debug, Deserialize)]
#[serde(rename_all = "PascalCase")]
struct DirtyRecord {
    CustomerNumber: Option<u32>,
    CustomerName: String,
    S2016: Option<f64>,
    S2017: Option<f64>,
    PercentGrowth: Option<f64>,
    JanUnits:Option<u64>,
    Month: Option<u8>,
    Day: Option<u8>,
    Year: Option<u16>,
    Active: String,
}

let mut rdr = csv::Reader::from_path("sales_data_types.csv").unwrap();
let mut count = 0;
// Loop over each record.
for result in rdr.deserialize::<DirtyRecord>() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    match result {
        Ok(record) => {
            // Print a debug version of the record.
            if count < 5 {
                println!("{:?}", record);
            }
            count += 1;
        },
        Err(err) => {
            println!("{}", err);
        }
    }
}

CSV deserialize error: record 1 (line: 2, byte: 85): field 0: invalid digit found in string
CSV deserialize error: record 2 (line: 3, byte: 161): field 2: invalid float literal
CSV deserialize error: record 3 (line: 4, byte: 236): field 2: invalid float literal
CSV deserialize error: record 4 (line: 5, byte: 305): field 2: invalid float literal
CSV deserialize error: record 5 (line: 6, byte: 370): field 2: invalid float literal





()

An alternative is to read everything as Strings and clean them up using String methods.

:dep csv = { version = "^1.3" }
:dep serde = { version = "^1", features = ["derive"] }

// This lets us write `#[derive(Deserialize)]`.
use serde::Deserialize;

// We don't need to derive `Debug` (which doesn't require Serde), but it's a
// good habit to do it for all your types.
//
// Notice that the field names in this struct are NOT in the same order as
// the fields in the CSV data!
#[derive(Debug, Deserialize)]
#[serde(rename_all = "PascalCase")]
struct DirtyRecord {
    CustomerNumber: String,
    CustomerName: String,
    S2016: String,
    S2017: String,
    PercentGrowth: String,
    JanUnits:String,
    Month: String,
    Day: String,
    Year: String,
    Active: String,
}

#[derive(Debug, Default)]
struct CleanRecord {
    CustomerNumber: u64,
    CustomerName: String,
    S2016: f64,
    S2017: f64,
    PercentGrowth: f32,
    JanUnits:u64,
    Month: u8,
    Day: u8,
    Year: u16,
    Active: bool,

}

fn cleanRecord(r: DirtyRecord) -> CleanRecord {
    let mut c = CleanRecord::default();
    c.CustomerNumber = r.CustomerNumber.trim_matches('"').parse::<f64>().unwrap() as u64;
    c.CustomerName = r.CustomerName.clone();
    c.S2016 = r.S2016.replace('$',"").replace(',',"").parse::<f64>().unwrap();
    c.S2017 = r.S2017.replace('$',"").replace(',',"").parse::<f64>().unwrap();
    c.PercentGrowth = r.PercentGrowth.replace('%',"").parse::<f32>().unwrap() / 100.0;
    let JanUnits = r.JanUnits.parse::<u64>();
    if JanUnits.is_ok() {
        c.JanUnits = JanUnits.unwrap();
    } else {
        c.JanUnits = 0;
    }
    c.Month = r.Month.parse::<u8>().unwrap();
    c.Day = r.Day.parse::<u8>().unwrap();
    c.Year = r.Year.parse::<u16>().unwrap();
    c.Active = if r.Active == "Y" { true } else {false};
    return c;
}

fn process_csv_file() -> Vec<CleanRecord> {
    let mut rdr = csv::Reader::from_path("sales_data_types.csv").unwrap();
    let mut v:Vec<DirtyRecord> = Vec::new();
    // Loop over each record.
    for result in rdr.deserialize::<DirtyRecord>() {
        // An error may occur, so abort the program in an unfriendly way.
        // We will make this more friendly later!
        match result {
            Ok(record) => {
                // Print a debug version of the record.
                println!("{:?}", record);
                v.push(record);
            },
            Err(err) => {
                println!("{}", err);
            }
        }
    }

    println!("");

    let mut cleanv: Vec<CleanRecord> = Vec::new();
    for r in v {
        let cleanrec = cleanRecord(r);
        println!("{:?}", cleanrec);
        cleanv.push(cleanrec);
    }
    return cleanv;
}

process_csv_file();
DirtyRecord { CustomerNumber: "10002.0", CustomerName: "QuestIndustries", S2016: "<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em;"></span><span class="mord">125</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">000.00&quot;</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="mord">2017</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">:</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord">&quot;</span></span></span></span>162500.00", PercentGrowth: "30.00%", JanUnits: "500", Month: "1", Day: "10", Year: "2015", Active: "Y" }
DirtyRecord { CustomerNumber: "552278", CustomerName: "SmithPlumbing", S2016: "<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em;"></span><span class="mord">920</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">000.00&quot;</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="mord">2017</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">:</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord">&quot;</span></span></span></span>101,2000.00", PercentGrowth: "10.00%", JanUnits: "700", Month: "6", Day: "15", Year: "2014", Active: "Y" }
DirtyRecord { CustomerNumber: "23477", CustomerName: "ACMEIndustrial", S2016: "<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em;"></span><span class="mord">50</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">000.00&quot;</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="mord">2017</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">:</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord">&quot;</span></span></span></span>62500.00", PercentGrowth: "25.00%", JanUnits: "125", Month: "3", Day: "29", Year: "2016", Active: "Y" }
DirtyRecord { CustomerNumber: "24900", CustomerName: "BrekkeLTD", S2016: "<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em;"></span><span class="mord">350</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">000.00&quot;</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="mord">2017</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">:</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord">&quot;</span></span></span></span>490000.00", PercentGrowth: "4.00%", JanUnits: "75", Month: "10", Day: "27", Year: "2015", Active: "Y" }
DirtyRecord { CustomerNumber: "651029", CustomerName: "HarborCo", S2016: "<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em;"></span><span class="mord">15</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">000.00&quot;</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="mord">2017</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">:</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord">&quot;</span></span></span></span>12750.00", PercentGrowth: "-15.00%", JanUnits: "Closed", Month: "2", Day: "2", Year: "2014", Active: "N" }

CleanRecord { CustomerNumber: 10002, CustomerName: "QuestIndustries", S2016: 125000.0, S2017: 162500.0, PercentGrowth: 0.3, JanUnits: 500, Month: 1, Day: 10, Year: 2015, Active: true }
CleanRecord { CustomerNumber: 552278, CustomerName: "SmithPlumbing", S2016: 920000.0, S2017: 1012000.0, PercentGrowth: 0.1, JanUnits: 700, Month: 6, Day: 15, Year: 2014, Active: true }
CleanRecord { CustomerNumber: 23477, CustomerName: "ACMEIndustrial", S2016: 50000.0, S2017: 62500.0, PercentGrowth: 0.25, JanUnits: 125, Month: 3, Day: 29, Year: 2016, Active: true }
CleanRecord { CustomerNumber: 24900, CustomerName: "BrekkeLTD", S2016: 350000.0, S2017: 490000.0, PercentGrowth: 0.04, JanUnits: 75, Month: 10, Day: 27, Year: 2015, Active: true }
CleanRecord { CustomerNumber: 651029, CustomerName: "HarborCo", S2016: 15000.0, S2017: 12750.0, PercentGrowth: -0.15, JanUnits: 0, Month: 2, Day: 2, Year: 2014, Active: false }

4. Let's convert the Vector of structs to an ndarray that can be fed into other libraries

Remember that ndarrays have to contain uniform data, so make sure the "columns" you pick are of the same type or you convert them appropriately.

:dep ndarray = { version = "^0.15.6" }
use ndarray::Array2;

let mut cleanv = process_csv_file();
let mut flat_values: Vec<f64> = Vec::new();
for s in &cleanv {
    flat_values.push(s.S2016);
    flat_values.push(s.S2017);
    flat_values.push(s.PercentGrowth as f64);
}
let array = Array2::from_shape_vec((cleanv.len(), 3), flat_values).expect("Error creating ndarray");
println!("{:?}", array);

CleanRecord { CustomerNumber: 10002, CustomerName: "QuestIndustries", S2016: 125000.0, S2017: 162500.0, PercentGrowth: 0.3, JanUnits: 500, Month: 1, Day: 10, Year: 2015, Active: true }
CleanRecord { CustomerNumber: 552278, CustomerName: "SmithPlumbing", S2016: 920000.0, S2017: 1012000.0, PercentGrowth: 0.1, JanUnits: 700, Month: 6, Day: 15, Year: 2014, Active: true }
CleanRecord { CustomerNumber: 23477, CustomerName: "ACMEIndustrial", S2016: 50000.0, S2017: 62500.0, PercentGrowth: 0.25, JanUnits: 125, Month: 3, Day: 29, Year: 2016, Active: true }
CleanRecord { CustomerNumber: 24900, CustomerName: "BrekkeLTD", S2016: 350000.0, S2017: 490000.0, PercentGrowth: 0.04, JanUnits: 75, Month: 10, Day: 27, Year: 2015, Active: true }
CleanRecord { CustomerNumber: 651029, CustomerName: "HarborCo", S2016: 15000.0, S2017: 12750.0, PercentGrowth: -0.15, JanUnits: 0, Month: 2, Day: 2, Year: 2014, Active: false }
DirtyRecord { CustomerNumber: "10002.0", CustomerName: "QuestIndustries", S2016: "<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em;"></span><span class="mord">125</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">000.00&quot;</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="mord">2017</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">:</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord">&quot;</span></span></span></span>162500.00", PercentGrowth: "30.00%", JanUnits: "500", Month: "1", Day: "10", Year: "2015", Active: "Y" }
DirtyRecord { CustomerNumber: "552278", CustomerName: "SmithPlumbing", S2016: "<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em;"></span><span class="mord">920</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">000.00&quot;</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="mord">2017</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">:</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord">&quot;</span></span></span></span>101,2000.00", PercentGrowth: "10.00%", JanUnits: "700", Month: "6", Day: "15", Year: "2014", Active: "Y" }
DirtyRecord { CustomerNumber: "23477", CustomerName: "ACMEIndustrial", S2016: "<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em;"></span><span class="mord">50</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">000.00&quot;</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="mord">2017</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">:</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord">&quot;</span></span></span></span>62500.00", PercentGrowth: "25.00%", JanUnits: "125", Month: "3", Day: "29", Year: "2016", Active: "Y" }
DirtyRecord { CustomerNumber: "24900", CustomerName: "BrekkeLTD", S2016: "<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em;"></span><span class="mord">350</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">000.00&quot;</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="mord">2017</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">:</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord">&quot;</span></span></span></span>490000.00", PercentGrowth: "4.00%", JanUnits: "75", Month: "10", Day: "27", Year: "2015", Active: "Y" }
DirtyRecord { CustomerNumber: "651029", CustomerName: "HarborCo", S2016: "<span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em;"></span><span class="mord">15</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">000.00&quot;</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="mord">2017</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">:</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord">&quot;</span></span></span></span>12750.00", PercentGrowth: "-15.00%", JanUnits: "Closed", Month: "2", Day: "2", Year: "2014", Active: "N" }

If your data does not need cleaning

This is not likely, but sometimes data preprocessing happens in other environments and you are given a clean file to work with. Or you clean the data once and use it to train many different models. There is a crate that lets you go directly from csv to ndarray!

https://docs.rs/ndarray-csv/latest/ndarray_csv/

:dep csv = { version = "^1.3.1" }
:dep ndarray = { version = "^0.15.6" }
:dep ndarray-csv = { version = "^0.5.3" }

extern crate ndarray;
extern crate ndarray_csv;

use csv::{ReaderBuilder, WriterBuilder};
use ndarray::{array, Array2};
use ndarray_csv::{Array2Reader, Array2Writer};
use std::error::Error;
use std::fs::File;

fn main() -> Result<(), Box<dyn Error>> {
    // Our 2x3 test array
    let array = array![[1, 2, 3], [4, 5, 6]];

    // Write the array into the file.
    {
        let file = File::create("test.csv")?;
        let mut writer = WriterBuilder::new().has_headers(false).from_writer(file);
        writer.serialize_array2(&array)?;
    }

    // Read an array back from the file
    let file = File::open("test2.csv")?;
    let mut reader = ReaderBuilder::new().has_headers(true).from_reader(file);
    let array_read: Array2<u64> = reader.deserialize_array2((2, 3))?;

    // Ensure that we got the original array back
    assert_eq!(array_read, array);
    println!("{:?}", array_read);
    Ok(())
}

main();
[E0308] Error: mismatched types

    ╭─[command_37:1:1]

    │

 22 │         writer.serialize_array2(&array)?;

    │                ────────┬─────── ───┬──  

    │                        ╰──────────────── arguments to this method are incorrect

    │                                    │    

    │                                    ╰──── expected `&ArrayBase<OwnedRepr<_>, Dim<...>>`, found `&ArrayBase<OwnedRepr<...>, ...>`

    │ 

    │ Note: note: method defined here

────╯



[E0308] Error: `?` operator has incompatible types

    ╭─[command_37:1:1]

    │

 28 │     let array_read: Array2<u64> = reader.deserialize_array2((2, 3))?;

    │                                   ─────────────────┬────────────────  

    │                                                    ╰────────────────── expected `ArrayBase<OwnedRepr<u64>, Dim<...>>`, found `ArrayBase<OwnedRepr<_>, Dim<...>>`

────╯

Technical Coding Challenge

Coding Challenge

Coding Challenge Review