I am following a course on Software Security for which one of the assignments is to write some basic programs in Rust. For one of these assignments I need to analyze a text-file and generate several statistics. One of these is a generated list of the ten most used words in the text.
I have written this program that performs all tasks in the assignment except for the word frequency statistic mentioned above, the program compiles and executes the way I expect:
extern crate regex;
use std::error::Error;
use std::fs::File;
use std::io::prelude::*;
use std::path::Path;
use std::io::BufReader;
use std::collections::BTreeMap;
use regex::Regex;
fn main() {
// Create a path to the desired file
let path = Path::new("text.txt");
let display = path.display();
let file = match File::open(&path) {
Err(why) => panic!("couldn't open {}: {}", display,
why.description()),
Ok(file) => file,
};
let mut wordcount = 0;
let mut averagesize = 0;
let mut wordsize = BTreeMap::new();
let mut words = BTreeMap::new();
for line in (BufReader::new(file)).lines() {
let re = Regex::new(r"([A-Za-z]+[-_]*[A-Za-z]+)+").unwrap();
for cap in re.captures_iter(&line.unwrap()) {
let word = cap.at(1).unwrap_or("");
let lower = word.to_lowercase();
let s = lower.len();
wordcount += 1;
averagesize += s;
*words.entry(lower).or_insert(0) += 1;
*wordsize.entry(s).or_insert(0) += 1;
}
}
averagesize = averagesize / wordcount;
println!("This file contains {} words with an average of {} letters per word.", wordcount, averagesize);
println!("\nThe number of times a word of a certain length was found.");
for (size, count) in wordsize.iter() {
println!("There are {} words of size {}.", count, size);
}
println!("\nThe ten most used words.");
let mut popwords = BTreeMap::new();
for (word, count) in words.iter() {
if !popwords.contains_key(count) {
popwords.insert(count, "");
}
let newstring = format!("{} {}", popwords.get(count), word);
let mut e = popwords.get_mut(count);
}
let mut i = 0;
for (count, words) in popwords.iter() {
i += 1;
if i > 10 {
break;
}
println!("{} times: {}", count, words);
}
}
I have a BTreeMap
(that I chose with these instructions), words
, that stores each word as key and its associated frequency in the text as value. This functionality works as I expect, but there I am stuck. I have been trying to find ways to sort the BTreemap
by value or find another data structure in Rust that is natively sorted by value.
I am looking for the correct way to achieve this data structure (a list of words with their frequency, sorted by frequency) in Rust. Any pointers are greatly appreciated!
If you only need to analyze a static dataset, the easiest way is to just convert your
BTreeMap
into aVec<T>
in the end and sort the latter (Playground):The vector contains the
(key, value)
pairs as tuple. To sort the vector, we have to usesort_by()
orsort_by_key()
. To sort the vector in decreasing order, I usedb.cmp(&a)
(as opposed toa.cmp(&b)
, which would be the natural order). But there are other possibilities to reverse the order of a sort.However, if you really need some data structure such that you have a streaming calculation, it's getting more complicated. There are many possibilities in that case, but I guess using some kind of priority queue could work out.