Making a typo detection tool in Rust
If you are not native speaker and write a lot you might need this tool. So most of the time when I write blog post I make typo a lot. Google translate actually can solve the problem just paste to word to google translate we will get suggestion to the correct word.
The process of writing and then making corrections by using Google Translate are sucks! Let's make a command-line application for this. I have been learning Rust recently, and I believe this would be a good use case of Rust.
My writing workflow
I will tell you a little bit of my writing workflow especially when writing blog post on this website. I write this blog using markdown format, I use vscode and sometime vim. Usually before publishing the blog post I will check for incorrect word that I write. What I want to speed up typo correction is when I have done writing a post I will run command to scan the words and then display the incorrect word along with filename and position on the document.
So my imagination is like this.
tpd --suggest ./posts/typo-detector-in-rust.md
# Result
"wrongs" => ./posts/typo-detector-in-rust.md:20:9 wrong
Using this format with vscode we can just click at the filename ./posts/typo-detector-in-rust.md:20:9
and vscode will open the file and move the cursor at the correct position.
And this is some simple demo about the tool.
Algorithm
So before we working on we need to define what feature we need, here is a list of feature that I can think of now.
- List of dictionary of english word
- Suggest the correct word
- Watch file changes
Dictionary
In 2006 Google AI Research release the data from the n-gram frequency analysis that they use for statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. You can read more detail for this here.
And luckily someone upload the data to Github here. So we will you this data as starting point for the dictionary.
Suggestion
Actually Jetbrains IDE have this feature spelling correction included on the editor, but for me I don not want to open Jetbrains IDE just for writing a blog post, I am very comfortable with vscode and vim so let's just steal Jetbrains implementation of spelling correction.
So after some research I found this class called EditDistance
from jetbrains open source IDE. It use Damerau-Levenshtein_distance
and Wagner-Fischer_algorithm
algorithm to get the list of suggestion word.
But we are using rust here not Java! Don't worry let's write the algorithm in Rust.
Detecting File Changes
It would be nice if we can automatically detect file changes when writing, so we don not need to run the command every time we want to check for typo.
To detect file changes we will use this feature called inotify from linux kernel, and to speed things up let just use the inotify wrapper in rust.
Enough Show me the Code
Okay, we have everything we need to create typo detector now. Let's start with loading the dictionary in this case we will use HashSet
data structure so we don't need to run a loop or binary search to query word in dictionary. One thing to keep in mind we need to be able to add custom word to dictionary.
use std::collections::HashSet;
pub struct Dictionary {
data: HashSet<String>
}
impl Dictionary {
pub fn new() -> Self {
let words = include_str!("words.txt");
let mut dictionary: HashSet<String> = HashSet::new();
for word in words.split('\n') {
dictionary.insert(word.to_string());
}
Self {
data: dictionary
}
}
pub fn add(&mut self, word: String) {
self.data.insert(word);
}
pub fn to_vec(&self) -> Vec<&String> {
self.data.iter().collect()
}
pub fn get(&self, key: &str) -> Option<&String> {
self.data.get(key)
}
}
Now we can read the file and clean up the word, we will remove this char [' " ! ? . , “ ”]
.
pub fn sanitize(word: String) -> String {
if word.len() <= 1 {
return word;
}
return word
.replace("\"", "")
.replace("'", "")
.replace("!", "")
.replace("?", "")
.replace(".", "")
.replace("“", "")
.replace("”", "")
.replace(",", "");
}
And we also will not accept non alphabetical chars.
trait CharExt {
fn accept(self) -> bool;
}
impl CharExt for char {
fn accept(self) -> bool {
match self {
'a'..='z' | 'A'..='Z' | '\'' => true,
_ => false,
}
}
}
Now we can easily check every word if it not in dictionary then it was a typo.
pub fn scan_words(source: &str, words: &Dictionary) {
let mut line_number = 0;
for line in source.lines().into_iter() {
line_number += 1;
let mut column = 0;
for child in line.split(" ").into_iter() {
let target = child.to_lowercase();
column += child.len() + 1;
target = sanitize(target);
if !target.chars().all(char::accept) || target == "" {
continue;
}
if words.get(&&*target).is_none() {
println!(
"\x1b[0;31m{}\x1b[m => {}:{}:{}",
child, file_path, line_number, column
);
continue;
}
}
}
}
Suggestion
To give suggestion for the given incorrect word we will use edit distance algorithm. Simply with this algorithm we will get the different char for two word.
For example this word have 1 edit distance.
wrong
wrongs
And this word have 2 edit distance.
wrong
wongs
This how it would look like in the Rust code.
pub fn damerau_levenshtein(word: &str, target: &str) -> usize
{
let (word, target): (Vec<char>, Vec<char>) = (word.chars().collect(), target.chars().collect());
let a_elems = word.as_slice();
let b_elems = target.as_slice();
let a_len = a_elems.len();
let b_len = b_elems.len();
if a_len == 0 {
return b_len;
}
if b_len == 0 {
return a_len;
}
let width = a_len + 2;
let mut distances = vec![0; (a_len + 2) * (b_len + 2)];
let max_distance = a_len + b_len;
distances[0] = max_distance;
for i in 0..(a_len + 1) {
distances[flat_index(i + 1, 0, width)] = max_distance;
distances[flat_index(i + 1, 1, width)] = i;
}
for j in 0..(b_len + 1) {
distances[flat_index(0, j + 1, width)] = max_distance;
distances[flat_index(1, j + 1, width)] = j;
}
let mut elems: HashMap<char, usize> = HashMap::with_capacity(64);
for i in 1..(a_len + 1) {
let mut db = 0;
for j in 1..(b_len + 1) {
let k = match elems.get(&b_elems[j - 1]) {
Some(&value) => value,
None => 0,
};
let insertion_cost = distances[flat_index(i, j + 1, width)] + 1;
let deletion_cost = distances[flat_index(i + 1, j, width)] + 1;
let transposition_cost =
distances[flat_index(k, db, width)] + (i - k - 1) + 1 + (j - db - 1);
let mut substitution_cost = distances[flat_index(i, j, width)] + 1;
if a_elems[i - 1] == b_elems[j - 1] {
db = j;
substitution_cost -= 1;
}
distances[flat_index(i + 1, j + 1, width)] = min(
substitution_cost,
min(insertion_cost, min(deletion_cost, transposition_cost)),
);
}
elems.insert(a_elems[i - 1].clone(), i);
}
distances[flat_index(a_len + 1, b_len + 1, width)]
}
fn flat_index(i: usize, j: usize, width: usize) -> usize {
j * width + i
}
Now we have the algorithm next we need to give the data for check the edit distance,
pub fn search_similar(words: &Vec<&String>, target: &str, score: usize) -> Option<String> {
let mut result: String = String::new();
let distance = if score == 2 { 0 } else { 1 };
words
.to_vec()
.iter()
.filter(|item| {
item.len() >= target.len() - distance && item.len() <= target.len() + distance
})
.for_each(|item| {
if damerau_levenshtein(*item, &target) == score {
result.push_str(*item);
result.push(',')
}
});
return if result.len() > 0 {
Some(result.trim_end_matches(",").to_string())
} else {
None
};
}
And now we will check for every word that have edit distance 1 or 2.
match search_similar(&words.to_vec(), &target, 1) {
Some(result) => {
println!(
"\x1b[0;31m{}\x1b[m => {}:{}:{} \x1b[0;32m{}\x1b",
child, file_path, line_number, column, result
)
}
_ => match search_similar(&words.to_vec(), &target, 2) {
Some(result) => {
println!(
"\x1b[0;31m{}\x1b[m => {}:{}:{} \x1b[0;32m{}\x1b",
child, file_path, line_number, column, result
)
}
_ => println!(
"\x1b[0;31m{}\x1b[m => {}:{}:{}",
child, file_path, line_number, column
),
},
}
Custom Dictionary
As programmer we use word that not available from the english dictionary, so we need to load custom dictionary. In this case we will store custom dictionary to this path $PATH/.config/tpd/dictionary
.
match home::home_dir() {
Some(mut path) => {
path.push(".config");
path.push("tpd");
path.push("dictionary");
let mut file = std::fs::OpenOptions::new()
.write(true)
.read(true)
.create(true)
.append(true)
.open(path)
.unwrap();
if *dictionary {
let mut data = String::new();
if let Ok(_) = file.read_to_string(&mut data) {
println!("{}", data);
} else {
println!("Unable to read dictionary");
}
} else {
if word.len() >= 1 {
write!(file, "{}\n", word).unwrap();
println!("Success store '{}' to dictionary!", word);
}
}
}
None => {}
}
And then load it to dictionary
match home::home_dir() {
Some(mut path) => {
path.push(".config");
path.push("tpd");
path.push("dictionary");
match std::fs::read_to_string(path) {
Ok(file) => {
for word in file.split('\n') {
words.add(word.to_string());
}
scan_words(&*source, &words, show_suggest, &file_path);
std::process::exit(0);
}
_ => {}
}
}
None => {}
}
Detect file changes
One last thing let's make the app detect file changes so we don't need to re run the application everytime we need to check for typo. In this case we will use create libraries inotify.
fn watch(file_path: &str, should_watch: bool, suggest: bool) -> notify::Result<()> {
detector::detect_typo(Path::new(file_path).to_path_buf(), suggest);
if !should_watch {
return Ok(());
}
let (tx, rx) = channel();
let mut watcher: RecommendedWatcher = Watcher::new(tx, Duration::from_millis(300))?;
watcher.watch(Path::new(&file_path), RecursiveMode::NonRecursive)?;
loop {
let event = match rx.recv() {
Ok(event) => event,
Err(err) => {
println!("Config watcher channel dropped unexpectedly: {}", err);
break;
}
};
match event {
DebouncedEvent::Rename(_, path)
| DebouncedEvent::Write(path)
| DebouncedEvent::Create(path)
| DebouncedEvent::Chmod(path) => {
println!("Processing file changes!");
detector::detect_typo(path, suggest)
}
_ => (),
}
}
Ok(())
}
So that's it for now, you can check full source code here.