How to extract text from PDF file in Rust?

Working with PDF files in Rust is now easier. There are many libraries available that can make our lives easier when we need to work with PDFs. However, in this article, we are going to talk about the lopdf library.

Install lodf dependency

To read a PDF file, we are going to use a Rust dependency called lopdf. Actually, we can use lopdf not only for reading PDF files but also for modifying them. This time, we will focus on reading the PDF file and extracting text from it.

Let's create a new project with Cargo.

cargo new read-pdf

So, lopdf has two options: nom_parser and pom_parser. We are going to choose the nom_parser this time since it is much faster than the pom_parser.

Let's install it.

cargo add lopdf -F pom -F pom_parser

Or you can also just copy paste this to your Cargo.toml file.

lopdf = { version = "0.30.0", features = ["pom", "pom_parser"] }

Reading PDF files

To get started with lopdf, let's try to get the total number of pages in a PDF file using the get_pages() method:

use lopdf::Document;

fn main() {
    let file = "2303.12712.pdf";
    let doc = Document::load(file);
    match doc {
        Ok(document) => {
            let pages = document.get_pages();
            println!("Total pages: {:?}", &pages.len());
        }
        Err(err) => {
            eprintln!("{err}")
        }
    }
}

Extracting text from specific pages

To extract text from a specific page, we can use the extract_text() method that takes an array of page numbers. Here's an example:

use lopdf::Document;

fn main() {
    let file = "2303.12712.pdf";
    let doc = Document::load(file);
    match doc {
        Ok(document) => {
            let text = document.extract_text(&[1]);
            println!("Total pages: {:?}", text);
        }
        Err(err) => {
            eprintln!("{err}")
        }
    }
}

This will extract text from the first page of the PDF file and print it to the console.

Extracting text from all pages

To extract all text from a PDF file, we can use a combination of get_pages() and extract_text() methods. Here's an example:

use lopdf::Document;

fn main() {
    let file = "example.pdf";

    match Document::load(file) {
        Ok(document) => {
            let pages = document.get_pages();
            let mut texts = Vec::new();

            for (i, _) in pages.iter().enumerate() {
                let page_number = (i + 1) as u32;
                let text = document.extract_text(&[page_number]);
                texts.push(text.unwrap_or_default());
            }

            println!("Text on page {}: {}", 42, texts[41]);
        }
        Err(err) => eprintln!("Error: {}", err),
    }
}

This will extract all text from every page of the PDF file and store it in a vector. You can access the text of a specific page by getting the element of the vector corresponding to the page number minus one (since page numbers start at one).

Conclusion

In this article, we explored how to extract text from PDF files using the lopdf library in Rust. While there are alternative libraries available, lopdf is a versatile solution that can perform both read and write operations on PDF files.