Optical Character Recognition (OCR) of the Beatrice Moxon Diary

An interactive report on the OCR development approach for the handwritten 1885 diary of Beatrice Moxon.

This project required a hybrid methodology to achieve high accuracy. We combined a pre-trained visual model for initial text recognition with a fine-tuned Large Language Model (LLM) to refine the output, correct errors, and apply specific diplomatic transcription rules. This section breaks down our four-step process.

System Architecture

1. Image Preprocessing

Enhancing scanned diary pages through noise reduction and contrast adjustment for optimal clarity.

2. Initial OCR Processing

Using a pre-trained model to generate a preliminary text transcription from the images.

3. LLM Fine-Tuning

Refining the raw text with a Large Language Model trained on historical context and specific handwriting.

4. Applying Rules

Enforcing diplomatic transcription rules to maintain the integrity of the original manuscript.

Performance Evaluation

The core of our evaluation was comparing the accuracy of the OCR system before and after fine-tuning the Large Language Model. We measured both Character Error Rate (CER) and Word Error Rate (WER), where lower percentages indicate higher accuracy. Use the buttons below to explore the results for different diary pages.

Character Error Rate (CER)

Measures the percentage of single characters incorrectly transcribed by the OCR system. It is a granular measure of accuracy.

$CER = \frac{E_{\text{character}}}{N} \times 100\%$

Word Error Rate (WER)

Measures the percentage of words incorrectly transcribed. This metric often correlates more closely with human-perceived readability.

$WER = \frac{E_{\text{word}}}{N} \times 100\%$

Where,

$E_{\text{character}}$ is the number of character errors (insertions, deletions, substitutions).
$E_{\text{word}}$ is the number of word errors (insertions, deletions, substitutions).
N is the total number of characters or words in the ground truth text.

Diplomatic Transcription Rules

Struck-through Words

Kept in the text, enclosed within a `<del>` tag to preserve original edits.

Underlined Words

Maintained and enclosed within a `<u>` tag to indicate emphasis.

Doubtful Letters

A single questionable letter is marked with a bracketed question mark.

mi[?]alie

Illegible Text

If a single word or a longer span of text is entirely unreadable, it is replaced with `[UNSURE]` to indicate the uncertainty while preserving the layout.

[UNSURE]

Limitations & Future Directions

🖼️

Improve Resilience

Enhance image preprocessing to better handle degraded or damaged manuscripts, reducing dependency on high-quality scans.

⚙️

Optimize & Scale

Explore optimization strategies to reduce the computational resources required for LLM fine-tuning, enabling larger-scale processing.

🌐

Expand Capabilities

Broaden the system's training to handle multilingual documents, different historical periods, and a wider variety of handwriting styles.

Contact & Project Repository

For further information or collaboration inquiries, please reach out via email.

Deborah Lee-Talbot

📧 deborah.leetalbot@deakin.edu.au

Khondaker Tasrif Noor

📧 k.noor@research.deakin.edu.au

Scan for the GitHub Repository