Optical Character Recognition (OCR) of the Beatrice Moxon Diary
An interactive report on the OCR development approach for the handwritten 1885 diary of Beatrice Moxon.
This project required a hybrid methodology to achieve high accuracy. We combined a pre-trained visual model for initial text recognition with a fine-tuned Large Language Model (LLM) to refine the output, correct errors, and apply specific diplomatic transcription rules. This section breaks down our four-step process.
System Architecture
1. Image Preprocessing
Enhancing scanned diary pages through noise reduction and contrast adjustment for optimal clarity.
2. Initial OCR Processing
Using a pre-trained model to generate a preliminary text transcription from the images.
3. LLM Fine-Tuning
Refining the raw text with a Large Language Model trained on historical context and specific handwriting.
4. Applying Rules
Enforcing diplomatic transcription rules to maintain the integrity of the original manuscript.
Performance Evaluation
The core of our evaluation was comparing the accuracy of the OCR system before and after fine-tuning the Large Language Model. We measured both Character Error Rate (CER) and Word Error Rate (WER), where lower percentages indicate higher accuracy. Use the buttons below to explore the results for different diary pages.
Character Error Rate (CER)
Measures the percentage of single characters incorrectly transcribed by the OCR system. It is a granular measure of accuracy.
Word Error Rate (WER)
Measures the percentage of words incorrectly transcribed. This metric often correlates more closely with human-perceived readability.
Where,
- $E_{\text{character}}$ is the number of character errors (insertions, deletions, substitutions).
- $E_{\text{word}}$ is the number of word errors (insertions, deletions, substitutions).
- N is the total number of characters or words in the ground truth text.
Diplomatic Transcription Rules
Struck-through Words
Kept in the text, enclosed within a `<del>` tag to preserve original edits.
<del>word</del>
Underlined Words
Maintained and enclosed within a `<u>` tag to indicate emphasis.
<u>word</u>
Doubtful Letters
A single questionable letter is marked with a bracketed question mark.
mi[?]alie
Illegible Text
If a single word or a longer span of text is entirely unreadable, it is replaced with `[UNSURE]` to indicate the uncertainty while preserving the layout.
[UNSURE]
Limitations & Future Directions
Improve Resilience
Enhance image preprocessing to better handle degraded or damaged manuscripts, reducing dependency on high-quality scans.
Optimize & Scale
Explore optimization strategies to reduce the computational resources required for LLM fine-tuning, enabling larger-scale processing.
Expand Capabilities
Broaden the system's training to handle multilingual documents, different historical periods, and a wider variety of handwriting styles.
Contact & Project Repository
For further information or collaboration inquiries, please reach out via email.
Deborah Lee-Talbot
📧 deborah.leetalbot@deakin.edu.auKhondaker Tasrif Noor
📧 k.noor@research.deakin.edu.auScan for the GitHub Repository