Some official medical certificates contain handwritten notes by physicians. These notes are read and interpreted by people and subsequently entered into digital applications. Automating part of this increases efficiency, but care must be taken not to err.
Can we build a model that accurately predicts what is written from a handwritten text containing medical terms?
Business goal : Our goal is to speed up the full process of electronic registration of handwritten terms by building a model that accurately predicts handwritten medical terms. This partly automated process should support the manual labour.
Solution and workflow: The flow from scanned document to deciphered handwriting includes the following steps.
Anonymising the document, which contains sensitive personal information
Image processing to remove background, correct for scan artifacts, improve contrast, normalise pen strokes
Use Google Vision API to set a detection baseline
Define a training set from labeled data
Train a convolutional neural network
Test the model on over 200,000 labelled handwritings
Run it on new data
This is a typical machine-learning project, which required some data and algorithm exploration to come to the best results. Technologies involved are anonymising data, image processing, Google Vision API, deep learning, named-entity resolution, natural-language processing
Result: It turned out that the manual labelling of the data is insufficiently accurate to provide exact statistics, but what we do know is that over 60% of the handwritten words are exactly matched to their labels, and for over 20% of certificates all handwritten words are exactly matched to the labels. This then is a lower bound on the final statistics, because wrong labels result in a rejection instead of an acceptance.
For each word and certificate, a confidence level for the accuracy is provided to support the person who is responsible for registering the words.