Optimizing ID Card Text Extraction with Deep Learning

In this article, we'll detail the methodologies behind the crafting of a Proof of Concept (POC) workflow, tailored for ID card text extraction. This solution was developed using five open source deep learning models implemented in Python.

‍

We will offer insights into the use of instance segmentation, classification and Optical Character Recognition (OCR) techniques.

‍

In a world that's rapidly going digital, the ability to extract information quickly and accurately from physical documents is becoming indispensable. Whether it's for customer onboarding in the banking sector, verifying identity in online services, or streamlining administrative tasks in various industries, ID card information extraction plays a pivotal role.

‍

But as anyone who has manually entered data can attest, manual extraction is prone to errors, tedious, and time-consuming. With advances in machine learning and Computer Vision, we now have the tools to automate this process, making it faster, more accurate, and adaptable to a wide range of ID card formats.

‍

‍This work was conducted by Ambroise Berthe, an R&D Computer Vision Engineer at Ikomia, in early 2022. The insights shared in this article draw inspiration from his comprehensive report.

‍

Overview of the solution

The solution we've designed comprises a series of independent open source algorithms capable of:

‍

Detecting and outlining all identification documents present in an image using an instance segmentation algorithm.
Cropping and straightening each detected object to ensure the text is always horizontal and readable from left to right.
Text detection: Identifying the positions of all words in the identification document.
Text recognition: Recognizing the characters in all the previously detected words.
Classifying these character strings based on their position and content to extract the main information, such as name(s), date, and place of birth.

‍

Building the algorithm

In this solution, we will delve into the components that make up the identity documents reading system. For this POC, our goal is to fine-tune the algorithms for several document variants, including:

French ID card (old and new versions)
French driving license (old and new versions)
Passport
French Resident card

‍

Nevertheless, with the right dataset, this solution can be tailored to accommodate any kind of document.

‍

Building the dataset

In today's world, the most effective methods for creating algorithms to perform complex tasks are based on deep learning. This supervised technique requires a substantial amount of reliable data to operate accurately. Therefore, dataset creation was the first step in this project.

Multiple datasets were essential for this project, as several different task models are involved.

‍

Database for Document Segmentation

We needed a database to train a model capable of segmenting identification documents within an image. We chose segmentation over simple detection to precisely extract the image area for processing. This decision was crucial as the documents photographed might contain extra text that could have disrupted the subsequent algorithm steps:

Example of the reverse side of a driving license with undesired text in the background.

‍

For each image a file was produced, detailing the class and the polygon outlining each identification document. This dataset comprises approximately 100 images per document type, totaling nearly 1100 annotated images.

*Example of labeled image for instance segmentation. Top right: recto of old French ID card, top left: verso of old French ID card, bottom: recto of an old French driving license.*

‍

From this initial dataset, we cropped and straightened all the images. This created the image database that would be annotated for OCR and Key Information Extraction (KIE).

This image set was also used to train the model to straighten the images so that the text is horizontal.

‍

Database for OCR & KIE

We decided to annotate text at the word level rather than at the sentence level. While annotating at the word level is more time-consuming, it allows for easier subsequent manipulation of the database. Specifically, it's simpler to merge detections than to split one.

‍

This involves assigning to every word in the image:

‍

A bounding box surrounding the word.
The corresponding character string.
The word's class (e.g., First Name, Last Name, Date of Birth, etc.).

Example of labeled image for OCR and KIE. A box is drawn over each word with their associated classes. Purple: ‘Other’, Green: ‘ Surname’, yellow: ‘first name’, blue: ‘date of birth’ and orange: ‘place of birth’.

‍

Model selection

This project incorporates five open source deep learning models implemented in Python. Choosing the right algorithm for each task is crucial. Apart from an algorithm's inherent performance, it must also be compatible with others, both in terms of input-output nature and of the execution environment.

‍

Since Ikomia's AI team continually monitors scientific advancements in the field of Computer Vision, we keep track of the latest top-performing algorithms for each application domain (classification, object detection, segmentation, OCR, etc.).

‍

Model for document segmentation

For this task, we tested two algorithms, each offering a balance between execution time and performance. We aimed for real-time responsive algorithms to ensure that the project's entire process remains efficient.

‍

SparseInst offers a slightly faster model, but its performance is far inferior to YoloV7. We favored the latter since this step forms the foundation of the entire process and requires high precision.

‍

Model for Image Straightening

Here, we employed two variants of ResNet. The first model identifies the four corners of an ID from its segmentation mask. The second determines the document's orientation—whether it's tilted at 0°, 90°, 180°, or 270°.

‍

We favored ResNet's, over other algorithms, such as Deskew and OpenCV, because ResNet offers a highly adaptable architecture, producing compact models tailored to our specific needs.

*Illustration of extracting document corner coordinates from its segmentation mask. The segmentation mask is shown in yellow, and the quadrilateral formed by the four corners is in blue.*

‍

Model for text detection

Numerous text detection models exist, so we focused on those from prestigious conferences, such as on DBNet and DBNet++ (TPAMI'2022), Mask R-CNN (ICCV'2017), PANet (ICCV'2019), PSENet (CVPR'2019), TextSnake (ECCV'2018), DRRG (CVPR'2020) and FCENet (CVPR'2021).

‍

We settled on DBNet and DBNet++ for their good performance on datasets with irregular texts. While DBNet++ is an improved version of DBNet and offers a slight performance boost, it's not compatible with ONNX conversion, which optimizes the model during deployment. Thus, we use the DBNet model.

‍

Model for text recognition

Like for text detection, we selected and tested two text recognition models. Even though ABINet was published two years after SATRN, the latter simply outperforms the former in terms of speed and accuracy on irregular texts, like those found on ID photos. For this reason, we selected SATRN as our text recognition algorithm implemented in the OpenMMLab framework.

‍

Model for information extraction

We chose SDMG-R for its smooth integration with SATRN. Using spatial relationships and the features of detected text regions, SDMG-R employs a dual-modality graph-based deep learning network for end-to-end KIE. This enables it to classify and link words effectively.

‍

Document segmentation

After selecting and implementing the algorithm, we carried out multiple training sessions using the database created earlier, retaining the best results. The training was conducted on a powerful computation server equipped with GPUs, essential for achieving operations within a reasonable time frame.

‍

Server specifications:

Processor: 48 cores
RAM: 188 GB
GPU Card: 2x NVidia Quadro RTX 6000
GPU Memory: 24 GB per card

‍

Segmentation metrics:

Precision: 0.989
Recall: 0.996‍
mAP: 0.995

‍

We obtained excellent precision metrics, making this model a robust foundation for subsequent steps. Once trained, the model can now predict a binary segmentation mask and the document type.

‍

Image straightening

While the segmentations are accurate, it's still relevant to smoothen these segmentation masks. Typically, under normal shooting conditions, the photographed documents are quadrilaterals. Thus, we developed a deep learning model inspired by ResNet capable of converting any segmentation mask into a quadrilateral.

We trained this model unsupervised, using automatically generated binary masks.

‍

From document detection to bounding coordinates

At this program stage, it can detect documents in an image and provide the coordinates of its four corners. By determining the smallest bounding rectangle around the four points of an ID, we can then crop the image to obtain well-defined ID snippets.

‍

Ensuring correct text orientation

However, these snippets might not always be oriented correctly, so the text could be upside down instead of running horizontally from left to right. The next model we implemented, largely inspired by ResNet, determines the ID's orientation. This means that the model predicts a number between 0 and 3, indicating how many 90° rotations are required to straighten the text.

‍

Model training and evaluation outcomes

We trained this model using the cropped image database from the previous algorithms, which we manually straightened. We used 80% of the images for training, and the remaining 20% were used for evaluation. The achieved score was 100%.

‍

Text detection for all ID types

After implementing MMOCR from OpenMMlab into Ikomia, we trained DBNet on a comprehensive dataset, covering all document types. The aim was to develop a singular text detection model suitable for all identification document types.

‍

The highest score (known as h-mean) achieved post-training was 0.86. This is particularly commendable, especially when considering the reading challenges presented by some documents and the scores achieved by competitors at the ICDAR2019 competition.

‍

Universal Text Recognition Approach

Similar to text detection, we decided to reduce the number of models in use by training a single model for all document types.

‍

Achieving high precision in text recognition

Word precision: 0.912
Character precision: 0.942
Character recall: 0.945

‍

Notable observations on recognition accuracy

The obtained scores are quite good, but it's evident that approximately 1 in 11 words will be misspelled. Furthermore, the longer the word, the higher the likelihood of it containing an error.

‍

Key Information Extraction

We attempted to consolidate the extraction of primary information into a single model, but it proved inconclusive. The chosen KIE model, SDMG-R, might require training on a broader dataset for diverse document types. Contrary to previous approaches, we later trained a distinct model for each document type, which yielded better results.

‍

The achieved scores vary depending on the document types. It's also worth noting that scores are deliberately rounded to the nearest 5% since only 20% of the dataset is used for evaluation, representing at most 20 samples (this can vary based on the document type).

‍

KIE performance by document type

Document type

Score

Front of the new ID card

0.95

Front of the old ID card

Front of the new driving license

0.95

Back of the new driving license

Front of the old driving license

0.75

Passport

0.85

Front of the residence permit

0.9

Back of the residence permit

0.8

‍

Observations on algorithm performance

We see that the algorithm excels with certain documents (e.g., ID cards) but struggles with others (e.g. old driving license). The challenges with old driving licenses could be due to:

‍

Our intent in detecting more data in them, with 8 key pieces of information targeted. In contrast, for the 'Front of the new ID card', we aimed to extract only 4 key information.
Inconsistencies in the positioning of information, with some licenses featuring handwritten text instead of printed text.

‍

Additionally, the evaluation is based on perfect bounding boxes and character strings, not on the predictions made by the text detection and text recognition algorithms. Therefore, an evaluation under real-world conditions would likely yield different results with slightly lower performance, but this would not allow for an assessment of SDMG-R alone.

‍

Post-processing

The SDMGR model assigns a class to each word. The next step often involves merging boxes when the sought field consists of multiple words. We use a function to groups words based on their class, text, and geometry inspired by the stitch_boxes_into_lines function from mmocr.

‍

Handling multiple predictions and errors

It's common for SDMG-R to make mistakes and predict multiple solutions for the same field. To address these scenarios, we established as natural systematic rules as possible. This step also standardizes the algorithm's output according the application case.

‍

Model optimization

At this stage, all algorithms involving Deep Learning are executed using the Python framework PyTorch. Designed for Deep Learning researchers, PyTorch excels in GPU model training but isn't optimized for CPU inference.

‍

Anticipating potential deployment on such an architecture, we chose to convert the most resource-intensive models into the ONNX format. This format, coupled with its inference engine, can potentially reduce model size and computation times while minimizing efficiency loss. This conversion, when feasible, offers speed gains ranging from 1.5 to 2 times.

‍

For this solution, we chose to optimize only the text detection and recognition models only. Combined, these two models account for over 90% of the entire algorithm's computation time.

‍

Algorithm integration

We subsequently combined all the described algorithms into a single program in the form of an Ikomia algorithm. This algorithm can be used with Ikomia STUDIO and Ikomia API.

‍

Output in standardized format

The output generated by our algorithm is provided in JSON format and contains a list with an information dictionary for each detected ID in the image.

‍

‍

Evaluation of the algorithm

The solution we're proposing was assessed on approximately 30 images for each document type.

‍

Evaluation principle and comparison

The goal of this section is to apply the algorithm to each image in the evaluation dataset, save the result in a file, and then compare it with the ground truth. Comparisons will be made between character strings. We propose two ways to score a comparison between character strings:

‍

Strict Score: Returns 1 if the characters are perfectly identical and 0 otherwise.

‍‍

NED Score(Normalized Edit Distance): Returns a decimal value between 0 and 1, which provides a kind of percentage of characters to modify (add, delete, or replace) to transition from one character string to another. For instance, if the character strings are identical, the score is 1, whereas if there are no characters in common, the score will be 0.

‍

Results

From this POC our solution gave an overall strict score of 68% and NED score of 82%. This suggests that if our algorithm was initially deployed as a pre-fill tool, a human verifier would only need to correct, on average, less than 20% of the entered characters.

Strict score	NED score	Number of samples
0.68	0.82	1041

‍

Score variability

Examining the outcomes, it becomes apparent that the scores differ significantly based on the document type and the specific field being extracted (not shown).

‍

In general, fields related to locations and first names score lower than others. These fields often contain more characters, increasing the likelihood of text recognition errors. On the other hand, fields like dates or numbers (excluding the front of old licenses) showcase commendable scores.

‍

For instance, the date of birth field on a passport yielded a Strict score of 0.93 and a NED score of 0.97 based on 30 samples. In contrast, the third 'first name' on the new French ID card had a Strict score of 0.23 and a NED score of 0.42 from 13 samples.

‍

Evaluation summary

At the end of this prototyping phase, analyzing the evaluation results leads to the following observations:

The outcomes are highly encouraging for most of the documents, boasting recognition rates exceeding 90%.
Extraction in fields like ‘location’ and ‘first Name’ can be enhanced through the inclusion of spaces, database improvements, and retraining of the text detection and recognition models.

Outcomes related to older driving licenses were not satisfactory. There's a need to expand the training database to better accommodate the complexity and variability inherent to these specific documents.

‍

Future directions

Based on these findings, we suggest continuing the algorithm's development with a second R&D sequence to address the current weaknesses of the solution on certain type documents. This phase would incorporate the findings of this study to implement improvements addressing the identified challenges.

‍

The primary focus will involve refining the annotated database and making necessary adjustments in different stages of the processing chain.

‍

Develop your own solution with Ikomia dev tools

The solution developed by Ikomia for this project demonstrates the potential of AI in automating repetitive tasks and improving efficiency in various processes.

‍

The ID card information extraction solution was developed using open source algorithms implemented in Python and available on Ikomia HUB. Using the open source training algorithms, we effortlessly trained our custom deep learning models, managing all parameters with just a few lines of code.

‍

Although we employed algorithms from various frameworks like TorchVision, YOLO, and OpenMMLab, the Ikomia API seamlessly chained the inputs and outputs of the inference models used to build the custom solution.

‍

Consult the documentation for a comprehensive understanding of the API's features.
Explore the advanced algorithms available on Ikomia HUB.
Check out Ikomia STUDIO for an intuitive experience with the same capabilities of the API.

‍

No items found.

Text extraction from ID card using deep learning