USPTO Image Recognition Challenge Post Contest Analysis

This competition was sponsored by US Patent and Trademark Office, together with the NASA Tournament Lab and was held by TopCoder

The Problem

You will be provided with an image of a patent drawing page.

Your task is to extract the following data from the given image:
for each figure, its location and title;
for each part label, its location and text.

Example: consider IMAGE A of a patent drawing page. It contains 3 figures named "2A", "2B" and "2C". They have 14, 8 and 8 part labels, respectively, a total of 30 part labels for the whole drawing page. The expected solution is shown in IMAGE B, figures in blue color and part labels in green color.


IMAGE A

IMAGE B

My Approach

The first thing that came to my mind in order to solve this problem was Machine Learning. I thought that using some logistic regression or neural network algorithm would work just fine. But when I started working in the problem I had only three weeks and realized that it wasn't enough time to work on these kinds of algorithms.

So instead of making the computer learn from scratch I needed to transmit some of my knowledge. To do that I needed first to realize how I find the figures and the part labels in each image. In this text I'm going to use five example patent drawings to better illustrate the approach. These are those five drawings with the respective expected solution.


IMAGE 1A

IMAGE 1B

IMAGE 1C

IMAGE 1D

IMAGE 1E
 

FindFiguresAndParts();

I realized that to find different figures in an image I search for black pixels that are connected between them. So my first approach was to look for every black pixel and all the other black pixels that are connected to it. To do this I used a Breath First Search (BFS) algorithm and you can see it results in the following images:

Each different color represents a different set of connected pixels found by the BFS algorithm.


IMAGE 2A

IMAGE 2B

IMAGE 2C

IMAGE 2D

IMAGE 2E
 


As you can see the algorithm seems to work pretty fine. After this I needed to differentiate between figures and numbers, but that was pretty easy because numbers are actually much smaller than figures so I chose the size of the smallest figure from the training examples as a threshold and this was the result:

Blue are figures, red are numbers.

IMAGE 3A

IMAGE 3B

IMAGE 3C

IMAGE 3D

IMAGE 3E
 

MergeIntersectingFigures(); AddPartsToClosestFigure(); MergeIntersectingFigures();

As you see I wasn't even close yet to the expected result. So the following step was to merge the intersecting figures, add the closest numbers of part labels to each figure and merge intersecting figures again. To add the closest numbers (red boxes) to each figure (blue boxes) I searched for the figure and number with the smallest distance and added that number into the figure bounding box. Then I repeated the process until all numbers were added. The optimal way to do this would have been to write a dynamic programming algorithm that minimizes the sum of all distances from the numbers to the figures they were added but I think this wouldn't have improve the performance of the algorithm and it was harder to implement so I discarded it.


IMAGE 4A

IMAGE 4B

IMAGE 4C

IMAGE 4D

IMAGE 4E
 

Like you can see in the images this change improved the bounding box of the majority of the figures but it made it worst for others. To fix those problems I was going to need a character recognition algorithm.

Character Recognition

Again neural networks came to my mind but I decided to go with something simpler because I didn't have much time. So I decided to go with a nearest neighbor algorithm but even with this simpler algorithm I was going to need a lot of classified examples of characters. I had two ideas to create a big database of classified characters:
1. Create artificial characters using the different fonts on my machine.
2. Use the numbers detected by my current algorithm, classify a couple by hand and then automatically classify the rest using the nearest neighbor algorithm.
This time I decided to go with the harder option, option number two because the database will be similar to the real data, but I would have liked to have time to test the other method.

The source code size limit was 1 MB so I did the following to put as many classified characters in the code without losing too much features:

I reduced the number of pixels of each character to 16x16 and instead of using 256 values for each pixel I just used ones or zeroes for black and white respectively using the same threshold as in the BFS algorithm to determine whether a pixel is black or white.

The end result was this:

0000001111111111
0000011111111111
0000011100000000
0000111000000000
0000110000000000
0000110000000000
0001110111100000
0001111111111000
0011100000011110
0000000000001110
0000000000001110
0000000000001110
1111000000011100
1110000000111000
1111000001110000
1111111111110000


This is the character of the number five. To store these compressed characters I used an array of 16 unsigned short int values, where each number of the array represents a row of the character. For example the corresponding array of this number five character is the following:

{1023,2047,1792,3584,3072,3072,7648,8184,14366,14,14,14,61468,57400,61552,65520}

The good thing about this implementation was that I could perform the nearest neighbor algorithm comparison very fast by using bit operations.

While this is not a state of the art algorithm for character recognition it works pretty well for this specific kind of images. I would say that is was intentionally orverfited to the patent images.

GroupParts();

Now that I had a character recognition algorithm I needed to group the numbers that were close to each other because those are the part labels and remove the other small parts marked with a red bounding box but that aren't numbers. To do this I grouped the red bounding boxes with similar sizes that are close to each other and then removed the ones that have characters that didn't match well with any of the data used for the nearest neighbors' algorithm. I also used the character recognition algorithm to detect the three letters FIG which are the title of each figure.

This was the result:


IMAGE 5A

IMAGE 5B

IMAGE 5C

IMAGE 5D

IMAGE 5E
 

CombineFiguresWithoutFIGs(); DivideFiguresWithMultipleFIGs();

Finally, trusting in the character recognition algorithm I combined the figures without the FIG word with the closest one that has the FIG word and also divided the figures that had more than one FIG word.

This is the final result:


IMAGE 6A

IMAGE 6B

IMAGE 6C

IMAGE 6D

IMAGE 6E
 

Here you can see the full problem statement.
Here you can see the source code but it was coded and modified very fast just for the contest so it will be kind of hard to read.

Conclusion and Future Improvements

This approach shows how combining machine learning algorithms with tratidional algorithms can create a good solution when there are several constraints like the development time, code size and excecution time. The proposed approach is a good start but it is far from the best solution that can be achieved. The following are some inprovements I'm going to work in the future:

Use a better and faster character recognition algorithm
By better and faster I mean a logistic regression or neural network algorithm that takes into account more features of the characters. This will increase the accuracy of the algorithm significantly not only to detect the text but also the figures. With a faster algorithm it is possible to perform the sliding window technique that will allow recognizing also characters that are to close and are connected with black pixels, something that I didn't take care in my approach due the lack of time.

Preprocess bad images
Some images hadn't very good quality and the black pixels weren't connected or were connected between different figures. I tried to apply some filters to later use the described algorithm but the filters weren't enough to repair the image so the algorithm works correctly.

Recognize Tables and Charts and Measures that aren't labels
Recognizing tables and charts it's a hard but doable task, they have distinguishable features, they are rectangular, there is symmetry between the numbers and the ratio of pixels from numbers versus pixel from figures is high. In measures that aren't labels are also recognizable features in some cases like decimal point and mm characters that stand for millimeters but in some cases it's a really hard task.