Text detection and recognition: implementation of FOTS paper

11 min readMar 6, 2021

In this blog, I will share the way I implement this paper.

I will follow the below path in explaining this whole work.

Problem statement
Source of data
About data
Losses we use
Preparing data for detection
Preparing data for recognition
Training detection model and recognition model
Making pipeline
Showing results
Conclusion
Profile
References

1. Problem statement

Here we have to detect the text region from any image(containing text), this image could be of anything with a different background. After detecting an image we have to recognize it also.

FOTS full form is fast-oriented text spotting. This is the case study for detecting and recognizing any text in any scene.

In the above image, FOTS gives a result, it detects the ‘GAP’ text area and all text areas in the image(scene) and recognizes that it is ‘GAP’, ’50’, and ‘OFF’ etc. So this is the task to do.

Now this task can be done with two different sections detection and recognition. In detection, part detects the text area in the scene and in recognition part recognizes text, what is the text? (see above image) So for detection, we use CNN and for recognition, we will use some sequential decoder on each detected region.

2. Source of data.

For this problem, we will use the ICDAR 2015 dataset. We will also use a synthetic dataset of text images.

3. About data.

Here we will use the ICDAR 2015 dataset. In which we have Three types of data.

Training set images —

We have 1000 images for detection text purposes.

Training set localization and transcription ground truth —

We have 1000 text files with coordinates of corners and labels(texts) Suppose in a text file, we have 5 lines so this means we have five text polygons in the corresponding image. And in each line, we have eight coordinates(x1, y1, x2, y2…) and a label.

Training set word images, along with transcription ground truth —

~4468 cut-out word images corresponding to the axis-oriented bounding boxes of the words are provided with a single text file with the relative coordinates of the bounding shape within each word image. Transcription ground truth is provided in a single text file.

And we have test images for recognition and detection also.

But for the training recognition model(which will train for recognizing text is cut out text images) I have taken almost 150000 text images from synthetic text data. Here is the link for synthetic data —

In synthetic data, we have text images and the text written in images is the name of images, so we can extract the name from the name of images.

4. Losses we use.

We will use the loss function as suggested in the actual paper. For the score map, we will use cross-entropy loss.

In this loss, we are calculating how the predicted probabilities are different from the actual probabilities in the score map-

And for the bounding box, we use IOU and for rotation, we use λθ(1 − cos(θx, θ∗ x ) rotation angle loss. For the bounding box, we have this loss —

here R_cap is the predicted bounding box and R* is the actual bounding box, So here the numerator term in the log is an intersected area between the predicted and actual area, and the denominator term is a union of both areas. Now for finding intersected areas, we use this-

where d1, d2, d3, and d4 are the distance from a pixel to the top, right, bottom, and left boundary. Here w_i and h_i are the intersected area width and height, Now we can get the intersected area by multiplying both.

Now the union area will be area_real+area_pred-intersected_area

and for the angle, we will use this-

Where theta_cap is predicted_angle and theta_* is the actual angle.

Now on merging these two the final loss for geo-map is —

Now the whole loss for detection is —

here L_reg is the same as L_g.

For the recognition branch, our loss will be CTC loss —

5. Preparing data for detection.

we have to transform data in such a way so that we can input it into our model and calculate loss with the help of output. So for input, we will only input a batch of images at a time, the output will be a score map(which represents where the text is and where not with the help of 0 and 1) and geo-map(this has 5 channels with the same height and width as an image, first four is top, right, bottom and left and the fifth one is angle). Based on these two outputs our model will converge, and we will also return a training mask so that at the time of calculating loss we will not consider those text areas which are very small and for which the label text is not given.

For each image, the input will be the shape of (512,512,3), and the output will be the shape of (512,512,6), there are 6 channels, one for the score map and four for distance from the top, right, bottom, and left and one for training mask.

if our batch size is 32 then the input shape will be (32,512,512,3) and the output shape will be (32,512,512,6).

Now how the geo map will look, as we know that its channels have only pixel distance from a real rectangle of text upper side right side bottom side, and left side, which has text. you can see this image for more clarity —

image is taken from the East paper

I have also drawn the score map, geo maps, and training masks, which look like this —

6. Preparing data for recognition.

Now for the recognition task, we have to give images of text as input and the sequence of encoded text(which is in that image). Before giving images as input we will resize all images to the same height and width. In the case of mine, I have resized all images to (15,64,3). I have encoded all texts corresponding to images and converted each in sequence with the help of the Keras preprocessing library. So after encoding our output will be the shape of (1,15), from where did this 15 come I have padded all encoded text to 15 lengths.

if the batch size is 32 then the input shape will be (32,15,64,3) and the output shape will be (32,1,15).

7. Training detection and recognition model.

a. Detection model —

The first research paper is https://arxiv.org/pdf/1801.01671.pdf, which explains this whole work. In this paper ‘FOTS’, they have done detection and recognition simultaneously, this is the end-to-end system, meaning if we give a scene with texts then it will return the detected text area with recognition of texts. First, they detect the text area with some CNN by extracting feature maps and after that, they do the recognition part with the help of sequence decoding on detected areas. Here is the overall architecture —

In the above architecture, we can see that first, they extract features from an image with the help of a shared convolutions layer, then these features go into the text detection branch(which is again a bunch of convolution layers), then the text detection branch predicts b boxes(bounding boxes) and orientation of the bounding boxes, and this predicted output goes to ROI rotate which gives oriented text regions with fixed height and unchanged aspect ratio, and then this goes to text recognition branch(which is RNN) and CTC decoder which gives predicted texts.

But I have implemented it in two parts, first I have a train detection model and then I have trained a recognition model as we have data for both these tasks.

So our detection part is inspired by East paper. The paper is EAST paper https://arxiv.org/abs/1704.03155. In this paper, the technique is explained to detect texts from scenes of different backgrounds. The architecture, this network uses is made with convolutional layers, a pooling layer, normalization layer.

This network is inspired by the u-shape network, as you can see from the feature extractor stem we are taking information to feature-merging branches.

Here we have used the pre-trained resnet50 model which is trained on imagenet dataset for extracting features and giving it to feature merging branches. You can see the epoch loss plot —

b. Recognition model —

For the recognition model, we have used some initial Conv, batch normalization, and max-pooling layers to extract information from the image after that we have to use a bidirectional LSTM layer. The architecture of our recognition model is here —

For the recognition model of how we build data, I have explained this above in the preparing data section.

You can see the epoch loss plot for the recognition model —

8. Making pipeline —

Now we have to make a pipeline or a Python function that can take images and return images with highlighted text area and text.

To implement this function we will use NMS(Non-Max suppression) technique and an ROI-ROTATE method. The question is what is NMS, NMS is a technique to select bounding boxes whose intersection with text area is high. Basically, what happens after predicting we will get the output of shape (512,512,6). With the help of score map, geo-map, and angle we will first make lots of bounding boxes, to understand this, suppose we have a text in the image now giving this image to our detector model, we will get a score, geo, and angle map, now we will take only those pixels from all 6 channels which have value one in predicted score map, now what we have is text area pixels position and their predicted distance from the top, right, bottom and left side of the rectangle, now you can see that for each pixel we have it’s own bounding boxes(we know the area of the region respect to that pixel and distances of each side from that pixel), so finally with the help of score map and distances, we will get one bounding box for each pixel. After this, the work of NMS starts and what NMS does is, choose the best bounding box which has most of the text in it. After that, we rotate the region in those bounding boxes with the ROI rotate technique. Now we crop the text image with the help of bounding boxes and send it to the recognizer model and this model gives the output. Now we will decode this output with the help of the TensorFlow ctc_decoder method. After this, we can get our text very easily. Here is the code for the pipeline —

The function used in this course can be found on GitHub. I will give the link to this code Github repository.

9. Showing result —

I have given this image to my pipeline —

and this pipeline gives me this result —

as we see here it detects ‘fendi’ and some more words and it recognizes the ‘fendi’ word correctly.

Now we can see some more examples —

so we can see in the above images the model's detecting and recognizing capability is not bad.

But there are some images on which the model is not performing well, like if an image has big words or if the words are on some angle then it is not detecting them properly and neither recognizing them properly. See some examples —

So to tackle this problem, first, we can use more data, keep in mind that I have trained my detection model only on 1300 images, and you can take more data in the training of the recognition model as well. So, If we train on more data so it might happen that the model predicts more accurate angles and geo maps for each pixel that contains the text.

you can see the working demo of the model here.

10. Conclusion —

Here we can see that in some cases this model fails and in some cases do well, I think this is happening because of the smaller amount of data. I have used less amount of data because I do not have enough resources, but you can try to train the model with more data, this will improve your model. Now here the data is two less so that your model can overfit also and for training this type of model you should use more data so that the result can be better.

11. Profile —

My LinkedIn Profile — https://www.linkedin.com/in/vishwas-upadhyay-36b94b1b6/

Github link for the code — https://github.com/vishwas-upadhyaya/text_detection_and_recognition