Text detection and recognition: implementation of FOTS paper

In this blog, I will share the way I implement this paper.

I will follow the below path in explaining this whole work.

  1. Problem statement
  2. Source of data
  3. About data
  4. Losses we use
  5. Preparing data for detection
  6. Preparing data for recognition
  7. Training detection model and recognition model
  8. Making pipeline
  9. Showing results
  10. Conclusion
  11. Profile
  12. References

1. Problem statement

FOTS full form is Fast oriented text spotting. This is the case-study for detecting any text and recognizing it in any scene.

In the above image, FOTS gives a result, it detects the ‘GAP’ text area and all text areas in the image(scene) and recognizes that it is ‘GAP’, ’50’ and ‘OFF’ etc. So this is the task to do.

Now this task can be done with two different section detection and recognition. In detection, part detect text area in the scene and in recognition part recognize text, what is the text? (see above image) So for detection, we use CNN and for recognition, we will use some sequential decoder on each detected region.

2. Source of data.

3. About data.

Training set images —

Training set localization and transcription ground truth —

Training set word images, along with transcription ground truth —

And we have test images for recognition and detection also.

But for the training recognition model(which will train for recognizing text is cut out text images) I have taken almost 150000 text images from synthetic text data. Here is the link for synthetic data —

In synthetic data, we have text images and the text which is written in images are the name of images, so we can extract image name from the name of images.

4. Losses we use.

In this loss, we are calculating that how the predicted probabilities are different from actual probabilities in the score map-

And for the bounding box, we use IOU and for rotation, we use λθ(1 − cos(θx, θ∗ x ) rotation angle loss. For the bounding box, we have this loss —

here R_cap is the predicted bounding boxes and R* is the actual bounding box, So here the numerator term in the log is an intersected area between predicted and actual, and the denominated term is a union of both areas. Now for finding intersected area we use this-

where d1, d2, d3, and d4 is the distance from a pixel to the top, right, bottom, and left boundary. Here w_i and h_i are the intersected area width and height, Now we can get the intersected area by multiplying these both.

Now the union area will be- area_real+area_pred-intersected_area

and for the angle, we will use this-

Where theta_cap is predicted_angle and theta_* is the actual angle.

Now on merging these two-loss the final loss for geo-map is —

Now the whole loss for detection is —

here L_reg is the same as L_g.

For the recognition branch, our loss will be CTC loss —

5. Preparing data for detection.

For each image, the input will be the shape of (512,512,3), and the output will be the shape of (512,512,6), here is 6 channels, one for the score map and four for distance from the top, right, bottom, and left and one for training mask.

if our batch size is 32 then the input shape will be (32,512,512,3) and the output shape will be (32,512,512,6).

Now how the geo map will look, as we know that its channels have only that pixel distance from a real rectangle of text upper side and right side and bottom side and left side, which has text. you can see this image for more clarity —

image is taken from the east paper

I have also drawn the score map, geo maps, and training masks, which looks like this —

6. Preparing data for recognition.

if the batch size is 32 then the input shape will be (32,15,64,3) and the output shape will be (32,1,15).

7. Training detection and recognition model.

The first research paper is https://arxiv.org/pdf/1801.01671.pdf, which explains this whole work. In this paper ‘FOTS’, they have done detection and recognition simultaneously, this is the end-to-end system, meaning if we give a scene with texts then it will return the detected text area with recognition of texts. First, they detect the text area with some CNN by extracting feature maps and after that, they do the recognition part with the help of sequence decoding on detected areas. Here is the overall architecture —

In the above architecture, we can see that first, they extract features from an image with the help of shared convolutions layer, and then these features go in text detection branch(which is again a bunch of convolution layers) then text detection branch predicts b boxes(bounding boxes) and orientation of the bounding boxes, and this predicted output goes to ROI rotate which gives oriented text regions with fixed height and unchanged aspect ratio, and then this goes to text recognition branch(which is RNN) and CTC decoder which gives predicted texts.

But I have implemented it in two parts, first I have a train detection model and then I have a trained recognition model. As we have data for both these tasks.

So our detection part is inspired by East paper. The paper is EAST paper https://arxiv.org/abs/1704.03155. In this paper, the technique is explained to detect texts from scenes of different backgrounds. The architecture this network uses is made with convolutional layers, pooling layer, normalization layer.

And this network is inspired by the u-shape network, as you can see from the feature extractor stem we are taking information to feature-merging branches.

Here we have used the pre-trained resnet50 model which is trained on imagenet dataset for extracting features and giving it to feature merging branches. You can see the epoch loss plot —

b. Recognition model —

For the recognition model, we have used some initial Conv, batch normalization, and max-pooling layers to extract information from the image after that we have to use a bidirectional LSTM layer. The architecture of our recognition model is here —

For the recognition model of how we build data, I have explained this above in the preparing data section.

You can see the epoch loss plot for the recognition model —

8. Making pipeline —

To write this function we will use NMS(Non-Max suppression) technique and an ROI-ROTATE method. The question is what is NMS, NMS is a technique to select bounding boxes whose intersection with text area is high. Basically what happens after predicting we will get the output of shape (512,512,6). With the help of score map, geo-map, and angle we will first make lot’s of bounding boxes, to understand this, suppose we have a text in the image now giving this image to our detector model, we will get a score, geo, and angle map, now we will take only those pixels from all 6 channels which have value one in predicted score map, now what we have is text area pixels position and their predicted distance from the top, right, bottom and left side of the rectangle, now you can see that for each pixel we have it’s own bounding boxes(we know the area of the region respect to that pixel and distances of each side from that pixel), so finally with the help of score map and distances, we will get one bounding box for each pixel. After this, the work of NMS starts and what NMS does is, chooses the best bounding box which has the most part of the text in it. After that, we rotate the region in those bounding boxes with the ROI rotate technique. Now we crop the text image with the help of bounding boxes and send it to the recognizer model and this model gives the output. Now we will decode this output with the help of the TensorFlow ctc_decoder method. After this, we can get our text very easily. Here is the code for the pipeline —

The function used in this course can be found on Github. I will give the link to this code Github repository.

9. Showing result —

and this pipeline gives me this result —

as we see here it is detecting ‘fendi’ and some more words, and it correctly recognizes the ‘fendi’ word.

Now we can see some more examples —



so we can see in the above images the model is detecting and recognizing capability is almost good.

But there are some images on which the model is not performing that well, like if an image has big words or if the words are on some angle then it is not detecting them properly and neither recognizing them properly. See some examples —

So to tackle this problem, first, we can use more data, keep in mind that I have trained my detection model only on 1300 images, and you can take more data in the training of the recognition model as well. So, If we train on more data so it might happen that the model predicts more accurate angles and geo maps for each pixel that contains the text.

you can see the working demo of the model here.

10. Conclusion —

11. Profile —

Github link for the code — https://github.com/vishwas-upadhyaya/text_detection_and_recognition

11. References —

  1. https://arxiv.org/pdf/1801.01671.pdf
  2. https://www.youtube.com/watch?v=c86gfVGcvh4
  3. https://github.com/Pay20Y/FOTS_TF
  4. https://github.com/yu20103983/FOTS/tree/master/FOTS
  5. https://github.com/Masao-Taketani/FOTS_OCR
  6. https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course
  7. https://machinelearningmastery.com/how-to-use-transfer-learning-when-developing-convolutional-neural-network-models/

Currently working on Data Science.Python Developer(Django Framework)