This was a problem I was working for a while. It is to do a proof of concept and see, with how much accuracy we can recognize a supermarket receipt. The scope is to try with 2 major supermarkets. One is Tesco and the other one is Sainsbury which are two of the leading supermarkets in the UK. The data set had over 2000 images and was a very challenging data set. Almost 60% for the receipts had below 20-30% accuracy (bad lighting conditions/angles/low resolution/crumpled) when the receipts are processed through the Tesseract OCR.
- Okay, Now let’s start with a reasonable receipt. This receipt is readable for the human eye. Let see how the Tesseract reads this. (Hmm Not bad at-least we can get the supermarket name, less than 5% accuracy)
2. Thus using only the Tesseract won’t help. Let try normalizing and sharpen (Make dark grey to black and light grey to white) the image before sending to the tesseract (Oh, now the restaurant name has gone, but we can read some line items, this seems promising)
3. Okay, we need to get more aggressive. Let’s make the image grayscale, stretch the image a little bit normalize and make it as a binary image. (wow, that like 95% accuracy)
4. Okay, that went well. Let’s test it with another image which has less dark background than the previous image. (Tesseract and pre-processing + tesseract can read the line items but the pre-processed image reads the values too, but this introduced a new problem which is noise)
5. As we can see there is so much garbage value. Because the background has some kind of pattern, so when trying to normalize and make it binary it detects some part of the background as text too, this is known as image noise. (Okays, so the output is good, but we have to remove the background in order to make it accurate)
6. So I removed the background by calculating the average color of the whole image then start from the corner and remove the pixel, when I encounter a big change in color(when I get to the receipt which is white) I stop the process. (The result increased the accuracy of the values of the items and also reduced the noise/ garbage values )
7. This is how the receipt looks before and after the background removal before sent into the tesseract.
8. Okay, that’s great let’s try to out with a white/light background. As the color variation for dark to white is less(in the image) it starts removing the all pixels in the image, considering everything is background (Oh that was very bad.)
9. Okay so let’s try to take the previous approach we did, normalize and sharpen the image. Works well for this kind of image (light background) and a good recognition although the image has a shadow and crumpled. The last editor shows the results of the normal output from tesseract, without any pre-processing.
10. So to conclude the light background images gives a good result to the text sharpen and normalize approach, and the dark/mixed background gives good result for the background removal process.
- Okay, so coming back to the point we have 3 image pre-processing algorithms(background removal + sharpening, only sharpening the image, normal output), so we are going to run in parallel on same image and we have to know which gave the good results.
- For that, we need a metric, so we define the high number of line items discovered as the best algorithm for that image and we’ll use that for the DB.
- So how are we going to define what is a line item. For we’ll use a regular expression.
- One interesting point is these receipt does not have units, if someone buys 2 of the same item, they print the same items twice. This makes data extraction less difficult.
- So I’m going to define the Line Item. If a line in a receipt has a description (some characters) followed by a big space (more than 3), followed by a currency symbol and followed by some digits will be considered as the line item.
- So we read the 3 different outputs from pre-processing and calculate which output has the maximum identifiable valid line items, and rank them in order.
- Then we analyze the other two low scored outputs and see if they have found any missing attribute (date/supermarket name/ coupons etc) which we not found by the high scoring output if so I’ll also save those results for the DB.
- The finally left lines from the top scoring output which failed the valid input test(line item rule), will be sent for manual checking.
- It is very hard to do full robotic process automation and extract all the data with 100% accuracy. But 100% accuracy has been achieved to some of the receipts which are scanned by a flatbed scanner.
- The supermarket name is identified by the keywords in the receipt, but going to more advanced techniques such as SURF or SIFT classifiers will solve the issue easily. (logo detection)
To see the sample code please do email me @ sshniro AT gmail.com
Sample data set link – https://drive.google.com/drive/folders/0B82Dey5Aw8TcSnhEQlFLZUdDb3c?usp=sharing
Technologies Used
- Java
- Bash
- Imagemagick
- MongoDB