OCR for Smart Data Extraction from PDF and Images with NER


Learn Data Extraction, Labelling with Training using Spacy & build a solution with Python, Pandas, OCR and NER concepts


  • Basic Python Programming knowledge


This is a hands-on course where we will teach how you can build a common pipeline irrespective of multiple data formats through a structured workflow wherein you will learn Data Extraction using OCR, Data Labelling with Spacy along with Training a model on custom NER data and towards the end, validating that model through prediction. Thereafter, we will combine all these learnings to build a Smart Text Extractor application using Python, Pandas, OCR, Tesseract, PyTesseract, OpenCV, Spacy and NER concepts.

Unique Offerings of this course:

a) Code Walkthrough of working Pipeline which perform various operation on documents like conversion, extraction and labelling

b) Line by Line Code Walkthrough of various operations performed at different steps

c) End Product that you will build with us towards the end of course is in working condition and support is provided within 24 hours for any issues faced

d) Detailed explanation of steps required to Train Spacy for NER

The course has been designed to explain text data extraction workflow in depth by first explaining the technology concepts and then their implementation through code. Detailed code walkthrough has been included for all the code implementations and 12 supporting notebooks containing source code are available for download. In addition to this, the quiz at the end of course helps you to assess your knowledge and identify the improvement areas. Enroll in this course and develop expertise in building a robust data extraction workflow.

Here is a summary of the key topics we will be learning and projects that we will design in the course:

Conversion of Document

When we are developing an application, we cannot restrict users to submit document in one particular format only, sometimes users upload word document, sometime they click photo from phones and directly upload them as image or scanned PDF, and many a times they prefer to properly scanned document using scanner and lastly some tech savvy users prefer to use PDF Writer for editing the text and then upload the same for data extraction and processing. This is the most technical problem as we need a solution that is capable enough to handle the extraction of all these different types of data without any failure.

We are solving this problem by building a technical solution that provides a complete pipeline where a user can submit any of these different types of documents. Once a user has uploaded the documents, the solution will handle these different varieties of documents in one go, convert them to text and then finally perform text labelling on it using Named Entity Recognition.

To elaborate further, the most common types of input documents that come for data extraction are:

  • Structured PDF document which doesn’t contains the scanned images but rather document is either converted from word document directly to PDF or PDF is formed with the help of Editable PDF writer or some other similar conversion to PDF as well. Once this data is converted from structured PDF to text, we can easily extract this data and consume it directly.
  • Scanned PDF document, this would contain embedded image in PDF document, which first need to be converted to image format and then Tesseract can be used to extract data from it.
  • A word document, you would assume that it would be very easy as its just copy paste, however copying data from word document programmatically require some steps and code which we have explained in detail.
  • Images produced either through scanner or clicked via mobile phones, which do not require any first level of document formatting and can be directly send for text extraction and labelling
  • Overall, we are targeting here to let all these formats be submitted freely by the user and our code would convert all different format to common format and send it for OCR.

Extraction of Data from Images

In our course we are explaining in detail how to extract data from images using Tesseract. Starting from reading images using PIL packages and OpenCV functions and then making use of Pytesseract package and then with the help of Tesseract, we are converting data present in images to text. We are also explaining in detail the various options of PSM and OEM that we can use to segment the input image into various formats depending upon the type of image to extract data in most optimal way.

As learned previously, we have converted all input document into single format and then we are using OCR engine to convert data to text for performing labelling in next sections.

Labelling Extracted Data

In the world of recognition, raw extracted data holds no meaning so to make it useful, in our course we are explaining in detail how to recognize the text that is extracted using OCR into various categories like Date, Name Country etc. There are multiple ways to perform this operation and we have explained in detail how to use Spacy to recognize text. Some pre-defined categories into which Spacy pre-trained model classifies the text are Person, Organization, Countries, Buildings, Date, Time, Money and Numerals.

Training Spacy for NER

In this course we are covering an important topic in which we are teaching how to train Spacy on the Tag name defined by you. For this we have to first perform labelling of text against your own Tag name and then train Spacy model on this custom data for Name Entity recognition operations. Once this training is completed, we are also performing Prediction using this newly trained Spacy model to predict entities on input text dataset

Pandas for CSV Output

If you are working in the Computer Vision domain then it’s crucial to understand how to make use of Pandas Dataframe to get output in the form of CSV file at code level. We are explaining that in detail as end of day we need output in some format rather than printing it over console.

Project Pipeline

All the previous code walkthrough are than combined in one running set of code which first performs conversion of all input document into one common format, and as next step in this code, it extracts data from images and then perform labelling using Spacy model and finally output the data into CSV file.

This complete running application can be downloaded from this course for your project requirements and can be customized as per your needs.

Who this course is for:

  • Python Developer who want to learn data extraction using OCR
  • NLP and NER Enthusiast who are keen to explore Text Labelling
  • Computer Vision professionals
  • OCR Engineer

Download Tutotial


0 0 votes
Article Rating
Notify of
Inline Feedbacks
View all comments