Doc Reader

Turn your PDF Document into Structured Data in seconds


  • The user can prepare the training set used by the neural model by annotating a few documents with the relevant information. These annotations will be used by the model as examples of what is expected in its predictions. These annotations can be created directly on a custom UI packaged within the 2OS system.
  • The training dataset is then sent to the training service and used to generate a trained deep learning model that will be afterwards deployed to make predictions on new unseen documents.
  • The training pipeline starts from the annotated documents and applies multiple processing steps in order to generate the trained model:Extract the text and its layout from the pdf.
    • Segment the text into a sequence of sentences based on punctuation and layout and then tokenize each sentence into a sequence of words.
    • Align the manual annotation with the tokens in the training set.
    • Build a deep neural network for sequence labelling.
    • Train the neural network on the processed annotated samples.
    • Produce the trained model artefact.


  • State of the art: The neural architecture uses the latest advances in deep learning research to be fast and data-efficient.
  • Customizable: The user can define his own set of labels and use them to train an extraction pipeline that he can use to automatically process new documents.
  • Synergy with other 2OS modules, especially 2OS Annotation Tool.
  • Adaptability: Doc Reader can learn to extract any type of information expressed in natural language.
  • Ease of use: Our app can be used by anyone without the need to write code or any technical knowledge of deep learning.
  • Extract unstructured data quickly and accurately and reduce time-consuming search and extraction tasks for your business.

How it works

In order to extract structured information from pdf documents, we will use Supervised Machine Learning. Meaning we need to construct a dataset that will be used to train the machine learning algorithm in a supervised manner. This dataset needs to contain examples of Input documents and what is the expected output for each of those documents.