Configuring the OCR server

The OCR server enables the system to analyze image files being sent through network channels, such as email attachments and web posts. The server determines whether the images are textual, and if so, extracts and analyzes the text for sensitive content. There is no special policy attribute to configure for optical character recognition (OCR). If sensitive text is found, the image is blocked or permitted according to the active policies.

The server can also be used to locate sensitive text in images during network discovery.

This feature does not support either handwriting or images containing text that is skewed more than 10 degrees.

To use OCR, install a supplemental Forcepoint DLP server; the OCR server is automatically included in supplemental Forcepoint DLP server installations.

To enable OCR analysis in your network:

  1. Navigate to the Settings > Deployment > System Modules page in the Data Security module of the Security Manager and edit the policy engine on each server or agent that will receive traffic that you want analyzed.
  2. In each Edit window, select Enable OCR by and indicate which OCR server (supplemental Forcepoint DLP server) to use to extract text from images.

When OCR is enabled, images of the following types are sent to that OCR server for text extraction:

  • JPEG_2000_JP2_File - JPEG-2000 JP2 File Format Syntax (ISO/IEC 15444-1) (.jp2, .j2k , .pgx)
  • JBIG2 - JBIG2 File Format(.jB2, .jbig2)
  • MacPaint - MacPaint
  • PC_Paintbrush - Paintbrush Graphics (PCX)
  • BMP - Windows Bitmap
  • JPEG_File_Interchange - JPEG Interchange Format
  • PNG - Portable Network Graphics (PNG)
  • GIF_87a - Graphics Interchange Format (GIF87a)
  • GIF_89 - Graphics Interchange Format (GIF89a)
  • TIFF - TIFF
  • Scanned documents PDF - documents containing only scanned text

All other PDF documents, including hybrid files containing both searchable text and scanned text, are sent to the default Forcepoint DLP extractor, not the OCR server. Should the system fail to extract text from a PDF, it is forwarded to the OCR server.

Tip: To specify a PDF type that should always be routed to the OCR server, edit the extractor.config.xml file as described in this knowledge base article.

Images embedded in Microsoft Office documents are sent to the OCR server for text extraction.

The OCR server can analyze images that meet the following criteria:

  • 32,000 x 32,000 pixels or less
  • 300 DPI resolution for images with large text (10 point font and larger)
  • 400-600 DPI for images with small text (9 point font or smaller)

Use the System Modules page to configure the languages to analyze and to fine-tune the module’s accuracy profile to optimize performance.

View OCR server status on the Main > Status > System Health page.