Regular expression patterns

Regex patterns are special text strings for describing search patterns that can be detected within content. (Content includes the body of the content as well as any attachments). You define the patterns to look for in content and you set the action to take when a pattern is found.

For example, the string “a\d+” matches all strings that start with the letter “a” and are followed by at least one digit, where “\d” represents any digit and “+” represents “at least one.” When the extracted text from a transaction is scanned, Forcepoint DLP

uses regular expressions to find strings in the text that match patterns for confidential information. For example, this is a very basic regular expression for catching Visa credit card numbers:

\b(4\d{3}[\-\\]\d{4}[\-\\]\d{4}[\-\\]\d{4})\b

Because a regular expression file contains many internal attributes, if it is improperly written it can create many false-positive incidents, slow down the system, and impede analysis.

One way of mitigating false positives in a pattern is to exclude certain values that falsely match it. When defining the classifier, define a “Pattern to exclude” listing words or phrases that are exceptions to the pattern rule (search for all Social Security numbers except these numbers that look like Social Security numbers but are not).

You can also add a “List of phrases to exclude” with words or phrases that, when found in combination with the pattern, affect whether or not the content is considered suspicious.

Another way to mitigate false positives is to consider the pattern as suspicious only when some other pattern or set of words appear in the analyzed data. To do this, create each content classifier (a pattern, dictionary or any other), then combine them in a rule condition with an AND operator.

When creating a rule for a policy, specify how many instances (matches) of the pattern must be found before the content is considered suspicious enough for the configured action to be taken (for example, 4 or more Social Security numbers).

For each content transmission, the system tallies the number of instances of the pattern found in the content.

  • If the number of pattern matches is less than the number of matches set, the content is not considered suspicious and there is no further analysis.
  • If the number of pattern matches is equal to or greater than the number of matches set, the content triggers the action specified in the rule.

Example:

The pattern is Social Security numbers and the number of matches is 4.

The body of an email contains 3 Social Security numbers; the subject contains 2 Social Security numbers.

Since there were 5 pattern matches, and this is greater than the number of set matches, the message triggers the action specified in the rule that uses this pattern.