How Forcepoint DLP machine learning works
- Content that needs to be protected (“positive” examples)
- Counterexamples (“negative”
examples)
Counterexamples are documents that are thematically related to the positive set, yet are not meant to be protected. Examples might be public patents versus drafts of patent applications, or non-proprietary source code versus proprietary source code.
Because it can be difficult and quite labor-intensive to find a sufficient number of documents for the negative set (while ensuring that no positive examples are in the set), Forcepoint has developed methods that allow the system to use a generic ensemble of documents as counterexamples. (See Negative examples consisting of “All documents” and Positive examples.)
For text-based data, some of the algorithms automatically create an optimal “weighted dictionary” that assigns positive weights to terms and phrases that are more likely to be included in the positive set and negative weights to terms and phrases that are more likely to be included in the negative set. The algorithms also find an optimal threshold. When the weighted sum of the terms that are found in a given document is greater than that threshold, the algorithm decides that the document belongs to the positive set. The assumption is that positive examples are more likely to have common themes.
Most machine learning algorithms are designed to be used with several hundred or several thousand positive and negative examples and require “clean” data, or data that is correctly labeled. Forcepoint DLP machine learning, however, utilizes different algorithms for different data sizes and attempts to automatically match the type of algorithm to the size of the data.
In addition, Forcepoint DLP machine learning algorithms can detect “outliers” among a set of positive examples. These are examples that should probably not be labeled “positive.” Forcepoint algorithms also allow learning to take place even when negative examples are not provided.