Accuracy of machine learning

The ability of the system to accurately classify data depends to a large extent on the examples provided. If Forcepoint DLP machine learning fails to find enough common elements, its results may not be accurate. Should this happen, the system performs another stage of validation to assess the level of false positives (unintended matches) and false negatives (undetected matches) on new data that is not used during the training phase, sometimes referred to as “zero-day documents.”

If the “recall” level of the classifier (the total number of true positives divided by the sum of false positives and false negatives in the new data) is below 70 percent, the system returns a FAIL message that includes the likely reason the attempt to accurately classify data failed.

Error messages include:
Error Code Error Message
DSCV_ERR_-420_CODE There are not enough examples in your positive examples folder. X were provided and at least Y are required. Please add more examples then restart the machine learning process.
DSCV_ERR_-421_CODE There are not enough examples in your negative examples folder. X were provided and at least Y are required. Please add more examples then restart the machine learning process.
DSCV_ERR_-422_CODE The files in your positive examples folder do not contain enough text. Of X files provided, only Y have enough text. At least Z are required. Please update the files or point to another folder, then restart the machine learning process.
DSCV_ERR_-423_CODE The files in your negative examples folder do not contain enough text. Of X files provided, only Y have enough text. At least Z are required. Please update the files or point to another folder, then restart the machine learning process.
DSCV_ERR_-424_CODE Your positive and negative examples are too similar. No significant difference in words distribution was found. Please provide new examples.
DSCV_ERR_-425_CODE Your positive and negative examples are too similar, or your positive examples may not be consistent enough to draw conclusions. There were bad error rates on both training X and validation Y. Use different example folders in the classifier.
DSCV_ERR_-426_CODE The examples you provided were not sufficient for accurate training. Though the accuracy of the training set is good X, the machine learning process cannot make accurate conclusions on unseen data X. Your positive examples may not be homogeneous enough. Please provide more consistent examples then restart the machine learning process.
DSCV_ERR_-427_CODE Your examples do not fit the content type you specified. You provided X positive examples, but only {2} of them fit the type.
DSCV_ERR_-428_CODE The files in your example folders don't contain enough meaningful text (only X words). Please add files with more meaningful content or point to other folders, then restart the machine learning process.
DSCV_ERR_-429_CODE More than one file in your examples folders doesn't contain enough text (only X words). Please update the files or point to other folders, then restart the machine learning process.

By adjusting the sensitivity level of the classifier, administrators can reduce the number of false negatives (unintended matches) while accepting a higher level of false positives (undetected matches) or accept some false negatives to reduce the rate of false positives (or find an acceptable balance in between).

Factors influencing the choice include:
  • The level of commonality in the positive set of examples (a low level tends to decrease accuracy)
  • The business implications of false positives
  • The resources that available to deal with false positives