Machine Learning Wizard - Scanned Folders

Use the Scanned Folders page of the machine learning wizard to identify the documents that will be scanned and used for finding similar documents or parts of documents in the future.

Steps

  1. Under Positive Examples, identify the Path to a folder that contains examples of the type of textual data that you want to protect, so the system can learn from them and identify similar data in traffic.

    For example, to protect proprietary source code written in Java, supply the path to the location of the proprietary source code.

    • The examples in the folder should look similar. In other words, don’t include examples of all sensitive content in the same folder. Instead, create a new classifier for other types of content.
    • For best results, there should be at least 50 examples in this folder.
  2. Use the Content type drop-down list to select a type that best describes the content to protect. This must match the type of content in the positive examples folder.

    For example, select Java and C Source code if the examples contain engineering source code written in Java. This helps the system know how to interpret your data. Possible types include:

    • Java and C source code
    • Perl source code
    • F# source code
    • Patents
    • Software design documents
    • Movie manuscripts
    • Financial information - investments
    • Other

    If none of the types in the drop-down list applies to your content, select Other.

  3. Under Negative examples, use the check box to indicate whether or not negative examples are available.
    Note: If you selected “Other” in the Content Type field, you must provide either negative or all-documents examples to help the system better understand your needs.

    If so, identify the Path that contains the files. For best results, there should be at least 50 examples in this folder.

    The folder:

    • Should contain examples of textual data that is similar to but does not represent the data you want to protect
    • Must be dedicated to negative examples, and it cannot be a subdirectory of the positive examples folder

    For example, to protect proprietary source code, the negative examples might reside in the location of publicly available source code. After learning, the system will create a classifier that can tell the proprietary source code apart from the non- proprietary.

  4. Under All documents, select the check box if there is not a dedicated negative documents folder. Then identify the Path to a folder containing all types of documents in your network and endpoint traffic, and the system will determine good negative examples for you.
    • The folder can contain both positive and negative examples.
    • The system compares the positive examples to the documents in this folder and decides which files represent negative examples.
    • Select this option and provide negative examples to improve the speed and accuracy of the classifier.
  5. Click Next to continue. See Machine Learning Wizard - Scheduler section.