3. First Discovery | Data Profiling and Monitoring

3.1 Activity Summary

Activity Description	This stage may represent the first full scale discovery of the project, or the first discovery for a specific repository type. It will enable you to understand: The speed with which you can process this type of repository. The size of your data source. The responsiveness of your data to your chosen rule configuration. How to review and interpret results.
Goals	Begin to understand the characteristics of your data; act on initial findings.
Participants	Application Administrator, Data Analysts
Pre-requisites	System has been deployed, configured, and validated. The inventory of data sources is available with valid credentials.
Outputs	Data has been discovered and locations assigned to Business Units for review.

3.2 Rule Configuration

The ActiveNav Cloud platform is pre-configured with the basic Feature Extraction rules and Scoring model that are designed to focus on common types of sensitive, personal, and financial data that indicate areas of risk in your repositories.

For a Data Profiling use case, the use of Feature Extraction is optional in the data discovery process. If you wish to target specific attributes of your data for profiling purposes, then you can extend the default Feature Extraction and Scoring configuration. Consult ActiveNav Support team for further instruction.

The age of data is a common value analyzed in data profiling. Your business may have specific age ranges that are significant to the operation, and these may vary by Business Unit. The application comes with out of the box age ranges, but you may update these to align with any specific data retention requirements that apply to your data.

Creating Date Ranges

If you do plan to develop your own rules and configuration, you have a choice of doing so before you begin the Discovery process or to start Discovery in parallel to the customization of rules. If you choose the parallel path, then you will be able to re-process data later to apply the updated configuration and rules.

3.3 Select Location

When you are ready to proceed with Discovery, the first step is to choose the location to discover. If this is the very first “real” Data Source, then you should aim to select one of your smaller locations to be able to complete the step in a reasonable amount of time.

3.4 Configure Data Source

The creation of a data source for a complete location is the same as the process used when validating Collector operation during deployment. However, if you are configuring multiple data sources, you may use CSV file to bulk load the paths. See below KBA on how to bulk load data sources:

Bulk Load Data Sources

When the focus is on data profiling, you may choose the “Discovery Only” mode for data sources. This will enable quicker discovery process time while enabling data profiling using metadata such as file age and type.

If you are targeting features within object names or content, then you should choose the “Discovery with Feature Extraction” mode to ensure that object content is inspected to apply feature extraction rules.

Your Data Source will be scheduled for Discovery as soon as the relevant Collector Group has a Collector available.

Example

Prest Team set up a Data Source for \\KS-NAS-02\IT as their first Discovery location. As a member of the IT team, Alexander is familiar with the volume and type of data present in this data share, and he will be able to quickly assess the effectiveness of the Discovery process and learn to utilize ActiveNav tools to review the findings. He can then use his experience to outline the standardized procedures for his Data Analyst team.

3.5 Monitor Data Source

A “Discovery with Feature Extraction” option takes longer to process due to the need to access document content. The exact rate at which a Data Source will be processed is dependent on a range of environmental factors:

The speed with which containers and objects can be retrieved.
The specification of the Collector host.
The number of Collectors available to process content.
The average object size and the mix of object formats

The ActiveNav Cloud UI shows the live indication of the throughput of data for the Data Source during Discovery activity. This can be used to review progress and estimate the total duration of Discovery process.

The link below describes the possible status of the Data Source during and after Discovery process.

Understanding Data Source Status

Normally, the Data Source would finish with “Completed” or “Completed with Warnings” status. If you see another status, you may need to restart the Discovery process using the “Refresh All” option on the Data Source menu. You may also select Audit option from the Data Source Actions drop down menu to download an audit log with more information about the discovery failure or issues. This KBA provides more information on data sources:

Data Source Overview

3.6 Setup Business Units

As data is discovered, you will be able to begin assigning data paths to your Business Units. Data paths must have been discovered before they can be assigned to a business unit. As a result, it is normal for the set of data paths assigned to a Business Unit to grow as the number of data sources in the ActiveNav Cloud platform grows.

You can use your Business Unit mapping created during the preparation phase to guide this process, but it is also normal that the desired mapping evolves as you become more familiar with the discovered data.

One option to configure Business Units efficiently is to use the option to import Data Source locations in bulk and assign the locations to Business Units at the same time. Please refer to the following KBA:

Bulk Load Data Source

Alternatively, if Data Source locations do not map directly to your Business Units, you can use Import Multiple Business Units to import Business Units in bulk. This KBA provides direction on how to create business units:

How to Create Business Units

3.7 Visualize Results

With Business Units configured, the dashboards within ActiveNav Cloud allow the scored discovered data to be reviewed.

Review of the findings from Discovery will help you understand the extent of inappropriate data identified. The following section will describe the recommended steps for acting on these findings.

Overview of results

The Administrator Dashboard provides a summary of the total data within the tenant and the distribution of data by file type and age, along with Data Hotspots by Business Unit, repository, and geo-location.

After the very first discovery, or the first discovery for a specific repository or geo-location, this view can provide simple confirmation of the data found.

The Home Page

Data Profile

The primary view used to understanding the details of your data profile is the Profile Dashboard. In the Profile view, you can select Business Units of interest and review statistics on age, size and type distribution of the data in each Business Unit. Refer to the following KBA for details:

Profile Dashboard

Specific Data

If you use custom scoring and feature extraction configuration to identify specific data, you can assess these findings using the Analyst Dashboard. As with the Profile Dashboard, selected Business Units drive it.

For each Business Unit, the Scoring hierarchy is presented with the scores indicated by color and value. This allows users to rapidly assess the data profile of the chosen Business Units according to the configuration, and see which Business Units or which aspect of the hierarchy have the least compliant data in profile sense. See the following KBA for details:

Analyst Dashboard

Next Step

Profile Review allows an Analyst to dig deeper into the details of each container and object. Explorer methods for creating object lists for targets of remediation activities.