1. Preparation | Data Profiling and Monitoring

1.1 Activity Summary

Activity Description

Preparing the key elements of your data discovery project will help ensure that the initial deployment of ActiveNav Cloud will run as smoothly as possible. Taking the time to identify locations, prepare credentials, and to engage with key users will allow you to achieve results as quickly as possible.

Goals

A project plan is in place to allow the initial deployment of ActiveNav Cloud to run smoothly.

Participants

Project Sponsor, IT team, Project / Program Manager

Pre-requisites

  • Cloud Services agreement is in place.
  • Project / Program manager identified.

Outputs

  • Top level plan outlines goals and timeline.
  • Initial inventory of repositories to be discovered is ready.
  • Credentials prepared for access to repositories.
  • Deployment architecture for FileShare Collectors is understood.

Activities

Overview DevelopmentOfProjectPlanDefineDeploymentArchitectureProvisionEnvironmentAlignKeyStaffResourceIdentifyBusinessUnitsDocumentPolicies Next Step

As in any project, taking the time to establish the scope of the data profiling activity will ensure that deployment and operation of ActiveNav Cloud can proceed as smoothly as possible. We recommend paying attention to the following preparation activities to orientate the entire project team ahead of deployment.

1.2 Project Plan, Milestones, and Timelines

The first goal should be to establish the scope of your data discovery project so that you can establish:

  • An inventory of the data locations to be addressed, the repository types and their estimated volume.
  • An estimated timescale for discovery based on the data volume.
  • The positive and negative data profile characteristics for your business
  • The relevant users for assessment of findings.

An inventory should be prepared that itemises the repositories that will be targeted using ActiveNav Cloud, with estimates for the volume of data held, and noting whether key activities such as preparing credentials have been completed. This inventory can then provide a basis for the later activities of defining Data Sources and Business Units.

The volume of data to be discovered will be a fundamental factor in establishing the duration of your initial data discovery phase. If you prepare a list of the locations that you plan to discover with ActiveNav Cloud, their location, and their approximate size you will be able to set expectations around the time required for discovery (ActiveNav Cloud support can provide details of expected performance for different data repository types). The size and geographic location of data repositories also provides key insights to guide deployment decisions.

The default configuration of ActiveNav Cloud implements a Scoring configuration that targets common forms of sensitive data, for a data profiling use case you may wish to introduce other feature extraction and scoring configurations to support specific data profiling goals.

Finally, you should begin to consider the users within your business that will help you achieve your goals in understanding the profile of the data that is found. You will need to engage with key individuals for different aspects of your business, ensure they understand the reasons for the data discovery process, and set expectations for their involvement in reviewing the results of your discovery project.

Key roles within a data discovery project include the following.

  • Project Sponsor: Executive ownership of the project; responsible for determining overall scope and goals.
  • Project Manager: Overall responsibility for guiding the team to achieve goals of the project
  • IT Team: Provide expertise in installing on-premises components; preparing credentials for data access and configuring cloud data sources.
  • Application Administrator: responsible for managing the ActiveNav Cloud application, facilitating access for end users, and configuring discovery tasks
  • Data Analysts: A team of end users trained to review discovery findings. Normally different Business Units will be represented by dedicated Data Analysts who understand the working practices of the Business Unit to enable them to review the data found in context.

Dependent on the scale of the project and the type of organization, an individual may be responsible for multiple roles.

Example

Prest Team are deploying ActiveNav Cloud to gain an understanding of all the data they hold, but with a particular emphasis on the new data that has arrived because of the acquisition of RBTSkills.

Alexander Knight will be their Project Manager and Application Administrator. He will delegate roles to other users to review the findings of the discovery project.

Alexander works with the IT team to build out the inventory of key locations that they will target using ActiveNav Cloud, noting their location and estimated volume. This inventory will then assist in the definition of the deployment architecture and configuration of data discoveries.

1.3 Define Deployment Architecture

ActiveNav Cloud uses processes we call Collectors to perform data discovery from your content repositories.

Discovery of cloud-based repositories (e.g. Office 365) is performed by Cloud Collector processes hosted within ActiveNav Cloud, and you simply need to provide repository location and credentials to enable access.

Aside from cloud hosted repositories, almost every project will involve some processing of on-premises content. Currently on-premises discovery is supported for Windows file share or iManage Work repository types.

Discovery of your on premises content requires the deployment of at least one Collector application within your own network for each repository type you wish to target.

The precise number of collectors that you need will depend on your volume of data, its distribution, and your desired time scale for discovery. The architecture overview below illustrates how different patterns of Collectors may be used for different areas of a business.

The on-premises Collectors are applications that are installed on a Windows server as close as possible, in network terms, to the data it will be used to discover.  ActiveNav Cloud allows you to deploy multiple Collectors to address your content; the following considerations will help you decide on the most appropriate deployment plan:

  • Data Location. If your file shares are physically distributed, e.g. you have data centers in different locations, then you should locate at least one File Share Collector in each network location. Discovery throughput will be significantly reduced if the Collector is used to address data across a wide area network.
  • Data Volume. In any given location, you can increase the number of File Share Collectors deployed to achieve higher discovery throughput. This would normally be a good choice for locations with large data volumes.

Architecture

Collectors are organized into Collector Groups which define Collectors with geographical affinity. Each distinct physical location will require at least one Collector Group to be defined, and you should then ensure you set the correct Collector group id during installation. This allows ActiveNav Cloud to request work to be carried out by the most appropriate available Collector.

More details about the way that Collector Groups are used to manage the interaction of Collectors and repositories see the overview at the link below:

https://support.activenav.com/collector-and-collector-groups-overview

Example

Their repository inventory shows that Prest Team has 2 file servers which are in different data centers. Because the connection between the data centers is dependent on a high latency WAN they will configure one Collector Group for each data center, with Windows File Share Collectors installed in each data center and assigned to the relevant Collector Group.

This allows the Collector to maximize the speed of data access for best performance and the independent Collector Groups mean that ActiveNav Cloud will be able to process content from each data center simultaneously. Because the overall data volume is moderately sized, only one collector is deployed in each Collector Group. 

Cloud Collectors are deployed automatically as part of their ActiveNav Cloud tenant and used to discover SharePoint Online, Teams, Exchange and Google Drive data repositories.

PrestArch

1.4 Provision environment and data access

At this stage your ActiveNav tenant will have been provisioned for you and your initial administrator user created. You should ensure that this user can access the cloud platform and that MFA configuration can be achieved in line with any specific requirements that your business may have.

For each of the data repositories identified in your inventory of locations you will need to ensure you have arranged the relevant access credentials. Recording these credentials in your ActiveNav Cloud tenant in advance will simplify the creation of data sources later.

File Share Access

For each file share you intend to discover you should ensure you have credentials that have read access for the file share itself and the underlying file system. It may be possible for your IT department to provide details of an existing account, or they may prefer to configure an account specifically for your project. You can test access for the account by mapping a network drive using the provided credentials.

File Share credentials can be entered in the ActiveNav Cloud platform as “Basic Credentials” where you will specify the Windows domain, username and password.

Adding Basic Credentials

iManage Work

For on-premises iManage Work respositories you must configure a Client ID and associated Secret within iManage to represent the iManage Work Collector, along with a user account that will be used by the collector application.

These details are then entered in the ActiveNav Cloud platform as a credential of type "iManage Work Ropc"

Adding iManage Work Credentials

M365 Cloud Repositories

For Microsoft M365 data repositories, access for Cloud Collectors is achieved by registering an application in Azure AD. Each repository type requires a different set of API permissions to be granted, as outlined in the following Knowledge Base Articles (KBAs).

Configuring Azure AD for SharePoint Online (📺)

Configuring Azure AD for Exchange (📺)

Configuring Azure AD for Teams

Configuring Azure AD for OneDrive

When setting up your application, make a note of your Tenant ID, App ID and assigned Secret Value. These will be used when configuring credentials within the ActiveNav Cloud platform.

Google Workspace Drive Cloud Repository

Access to Google Workspace Drive is facilitated by configuring a Service Account in the Google Cloud environment. You must also identify or create a Google user account that will be used to discover Google Workspace users and to access Shared Drive data.

The process for preparing the service account and associated user account is outlined in this KBA:

Creating a Google Workspace Service Account & User Account

Using the JSON file downloaded during the Service Account creation, and the identified user account, you can then configure Credentials in the ActiveNav Cloud platform.

Creating the GWD Credentials

iManage Cloud Repository

For access to an iManage Cloud you must create a user account that will be entered in the ActiveNav Cloud credentials interface as a credential of type "iManage Cloud Ropc".

You must also request access to the ActiveNav Cloud application within the iManage Cloud instance.

Adding iManage Cloud Credentials

Additional Considerations for Cloud Repositories

Discovery of content within a Cloud repository will entail the transfer of data from the repository host to the Collector that is processing it. In some cases there may be cost implications for the transfer of data from a Cloud repository. You should review the terms and conditions for your repository to ensure you understand any such costs.

1.5 Align Key Staff Resources

ActiveNav Cloud provides a lens to observe the profile of data within your organization, but users must make the assessment of whether data identified is appropriate and relevant to hold, and if it is properly secured. You should identify key users within your organization that can provide expertise for the Business Units you have defined and provide them with information about your project, the ActiveNav Cloud project, and the role you will want them to perform.

Example

At Prest Team, the Project Manager Alexander briefed Maxine Steele, Anastasia Romano and Leonardo Rossi about the goal to ensure that inappropriate data is not found within their data repositories. Leonardo in particular is briefed on the need to ensure that the user areas in Google Workspace Drive from the RBTSkills acquisition are well organized and do not hold stale data or  inappropriate data types.

1.6 Identify Business Units

An organization will naturally hold information relating to a range of different business areas – e.g. Finance, IT, Sales, etc. When data has been discovered that crosses a range of business areas such as this you will need to enable specific Data Analysts to focus on the content that is relevant to them.  

Within ActiveNav Cloud,  this is achieved by configuring Business Units. Business Units can be used in a range of ways depending on the way that data is distributed across repositories. Some examples include:

  • In the simplest case a business may have a single file server with file shares for each department. In this case the Application Administrator can create one Business Unit per department and assign a single location – the relevant file share – to each business unit.
  • In a larger organization there will normally be multiple file servers – perhaps for different offices – where data relevant to a particular business area can be found in multiple locations. In this case multiple data paths would be assigned to a business unit, associating relevant locations from different file servers into a logical collection.
  • In some organizations related data may be split across different repository types. For instance it may be appropriate to combine SharePoint Online sites, Microsoft Teams and file shares into a business unit to collect together all relevant data for a particular business area.

Careful creation of Business Units will enable you to direct your users to attend to the data areas that they are responsible for,  allowing them to assess the state of their data and the impact of their activities to clean up any issues identified.

While the Business Units that are required for your project and the data paths that are assigned to them will naturally evolve over time, establishing an initial plan for Business Units and associated locations will ensure that you will be ready to work with results as soon as your discovery process is underway.

The initial collection of Business Units will normally be a logical reflection of the structure of your business. Using your inventory of repositories and consulting with  IT staff you can create an initial mapping of data paths to these Business Units. This mapping can then be used to configure Business Units as you progress through the process of discovering your data.

Example

Prest Team review their inventory of locations and identify 3 logical Business Units to create in the first instance : “Sales”, “User Data”, and “Finance and Operational Data”. Data locations are mapped to these groups ready for their creation as their cleanup project begins.

Because the locations within the Prest Team repository inventory are already organised in a functional manner they are able to use Business Units simply to collate primary repository locations together in logical groups. As noted above, larger organisations will normally have more complex distributions of data and in such cases it is expected that lower level locations within repositories will be identified to be mapped into appropriate Business Units.

The proposed Business Units for Prest Team are:

Sales

FileShare Content : \\RBT-543\Sales
Microsoft Teams : “Sales” Team
SharePoint Online : “Sales” site collection

Finance & Operational Data

FileShare Content : \\RBT-543\Operations,\\KS-NAS-02\Operations,\\KS-NAS-01\Finance

SharePoint Online : “Finance” site collection

User Data

Exchange Online : User Mailboxes
Google Workspace : Personal Drives

These are recorded in a spreadsheet to support straightforward import into ActiveNav Cloud as the data locations are discovered.

Maxine Steele will review findings in the Sales Business Unit, ensuring that no inappropriate data types have been retained e.g. data about old projects. Leonardo Rossi will be tasked with reviewing findings for the User Data to ensure that users are not inadvertently replicating data in their personal storage areas. Anastasia Romano will review findings in the Finance & Operational Data Business Unit, ensuring that supplier and customer data is appropriately organized and disposed of when no longer needed.

Data that is not assigned to a specific Business Unit, is grouped together under the “No Business Unit” category. Alexander will review the findings within unassigned locations, and this may allow him to add additional data areas to the initial set of Business Units, or to identify new Business Units to create.

1.7 Document Policies

In order to implement and attain compliance with a target data profile across your organization it is essential to identify, define and document the policies that will drive your organization to success. When reviewing the profile of your content, it is essential to understand and document what type of data is relevant to your organization, how long it should be retained for, and how it should be managed from a security and compliance perspective, in line with your policies.

You should establish which data types are expected, and those that are not. For acceptable data types you should establish in what form they are expected to be found, which locations it is acceptable to find them, and the security controls that should be in place. If certain types of data should only be held for a limited time this should be defined along with “health” factors such as an expected maximum age for data.

ActiveNav Cloud is pre-configured with a feature extraction and scoring configuration that focuses on commonly found items of sensitive data such as national identifiers, financial identifiers etc. If as part of your data profiling activity you wish to target features within document content or filenames then you may wish to customize the out of the box configuration with additional Impact Areas, Data Element Types or Data Elements. It is also possible to customize the default date ranges used in Profile views to suit specific data retention policies that apply to your data.

https://support.activenav.com/creating-date-ranges

ActiveNav staff can provide guidance on customization of feature extraction and scoring configuration, but this is best carried out in parallel to the initial phases of discovery rather than allowing it to delay deployment.

https://support.activenav.com/scores-overview

Next Step

Service Deployment and Configuration outlines the work that is required to progress your project plan by deploying on premise connectors, configuring them, and validating their operation.