Written by Vincent Govaers
Data security is a big deal for modern organizations. As the importance of data analytics grows, more users get access to more data. Unfortunately, most efforts to protect data are giving data scientists and analysts instant headaches. Therefore, a practical approach towards data protection becomes crucial if you want to realize the full potential of your data assets while preventing data breaches, losses, and damages.
Our lives used to be rather simple: data was produced in operational systems, to then be transferred to large data warehouses. Business intelligence (BI) teams typically had access to one of these siloed databases for analytics. Today, it’s more complicated: data scientists require long histories of huge datasets, combine different data sources, use advanced data mining or analytics techniques, etc. On top of that, BI teams don’t have exclusivity on data analytics anymore. Data is replicated and transformed by many teams working on distributed systems. This data security mess becomes increasingly more difficult to manage.
How are we supposed to implement a successful data protection strategy?
Access control is key. There are three main challenges related to access control: identity management, authentication, and authorization. In order to address the first challenge, an identity management system such as Active Directory (AD) is usually implemented. The second challenge can also be addressed by setting up authentication protocols suitable for distributed systems such as Kerberos. The toughest challenge in any enterprise will be authorization. From a technical point of view, an application like Apache Ranger allows for centralization and flexibility. However, authorisation can quickly become chaos from an organizational point of view. Especially when the systems support hundreds of users accessing thousands of datasets through scores of applications.
Authorization requires changes in the organization itself. A thorough data classification needs to be performed to categorize data based on their confidentiality, integrity, and availability levels. This data classification should be done by the business departments who can evaluate the importance of certain datasets given the business context. Besides the classification of data, business roles need to be created and matched to groups in AD. Data access policies are defined based on these groups, as well as applications and computing resources. Finally, full-time functions like security specialists dedicated to distributed data systems need to be introduced in the operational teams. They will implement and maintain the authorization mechanisms as well as perform day-to-day security tasks such as access granting or revoking, or audit logs monitoring.
Sharing our experience
At BrightWolves, when implementing data protection in an organization, we start from the first principles, we:
review existing data protection policies;
analyze existing organizational structures;
understand the technical setup, i.e., data systems and the current security mechanisms in place.
By mastering these topics, we are able to develop a tailored, step-by-step approach proposing policy updates, new organizational structures, and scalable solutions to facilitate a realistic and robust data protection setup for data analytics across the whole organization.