AI and Data Ethics: Data Governance
In this series on artificial intelligence (AI) and data ethics, we're talking about ADP's ethics principles of accountability and transparency, explainability, privacy, ethical use of data, and data governance.
We pulled together a team of experts to talk about data governance, with ADP Sr. Director Data Scientists, Xiaojing Wang and Vibhor Shukla, VP Product Development, Zaf Babin, and Lead Business Systems Analyst, Kristin Hlavinka. These are some of the people who work on ADP's Data Cloud team.
For those of us not that familiar with it, what is data governance?
Babin: Data governance is defining and enforcing who can access data—who can view it, use it, and transfer it. It also involves where and what state the data should reside in. We ask questions like, can it be transferred? Should we encrypt it?
Governance is important to make sure that whoever handling the data is following the rules about how data is handled. We look at the data source, the rules for that zone, region, country, or state. For example, if we want to transfer certain data, we need to know whether we need to get permission and whether there are other privacy requirements that apply.
Shukla: We also look at the multiple kinds of users who are accessing the data and how they want to use it. Is it for research or analysis? Will they be using it to build a machine learning model? We also consider how the access rights should be handled and distributed. We designate what can be seen by whom and make sure the usage is correct.
Hlavinka: I define data governance as a structured approach to the design, creation, and operation of formal business processes and accountabilities to manage the enterprise's data through the entire data lifecycle. The data lifecycle begins with the data acquisition, and proceeds through the following stages: storage, synthesis, usage, publication (through analytics and products), archival and purge.
Can you explain how you organize and handle data to do this?
Shukla: With Data Cloud, we store data in a data lake, which is basically the data we want organized in the way we want it. The data lake contains different zones depending on the type of data stored, with commensurate permissions set based on the data type.
At ADP, we classify data based on what it contains and how it will be used. For Data Cloud, we use aggregate data that has been anonymized to remove data that could identify someone. In other products like payroll, we need that information to make sure that paychecks are correct.
If someone wants access to sensitive data, there is a high bar before permission is granted. It requires multiple approvals to review the purpose, make sure the data will stay secure, and that we are protecting privacy.
Desensitized data is a little easier to get access to, but we still make sure that the purpose is appropriate and that security and privacy are maintained at every step.
When we use data for analysis or benchmarks, we also make sure we have adequate sample sizes for privacy and accuracy.
Babin: Accuracy is essential. We want to make sure the data quality is good and that it's cleaned and sorted so the user has the right information for the purpose. Then we ask whether any of the data is out of whack. Is there something we need that's missing? Is all of the data in the right format?
What goes into the process of giving people access to data?
Babin: We create and use entitlement tools that can grab access to data based on a person's user profile and restrict access to data they don't have rights to.
Shukla: We figure out which zone access is needed, then isolate within each zone for data, titles, or statistics and give permissions based on use case. Those permissions are embedded in code base. We can control what people see and can audit history and use.
Why is data governance an ethics issue?
Babin: Data governance is an ethics issue because we are dealing with people's personal and private information. We have to handle and use personal data correctly in a way that is fair and protects people's privacy.
Shukla: We also have to make sure it's not possible for someone to apply filters and keep zooming down into the datasets to identify any particular person or company.
We regularly review this. We want to make sure the right people are using the data for the right reason.
Babin: There are also ethical issues related to data quality and data handling. We give people insights, predictions, or benchmarks from data. Customers are making decisions using the information we provide. If we're not handling data correctly, whoever uses the information may be making a wrong decision.
Hlavinka: It's really all about securing privacy and ensuring data is used appropriately for each use case and that we are complying with all local privacy laws.
Learn more about ADP's privacy commitment and read our position paper, "ADP: Ethics in Artificial Intelligence," linked from the AI, Data & Ethics blade on the Privacy at ADP page.