What goes into selecting, gathering, cleaning and testing data for machine learning systems.
It's not enough to have a lot of data and some good ideas. The quality, quantity and nature of the data is the foundation for using it effectively.
We asked Marc Rind, VP of Product Development & Chief Data Scientist - ADP Analytics and Big Data, to help us understand what goes into selecting, gathering, cleaning and testing data for machine learning systems.
Q: How do you go from lots of information to usable data in a machine learning system?
Marc: The first thing to figure out is whether you have the information you want to answer the questions or solve the problem you're working on. So, we look at what data we have and figure out what we can do with it. Sometimes, we know right away we need some other data to fill in gaps or provide more context. Other times, we realize that some other data would be useful as we build and test the system. One of the exciting things about machine learning is that it often gives us better questions, which sometimes need new data that we hadn't thought about when we started.
Once you know what data you want to start with, then you want it "clean and normalized." This just means that the data is all in a consistent format so it can be combined with other data and analyzed. It's the process where we make sure we have the right data, get rid of irrelevant or corrupt data, that the data is accurate and that we can use it with all our other data when the information is coming from multiple sources.
A great example is job titles. Every company uses different titles. A "director" could be an entry level position, a senior executive, or something in between. So, we could not compare jobs based on job titles. We had to figure out what each job actually was and where it fit in a standard hierarchy before we could use the data in our system.
Q: This sounds difficult.
Marc: There's a joke that data scientists spend 80 percent of their time cleaning data and the other 20 percent complaining about it.
At ADP, we are fortunate that much of the data we work with is collected in an organized and usable way through our payroll and HR systems, which makes part of the process easier. Every time we change one of our products or build new ones, data compatibility is an important consideration. This allows us to work on the more complex issues, like coming up with a workable taxonomy for jobs with different titles.
But getting the data right is foundational to everything that happens, so it's effort well spent.
Q: If you are working with HR and payroll data, doesn't it have a lot of personal information about people? How do you handle privacy and confidentiality issues?
Marc: We are extremely sensitive to people's privacy and go to great lengths to protect both the security of the data we have as well as people's personal information.
With machine learning we are looking for patterns, connections or matches and correlations. So we don't need personally identifying data about individuals. We anonymized the information and label and organize it by categories such as job, level in hierarchy, location, industry, size of organization, and tenure. This is sometimes called "chunking." For example, instead of keeping track of exact salaries, we combine them into salary ranges. This both makes the information easier to sort and protects people's privacy.
With benchmarking analytics, if any data set is too small to make anonymous, meaning it would be too easy to figure out who it was, then we don't include that data in the benchmark analysis.
Q: Once you have your initial data set, how do you know when you need or want more?
Marc: The essence of machine learning is more data.
We want to be able to see what is happening over time, what is changing, and be able to adjust our systems based on this fresh flow of data. As people use the programs, we are also able to validate or correct information. For example with our jobs information, users tell us how the positions in their organization fit into our categories. This makes the program useful to them, and makes the overall database more accurate.
As people use machine learning systems, they create new data which the system learns from and adjusts to. It allows us to detect changes, see cycles over time, and come up with new questions and applications. Sometimes we decide we need to add a new category of information or ask the system to process the information a different way.
These are the things that both keep me up at night and make it exciting to show up at work every day.
Read other articles in this series