Why data management is urgent for your organization
Optimal data protection or the development of a high-quality end product: Many companies struggle with this paradox. The importance of data accessibility during the exploration phase of AI-related projects is a central issue within more and more organizations. On the one hand, full access to data that represent reality is important to develop an optimal product, process or application. On the other hand, privacy regulations are often an obstacle in this process. Data scientist Kilian Toelge and computer science engineer Sudhanya Mallick explain why the right balance is difficult to find, but that optimizing your data management process is a conditional first step.
Are 'real data' necessary for development?
Data are needed to create and test new products and services related to machine learning (ML) that optimally suit the customer. With classic software products, the structure of the data is particularly important so the most obvious solution is the use of mock data. These mock data do not involve privacy risks and do not require lengthy contracts to ensure security. When developing ML software, however, the content and amount of data is also important. “In principle, this applies to all software development, but it is crucial if you want to bring machine learning into a product or into customer development,” says Toelge. “You then need to train the underlying AI model and make the right decisions for the model architecture based on the available data. You want these data to be representative. They must reflect reality, otherwise the results of your analysis will be useless and you will make wrong choices. If you work with mock data that are incomplete or do not contain trends, you build a model with an architecture that ultimately does not work as well or at all in production.”
Many companies use separate development, testing and production environments. With machine learning, it is important to link these different environments with each other; certainly in terms of the data available in the environments. Because during the development process and in the test environment you need access to production data or at least production-grade data.
Security and privacy
Risk-averse strategies are common within IT departments. The fewer people who have access, the lower the risk of breaches. Ransomware is constantly lurking and phishing methods are getting smarter. So the emphasis on security is unsurprising. “What we see with many customers is that you run into security in every possible way,” Toelge continues: “But you are often not allowed to use production data and there is no automated and secure process for creating production-grade data. It doesn’t even have to concern privacy-sensitive personal data. It can also be data that are sensitive to the company and contain certain strategic choices or financial data. So, there is always a trade-off to be made between where data security begins and where a developer's freedom to be able to make the right choices begins for an application you wish to implement.”
“Within companies, we see that developers often use work-arounds to gain access to production data during the development phase,” Mallick adds. “These work-arounds are not desirable from a security perspective. You can avoid them by having a well-considered and established data management strategy that takes into account the needs of data scientists and ML engineers.”
How production or production-grade data can still be used
Working with production and/or production-grade data has many advantages: not only does it produce a higher quality product, it also saves the time spent creating mock data and is therefore cheaper. The security risks are high, however. To mitigate or even avoid these risks, there are two common solutions.
It is fairly standard to give development teams access only to the subsets/tables of data available in a company that they need. By doing so, depending on the use case, you can ensure that teams avoid personal data and company-sensitive data completely.
For other teams that do need access to sensitive data, it is important to have an automated process that transforms the data into production-grade data. “For starters, you should remove or mask sensitive data, such as financial data or trade secrets, before copying production data to the development environment,” Toelge resumes. “When it comes to personal data, you can, for example, shuffle first and last names, bank account numbers, and address information. You can also use software programs to blur or anonymize sensitive data so that they cannot be traced back to the person in question.” Where ML models are concerned, it is important that the statistical structure of the data is not changed.
What can I do in my organization?
More and more companies are aware of the challenges and risks involved in machine learning. This brings new difficulties: Machine learning is still so new to many organizations that employees do not yet have sufficient knowledge and the technology is still lagging behind. “Google first used a deep learning model in their Google Translate service in 2016,” Toelge explains. “The translations of whole texts have become much better as a result. The theory behind deep learning has existed for at least fifty years but we have only had the computing power needed for a few years. If a company like Google has only been able to implement this for six years, it is not surprising that medium-sized companies in the Netherlands are not yet at the same level of maturity in terms of knowledge and skills in ML and AI.”
An additional problem is the scarcity of data management expertise." Organizations need to become aware of the huge role data play throughout the business,” says Mallick. “It is important for companies to scale up data management, approach data in the same as any other asset in the company, and see how the business can capitalize on this.”
Collaboration between IT and the business
Mallick therefore advocates close collaboration between business and IT. “Both disciplines often do not know enough about what each other are doing and in what way. This requires a mindset shift. IT needs knowledge of the business to turn raw data into data that can really be used within all environments of a development process. Moreover, departments such as Marketing, Sales and HR use data in a slightly different way or even different data. That data management process must be streamlined within all departments of the organization and, for multinationals, even across countries. By introducing a standardized data management process and creating awareness and support for it within your organization, you ensure reliable data. Different departments can then make decisions and choices, and develop services based on the same data.”
Streamline data management process from the inside out
“IT companies make a lot of money from masking data or introducing and scaling up a data management process,” Mallick continues. “And this will continue to be the case for some time because the expertise of data engineers and data consultants is becoming increasingly scarce. Precisely because of this shortage in the labor market, it is wise for companies to use their own employees, for example by upskilling and reskilling them and training them to become data professionals.” Toelge adds that, “At Capgemini Academy, we offer training and client-tailored learning programs in the fields of machine learning and data science as well as data privacy and security.”
“Capgemini has been advising organizations at various stages of their data management process for more than 45 years,” Mallick concludes. “We do this by deploying consultants and providing insight into the learning and development needs of their employees. We focus not only on digital 'hard' skills, but also on personal and leadership skills. In this way, we train data specialists who are not only technically proficient, but who are also able to talk to the customer, provide feedback, and present their findings. Together with our clients, we then develop the most appropriate learning and development solutions. These can be training courses, but also other forms of learning, such as serious gaming or hackathons. In this way, you make optimal use of the knowledge and experience of existing employees to improve your data management process. And you don't need to look for new people in this very tight labor market.”