Success metrics must be well defined

Data engineering & Machine Learning Ops: What is data science without engineering?

By Dr Susanne Beckers, Head of Infused Intelligence, SAP

Imagine that you want to build a house and that you come up with a beautiful design. The design is very functional and pleasing for the people who will live in this house. You have discussed every aspect with them: what they need and expect, and your design faithfully reflects all their needs. Then you want to start building your house: You need a lot of bricks, mortar, doors, windows. You need electricity – wait, you actually need the foundation first. Because you do not know any companies that construct good foundations, you take the responsibility on your own and pour the concrete, even though you don’t have the necessary expertise. It looks a bit crumbled and imperfect, but you think it will work.

So, you go on with many craft-inspired works: they’re customized but not perfect. Nonetheless, you think it’s good enough. Finally, you are wallpapering and coloring. This looks awesome and – more importantly – it covers all the higgledy-piggledy construction below. The people moving in are happy with their new home, as it is apparently all that they dreamed of, and it works for them exactly as discussed.

But let us be realistic: a great construction engineer would have simplified and sped up the building process. An engineer would have known how to build a proper foundation, where to get a lot of pre-build material, and in general how to build a long-lasting house, exactly according to the specifications.

Did you recognize the Data Scientist? Did you recognize the Engineer?

A Data Scientist needs to understand the business problem to design a solution that will work for the business user. The Data Scientist has expertise in mathematics, statistics, algorithms and ML techniques – the design, the functionality, the usability of the house. ‘Making sense out of data’, and creating the famous ‘insights to actions’: all the many elements that comprise a Data Scientist’s job.

But how does the Data Scientist get the data in the first place? Many Data Scientists do the job of a Data Engineer by creating the infrastructure to get the data, followed by warehousing the data and modeling the data, e.g. creating new tables, joints or partitions in SAP HANA. All these are actually jobs for the Data Engineer.

Especially with SAP’s 4+1 Strategy, relevant data for great machine learning models can be distributed on different platforms, including hyperscalers. Thus, the engineers need skills like SQL, Spark, Hadoop, Kafka and knowledge about cloud technologies on SAP HANA Cloud, AWS, GCP, Azure, and Ali Cloud. Tools like SAP DI can help the Data Engineer to design, manage and optimize the data flow.

Once the necessary data tables – the house’s foundation, in our opening analogy – are built, the Data Scientist can begin the job: analyzing the data, establishing correlations and patterns; cleaning the data, replacing or removing empty and unreasonable values; analyzing it again. Having a chat with the homeowners – eh, stakeholders – to discuss the analysis and potentially find another relevant data attribute, a potential feature for the ML model. Then, the Data Scientist uses part of the data for model training and part of it for testing. These tests are necessary to improve the model. The model results need to be discussed with the stakeholders and success metrics are defined, determining if the model results are good enough or not. Once the stakeholders are satisfied with the model’s results, the model can be deployed.

For the deployment, e.g. to SAP’s AI Core, our MLOps Engineers come into play. Instead of deploying tediously in a non-perfect landscape – in our metaphor that would be transporting a lot of bricks manually to the construction side – the MLOps Engineers bring whole (Docker) containers smoothly to the construction side and efficiently build the house. That means they are enabling the model serving, the testing and execution of pipelines, and the model lifecycle management.

Specifically, the handling of the model lifecycle is what differentiates ML development and maintenance from standard software development and maintenance. Not only do the ML models have to adjust and scale to the amount of incoming data, like in classic software development, but the ML models also must adjust according to the data content and composition in order to perform reliably.

In terms of our house, consider the consumables, e.g. water, electricity and air. If you build the pipe(line)s correctly, they can handle different amounts of water, but the salinity and the pH-value of the water might change. This means regular checks on the water quality are necessary. The Data Scientist can do this manually; the MLOps Engineer can help to install automated water quality testing and automatic injection of basic or acidic solutions to neutralize the water’s pH-value – meaning model monitoring and automated re-training pipelines are best built by MLOps Engineers.

Finally, the real power of ML lies in automated retraining and deep learning, where the machine itself decides and continuously improves. The steps to get there are very nicely depicted in this Data Science Pyramid of needs, which can be seen here. The activities ‘Collect’ and ‘Move/Store’ are the foundation, ideally provided by the Data Engineer. The Data Scientist explores and transforms the data, based on the process insights provided by the business stakeholders. During the ‘Learn/Optimize’ phase, the Data Scientist experiments with different ML models to evaluate the performance. Once accepted by the business stakeholders, the model gets deployed via the deployment pipeline by the MLOps Engineer. Ideally, the model is A/B tested with real data to verify the value impact.

If regular ML models are not sufficient to solve the problem and huge amounts of data are available, Deep Learning is another approach which can be tried. However, Deep Learning or not, the models need continuous monitoring and potentially (automated) retraining to get to the tip of the performance iceberg. How do we get there? Only if Data Engineers, Data Scientists, and MLOps Engineers are working hand in hand to build the house of our wildest ML dreams.

Displaying item 14 of 205