Data pipelines need love too

Hitachi has aligned its data analytics divisions and and fused it with its Pentaho acquisition to call the new entity Hitachi Vantara. So… Vantara… kind of sounds like ‘advantage’ with a bit of Latin ‘avanti’ in there for added good measure right?

Branding shenanigans aside, Hitachi Vantara (more usually pronounced in an American accent as Hitachi Ven-tera) has continued to roll out products aligned for that role which we can now quite comfortably define as the ‘data developer’.

These data developers (call them data management and data science focused software engineering professionals with an appreciation for the need to apply analytics and machine learning technologies to database and application strucures… if you must – but it’s not as catchy) use machine learning (ML) functions, obviously.

Orchestration situation

As ML becomes the order of the day, these same data devs will also (arguably) need an increasing degree of orchestration functions with which to corall and manage the models they seek to build, execute and apply – this is what Hitachi Ven-tera (sorry, Vantara) is now rolling out.

The company now offers machine learning orchestration to help data professional to monitor, test, retrain and redeploy supervised models in production.

Emanating from its Hitachi Vantara Labs machine learning model management’ and these tools can be used in a data pipeline built in Pentaho.

Once an algorithm-rich ML model is in production, it must be monitored, tested and retrained continually in response to changing conditions, then redeployed. This work involves manual effort and, consequently, is often done infrequently. When this happens, prediction accuracy will deteriorate and impact the profitability of data-driven businesses.

Pipeline wear & tear

Hitachi Vantara explains that once a machine learning model is in production, its accuracy typically degrades as new production data runs through it. To avoid this, the company provides a new range of evaluation statistics helps to identify degraded models.

More organisations are demanding visibility into how algorithms make decisions. Lack of transparency often leads to poor collaboration in groups deploying and maintaining models including operations teams, data scientists, data engineers, developers and application architects.

“These new capabilities from Hitachi Vantara promote collaboration, providing data lineage of model steps, and visibility of data sources and features that feed the model,” said the company, in a press statement.

Love your pipeline

Building out the ML-enriched ‘data pipeline’ appears to be a surprisingly non-sequential process i.e. in that we can build our pipe and lay it down, but we will need to go back and look for leaks and other areas of weakness where the structure of the pipe itself may have become compromised as a result of the content we put through it.