Confluent: Shifting the paradigm to (real-time) data engineering

As part of the Computer Weekly Developer Network series on data engineering, we have reshaped our story headline format on this occasion.

To put real-time data streaming platform company in the data engineering category – when it should arguably be lodged in the real-time data engineering category would be remiss and, well, altogether batch wouldn’t it?

As such, this is a guest post written by Peter Pugh-Jones, in his capacity as director of strategic accounts at Confluent. Understanding how the cadence and throughput of real-time data impacts modern software application development teams is what Pugh-Jones and his division devotes itself to, so how should we consider data (sorry, real-time) data in the new information engineering landscape?

Pugh-Jones writes in full as follows…

It shouldn’t come as a surprise to us that most data infrastructure experts are builders above all else. They focus on design and construction – solving problems to enable a flow of complex updates of events and data.

When organisations outsource that infrastructural management, their pressing need for infrastructural support fades away. With the host of challenges that come with managing a tech stack suddenly solved (or simply off their plate) their focus shifts.

Now, they need to become a specialist in interacting with the data itself.

Classic engineering (aka mudwrestling)

For those who still use traditional data structures, this is the realm of the classic data engineer. Traditionally, this role involved a lot of old-fashioned mudwrestling with data lakes, databases and data models – wrangling and translating swathes of data before compiling reports from the output.

In a world where automation and AI are increasingly outpacing that approach, however, many organisations are going through something of an enlightenment in terms of how they see data… and how they treat it as a result.

Let’s shift left

Confluent’s Pugh-Jones: Get out of the (classic) data mudwrestling era and clean your act up for real-time streaming.

As more intelligent and efficient data processing models evolve, a lot of the admin at the end of the funnel can now be automated within the flow of data itself. Rather than wait for it to stop, you can interact with data in motion, moving your processes further up the analysis stream to save you time and headspace.As we know, the IT industry calls this concept ‘shift left’ 0 literally meaning that we shifting tasks further up (leftward up) the data processing pipeline.

The benefits of this are threefold:

  • Data producers can clean and apply quality checks upstream with data contracts and data quality rules.
  • Data engineers can curate, enrich and transform streams on-the-fly creating reusable data products.
  • Businesses can introduce more ‘self-serve’ data discovery across multiple teams to maximise data use, value and adoption.

If you’re an engineer who works on data at rest in a data lake and you’re performing the same task multiple times to make that data behave itself, it’s very possible that you can automate that task entirely  – further up the stream.

This is one of the factors driving an uptick in the adoption of data streaming across almost every sector, from finance, to retail, to manufacturing. In fact, three in four enterprise leaders (77%) say they’re investing in data streaming to aid their decision-making this year, while 96% say data streaming is set to become an important part of their businesses.

But for all its benefits, tackling data in motion also means investing in new skills. As the nature of data evolves, and more processing and analysis is conducted in real-time, the technical skills required to take advantage of it must also evolve.

On the stream

Enter a new flavour of data engineer: the data streaming engineer.

While data streaming engineers won’t necessarily sacrifice the technical skill of their predecessors, they now need to be able to understand data challenges in real-time and within the context of their businesses.

Technical proficiency with the data streaming pipeline needs to be framed by real-time questions:

  • “What does my business need right now?”
  • “What relevant existing data is already in the stream?”
  • “What current processes can I take advantage of to enhance our insights.”

Ultimately this is about a mindset shift. Engineers can’t think or plan in batches anymore. Real-time streaming requires us to adopt more real-time thinking.

As data streaming becomes the norm, these skills will evolve from a novel requirement to part of how we consider data and related infrastructure in the first place.

In fact, we’re already starting to see this demand reflected in job specs and certifications. Take Microsoft Azure’s Data Engineering Associate certification as just one example. Recently the need to build and maintain “compliant data processing pipelines” and to consolidate “data from streaming systems” were included as standard. Being a data streaming engineer is rapidly becoming a requirement.

This is all part of the new paradigm shift. We’ve had the mainframe era and the client-server era. Now we’re reaching the era of continuous data streaming. Just like every other shift in the computing industry, a new job role is emerging in response.

Welcome to the age of the data streaming engineer.