Bridging the Gap: Software vs. Data Engineering
Software vs. data engineering - a chasm that has existed for years. Are we closer to bridging the gap, and is Databricks the catalyst?
For those that know me personally, you know that I’ve spent my career so far wearing several different hats over the past ~10 years. DevOps, Software Engineering, Data Engineering, Data Architecture, managing hybrid data + software engineering teams... The point was never to explicitly sample different domains, but as I followed my passions and the organic needs of employers & industry, I found that I had accidentally experienced an extremely unique trajectory that allowed me to love software & data engineering in a unified way that most only dream of.
Today I was moved to write this article about this topic -- software vs. data engineering -- the chasm that has existed between the two for several years, and whether we are closer to bridging the gap.
Brief Background
What makes a “Data” engineer versus a “Software” engineer? Are they different skill-sets and job requirements, and do we really need both?
The real answer is slightly different from company to company, but in general the two do tend to differ and most modern organizations I’ve worked for or with admit you absolutely need both.
Here’s a high-level Venn diagram of concepts (note: your team may have already combined some of these into the middle “overlap” or not):
In their 2019 paper [1] published in the Journal of Systems and Software, Monika Solanki et al. argue that despite the rise of data-intensive apps, team structures remain siloed, stating that “our techniques for building such systems are still fragmented into disparate and un-aligned engineering processes, tasks or teams.”
So in fact it’s clearly not a new topic at all. But IMHO it is something that many organizations either overlook or choose to ignore.
So who cares? What’s the big deal and why can’t we just let these sort of organically figure themselves out, or simply swim in their own lanes in the real world? The consequences of ignoring it:
Application software inevitably needs to access some curated data from DE -- be it reports, analytics to surface to end users, or maybe a ML model’s batch inference.
Likewise, DE needs the data being produced by our Applications -- message buses galore here.
Teams don’t agree on data interchange: formats, storage location, data model -- don’t even get me started on what happens when the app team adds a new column or changes from epoch seconds to epoch milliseconds without telling the DE team!
As a result, pipelines fail, end users see bad quality data (or no data at all), and the business doesn’t have real darn data-driven clue about how the users/product are actually behaving.
Team silos expand beyond the technical -- it creates personal silos and “us vs. them” vs “we” -- degrading team morale and slowing down business velocity.
There’s two logical conclusions on where to go from here:
Align the two practices in a way that they allows them to get/give what one another needs in perfect harmony (Monika’s paper outlines ontologies for this approach)
Merge the two, either through technology, staffing, or both -- insane, I know, right?
Honestly sometimes we need 1, sometimes we need 2. Ontologies, data contracts, and good leadership are all things that can help drive Option 1 to success. But today I want to talk about Option 2 and technology options that enable it more than ever.
Databases - the great divide
The greatest divide in my experience has always been the database. I’m talking OLTP MySQL and Postgres databases that power the most critical applications in the company – our REST APIs or gRPC microservices that generate all the data.
This tends to be a natural infra divider because the obvious primary purpose of the database is to store data written by the applications owned by our Software Engineers; however, Data Engineers need this data too to make available for BI reports, machine learning, sometimes even data products or partner sharing.
SWEs are concerned with connectivity to the database, maybe connection pools and horizontal scaling with tools like PgBouncer. They are also concerned with performance using indexes and caching techniques to keep latency down, as well as avoiding breaking changes to their application, though that’s often where the line is drawn.
DEs are concerned with data quality, schema evolution, and data governance. They care about performance as it relates to data ingestion and whether it’s incremental CDC or snapshot-based JDBC. Sometimes the DEs need to write-back as well, publishing enriched analytics and ML outputs to the OLTP database so that the SWEs can consume it with ultra-low latency in more API endpoints.
All of this is usually held together with separate tools, like Debezium and Kafka for CDC, or maybe AWS DMS and S3 in small scales, which are all extremely fragile -- not a matter of if it will break, but when.
Ingesting Database changes to the Lakehouse
Last year, Databricks acquired Neon, leading to the Data+AI company offering Lakebase, a managed serverless Postgres database with separated storage & compute.
Full disclosure - I work for Databricks, but was a user beforehand and a long-time practitioner dating back to when no one had even heard of Databricks. My thoughts and opinions in this blog are my own, I assure you.
With Lakebase’s CDF, what I’m seeing now more and more is that database itself handles syncing the data from OLTP to OLAP storage layers. Today this may be fairly specific to Lakebase by Databricks, but I believe we’ll see other engines try to offer a similar feature soon.
This means no external CDC tool needed and no pipeline to maintain. It solves the PITA that was database ingestion previously.
Serving Analytical Data
Similarly, Lakebase’s Synced Tables carry data in the opposite direction. Lakehouse tables get synced into Lakebase’s postgres database for low-latency reads.
For example, next time some batched ML outputs need to be written back and stored in Postgres for the SWEs’ APIs to read and serve, we can sync it without custom pipelines running JDBC, or hacky sync scripts to copy data exports.
Point is... OLTP databases have been a notorious divider between SWE & DE personas, and that divide is now (from a technical standpoint) all but gone. Data contracts & schema evolution are still very important of course.
Message Buses - the magic pipe
The 2nd largest area I see SWE & DE gap shrinking in is the message bus. Almost every API microservice ever created has eventually had the need to hand off data as quickly as possible to some other service for ingestion or processing.
As SWEs who own an API, we prioritize latency and avoiding synchronous processing as much as possible. So what do we do? We go out and grab a message bus—Kafka, Kinesis, etc.—now we can fire off messages to the bus and carry on with our own application. Someone else’s problem.
Well, that someone is usually the DE persona. Just like with the OLTP database, the DE usually needs to ingest this data, which means maintaining a pipeline reading from the message bus, and that pipeline often stays on 24/7 (even when the API traffic isn’t constant) to maintain decent data freshness in the lakehouse.
Data contracts are often broken here more than in databases too, since message buses often encourage unstructured or semistructured data representation like JSON. One time I had to hotfix a kafka ingestion pipeline because the application team decided to change the payload to store event timestamps as epoch millis instead of epoch seconds, then realized a junior dev accidentally used String instead of Long, requiring an additional payload change -- this led to our data ingestion pipeline having a total of 3 different timestamp parsers in it, and they had to stay for a long time since the application was a mobile app and some devices could still have one of the old app versions still.
Message buses have a myriad of other challenges in testing, monitoring, and debugging, as discussed at length in this 2024 paper [2].
The game changer in this area for me has been Zerobus, also by Databricks.
I know, another Databricks product...from the dude that works at Databricks... big surprise. I’m definitely biased, but I promise I have more personal relatability to these pain points than you can imagine.
Zerobus is a push-based API that writes directly to Unity Catalog Delta tables. At its core, it’s a load-balanced gRPC service that you can call just like you would a message bus, so it doesn’t hold up those precious low-latency APIs, but the data still gets delivered to the Lakehouse in near real-time (P50 <= 5 seconds, P95 <= 30 seconds). It doesn’t replace 100% of the use cases for true message buses, but for many real-world scenarios where message buses are being used as a transport-only mechanism, it certainly fits.
Because it’s gRPC, which uses Protocol Buffers, this addresses 90% of the “data contract” needs in cross-functional data exchange in my experience. I don’t have to guess at what the DE is expecting, or cobble together some AI-slop technical documentation of each field -- just share the .proto and you’re good. Plus it helps that SWEs are often already well-versed in protocol buffers and RPC!
The result ends up being transparency in data exchange, zero expensive message buses, zero painful bus re-sharding for the DevOps team, and zero data ingestion pipeline for the DEs to maintain.
A Modern Stack
This is already a pretty opinionated piece, betting on key Databricks services, but I’ll paint a fuller picture to wrap things up.
In the new world, this is how I envision the “hybrid” data/software stack to be:
In fact, if you want to begin working towards this architecture on your next API or microservice solution, I’ve created a go-api-template repo that serves as my personal recipe for “if I were to build a new micro tomorrow, here’s how I’d build it.”
Of course, Databricks just announced LTAP (Lakehouse Transactional/Analytical Processing) to combine Lakebase (serverless Postgres on open object storage) with the Lakehouse under a single governance model, source of truth, and storage layer for all operational, analytical, and streaming data. That deserves its own separate post but with it being even newer than the features I just talked about, I’ll give it some time in the wild before covering it.



