Claravine vs ETL, ELT, and Reverse ETL
Data is growing faster than ever.
A laptop in the early 2000s offered 40GB of data capacity. Now, even the most basic smartphone provides 30GB of data storage, while a high-end iPhone has more than 500GB.
As data in the world continues to increase, it has to be measured in zettabytes, not gigabytes.
According to IDC, the ‘Global Datasphere’ in 2018 reached 18 zettabytes. This was the total of all data created, captured, or replicated since the beginning of time. However, by 2025, IDC is estimating that the world’s data will grow to 175 zettabytes. Or, in the next four years, the world’s data will grow by over nine times what the entire world has produced since time began to today.
To put it into perspective, if 175 zettabytes were stored on DVDs, the stack would be long enough to circle Earth 222 times!
How Is All This Data Created?
Every digital interaction — with computers, cell phones, IoT devices, Google queries, music or video downloads, streaming, social media, marketing campaigns —creates data. From GPS sensors in our cars to contactless debit cards, we’re creating data. The more digital we get, the more data we create.
And, a good percentage of this data can be used for refining personas or cohorts to improve targeted marketing campaigns. This will create an entirely new round of data, unless the data is corrupted or from a variety of different sources. Then, it requires downloading, reformatting, checking for errors, and re-uploading — in other words, a lot of wasted time.
What Data Delivers
As data gives us clues about consumers or corporate sentiment, it can provide insight into purchasing trends. Over time, data from machinery, manufacturing, space flight, athletics, oil production, traffic lights, or responses to marketing campaigns can provide the kind of information needed to gain insights and help create more efficient brands and customer experiences.
As with all valuable things, the data needs to be stored. Securely. Access needs to be fast and easy because exporting must be near real-time in order for business analytics to be effective. However, data from an iPhone is far different than data from an IoT device. Scraping data from Facebook traffic is different from LinkedIn or Instagram. Yet it’s all very important for building a link with a prospect in a purchasing cycle.
The relationship between customers and businesses can be built with cognitive empathy – requiring business leaders to make strategic decisions that emphasize experiences, journeys, trust, and satisfaction. By using data and employing technologies that address requirements for contextual awareness, frictionless engagement, active learning, and sentiment measurement, organizations will be better able to customize and personalize experiences.
The graphic above points out how valuable customer experience data comes from a variety of locations, sources, and corporate business functions. Tracking clicks, web visits, content downloads, keywords, merchandising, inventory management, project management systems, creative/content management systems, SQL databases, CRM systems, and sales all generate their own metadata in different formats. This doesn’t even take into consideration the other corporate functions that may need access or collect their own data, which includes finance, general counsel, sales, and support. Each has its own technology stack. Each stack has its own data configurations that might serve marketing well in terms of understanding the customer experience, but it’s literally in different languages. So, any data extraction needed to store all of this data together, in a similar format, can take huge amounts of time, manual human effort, and cost.
The Challenges of Data Quality
The IDC survey found that a large number of companies have suffered negative consequences due to poor data quality, including wasted media spending, inaccurate targeting, and even lost customers. While these companies appreciate the importance of high-quality data, several barriers — especially the need to manage a wide variety of data sources, soaring data volumes, integration issues, and regulatory/privacy concerns — slow their progress.
According to a recent Forrester report on data quality, wasted media spend is the most frequently cited repercussion. Companies estimate that 21 cents of every media dollar spent by their organization in the last year was wasted due to poor data quality, which translates to a $1.2 million and $16.5 million average annual loss for the midsize and enterprise organizations in the Forrester study, respectively.
In addition, as much as 32% of their marketing teams’ time is spent managing data quality and, on average, 26% of their campaigns in the last year were hurt by poor data quality. Decision-makers also identified access to high-quality data as the No.1 factor driving their marketing performance success.
Some of the top benefits, realized or expected, from marketing/media data quality improvements include:
- Better customer experiences
- Improved customer targeting
- Faster decision-making
- Reduced media spending waste
But how must this data be managed and stored?
Managing the Data Lake Ecosystem for Clarity, not Mud and Algae
Back in the day when laptops came with a whopping 40GB (one 20 slide PPT presentation today) of storage, managing data wasn’t so difficult.
Over 40 years ago, the leading form of data management was known as Extract, Transform, and Load (ETL). It was primarily used for uploading data. The ETL process became popular in the 1970s and was used for creating early data warehouses or “lakes.”
Properly designed ETL systems extract data from various source systems, enforce data quality and consistency standards, and conform that data so that separate sources can be used together. The goal is to deliver data in a presentation-ready format. The entire process, however, requires a lot of extra processing by hand. But, when managing terabytes of data, it takes unnecessary amounts of time and is prone to error. To compensate, it’s common to execute all three phases (E-T-L) concurrently while in the pipeline. It’s not a process meant for real-time decision-making.
These systems are commonly forced to integrate data from multiple applications, on disparate systems, typically developed and supported by different vendors or hosted on separate computer hardware. The separate systems containing the original data are frequently managed and operated by different employees creating silos of information. This process has created an entire ecosystem of technology products to help make this possible. You can imagine it’s expensive, and you’d be right.
Dated or improperly designed ETL processes involve considerable complexity and create significant operational issues with today’s data loads. Multiple terabytes of data can only be processed by using powerful (expensive) servers with multiple CPUs, multiple hard drives, multiple gigabit-network connections, and a lot of memory.
In real life, the slowest part of an ETL process usually occurs in the database load phase. Databases may perform slowly because they have to take care of concurrency, integrity maintenance, and indices. Still, even using bulk operations, database access is usually the bottleneck in the process.
Another common issue occurs when the data are spread among several databases, and processing is done in those databases sequentially. Database replication, which may be involved as a method of copying data between databases, can significantly slow down the entire load process.
When processing big ETL loads, warehousing procedures will usually subdivide the process into smaller pieces to run them sequentially or in parallel. To do this, each data row and piece of the process must be tagged with an ID to keep track of these data flows in case of a failure. These IDs then help to roll back and rerun the failed piece — wasting yet more time — and preventing other projects from running.
Earlier in this century, data virtualization began to advance ETL processing. The application of data virtualization to ETL allowed solving the most common ETL tasks of data migration and application integration for multiple dispersed data sources. But the process is still about dealing with the horse after it’s out of the barn.
ETL vs. ELT
The next version of ETL, Extract, Load, Transform (ELT) is a variant that works to load extracted data into the target system first. The architecture for the analytics pipeline can consider where to cleanse and enrich data as well as how to conform to dimensions.
Cloud-based data warehouses require highly scalable computing power. This lets businesses forgo preload transformations and replicate raw data into their data warehouses, where it can transform them as needed using SQL. After having used ELT, data may be processed further and stored in a data mart.
There are pros and cons to each approach. Where most data integration tools skew towards ETL, ELT is popular in database and data warehouse appliances.
As if managing data isn’t complex enough, now teams are adopting yet another approach called “reverse ETL,” or the process of moving data from a data warehouse into third-party systems that require APIs to standardize the data for each stack. Reverse ETL refers to data movement but through a single pipeline that simplifies the stack, security, and data governance. The emergence of reverse ETL solutions is a useful component so stacks can better leverage data across an organization.
Both data analytics and go-to-market teams benefit from reverse ETL. Data teams now only have to maintain a single data pipeline; they no longer have to write scripts; and they have visibility and control over data syncs. Sales, marketing, and analytics teams can analyze and act upon the same, consistent, and reliable data. Data consistency helps create continuity across a business since functional teams are working off the same data even if using different SaaS products or stacks.
For example, sources of data often include events coming from client or server-side apps, SaaS tools, internal databases, data warehouses, data lakes, and internal event streams. These sources can also be destinations, in a customer data stack as in:
- an app feeding into a warehouse
- SaaS feeding to a warehouse in a traditional ETL/ELT or,
- a warehouse feeding into a SaaS stack as in Reverse ETL.
Fresh data across all corporate stacks helps improve and accelerate decision-making and reporting.
There is, however, a better way to get data where it needs to be.
The New Data Management – A Centralized Platform
Using standards to achieve data integrity isn’t a new idea. Unfortunately, the approach to realizing this goal has been flawed and clumsy.
Many organizations see data integrity as an engineering problem to fix. Data engineers and data solutions suffer the burden of applying enough transforming, cleansing, stitching, and third-party augmentation to the data within their systems. This is, at its most basic, a dated process.
Corporate teams may think they have implemented strategies in the service of data integrity. But in practice, existing solutions tend to be manual and decentralized and push the problem of data integrity downstream. This shifting of the problem results in a reactive, time-consuming, and flawed approach to fixing data which ultimately impedes business potential and sells a false narrative around the possibility of reaching data integrity.
An organization’s most vital resource is its data. By bringing together a truly unique set of solutions, including a collaborative blueprint for approaching data standards, each organization within a company can customize and construct its own unique approach to data integrity. With this method, data integrity is created and cultivated over time through the contributions of business owners and data owners within the systems they already use. Through these systems and new technology, they can standardize and manage their data, connect it where it needs to flow to and from, and control these standards with clear visibility and access.
Some organizations may think that an internally built solution leveraging ETL will address their problem. However, while they may be able to architect for aspects of their data integrity challenges, they will lack a centrally accessible, intuitive, no-code UI that enables collaboration between user groups across expanding and evolving use cases within an organization.
What companies need is a modern, centrally maintained cloud solution, built from the ground up for proactively assigning standard definitions through a centralized data platform. Only by managing standards across the technology and data landscape will data creators and owners be alerted to issues before they escalate and have the data quality needed for optimal use in each of their technology stacks.
In the end, brands will be able to develop empathetic relationships with their customers by understanding what the customer wants — and how they want to be treated — through the technology lens of awareness, engagement, learning, and measuring.
The traditional business models that once focused on building products, designing services, advertising, and marketing those products, and then waiting for the sales to roll in is over. That industrialized model of the customer journey is fading into antiquity as customers return for purchase experiences.
Brands that want to grow should focus on providing empathy at scale. And that will only happen with rapid and well-informed decisions resulting from good data.