The problem with delta loads from transactional systems
markus-spiske-xekxE_VR0Ec-unsplash.jpg
At Celonis, we use delta loads to make the data available to our customers as quickly as possible. A delta load means that we do not repeatedly extract the entire data of a table, but only the new data that has been added to a table since the last load.
Background and testing environment
We were assessing the possibility of delta loading rows from a so-called change log table we create in SAP systems. The change log table contains a row for each change or update that occurs. The idea was simple: add a column to the change log table that contains the timestamp of the row. When extracting from the change log table, the rows where the timestamp is greater than the maximum already extracted are selected. The cached maximum timestamp is updated after each extraction.
We were testing on SAP servers that use Oracle and HANA as their databases. Their standard isolation level is committed reads (i.e. transactions are solely able to read committed changes). It’s important to note for later that neither allow dirty reads (i.e. reading of changes that have been conducted by another running transaction and not yet committed) since they are able to provide lock-free reads without the side effect of being able to read uncommitted changes from other transactions.
Discovery of possible unextracted rows
When discussing the implementation details we identified that when several database operations are wrapped in a transaction, which itself can be run concurrently with other transactions, it can lead to rows with older timestamps being committed after those with newer ones.
Please follow this chain of events (scripting in HANA) to both reproduce this and see why it is problematic for delta loads:
1. Create a table that contains three columns:
- ID: auto-increment
- Name: string
- Timestamp: insertion timestamp.
CREATE TABLE ZTEST (
ID VARCHAR(5),
NAME VARCHAR(20),
TIMESTAMP VARCHAR(20));
CREATE SEQUENCE seq
START WITH 1
INCREMENT BY 1 ;
INSERT INTO ZTEST VALUES (
seq.NEXTVAL,
'old',
TO_CHAR(CURRENT_TIMESTAMP, 'YYYY-MM-DD HH:MI:SS'))
CREATE PROCEDURE "batch_insert"() AS BEGIN
DECLARE counter INT := 0;
WHILE counter < 1000 DO
counter := counter + 1;
INSERT INTO ZTEST VALUES (
seq.NEXTVAL,
'batch-insert',
TO_CHAR(CURRENT_TIMESTAMP, 'YYYY-MM-DD HH:MI:SS'));
END WHILE;
END;
5. Call the procedure and then move quickly to step 6 as we need the two queries to run in parallel. In other databases you can use a sleep() command in the procedure to make this timing easier.
This is a fairly standard situation in databases, for example in SAP having several queries to update invoice statuses wrapped in a LUW (more information) , several copies of which could be running at once.
CALL "batch_insert"();
INSERT INTO ZTEST VALUES (
seq.NEXTVAL ,
'single-insert',
TO_CHAR(CURRENT_TIMESTAMP, 'YYYY-MM-DD HH:MI:SS'))
7. When the single insert finishes and commits and the batch insert is still running the resulting table will only have the single insert visible. The entries of the batch insert will only be visible after the complete transaction finishes and is committed.
Assume we are extracting at this point in time from the table and load the latest timestamp 2020-09-29 04:04:52 into the cache.
8. When the batch insert then finishes, the resulting table will look like the one below. Clearly visible is that the database calculates the timestamp mid-transaction, not on commit.
Assume we are extracting again, filtering on “timestamp > MAX(timestamp)”, whereby the cached maximum timestamp is 2020-09-29 04:04:52. We will only extract the rows highlighted in purple. The rows with ID 2 to 721 will not be extracted because their timestamp is less than what was received in the first delta load.
Svenja Matthaei
As a software engineer and data pipeline enthusiast, Svenja has been involved in every step of the Celonis ETL process.
At Celonis she first specialized on transforming and loading processes and now focuses on the extraction of valuable data.
She assesses comprehensive data as key to effective decision making.
Thus, Svenja is passionate about providing and combining data from various sources in real-time.
- Svenja Matthaei