The problem with delta loads from transactional systems

markus-spiske-xekxE_VR0Ec-unsplash.jpg

At Celonis, we use delta loads to make the data available to our customers as quickly as possible. A delta load means that we do not repeatedly extract the entire data of a table, but only the new data that has been added to a table since the last load.

To enable delta loads we need to identify in a table which rows have been extracted already and which ones are new or updated. We rely on a timestamp, such as modified date, selecting rows where the timestamp is greater than the maximum already extracted. Lately we have discovered that in certain situations this can lead to rows being left unextracted. Want to know why?

Background and testing environment

We were assessing the possibility of delta loading rows from a so-called change log table we create in SAP systems. The change log table contains a row for each change or update that occurs. The idea was simple: add a column to the change log table that contains the timestamp of the row. When extracting from the change log table, the rows where the timestamp is greater than the maximum already extracted are selected. The cached maximum timestamp is updated after each extraction.

We were testing on SAP servers that use Oracle and HANA as their databases. Their standard isolation level is committed reads (i.e. transactions are solely able to read committed changes). It’s important to note for later that neither allow dirty reads (i.e. reading of changes that have been conducted by another running transaction and not yet committed) since they are able to provide lock-free reads without the side effect of being able to read uncommitted changes from other transactions.

Discovery of possible unextracted rows

When discussing the implementation details we identified that when several database operations are wrapped in a transaction, which itself can be run concurrently with other transactions, it can lead to rows with older timestamps being committed after those with newer ones.

Please follow this chain of events (scripting in HANA) to both reproduce this and see why it is problematic for delta loads:

1. Create a table that contains three columns:

  • ID: auto-increment
  • Name: string
  • Timestamp: insertion timestamp.
    CREATE TABLE ZTEST (
ID VARCHAR(5),
NAME VARCHAR(20),
TIMESTAMP VARCHAR(20));
2. Create a sequence to enable an auto-increment for the ID.
    CREATE SEQUENCE seq
START WITH     1
INCREMENT BY   1 ;
3. Insert a value into your table to check if everything works as expected.
    INSERT INTO ZTEST VALUES (
seq.NEXTVAL,
'old',
TO_CHAR(CURRENT_TIMESTAMP, 'YYYY-MM-DD HH:MI:SS'))
4. Create a procedure to mimic a batch insert.
    CREATE PROCEDURE "batch_insert"() AS BEGIN
DECLARE counter INT := 0;
WHILE counter < 1000 DO
    counter := counter + 1;
    INSERT INTO ZTEST VALUES (
seq.NEXTVAL,
'batch-insert',
TO_CHAR(CURRENT_TIMESTAMP, 'YYYY-MM-DD HH:MI:SS'));
END WHILE;
END;

5. Call the procedure and then move quickly to step 6 as we need the two queries to run in parallel. In other databases you can use a sleep() command in the procedure to make this timing easier.

This is a fairly standard situation in databases, for example in SAP having several queries to update invoice statuses wrapped in a LUW (more information) , several copies of which could be running at once.

    CALL "batch_insert"();
6. Open another session and execute a single insert while the procedure from step 5 is still running.
    INSERT INTO ZTEST VALUES (
seq.NEXTVAL ,
'single-insert',
TO_CHAR(CURRENT_TIMESTAMP, 'YYYY-MM-DD HH:MI:SS'))

7. When the single insert finishes and commits and the batch insert is still running the resulting table will only have the single insert visible. The entries of the batch insert will only be visible after the complete transaction finishes and is committed.

Assume we are extracting at this point in time from the table and load the latest timestamp 2020-09-29 04:04:52 into the cache.

Screenshot_2020-10-26_at_21.55.04.png

8. When the batch insert then finishes, the resulting table will look like the one below. Clearly visible is that the database calculates the timestamp mid-transaction, not on commit.

Assume we are extracting again, filtering on “timestamp > MAX(timestamp)”, whereby the cached maximum timestamp is 2020-09-29 04:04:52. We will only extract the rows highlighted in purple. The rows with ID 2 to 721 will not be extracted because their timestamp is less than what was received in the first delta load.

Screenshot_2020-10-26_at_21.58.21.png

Svenja Matthaei

As a software engineer and data pipeline enthusiast, Svenja has been involved in every step of the Celonis ETL process.
At Celonis she first specialized on transforming and loading processes and now focuses on the extraction of valuable data.
She assesses comprehensive data as key to effective decision making.
Thus, Svenja is passionate about providing and combining data from various sources in real-time.

author
  • Svenja Matthaei