The Birth of SAP HANA or “A Common Database Approach for OLTP and OLAP Using
an In-Memory Column Database” by Hasso Plattner
This article has been in my drafts and I did not have time to work on it for a long time. I saw it was part complete and decided to finish and cap it off.
This is the follow up to my earlier article which touched briefly on Relational Databases and how in-memory technology is currently being leveraged with respect to Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) workloads. In this article I will elaborate more on how in-memory technology is used in mainly SAP HANA and touch more on it on the seminal paper as written by Hasso Plattner — who was a co-founder of SAP.
Before I go any further let me give a brief rundown of the four main requirements a database system must meet to ensure reliability:
- Atomicity — A transaction has to be atomic. (For example, if part of a transaction fails the entire transaction has to fail and leave the database state unchanged.)
- Consistency — Only valid data should be written to a database. ( If a transaction is executed which contravenes the databases consistency rules the transaction should be rolled back and the database restored back to a state valid with the consistency rules. If a transaction successfully executes it will take the database from one consistent state to another.)
- Isolation — A transaction must be isolated so that no one transaction can interfere with another.
- Durability — After a transaction has been committed to the database it should remain there.
The above constraints are considered important for both OLTP and OLAP systems. While both systems are based on relational theory they can be thought of as having different technical approaches. In his paper titled “A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database” by Hasso Plattner ( in the title !!) in 2009, he explored the capability of having both OLTP & OLAP capabilities on a single RDBMS which lead to the birth of SAP HANA as we know now it.
Over the years growth of main memory and parallelism have enabled enterprise systems to grow both in functionality and volume. Prior to the above paper in-memory column storage was used for OLAP successfully (for example in Sybase databases. Sybase was consequently acquired by SAP in 2010.)
When comparing column storage as opposed to row storage there were several reason it was implemented for OLAP systems initially. One of the advantages being that in column-based storage only relevant columns were read during the selection process, and any column could serve as index or key for data retrieval while in row-based storage during selection, all data needed to be read. However, when considering update or insert intensive operations row-based storage were more efficient at the the time since data was stored together contiguously as opposed to row-based storage as illustrated in the diagram below.
Also the columnar storage allowed for compression because of the cardinality of columns in most practical cases and also due to dictionary encoding of values. This meant that the size of column store tables were significantly less than that of equivalent row based tables and also allowed for parallel execution. In the example given in the paper for a table with 35 million rows and 160 attributes the row based table consumed 35 GB of space while the column store table required only 8 GB.
Advantages with column storage included with the existing CPU speeds at the time there was no need to even provide a primary key index and it was instead possible to use a full table scan. Since there was also no restriction anymore for the applications to select data only on predefined navigation paths, this leads to a better separation concerns between the database layer and application layer. In fact when comparing in-memory and hard disks for persistence in his paper, Plattner goes so far as to suggest that hard disks have become yesterday’s tape and be used only for transaction logging and snapshot recovery.
When considering the suitability of column storage for update intensive applications it is interesting to note that updates on column stores are considered to be expensive since even though data is in memory the attribute dictionaries might have to be recalculated. In his paper Plattner categorized the updates in SAPs Financial System as of three main types.
- Aggregate Updates — Most of the updates taking place in financial applications apply to total records following the structure of the coding block. The coding block can contain e.g. account number, legal organization, year, etc. These total records are basically materialized views on the journal entries in order to facilitate fast response times when aggregations are requested. The more instances of aggregates are requested, the better for the relative performance of the column storage on the fly
- Status Updates-Status variables (e.g. unpaid, paid) typically use a predefined set of values and thus create no problem when performing an in-place update since the cardinality of the variable does not change.
- Value Updates- In order to handle value updates, the concept of column-store table comprising of a main store and a delta store is used. When values are updated they are stored in the delta store and at a later point of time “merged” with the main store. However at any point of time read operations are carried out on both the main store and delta store to ensure all values in the table are read.
There are several consequences of the insert only approach to value updates as above and one of them is the locking which happens at an application and database level. When considering application level locks, many business transactions can be thought of to deal with several relational tables and multiple tuples of one table simultaneously. The applications “think” in objects, a layer which is established on top of the relational model. For example in the reconciliation of open items in accounts payable or receivable multiple open items will be marked as paid in one transaction. The lock is not on the accounting line items table but on the objects creditor or debtor. When considering database level locks, inserts are added to the delta store at the appropriate partition of a table. The timestamp at the start of a query defines which tuples are valid (only tuples with a lower timestamp). In case an insert is in progress (single or multiple ones), the timestamp of the start of a new query will be set to the timestamp of the insert transaction minus one, and again the ongoing insert(s) will be ignored. This procedure is equivalent to snapshot isolation via timestamps.
In terms of the benefits that were outlined in the paper, one of them was that with regards to memory consumption, a factor of 10 was estimated by Plattner to be the saving in compression of column storage as opposed to row storage. In addition, he envisaged that rewriting existing applications in SAP ERP using SQL, a reduction in the amount of code by more than 30% (in more formal applications like financials by 40–50%). Some of the other benefits he foresaw were, that there would be no need for indices, and there would be no need for aggregates in the form of materialized views, and that views would be created on the fly.
SAP HANA which is positioned as being able to handle OLTP and OLAP queries simultaneously was the idea which was posited in this paper and it is pretty amazing to see it being used to power organizations across the globe. Please feel free to post your comments as well as I would love to hear your comments on SAP HANA and if your organization is also an adopter of in-memory columnar tech for databases.