Row Oriented Database vs Column Oriented Database simplified.

Nilesh Khandalkar
4 min readNov 28, 2021

--

Database is either Column oriented database- or row oriented database, choosing either of this will have an impact on your application performance and will solely depend on your use-case.

Let’s suppose we have the EMP table with many employee records and we execute the below SQL’s

Select ename from emp where salary = 100000;

Select * from emp where empid = 1;

Select sum(salary) from emp;

In Row oriented database the above rows of the table will be stored as below, imagine the below section (B1,B2,..) as the blocks, for simplicity I just showed 2 rows per block, it can be more than that depending on the block size of the storage. For a Given block each block takes one IO so when you query for one block it will get everything inside the block.

So, now when we query — Select ename from emp where salary = 100000;

It will do an IO to the first block, does the above query satisfies the condition? yes then it will pull the first block and store in the memory and need no more IO to the disk. Now to select ename or whatever you need from that block will be relatively cheap. Now if we execute Select * from emp where empid = 1; In this case you need no IO as this will pick it from the memory as this block already stored in the memory.

Now, If you execute an aggregate function like — Select sum(salary) from emp; Assuming there are no indexes on the column, this will go on reading all the blocks — B1,B2,B3 and so on, and sum them. This will require an IO for all the blocks. Imagine a table with millions of rows and the amount of IO for reading all the blocks. In this case you will end up reading entire table and only use one column. Yes, there are internal multi-threading, parallel processing and all sort of tricks inside the database which helps optimizing the query result.

Now lets understand Column Oriented Database, the same table will be stored as below —

In this case, rowid is duplicated in all the rows as the reference for each column. Again B1 which is shown to be stored in single block, this can also be split in multiple blocks and depends on block size.

Now, Select ename from emp where salary = 100000; Now the database will know which block to read i.e B4, but we also need ename which is not part of this block but we know the rowid. Now it will directly jump to block B2 and match the rowid 1001 which is Katie. So in this case it has to do 2 IO reads.

If we execute, Select * from emp where empid = 1; so it will pull block B1 and gets the rowid 1001 but in this case we need everything as we said select * then we have to do go on reading all the blocks where rowid = 1001 and read ename, e_last_name and everything. In this case we are reading every block and imagine the IO going in this. This is the worst ever query executed on column oriented database. So as per the best practice, never do select * in column oriented database and only select specific columns which are required.

Now, Select sum(salary) from emp; In this case it will directly jump to B4 and read the block as it will have all the salaries which are required as part of the sum(salary), done just one IO read.

To summarize, it all depends on the use case on which databases to be used as both Row and Column based has their advantages and disadvantages. Column based are mostly used in Datawarehousing or Datalakes where analytics are heavily used, whereas Row based are used in transactional storage.

Hope you find this explanation useful.

--

--

Nilesh Khandalkar

Passionate about Data and Cloud, working as Data Engineering Manager at Capgemini UK. GCP Professional Data Engineering Certified Airflow Fundamentals Certified