How Parallel Database and Query Optimizer work together!
Today Database systems are an inseparable part of every business in any industry. In the space of 20 years, the field of database has become so powerful that it is funded roughly 10USD billion just for research purposes. Database systems started in 1960s with file systems. From then, they started to improve gradually. Navigational databases, Relational database systems and NoSQL databases are some of the systems that have been developed during this period. Improvement of technology, higher speed of internet and a shift towards online systems changed the database systems as well. Today, database systems must deal with massive amount of data every day. Humans create 2.5 exabytes of data daily, to give you a clearer picture of how big that is, you could imagine watching 90 years of HD videos or having 250 thousand congress libraries. It was around 2004, when the term “Big Data” came around and it was an implication of the massive production of data by humans. Now we have to come up with a database system that could store this big data on a daily basis! Database researchers started to come up with different ideas of how to solve this issue, until they came up with a feasible solution in 2004 and that was parallel databases (MPPs). The idea behind parallel computing is, instead of having one large data center, we could have a set of small data centers that run in parallel. Therefore, we could process more data with lower cost. Imagine a scenario where you have to transfer a group people from destination x to y with a certain budget. One way is to take them with an expensive car, but that will take longer as a certain number of people could fit in a car. The other option is to rent 3 cheap cars, it will do the same job, but much faster and easier. Parallel computing goal is to achieve the same thing, lining up a large number of low-end servers instead of deploying a smaller set of high-end servers. Here is a picture of the architecture of Parallel Database systems. Query Optimizer Query optimizer is a function that controls everything in parallel database systems. Many people believe, complex enterprise query optimizer engines kept as one of the most valuable corporate secrets. From joining tables over different nodes, using techniques such as indexing or cluster-indexing and translating the SQL commands over the nodes are performed by the Query Optimizer. You could think of Query Optimizer as having a new Tesla Model X car, where you could set the driving on auto-mode and let the automation takes care of driving, watching out the speed limits or give way to pedestrians! Yes, life is much easier with Query Optimizer and a Tesla car! Query Optimizer works by finding the best possible solution to execute a query by considering query plans. A query plan in database is called an ordered number of steps that enables us to access data in SQL relational language. Let’s go through a scenario together to see exactly how it works. Let’s say you own an international paper company and you want to see the total number of papers that you sold in the last quarter based on each city that you have a branch. You would write a SQL query and run it on the MPP. Then Query Optimizer come up with a number of query plans and chooses the best in regards to various requirements such as if indexing could be used, or if we need to join tables, etc. In this case, Query Optimizer finds that one of the plans is to use the date indexing table. Indexing in databases keeps positions of records in a separate table based on a given attribute or statement. For example, in here, date indexing table is sorted the records based on their date. Therefore, you do not need to read the whole table and you just pick up the records of papers that are sold in the last quarter. Indexing could help you immensely in having faster processing times. Now that we have all the records, query optimizer will divide the aggregation task among the nodes that are available in MPP and run the query. After that, the results of these nodes would be merged and you could see the total number of papers that were sold in the last quarter based on each city. Conclusion In conclusion, we could look at Query Optimizers as the heart of parallel databases. They are the one who are in control of everything and constantly check to see if everything runs smoothly. With emersion of other new systems such as NoSQL databases and Hadoop MapReduce, functions such as query optimizers and the technique it uses to manage complex queries are still make parallel database systems compatible with these new systems. Going forward, if researchers could improve other parts of MPPs such as fault tolerance and data loading, relational database systems could still be the leader in the database community as they have been in the last 30 years.