General

Hadoop Study For Dummies Part-1

Dashboards and reports are key elements in any application for its users to have a clear vision on the status of a system. It is very hard to satisfy the expectations for user experience in this era. Adding technical challenges like distributed computing, hadoop, big data makes it harder and more painful to continue for improvements.

Some time ago, we were asked how we can improve the reporting/dashboard experience on our applications where some difficulties were faced about calculating and presenting data in an effective manner.

As the first step to solve the problem, we identified on which fields we have room to fill. Our main problem was performance related. Once we settled the performance problem, it will affect overall experience from user perspective and we can focus on UX part later. For this study, we formed a team with volunteers who were looking for new opportunities and wanted to invest their time in exploring new horizons. The main intention of the group was to provide a system with a well-designed UI for customers where they can query data sets with efficient results.

For performance, an upper manager from our company mentioned that Hadoop might be a good solution.

This is the story of how we dived into the world of Hadoop and the challenges we faced.

The first step: Exploring Hadoop

Our first step was to understand Hadoop, one of the most popular environments for distributed data processing.

The team consisted of people that had extensive experience on relational databases and some experience with Neo4J and Mongo DB. So we thought that Hadoop is yet another database with a similar structure where we would add some data and run a query to get results.

After spending some days on reading documents and watching videos, it dawned on us that Hadoop is NOT a database but something completely new. I would rather think of it as an environment. It is a distributed data processing ecosystem, which requires a new approach to understand when you are an old-style developer. To get better understanding of Hadoop, we need to take look at the components one by one. Our first Hadoop component and the second topic is HDFS.

The second step: HDFS

HDFS, basically, is a file system. It stands for Hadoop Distributed File System. This is the main component and hadoop executes all applications on top of this system. All the examples that we investigated showed us that we need to copy a file into HDFS. As soon as a file is copied/moved to the system, HDFS starts to manage the file and it is being distributed across nodes with certain chunks.These chunks are set as 128 mb by default, which means that if you a have 1280 mb file, it is going to be chunked by 128 mb parts each and being distributed to as many number of computers that you use.

Everything seemed clear enough to us. Since we have a distributed environment, system handles all the required actions through chunking and distributing them to servers. Next move, we want to process the file, but we have some question marks on our minds. Do we have to work with files? We have a relational database, the expectation is that we will run some sql-like queries. Anyway, we are exploring new horizons and we can answer this question later.

The third step: Map&Reduce

Map reduce is a programming model for processing and generating big data with distributed sources. If you would like to check it out in more detail, you can take a look here. Hadoop mainly uses this approach to handle data. Map&Reduce applications for Hadoop can be developed using scala (preferred one), python or java. For sure you can also use C, but we do not recommend it. Since we are developing pure java/scala code, we can process any kind of data included in CSV, XML, Json files as well as pure text files. It surely is very flexible, but we thought that we did not want to continue with the idea of developing a java application for each report. After all, we still are using a relational database.

We tried to develop a couple of basic applications like frequency analysis using map&reduce. It sounds nice to the ear but not nice enough to be used on this kind of an application

My expectation was very simple; open an editor or command line tool, write a query, get the result. What was the best option for us? While we were walking around in Hadoop realm, we met Hive.

The fourth step: Hive

In Hadoop ecosystem, Hive is highly suggested if you have relational data and would like to develop analytical queries. So we decided to take a look at Hive.

As the first step, we created a Hive table on Hadoop, recorded some data, ran a query on Hive. It was so easy and the results were perfect! 

Our queries were standard style SQL queries. This meant that we would not even need to change our existing report structure completely. There were only little dialect differences between oracle and SQL standards, which is not a big problem. Hive is running on map&reduce as its base model so you don’t need to write codes in any programming language.

Mainly what hive does is to cross-compile the SQL code into java code at runtime. You can imagine how easy it is.

The only issue we faced is the amount of time it took when we tried to execute one single basic SQL query over only a few records (15 records processed in 55 seconds). That’s crazy and we could not accept such a result. We are discussing the result within the team and the only valid explanation is that we are using a standalone mode. The reason is that Hive has a standard initialization process, that takes about 30-45 sec, then your query is being processed. You might not consider this specific time frame as important when looking at the entire picture. For me it is important.

After all, we successfully inserted some data into our hive table and we are able to execute our SQL query.

Our next target is to insert data into Hadoop that resembles production data.

To be continued.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top