After long research activities and trials, we started to draw a picture about our big-data-architectural approach. All team members started to think in an aligned way. We needed to have a database for reporting and we needed to sync the database with a certain time frequency. Our understanding; hadoop is not a database. Hive is not a replacement for any relational database. Finally, Apache Spark is not the answer we were dreaming of (at least, it was not we expected). The question was simple, but the way up to the solution was not.
Wheel reinvented: Lambda Architecture
Lambda architecture is basically an architectural approach for big data processing problems. You can find its general view in the following illustration. Also you can find more information on http://lambda-architecture.net.
We decided not to store our data within Apache Hadoop ecosystem. The final decision was to copy and sync our data with another database. Mongo DB perfectly suited our expectations as it was easy to implement and develop. In lambda architectural terms, it is called the speed layer. The main intention is that the application will generate reports using a different database than the production environment to avoid unnecessary load. These reports do not need highly intensive processing. We will handle syncing the data between data source and mongo db using Apache Kafka with the help of Confluent.
When we need to implement complex and intensive reports, Apache Hadoop/Spark comes onto the stage. It is the layer which will handle the processing part. In lambda architectural terms, this is the batch layer. There are some different approaches here to fetch the data from the source. Here are some examples:
- Export full data from the source and load it to hdfs (very common).
- Connect to the data source from Apache Spark and process it.
- Fetch data from the speed layer using Apache Spark since it has good support for mongo db.
- Develop a message queue or streaming layer which pushes the data to batch layer at the same time of collecting the data.
Finally, generated data from batch layer to be pushed into mongo db again as serving layer. Users will be able to query the processed data from new collections.
After long hours of discussion and research, we came to a conclusion and decided on this structure. Then we invited a colleague from another team, who said that there is an existing presentation on this. It describes an approach to the situation. He told us he could share the document and he eventually did. When we saw the structure, we all started laughing and said we are a good team.
Thank you to each and everyone of you for everything.
And Sancar for being the man within the shadows as editor.