Introduction to Apache Kafka
Hi, I am Malathi Boggavarapu working at Volvo Group and i live in Gothenburg, Sweden. I have been working on Java since several years and had vast experience and knowledge across various technologies.
This post introduce you about the Apache Kafka - a large Streaming platform. So let's get started and get into the details of it.
This post introduce you about the Apache Kafka - a large Streaming platform. So let's get started and get into the details of it.
Who invented Kafka
Kafka is invented by LinkedIn. They are the first ones that created the project which is now a top level Apache project which is the Open source project that has large community behind it.Why use Kafka
ETL is dead. Long live streams. To introduce this topic, we look at typical data pipeline that is used at data warehousing known as Extract Transform and Load, the ETL process. So what exactly is the ETL process? Let's take a look at it now.
The process starts with Data Providers. The data from these providers were loaded into staging environment every night or more frequently. The staging environment could be database like sqlserver, oracle or in more modern systems something like HDFS (Hadoop Distributed File system). This is your back office location where the data can be prepared and shipped over to your data warehouse where your analysts and scientists will use it. And also other applications can leverage data to do things like marketing efforts or business monitoring, Decision making and many more. See the below picture. Data Providers could be some external events, data coming from business systems and internal applications. So data from these data providers will be loaded into different data storages such as Oracle, SqlServer or HDFS. Data will be processed at this stage and could be used by data analysts or data scientists and also by some other applications to implement business monitoring functionalities and so on.
So how this ETL process is replaced with Kafka? Well, Kafka uses Streaming to do that.
Kafka uses Streaming instead of ETL batch process and the way Streaming works is all the data from Data Providers or what we call Producers in Kakfa comes into the Streaming platform. Inside the Streaming platform, as the data is being written and actually ingested into the platform, operations are being performed on it. The simple operation could be "Total number of orders" or "Total number of Likes on Facebook post". All the things happen in real time or with minimum latency. The output of streaming operations will be sent to other applications. In Kafka terms they were called as Consumers. These are things that are listening for events that have occured. So in Kafka world, we can think the Data Providers as Producers - The things that write data to clusters. When they send data in, we have Consumers on the other side which consumes the data. One interesting thing about Kafka is that Consumers can also write data to other part of Kafka so that Consumers can also become Producers. See the below picture that depicts Producers and Consumers and also the Kafka Cluster.
Some of the ways that we can pull data in is, we can write the data directly to the Kafka cluster or we can connect to existing apps using a Connector. This is nice because most likely in cooperative environment, they have software and systems that run their business which are placed since long time and can't easily updated to work with Kafka. These connectors were used in such cases. They just pull out the data from the data source whenever there is a change in the database. Relational databases are good example. Example: Whenever a customer table is being updated by the legacy application, these connectors will track such update events and grab out the updated data and push them to the Kafka cluster.
Who is using Kafka and How?
Kafka is being used by the companies which process large volumes of data. One big company using Kafka today is Walmart. Walmart the biggest retailer in UnitedStates, possibly in the world has billions of transactions every single day. All of these transactions need to be streamed into a data platform and some of them need to be processed and handled immediatly while others will be used for analystics later on. Because Walmart has lots of data to be processed in real time and Kakfa suits those needs for them.
Cisco is another large company that does everything from Cameras to networking gear to software and they also have lots of transactions happening across all these systems that are used to run their business. This is one of the biggest company where Kafka s being used.
Netflix uses kafka pretty heavily. Think about everytime you watch a movie on Netflix or when you watch just upto certain point in a movie, they record all of these things in advance through kafka and they use that to improve the platform. It provides better recommendations or better user experience by remembering where you left off when you are watching the movie.
Paypal process billions of transactions every month and Kafka is ensuring that those trnsactions are consistent. It means they are accurate and timely when they are processed and help prevent fraud.
This is very big for financial industry.
Spotify is very similar to Netflix and is being used for providing better user experience.
Uber is an interesting usecase. Their product is very real time where it is connecting riders and drivers. Kafka plays an important role for Uber as well.
So that's all about the introduction to Kafka. Please hand on to the blog to learn more about Kafka architecture and it's implementation. See you in the upcomng posts of Kafka.
Nice post ! Thanks for sharing valuable information with us. Keep sharing.. Big Data Hadoop Online Training Bangalore
ReplyDeleteThanks a lot!
ReplyDelete