Yahoo developed Pulsar, pub-sub messaging system and made it open source. Its now Apache's incubating project. Since Kafka is also used for same purpose. Want to know, major plus and minus points of Kafka over Pulsar.
What are the advantages and disadvantages of Kafka over Apache Pulsar
22.1k views Asked by Ajit Dongre AtThere are 4 answers
Pulsar, the Apache Software Foundation’s newest project to attain top-level status, is drawing a lot of comparison to Kafka, another ASF project.
Pulsar is a highly-scalable, low-latency messaging platform running on commodity hardware. It provides simple pub-sub and queue semantics over topics, lightweight compute framework, automatic cursor management for subscribers, and cross-data center replication.
Meanwhile, the 2018 Apache Kafka Report, which surveyed more than 600 users, found data pipelines and messaging the top two uses for the technology. It found growing use with the rise of microservices architectures.
“There is a big overlap in the use cases for the two systems, but the original designs were very different,” said Matteo Merli, one of its creators who have since formed Streamlio, a startup offering a fast-data platform.
Yahoo created Pulsar as a single multi-tenant system as a solution to its problems with multiple messaging systems and multiple teams deploying them.
It was released it as open source in 2016 and entered the ASF incubator in June 2017. For around four years, it’s been used in Yahoo applications Mail, Finance, Sports, Gemini Ads and Sherpa, Yahoo’s distributed key-value service.
In a blog post, co-founder Sijie Guo summed up Pulsar vs. Kafka this way:
“Apache Pulsar combines high-performance streaming (which Apache Kafka pursues) and flexible traditional queuing (which RabbitMQ pursues) into a unified messaging model and API. Pulsar gives you one system for both streaming and queuing, with the same high performance, using a unified API.”
Said Merli: “There are differences between streaming and queuing; there are a lot of use cases where you need one or the other, but most people need both for different use cases.”
Two-Layer Architecture A two-layer design is key to Pulsar, Merli said. There’s a stateless layer of brokers that receive and deliver messages, and a stateful persistence layer, with a set of Apache BookKeeper storage nodes called bookies that provide low-latency durable storage.
Pulsar was built on the idea of having strong data guarantees, Merli said. It was designed for shared consumption, while Kafka was not. And Pulsar enables users to configure a retention period for messages even after all subscriptions consume them.
Its layered architecture and segment-centric storage provide key advantages:
You can scale the brokers or the storage layer independently. Since the brokers are stateless, a topic can be quickly moved to other brokers. That opens up an efficient way to balance traffic across brokers. Can have multiple consumers on the same partition and you can add as many as you want. Since no data is stored locally, it eliminates the need to copy partition data when expanding capacity and no rebalancing is required. When a partitioned topic is created, Pulsar automatically partitions the data in an agnostic way to consumers and producers.
The broker sends message data to multiple BookKeeper nodes, which write the data into a write-ahead log and also keep a copy to memory. Before the node sends out an acknowledgment, the log is force-written to stable storage, which ensures retention even if you lose power. Topic partitions can scale up to the total capacity of the whole BookKeeper cluster, and you can scale up a cluster by simply adding nodes.
Since entering the incubator, a key focus has been on making it easier to get started with Pulsar .
Version 2.0 of Pulsar was released in June, including a “stream-native” processing capability called Pulsar Functions which enables users to write processing functions in Java or Python for data as it moves through the pipeline. Version 2.2 will be released soon, which will feature interactive SQL querying.
Pulsar provides multiple language and protocol bindings, including Java, C++, Python, and WebSockets, as well as a Kafka-compatible API.
Further reading : Apache Pulsar : Is it a KAFKA Killer?
Apache Pulsar : Is it a KAFKA Killer? Written By Bhagwan s. Soni
WHY should we choose Apache Pulsar over Kafka?
Apache Pulsar is a Enterprise Edition of PubSub, Originally developed by Yahoo and now supported by Apache Software Foundation. Apache Pulsar is running on Production systems from last more than 3 years and proved it’s stability.
Apache Pulsar covers almost all the features which Kafka offers us, may be with diferent names. Pulsar has many features but I would like to highlight some of them which will help us to differentiate with Kafka -
1} Apache Pulsar gives you 3 types of subscription over the topic: A} Exclusive — Only one consumer will consume the data from a Topic B} Shared — Multiple Consumer will consume the data from a Topic C} Failover — More than one Consumer but at a given point of time only one will consume the data.
2} Each namespace can have one or more topics
3} Strong support for Multitanency
4} Data replication over multiple cluster
5} Strong message durability support against data loss
We needed a streaming platform with persistent topics and reasonable latency and high throughput. Recently, we evaluated whether we should go with Kafka or Pulsar and unlike @nha we are now in favour of Apache Kafka. Here are our findings:
Pulsar - Pros
- feature rich - persistent/nonpersistent topics, multitenancy, ACLs, Multi-DC replication etc.
- more flexible client API - including CompletableFutures, fluent interfaces etc.
- java client components are thread safe - consumer can acknowledge messages from different threads
Pulsar - Cons
- java client has little to no javadoc
- small community - 8 stackoverflow questions currently
- messageId concept tied to BookKeeper - consumers cannot easily position itself on the topic compared to Kafka offset which is continuous sequence of numbers.
- Reader cannot easily read last message in the topic - need to skim through all the messages to the end.
- no transactions
- higher operational complexity - Zookeeper + Broker nodes + BookKeeper - all clustered
- latency questionable - there is one extra remote call between Broker node and BookKeeper (compared to Kafka)
Kafka - Pros
- very rich and useful javadoc
- Kafka Streams
- mature & broad community
- simpler to operate in production - less components - broker node provides also storage
- transactions - atomic reads&writes within the topics
- offsets form a continuous sequence - consumer can easily seek to last message
Kafka - Cons
- consumer cannot acknowledge message from a different thread
- no multitenancy
- no robust Multi-DC replication - (offered in Confluent Enterprise)
I played a bit with both lately, and here is what I gathered.
Neutral:
- I was going to make Kafka win on the community/documentation etc. But I wasn't able to find replies to questions I had on Kafka easily, some were old and confusing (targetting the legacy API). But Pulsar documentation is good enough, the developpers are very responsive on Slack (hello @Matteo Merli :) ) , and the underlying pieces (Zookeeper, Bookkeeper) have decent documentation as well should you want to dive in the internals.
- Kafka aims for high throughput, Pulsar for low latency. Both provide settings to control it.
- Both are production-ready and battle-tested in several companies
Pro pulsar:
- from my experience the API is easier to use. In Kafka, the broker is dumb and the consumers do the job of structuring communications as they see fit. This flexibility comes at the cost of the user of Kafka having to understand how to make the pieces fit together. I guess the intended benefit is increased flexibility, but since Pulsar was able to replicate Kafka Consumers API (and with fairly little code) I give that as a pro to Pulsar.
- you can do things that are not easily done (or maybe impossible in Kafka): multi-tenancy (security, isolation...), resource management (topic throttling, quotas), geo-replication
- It has some features that Kafka currently lacks, like seeking to a particular MessageId
- Pulsar scales to millions of topics, whicle Kafka is limited by the way it structures data in Zookeeper
- Easier deployment. A standalone Pulsar will start it's own local Zookeeper, and I personally found the configuration easier to understand
- written in Java, versus a mix of legacy Scala and Java code. Also I found the codebase well organised and much easier to follow. In part because it relies on Zookeeper and Bookkeeper, which are external projects with their own documentation/community/developers etc. (please note, those are also in the Apache foundation, and also coming from Yahoo so they work well together).
Pro Kafka:
- Kafka has things built on top like Kafka Streams (never used it so I can't say if there is an equivalent)
Also read:
Apache Kafka is more mature (it's been around for longer) and has higher level APIs (i.e. KStreams). It's maturity, however restricts fluidity and flexibility i.e. ~500 open PR on github
Apache Pulsar has deeply studied the design decisions of Apache Kafka, and has incorporated an improved design and a set of exciting capabilities i.e. the idea of namespacing topics, and allowing ACL or quotas to be applied on a name-space level seems such a profounding good idea, to provide better multi-tenancy support. Some other exciting features of Pulsar is the geo-replication, as well as the unification of queuing and streaming