[From Medium] Distributed Systems II: Leaderless and P2P Systems

@gov.glados.computer

By Govind Mohan and Sajeed Syed Bakht

Note from Gov: this is a 2 part article I wrote with a friend on Medium in 2020

In our previous article on distributed systems, we explored the leader-follower paradigm where write requests are handled by a leader and database state updates are subsequently propagated to followers, which handle reads. We saw a drawback of these systems in that they are more suited to high read throughput and lower write throughput. In this article, we will check out systems that don’t have a leader. Having a leader is typically useful if consistency is a key system requirement. However, it may be more important for some systems to be highly available, which means any node returns a non-error response to a query from our definition in the previous article. In such cases, a leaderless system is better suited.

Leaderless systems, as the name suggests, do not have leaders. All nodes (often referred to as peers) are identical in function. As a result, they do not have failover mechanisms for leader loss as there is no need to replace leaders. Instead, measures need to be put in place to ensure the distributed database state achieves eventual consistency. Thus, these systems typically have a background process to synchronize state between nodes. This process of information exchange between peers is known as gossip. Intuitively, the idea of gossip is a very good fit for describing communications in leaderless systems. Gossip in the human context could be the act of spreading rumors, and as the saying goes, “anyone can start a rumor, but none can stop one”. While this is dubious in the human context, it’s a very desirable property for leaderless systems as there are no guarantees that a node will remain live for an extended period of time. Further, all the peers run the same program but none of them are singularly responsible for orchestrating how reads/writes are propagated across the system. Hence, they need to gossip the information (writes) they receive with their neighbors. We will explore two types of leaderless systems here: Dynamo-based systems, and P2P systems.

Dynamo-based Systems

Amazon’s internal database system is called Dynamo, which follows the leaderless paradigm and has inspired various similar systems. Interestingly enough, Amazon Web Services offers a similarly named DynamoDB as a service, which is actually a single leader system. Our focus here is on the former. A key feature in Dynamo-based leaderless systems is that read and write requests to the system are sent in parallel to multiple peers. This way, if some peers are offline others can receive the request. Offline peers can be updated when they come back online. However, if an offline peer misses a write and comes back online then it may not get a chance to synchronize new writes with its neighbors. Thus, if a peer receives a read at this state it will provide a stale value. Peers can track stale values by maintaining a version number along with each record, which they return in read responses and update with modifications to records.

Synchronizing Mechanisms

If queries are spread across the network in parallel, there needs to be a synchronizing mechanism to ensure the system attains eventual consistency. Dynamo-style leaderless systems have two such mechanisms.

  1. Read repair

Since a read request gets versions from different peers in the network, it can find the most recent version and all the versions that differ from it. Thus, the client can find the latest version among all responses to its query, and send it to all peers that sent an outdated version. These peers will subsequently update their stale values.

  1. Anti-entropy

Dynamo inspired systems can also have a background process running in each peer that looks for discrepancies in its datastore by frequently querying other peers and looking for outdated values.

Quorum

We can notice that eventual consistency is possible using synchronization mechanisms, however the system needs to ensure that a write successfully occurs in enough peers so that it is accessible to future reads. This relationship can be described by the equation w + r > n, where w is the number of writes, r is the number of reads and n is the number of replicas. This means that the number of reads and writes have to be greater than the number of replicas as reads are guaranteed to reach peers where the write was successfully registered. This is known as quorum, as writes and reads must be able to agree on whether a write is confirmed in the network.

P2P (Distributed Hash Table)

Peer-to-peer, or P2P systems are leaderless networks which have a content addressing system. In other words, peers do not maintain full copies of the database, rather, records are distributed and replicated individually as key-value pairs with each peer containing a Distributed Hash Table (DHT) to find content in the network. As a result of records being distributed across the network, it does not require a quorum mechanism like Dynamo networks to ensure records are available. The DHT on a peer stores locations for records so they can be found across the network. This way, if a peer gets a read, it refers to its DHT to fetch the data from the correct peer. It is thus important that writes to the network are propagated far enough that any peer in the network is able to reach it, and also that all the network peers can still respond to reads if some peers have faults. There are various DHT protocols, such as the following, that accomplish this. They all share the approach of defining a virtual structure (like a circle or a tree) as an overlay network on top of the original network in order to make reads/writes efficient.

Chord

In this protocol, all peers are assigned an address that is of size m bits. The addresses are arranged from 0 to 2ᵐ as points in a circle like the diagram below. Records that are stored in the network are also assigned a key that is m bits long. A hashing algorithm such as SHA1 is used to generate these IDs as they need to be collision resistant (the same key or address shouldn’t point to two different values or peers) for the system to work. A record’s key is numerically close (greater than/less than/equal) to a peer’s address, since they both occupy the same number of bits. In fact, write queries work by generating an m-bit key for the write contents and storing the key-value pair in the peer with the ID closest to that key. “Closest” can be defined to mean greater than or less than by convention. Peers are randomly assigned a peer ID and thus the Chord circle can have gaps in it. As a result, a write with a key k is written to the peer with ID defined by successor(k) which returns the ID of an active peer with the closest address to k.

image

This addressing scheme is used in the DHT for finding content. A basic read occurs when a peer receives a query for a key, it passes the query on to the next peer (in the address circle) unless it contains the key in its datastore. However, this takes O(n) time which is quite costly in large networks. There is a faster solution that involves each peer maintaining a “Finger Table”. This table contains up to m ordered entries. To see this in action, if the node with ID N8 receives a query for key with ID K53, it will refer to its finger table for the node with the closest ID to 53 that it is aware of. From the diagram, we can see that this is N42. Thus, N8 will forward the request to N42. This node will see that N51 is the closest ID it is aware of less than 53. Finally, N51 will forward the request to N56 which will return the value for K53 (as there is no node closer to 53 counting upwards than N56). More generally, the finger table entry at row i will contain the peer with the address closest to the next power of 2 offset by the distance between the current peer and the peer with address 0. This can be represented in fancy notation (with n being the current peer’s address) as:

image

This modification makes reads happen in O(log(n)) as the query is passed to the finger table entry for the closest successor to the key specified in the query, which by design is spaced out by 2ⁿ.

Kademlia

Kademlia operates on very similar ideas to Chord with a few key differences. For starters, both peers and records in a Kademlia network maintain the property of being addressed in the same m-bit space. The key difference is that it uses a binary tree as an overlay network rather than the circle topology of Chord.

image

Visualizing this isn’t easy so let’s refer to the above diagram. Nodes and keys are addressed by a binary trie, which is a form of addressing using prefixes of a binary sequence. This means any path from the root node to a leaf will represent a unique address attained by concatenating the edge values. At any node the right forward edge will have the value 0 and the left forward edge will have the value 1. Let’s focus on the leaf node on the right side with all the curved arrows coming out of it. The address of that node will be 0011, which is the path from the root node to it — root, right, right, left, left. This is a pretty elaborate scheme so it would make sense that there is some benefit to it. In fact, this addressing scheme has the massive benefit of being able to determine the distances between addresses (either a peer or a record) using the XOR operation.

image

Two questions arise at this point: where did XOR come from? Why is it even relevant? It turns out that XOR captures the notion of distance implicit in this binary tree structure. As an example, notice that 1001 XOR 1101 is 0100. Let’s tie this back to our tree structure; in a fully populated tree of m-bit IDs, the distance between two IDs is the height of the smallest subtree containing both of them. Referring back to our binary example, 0100 shows that the height of the smallest subtree with both is 4. As to why the XOR metric is important, it maintains the notion of distance very well. In Chord, for example, a node at the first quadrant will think of a node in the second quadrant as ‘close’ since it needs few successive hops. However the node in the second quadrant will deem the node in the first quadrant as ‘far’ since it has to hop all the way across the third and fourth quadrants to get to the first quadrant. XOR thus works much better as a distance metric given that it works both ways (i.e. a XOR b = b XOR a).

Now let’s look at how queries work in Kademlia. Each peer maintains a list of contact information for other nodes, typically IP address, UDP port, Node ID (as Kademlia messages are passed using UDP). Further, this list is divided into several sublists known as k-buckets. The i-th k-bucket stores k nodes with IDs between 2^i and 2^{i+1} from itself. This can be visualized as follows:

image

As you might notice from this structure, k is fixed while the number of possible nodes between 2^i and 2^{i+1} is strictly increasing. Thus the k-bucket structure ensures that a peer knows a lot about its neighbors and less about nodes that are far away. Now let’s tie all this together to see how a query works. A peer receiving a query for a node with ID m will refer to the k-bucket it has that contains IDs closest to m by XORing m with its own ID. As mentioned earlier, this will provide the height of the smallest subtree that contains both IDs. This can be used to determine which bucket m belongs in. Since it only knows at most k peers in this bucket, it will forward the request to some of those nodes (the exact number is set as a system-wide concurrency parameter). This is done simultaneously, as some of those nodes might have failed in the time since their last communication with the node sending the message. Any of those nodes that is aware of the node that contains the value for the key m will forward the query to the corresponding node and send an acknowledgment to the original querier. Since each k-bucket essentially stores contacts from various subtrees as in the above diagram, the query will converge logarithmically to the correct node. In other words, a Kademlia network with 10,000,000 nodes would only require at most 20 hops for any node to match a message ID to its value!

There are several very interesting and complex ideas in the world of leaderless systems. Peer-to-peer networks specifically have several extremely valuable properties because they offer the possibility of fully trustless networks. In future articles, we will go through what trustless networks actually entail by examining concepts such as consensus, byzantine fault tolerance and cybersecurity in p2p systems. They are a waning presence on the internet as we are moving to a more cloud centric approach, however there are still some pioneers in the p2p world such as IPFS and Virgil Systems, who are redefining the internet from the ground up in a trustless, decentralized manner for data/content management.

Citations

  • Petar Maymounkov and David Mazières. 2002. Kademlia: A Peer-to-Peer Information System Based on the XOR Metric. In Revised Papers from the First International Workshop on Peer-to-Peer Systems (IPTPS ‘01). Springer-Verlag, Berlin, Heidelberg, 53–65.
  • Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. 2001. Chord: A scalable peer-to-peer lookup service for internet applications. SIGCOMM Comput. Commun. Rev. 31, 4 (October 2001), 149–160. DOI:https://doi.org/10.1145/964723.383071
  • Kleppmann, M. (2019). Designing data-intensive applications the big ideas behind reliable, scalable, and maintainable systems. Beijing: O’Reilly.
gov.glados.computer
Gov

@gov.glados.computer

p2p enthusiast

still punk af

Post reaction in Bluesky

*To be shown as a reaction, include article link in the post or add link card

Reactions from everyone (0)