Synchronization
In a DS there isn’t a single clock
This chapter explores distributed algorithms for various synchronization tasks such as:
- synchronizing physical clocks
- simulating time using logical clocks
- preserving event ordering
- achieving mutual exclusion
- conducting leader election
- collecting global state and detecting termination
- managing distributed transactions
- detecting distributed deadlocks
In all of this time is a fundamental critical concept. In ds ensuring that all machines perceive the same global time is a critical challenge. Computer clocks are actually timers and to achieve synchronization, several factors need to be considered:
- clock drift rate, which is a constant value determined by the timer. For most quartz crystals, this drift rate is around 1 second per day, meaning that they can drift by approximately 11.6 seconds every 11.6 days.
- Clock skew refers to the difference in drift rates between two clocks. If two clocks are drifting in opposite directions, they will accumulate a skew equal to twice the product of the drift rate and the elapsed time.
To maintain synchronization, a resynchronization process is needed. There are two main approaches to achieving synchronization:
- The first approach is to synchronize all clocks against a single clock, typically one that has external and accurate time information. This ensures accuracy for all clocks, as they are aligned with the reference clock.
- The second approach is to synchronize all clocks among themselves, ensuring that they all agree on the same time. At the very least, time monotonicity needs to be preserved, meaning that time should always move forward and not jump backwards or stall.
Before dive in synchronization algorithms one important issue to note is that if a client recognizes that its own time is ahead of the correct time, it should never switch its clock back in time: it should be obvious .. switching the clock back can cause errors in running applications. Instead, the client should delay its clock until it reaches the synchronization point. For example, if the clock should be 11:59:59 but is at 12 o’clock, the client can delay its clock by going half the speed for two seconds until it reaches perfect synchronization.
Synchronization algorithms
GPS
The GPS algorithm is highly efficient in providing device positioning and clock synchronization. It operates through triangulation from a set of satellites whose positions are known. By measuring signal delay, the distance can be determined accurately. However, a challenge arises from the necessity of synchronizing the clocks between the satellites and the receiver, due to the inevitable clock skew. This limitation hinders the effectiveness of GPS in indoor environments, where GPS signals cannot be received reliably. Although GPS is a viable option, it may not perform optimally under normal circumstances.
In order for GPS to work, the clocks of the satellite and the station need to be perfectly synchronized. The satellites emit signals that are perfectly synchronized among themselves because they have atomic clocks on board. By measuring the flight time of the signal, the receiver can determine its position. To achieve this, at least four satellites are needed to provide four equations to solve for four variables (x, y, z, and time).
The precision of GPS is around 10 meters in distance, which requires highly precise clock synchronization. The main source of error comes from the time it takes for the signal to go from the GPS to the receiver’s computer.
However, it is possible to synchronize the clocks of multiple computers with a precision in the order of nanoseconds using GPS, although this may not be feasible in certain cases where computers are located indoors.
Simple algorithms: Cristian’s (1989)
How do you synchronize if you don’t have a GPS on board of every station? One of the simplest algorithm is the Christian’s algorithm.
In the time synchronization process, clients periodically send requests to the time server. However, there are certain problems:
- the time might run at a different speed on the client machine. To avoid this, a gradual change is introduced.
- the non-zero time it takes for the message to travel to the server and back. To account for this, the round-trip time is measured and adjusted. The adjusted time, denoted as , is calculated as the sum of the current clock time and half of the round-trip time.
Multiple measurements of the round-trip time are taken and then averaged to improve accuracy. However, even with these adjustments, there is still some error in the obtained time due to the delay between the request and response. Indeed, the assumption here is that the network is symmetric in terms of latency: meaning the time of the request is nearly the same as the response time. However, this assumption is an oversimplification that works if the flight time is very short compared to the desired precision.
Berkeley (1989)
The second approach in Berkeley Unix differs from Christian’s algorithm in that the time server is active instead of passive: it collects the time from all clients, averages it, and then retransmits the required adjustment.
This approach synchronizes the machines with each other instead of against a single machine, which is reasonable when there is no assumption that one machine is more correct than the others.
Network Time Protocol (NTP)
This protocol was designed for UTC sync over large-scale networks and it’s actually what is used today for large-scale networks like the internet and uses servers organized in a hierarchy. At the top of the hierarchy are machines with atomic clocks that synchronize other machines, down to the client machines. The synchronization method depends on the network:
- Multicast (over LAN): on LAN broadcast communication is typically used, where the NTP server periodically broadcasts the current time and the receiving machine synchronizes based on that time.
- Procedural-call mode: similar to Christian algorithm
- Symmetric mode: for higher levels that need the highest accuracies
The transmission times of messages and gives this: where is an estimation of the time offset (between the two clocks), and represents the accuracy of this estimation.
Logical time
In some applications, it is not necessary to have accurate absolute time. Instead, what is important is the ordering and causality relationships of events.
Scalar clocks
Lamport invented a simple mechanism by which the happened before ordering can be captured numerically using integers to represent the clock value. Each process keeps a logical scalar clock :
- starts at zero
- is incremented before sends a message
- Each message sent by is timestamped with
- Upon receipt of a message, sets to: timestamp
The idea behind Lamport clocks is that they can serve as an approximation of the “happens before” relationship between events. In Lampard clocks, if one event happens-before another event , then the scalar clock of is less than the scalar clock of . Note that the converse is not always true. Just because the scalar clock of “a” is less than that of ” ” does not mean that . This because scalar clocks capture a partial ordering of events, not the full causal relationship.
Other versions of Lamport Clocks
- Lamport Clocks with Process IDs for Total Order Guarantee:
- Objective: Establish a total order of events across distributed systems.
- Mechanism: Each process maintains the logical clock + each process IDs: IDs are used like tie-breakers for events with identical timestamps, in this way “nothing happens concurrently”.
- Lamport Clocks for Total Ordering Multicast:
- Objective: Ensure total ordering of multicast property, which says “every process receive the message in the same order”.
- Mechanism: Ever
- A message is sent in multicast, with the logical timestamp of the sender
- When a process receives a message, it is put in a local queue, ordered by timestamp
- The receiver multicast an ACK to the other processes
- A message is delivered to the application only when it is at the highest in the queue and all its acks have been received.
Vector clocks
The problem of scalar clocks is that but the reverse does not necessarily hold, e.g., if .The solution are Vector clocks. Basically is the same a scalar clocks but all process has/sends a vector, in which for each cell there is a value associated with the process (so values for processes).
Rules:
- is the number of events that have occurred at , initially for all
- If then knows that events have occurred at
- attaches a timestamp in all messages it sends (incrementing “its value” just before sending the message
- When receives a message containing , it sets for all and then increments (basically it updates its own vector according to the received vector using function).
A vector clock defines a perfect isomorphism with respect to the happens before relationship. Position -th of the vector clock of each process represents the corresponding number of events that occur at process -th.
Vector clocks for causal delivery
We want to order questions and replies, messages in causal order, not in a total order.
Using vector clock, we can order events according to causality, not exactly causality. In order to do so, we need a variation of vector clock. Causal delivery: if two events are causally related, everybody must see the message in the same order. A slight variation of vector clocks can be used to implement causal delivery of messages in a totally distributed way.
We can use vector clocks:
- Variation: increment clock only when sending a message. On receive, just merge, not increment
- Hold a reply until the previous messages are received:
- for all
Mutual exclusion
Mutual exclusion is required to prevent interference between processes in a distributed system.
- Safety property: says that at most one process exits the critical section at a time.
- Liveness property: all requests to enter/exit the critical section eventually succeed (no deadlock, no starvation)
- Optional: if one request happened-before another, then entry is granted in that order
Centralized solution
The simplest solution is to have a server to coordinate access to a resource. This server emulates a centralized solution where it manages the lock using a token, which allows a process to access the resource. Requests and releases for resource access are obtained through messages to the coordinator.
This solution is easy to implement and ensures mutual exclusion and fairness. However, it has some drawbacks:
- the server can become a performance bottleneck
- the server is a single point of failure
Actually the best option is this: just replicate the central server to address its limitations.
Mutual exclusion with Lamport scalar clocks
To request access to a resource, process multicasts a resource request message , with timestamp , to all processes (including itself). A resource request is granted to a process when its request has been acknowledged by all other processes.
Upon receipt of , a process :
- If it does not hold the resource and it is not interested in holding the resource, sends an acknowledgment to
- If it holds the resource, puts the requests into a local queue ordered according to (process ids are used to break ties)
- If it is also interested in holding the resource and has already sent out a requests, compares the timestamp with the timestamp of its own requests
- If the is the lowest one, sends an acknowledgement to
- otherwise it put the request into the local queue above
This protocol satisfies:
- the safety property
- liveness property, since it is guaranteed that each request will eventually be acknowledged.
- optional property as it guarantees access to the resource in Lamport clock order, which respects the happened-before order.
Token ring solution
Processes are logically arranged in a ring, regardless of their physical connectivity. Access to a shared resource is granted through a token that is passed along the ring in a specific direction. When a process does not require access to the resource, it forwards the token to the next process in the ring. To gain access to the resource, a process keeps hold of the token. Once a process has finished using the resource, it releases it by passing the token to the next process in the ring.
Leader election
In many distributed algorithms, a coordinator or special role is required. One example is server-based mutual exclusion. The problem arises when there is a need for a consensus on selecting a new leader when the old leader is no longer available, either due to failure or applicative reasons. The minimal assumptions for this scenario are:
- the nodes are distinguishable, as without this distinction, it is not possible to perform selection.
- the processes have knowledge of each other and their respective IDs. However, they do not have information about which processes are up and running or which ones have failed.
The bully election algorithm
The election algorithm works as follows: when a process notices that the current coordinator is not responding, it initiates an election sending an ELECT
message to all other processes with higher IDs. If no one responds, wins the election and sends a COORD
message to the processes with lower IDs.
If someone with higher IDs responds, doesn’t win and in a recursive way, the other processes with higher IDs perform the algorithm.
A ring-based algorithm
In a ring topology among nodes, when a process detects a leader failure, it sends an ELECT
message containing its ID to the next closest alive neighbor. The process receiving the election message follows these steps:
- If the process is not already in the message, it adds itself and propagates the message to the next alive neighbor.
- If the process is already in the message, the message type is changed to
COORD
, and the modified message is recirculated.
When arrive the COORD
message, it means it has circulated around the entire ring. It then takes the list of IDs from all the processes and selects the greatest, lowest, or desired ID as the new leader (all processes obv must choose the same criteria).
After another round of message propagation, the leader will be elected. Multiple messages may circulate simultaneously but will eventually converge to have the same content.
Capturing global state
Capturing global state is a problem that one application of this case is the problem that we already encountered, the problem of creating a snapshot of a system. Capturing the global state of a distributed system is not as straightforward as it would be with a global clock. Since a global clock is not available, we have to rely on recording the state of each process at different times.
A cut of a system composed of processes can be defined as the union of the histories of all its processes up to a certain event. is consistent iff for any event it includes, it also includes all the events that happened before .
Distributed snapshot (Chandy-Lamport)
The Chandy-Lamport algorithm selects a consistent cut. Any process may initiate a snapshot by:
- Recording its internal state
- Sending a token on all outgoing channel to signal the start of the snapshot
- Start recording local state (messages arriving on every incoming channel)
Upon receiving a token, a process :
- Stop recording incoming message on the channel the token arrived along
- If not already recording local snapshot
- Records its internal state
- Sends a token on all outgoing channels
- Start recording a local snapshot
The end of the snapshot is “natural”: at some point the token have arrived on all its incoming channels. Things to note:
- the application runs continuously during recording. If it receives a token from a channel which is being recorded, it will only pause the recording but continue processing the messages throughout the entire operation.
- Once the snapshot is complete, the collected data can be sent to a single collector that can reconstruct the global state of the system based on the individual process snapshots.
Dijkstra-Scholten distributed termination
In a diffusing computation, initially all processes are idle except the init process, which after the reception of a message/signal starts the distributed computation. The termination condition is that when processing is complete at each node and there are no more messages in the system the computation is terminated.
The Dijkstra-Scholten termination detection algorithm basically consists into creating a spanning tree to ensure that each successor is unique to only one node, without creating dangerous cycles. The core ideas:
- each node keeps track of the nodes it sends messages to: its children.
- If a node was already awake when the message arrived, it is already part of the tree and should not be added as a child of the sender.
- When a node has no more children and is idle means that it’s a leaf node: it tells its parent to remove it as a child.
Distributed transactions
Different transaction types exist:
- Flat: happening on a single database, relatively easy to guarantee ACID
- Nested:
- Each transaction consists in multiple sub-transactions in different (completely independent) DBs.
- If a sub-transaction fails, then the entire original transaction fails
- Distributed:
- Flat transactions on distributed data: the difference between nested transaction is that same the multiple DBs are part of the same database.
- A single transaction should be able to read and write to all the data in all the distributed data stores
- Need distributed locking
The main idea is to use a transaction manager which works as a “frontend” and interact with multiple schedulers which guarantee the legality of the transactions. The schedulers properly schedule conflicting operation using mainly two approaches:
- Locking with 2PL leads to serializability but maybe can cause a lock. 3 types:
- Centralized lock manager
- Primary: multiple lock managers
- Distributed: distributed lock managers, so necessity to synchronize the locking over multiple hosts
- Timestamp ordering locks anything, perform everything with timestamps according to some criteria which are mainly divided into pessimistic ordering and optimistic ordering. The optimistic approach is different from the pessimistic one since it relies on the assumption that the conflicts are rare: maximum parallelism, eventually rollback. Not widely used in DS. Actually, not widely used in general.
Pessimistic timestamp ordering
Serializability without risk of deadlocks using a timestamp to each transaction (e.g., using logical clocks).
Some rules:
- We refer to the write timestamp of the committed version of as (That of the last transaction which )
- Same but with read timestamp (that of the last transaction which )
- When scheduler receives write at time
- If and perform tentative write with timestamp
- else abort since the write request arrived too late
- When scheduler receives at time
- If :
- perform read and set
- else abort since the read request arrived too late
- If :
Optimistic timestamp ordering
Based on the assumption that conflicts are rare, the strategy here is to simply proceed with the transaction and handle conflicts at a later stage:
- data items are marked with the start time of the transaction. When the transaction is ready to commit, a check is performed to determine if any items have been changed since the start of the transaction.
- If changes are detected, the transaction is aborted.
- Otherwise, it is committed.
Things to remember about optimistic timestamp ordering:
- Deadlock-free, allows maximum parallelism.
- Under heavy load, there may be too many rollbacks.
- Not widely used, especially in DS
Detecting distributed deadlocks
The 2PL locking can produce deadlocks (obv the timestamp approach can’t produce deadlocks). Distributed deadlocks can be addressed mainly:
- Ignore the problem: most often employed, actually meaningful in many settings
- Detection and recovery: typically by killing one of the process
- Prevention
- Avoidance: never used in (distributed) systems, as it implies a priori knowledge about resource usage
Not only fix deadlocks is difficult, but even detecting deadlocks in DS is difficult. The approaches:
- Centralized solution consists in a coordinator which collects messages from each machine which maintains the own resource graph for its own resources. Then the coordinator with a “god-view” can “see” if there are deadlocks. The problem with these approach is that the timing of message arrivals can lead to false deadlocks.
- Distributed detection system (Chandy-Misra-Haas) where each process is allowed to request multiple resources simultaneously and send messages containing the tuple (initiator, sender, receiver). If any of the process detects a cycle, a deadlock is found.
A smarter solution to prevent deadlocks is to design the system in a way that makes it impossible for deadlocks to occur using a distributed prevention approach. It’s possible to do this using global timestamps with two possible approaches:
- wait-die algorithm
- wound-wait algorithm