Synchronization

In a DS there isn’t a single clock

This chapter explores distributed algorithms for various synchronization tasks such as:

synchronizing physical clocks
simulating time using logical clocks
preserving event ordering
achieving mutual exclusion
conducting leader election
collecting global state and detecting termination
managing distributed transactions
detecting distributed deadlocks

In all of this time is a fundamental critical concept. In ds ensuring that all machines perceive the same global time is a critical challenge. Computer clocks are actually timers and to achieve synchronization, several factors need to be considered:

clock drift rate, which is a constant value determined by the timer. For most quartz crystals, this drift rate is around 1 second per day, meaning that they can drift by approximately 11.6 seconds every 11.6 days.
Clock skew refers to the difference in drift rates between two clocks. If two clocks are drifting in opposite directions, they will accumulate a skew equal to twice the product of the drift rate and the elapsed time.

To maintain synchronization, a resynchronization process is needed. There are two main approaches to achieving synchronization:

The first approach is to synchronize all clocks against a single clock, typically one that has external and accurate time information. This ensures accuracy for all clocks, as they are aligned with the reference clock.
The second approach is to synchronize all clocks among themselves, ensuring that they all agree on the same time. At the very least, time monotonicity needs to be preserved, meaning that time should always move forward and not jump backwards or stall.

Before dive in synchronization algorithms one important issue to note is that if a client recognizes that its own time is ahead of the correct time, it should never switch its clock back in time: it should be obvious .. switching the clock back can cause errors in running applications. Instead, the client should delay its clock until it reaches the synchronization point. For example, if the clock should be 11:59:59 but is at 12 o’clock, the client can delay its clock by going half the speed for two seconds until it reaches perfect synchronization.

Synchronization algorithms

GPS

The GPS algorithm is highly efficient in providing device positioning and clock synchronization. It operates through triangulation from a set of satellites whose positions are known. By measuring signal delay, the distance can be determined accurately. However, a challenge arises from the necessity of synchronizing the clocks between the satellites and the receiver, due to the inevitable clock skew. This limitation hinders the effectiveness of GPS in indoor environments, where GPS signals cannot be received reliably. Although GPS is a viable option, it may not perform optimally under normal circumstances.

In order for GPS to work, the clocks of the satellite and the station need to be perfectly synchronized. The satellites emit signals that are perfectly synchronized among themselves because they have atomic clocks on board. By measuring the flight time of the signal, the receiver can determine its position. To achieve this, at least four satellites are needed to provide four equations to solve for four variables (x, y, z, and time).

The precision of GPS is around 10 meters in distance, which requires highly precise clock synchronization. The main source of error comes from the time it takes for the signal to go from the GPS to the receiver’s computer.

However, it is possible to synchronize the clocks of multiple computers with a precision in the order of nanoseconds using GPS, although this may not be feasible in certain cases where computers are located indoors.

Simple algorithms: Cristian’s (1989)

How do you synchronize if you don’t have a GPS on board of every station? One of the simplest algorithm is the Christian’s algorithm.

In the time synchronization process, clients periodically send requests to the time server. However, there are certain problems:

the time might run at a different speed on the client machine. To avoid this, a gradual change is introduced.
the non-zero time it takes for the message to travel to the server and back. To account for this, the round-trip time is measured and adjusted. The adjusted time, denoted as , is calculated as the sum of the current clock time and half of the round-trip time.

Multiple measurements of the round-trip time are taken and then averaged to improve accuracy. However, even with these adjustments, there is still some error in the obtained time due to the delay between the request and response. Indeed, the assumption here is that the network is symmetric in terms of latency: meaning the time of the request is nearly the same as the response time. However, this assumption is an oversimplification that works if the flight time is very short compared to the desired precision.

Berkeley (1989)

The second approach in Berkeley Unix differs from Christian’s algorithm in that the time server is active instead of passive: it collects the time from all clients, averages it, and then retransmits the required adjustment.

This approach synchronizes the machines with each other instead of against a single machine, which is reasonable when there is no assumption that one machine is more correct than the others.

Network Time Protocol (NTP)

This protocol was designed for UTC sync over large-scale networks and it’s actually what is used today for large-scale networks like the internet and uses servers organized in a hierarchy. At the top of the hierarchy are machines with atomic clocks that synchronize other machines, down to the client machines. The synchronization method depends on the network:

Multicast (over LAN): on LAN broadcast communication is typically used, where the NTP server periodically broadcasts the current time and the receiving machine synchronizes based on that time.
Procedural-call mode: similar to Christian algorithm
Symmetric mode: for higher levels that need the highest accuracies

The transmission times of messages $m$ and $m^{'}$ gives this: $o = o_{i} + (m^{'} - m) /2$ where $o_{i}$ is an estimation of the time offset (between the two clocks), and $d_{i}$ represents the accuracy of this estimation.

Logical time

In some applications, it is not necessary to have accurate absolute time. Instead, what is important is the ordering and causality relationships of events.

Scalar clocks

Lamport invented a simple mechanism by which the happened before ordering can be captured numerically using integers to represent the clock value. Each process $p_{i}$ keeps a logical scalar clock $L_{i}$ :

$L_{i}$ starts at zero
$L_{i}$ is incremented before $p_{i}$ sends a message
Each message sent by $p_{i}$ is timestamped with $L_{i}$
Upon receipt of a message, $p_{i}$ sets $L_{i}$ to: $M A X ($ timestamp $_{m s g}, L_{i}) + 1$

The idea behind Lamport clocks is that they can serve as an approximation of the “happens before” relationship between events. In Lampard clocks, if one event $a$ happens-before another event $b$ $(a \to b)$ , then the scalar clock of $a$ is less than the scalar clock of $b$ . Note that the converse is not always true. Just because the scalar clock of “a” is less than that of ” $b$ ” does not mean that $a \to b$ . This because scalar clocks capture a partial ordering of events, not the full causal relationship.

Other versions of Lamport Clocks

Lamport Clocks with Process IDs for Total Order Guarantee:
- Objective: Establish a total order of events across distributed systems.
- Mechanism: Each process maintains the logical clock + each process IDs: IDs are used like tie-breakers for events with identical timestamps, in this way “nothing happens concurrently”.
Lamport Clocks for Total Ordering Multicast:
- Objective: Ensure total ordering of multicast property, which says “every process receive the message in the same order”.
- Mechanism: Ever
1. A message is sent in multicast, with the logical timestamp of the sender
2. When a process receives a message, it is put in a local queue, ordered by timestamp
3. The receiver multicast an ACK to the other processes
4. A message is delivered to the application only when it is at the highest in the queue and all its acks have been received.

Vector clocks

The problem of scalar clocks is that $e \to e^{'} \Rightarrow L (e) < L (e^{'})$ but the reverse does not necessarily hold, e.g., if $e ∥ e^{'}$ .The solution are Vector clocks. Basically is the same a scalar clocks but all process has/sends a vector, in which for each cell $V [j]$ there is a value associated with the process $j$ (so $N$ values for $N$ processes).

Rules:

$V_{i} [i]$ is the number of events that have occurred at $P_{i}$ , initially $V_{i} [j] = 0$ for all $i, j$
If $V_{i} [j] = k$ then $P_{i}$ knows that $k$ events have occurred at $P_{j}$
$P_{i}$ attaches a timestamp $t = V_{i}$ in all messages it sends (incrementing “its value” $V_{i} [i]$ just before sending the message
When $P_{i}$ receives a message containing $t$ , it sets $V_{i} [j] = max (V_{i} [j], t [j])$ for all $j \neq = i$ and then increments $V_{i} [i]$ (basically it updates its own vector according to the received vector using $ma x$ function).

A vector clock defines a perfect isomorphism with respect to the happens before relationship. Position $i$ -th of the vector clock of each process represents the corresponding number of events that occur at process $i$ -th.

Vector clocks for causal delivery

We want to order questions and replies, messages in causal order, not in a total order.

Using vector clock, we can order events according to causality, not exactly causality. In order to do so, we need a variation of vector clock. Causal delivery: if two events are causally related, everybody must see the message in the same order. A slight variation of vector clocks can be used to implement causal delivery of messages in a totally distributed way.

We can use vector clocks:

Variation: increment clock only when sending a message. On receive, just merge, not increment
Hold a reply until the previous messages are received:
- $t s (r) [j] = V_{k} [j] + 1$
- $t s (r) [i] \leq V_{k} [i]$ for all $i \neq = j$

Mutual exclusion

Mutual exclusion is required to prevent interference between processes in a distributed system.

Safety property: says that at most one process exits the critical section at a time.
Liveness property: all requests to enter/exit the critical section eventually succeed (no deadlock, no starvation)
Optional: if one request happened-before another, then entry is granted in that order

Centralized solution

The simplest solution is to have a server to coordinate access to a resource. This server emulates a centralized solution where it manages the lock using a token, which allows a process to access the resource. Requests and releases for resource access are obtained through messages to the coordinator.

This solution is easy to implement and ensures mutual exclusion and fairness. However, it has some drawbacks:

the server can become a performance bottleneck
the server is a single point of failure

Actually the best option is this: just replicate the central server to address its limitations.

Mutual exclusion with Lamport scalar clocks

To request access to a resource, process $P_{i}$ multicasts a resource request message $m$ , with timestamp $T_{m}$ , to all processes (including itself). A resource request is granted to a process when its request has been acknowledged by all other processes.

Upon receipt of $m$ , a process $P_{j}$ :

If it does not hold the resource and it is not interested in holding the resource, $P_{j}$ sends an acknowledgment to $P_{i}$
If it holds the resource, $P_{j}$ puts the requests into a local queue ordered according to $T_{m}$ (process ids are used to break ties)
If it is also interested in holding the resource and has already sent out a requests, $P_{j}$ compares the timestamp $T_{m}$ with the timestamp of its own requests
- If the $T m$ is the lowest one, $P j$ sends an acknowledgement to $P i$
- otherwise it put the request into the local queue above

This protocol satisfies:

the safety property
liveness property, since it is guaranteed that each request will eventually be acknowledged.
optional property as it guarantees access to the resource in Lamport clock order, which respects the happened-before order.

Token ring solution

Processes are logically arranged in a ring, regardless of their physical connectivity. Access to a shared resource is granted through a token that is passed along the ring in a specific direction. When a process does not require access to the resource, it forwards the token to the next process in the ring. To gain access to the resource, a process keeps hold of the token. Once a process has finished using the resource, it releases it by passing the token to the next process in the ring.

Leader election

In many distributed algorithms, a coordinator or special role is required. One example is server-based mutual exclusion. The problem arises when there is a need for a consensus on selecting a new leader when the old leader is no longer available, either due to failure or applicative reasons. The minimal assumptions for this scenario are:

the nodes are distinguishable, as without this distinction, it is not possible to perform selection.
the processes have knowledge of each other and their respective IDs. However, they do not have information about which processes are up and running or which ones have failed.

The bully election algorithm

The election algorithm works as follows: when a process $P$ notices that the current coordinator is not responding, it initiates an election sending an ELECT message to all other processes with higher IDs. If no one responds, $P$ wins the election and sends a COORD message to the processes with lower IDs. If someone with higher IDs responds, $P$ doesn’t win and in a recursive way, the other processes with higher IDs perform the algorithm.

A ring-based algorithm

In a ring topology among nodes, when a process detects a leader failure, it sends an ELECT message containing its ID to the next closest alive neighbor. The process receiving the election message follows these steps:

If the process is not already in the message, it adds itself and propagates the message to the next alive neighbor.
If the process is already in the message, the message type is changed to COORD, and the modified message is recirculated.

When arrive the COORD message, it means it has circulated around the entire ring. It then takes the list of IDs from all the processes and selects the greatest, lowest, or desired ID as the new leader (all processes obv must choose the same criteria).

After another round of message propagation, the leader will be elected. Multiple messages may circulate simultaneously but will eventually converge to have the same content.

Capturing global state

Capturing global state is a problem that one application of this case is the problem that we already encountered, the problem of creating a snapshot of a system. Capturing the global state of a distributed system is not as straightforward as it would be with a global clock. Since a global clock is not available, we have to rely on recording the state of each process at different times.

A cut $C$ of a system $S$ composed of $N$ processes $p_{1}, ..., p_{n}$ can be defined as the union of the histories of all its processes up to a certain event. $C$ is consistent iff for any event $e$ it includes, it also includes all the events that happened before $e$ .

Distributed snapshot (Chandy-Lamport)

The Chandy-Lamport algorithm selects a consistent cut. Any process $p$ may initiate a snapshot by:

Recording its internal state
Sending a token on all outgoing channel to signal the start of the snapshot
Start recording local state (messages arriving on every incoming channel)

Upon receiving a token, a process $q$ :

Stop recording incoming message on the channel the token arrived along
If not already recording local snapshot
1. Records its internal state
2. Sends a token on all outgoing channels
3. Start recording a local snapshot

The end of the snapshot is “natural”: at some point the token have arrived on all its incoming channels. Things to note:

the application runs continuously during recording. If it receives a token from a channel which is being recorded, it will only pause the recording but continue processing the messages throughout the entire operation.
Once the snapshot is complete, the collected data can be sent to a single collector that can reconstruct the global state of the system based on the individual process snapshots.

Dijkstra-Scholten distributed termination

In a diffusing computation, initially all processes are idle except the init process, which after the reception of a message/signal starts the distributed computation. The termination condition is that when processing is complete at each node and there are no more messages in the system the computation is terminated.

The Dijkstra-Scholten termination detection algorithm basically consists into creating a spanning tree to ensure that each successor is unique to only one node, without creating dangerous cycles. The core ideas:

each node keeps track of the nodes it sends messages to: its children.
If a node was already awake when the message arrived, it is already part of the tree and should not be added as a child of the sender.
When a node has no more children and is idle means that it’s a leaf node: it tells its parent to remove it as a child.

Distributed transactions

Different transaction types exist:

Flat: happening on a single database, relatively easy to guarantee ACID
Nested:
- Each transaction consists in multiple sub-transactions in different (completely independent) DBs.
- If a sub-transaction fails, then the entire original transaction fails
Distributed:
- Flat transactions on distributed data: the difference between nested transaction is that same the multiple DBs are part of the same database.
- A single transaction should be able to read and write to all the data in all the distributed data stores
- Need distributed locking

The main idea is to use a transaction manager which works as a “frontend” and interact with multiple schedulers which guarantee the legality of the transactions. The schedulers properly schedule conflicting operation using mainly two approaches:

Locking with 2PL leads to serializability but maybe can cause a lock. 3 types:
- Centralized lock manager
- Primary: multiple lock managers
- Distributed: distributed lock managers, so necessity to synchronize the locking over multiple hosts
Timestamp ordering locks anything, perform everything with timestamps according to some criteria which are mainly divided into pessimistic ordering and optimistic ordering. The optimistic approach is different from the pessimistic one since it relies on the assumption that the conflicts are rare: maximum parallelism, eventually rollback. Not widely used in DS. Actually, not widely used in general.

Pessimistic timestamp ordering

Serializability without risk of deadlocks using a timestamp to each transaction (e.g., using logical clocks).

Some rules:

We refer to the write timestamp of the committed version of $x$ as $t s_{w r} (x)$ (That of the last transaction which $write x$ )
Same but with read timestamp $t s_{r d} (x)$ (that of the last transaction which $read x$ )
When scheduler receives write $(T, x)$ at time $= t s$
- If $t s > t s_{r d} (x)$ and $t s > t s_{w r} (x)$ perform tentative write $x_{i}$ with timestamp $t s_{w r} (x_{i})$
- else abort $T$ since the write request arrived too late
When scheduler receives $read (T, x)$ at time $= t s$
- If $t s > t s_{w r} (x)$ :
  - perform read and set $t_{r d} (x) = max (t s, t_{r d} (x))$
- else abort $T$ since the read request arrived too late

Optimistic timestamp ordering

Based on the assumption that conflicts are rare, the strategy here is to simply proceed with the transaction and handle conflicts at a later stage:

data items are marked with the start time of the transaction. When the transaction is ready to commit, a check is performed to determine if any items have been changed since the start of the transaction.
If changes are detected, the transaction is aborted.
Otherwise, it is committed.

Things to remember about optimistic timestamp ordering:

Deadlock-free, allows maximum parallelism.
Under heavy load, there may be too many rollbacks.
Not widely used, especially in DS

Detecting distributed deadlocks

The 2PL locking can produce deadlocks (obv the timestamp approach can’t produce deadlocks). Distributed deadlocks can be addressed mainly:

Ignore the problem: most often employed, actually meaningful in many settings
Detection and recovery: typically by killing one of the process
Prevention
Avoidance: never used in (distributed) systems, as it implies a priori knowledge about resource usage

Not only fix deadlocks is difficult, but even detecting deadlocks in DS is difficult. The approaches:

Centralized solution consists in a coordinator which collects messages from each machine which maintains the own resource graph for its own resources. Then the coordinator with a “god-view” can “see” if there are deadlocks. The problem with these approach is that the timing of message arrivals can lead to false deadlocks.
Distributed detection system (Chandy-Misra-Haas) where each process is allowed to request multiple resources simultaneously and send messages containing the tuple (initiator, sender, receiver). If any of the process detects a cycle, a deadlock is found.

A smarter solution to prevent deadlocks is to design the system in a way that makes it impossible for deadlocks to occur using a distributed prevention approach. It’s possible to do this using global timestamps with two possible approaches:

wait-die algorithm
wound-wait algorithm

blog.martino.im

Explorer

04.Synchronization