Multithreading Architectures

Superscalar architectures

A scalar architecture is a single issue architecture while superscalar architectures allow for multiple instructions to be executed per clock cycle. Remember that ILP parallel processing:

  • ILP improves throughput by pipelining operations
  • Parallel processing is non-user-transparent way of executing programs

{width=50%}

Superscalar architectures have limitations. ILP has an upper bound:

  • Dynamic scheduling is expensive and difficult to design and verify
  • Register renaming has its limits
  • Jump and branch predictions are not always accurate
  • Memory latency is a huge issue.
  • Hazards prevent too many instructions from being issued simultaneously.

Temporal Multithreading

Using Temporal Multithreading is possible to further increase parallelism by alternately switching between tasks. Temporal Multi-threading has a high cost due to each thread having its own context and there are different techniques to achieve it:

  • Fine-grained multi-threading involves switching between different threads at a very fine level of granularity, such as after every instruction. This allows for maximum utilization of resources but can also lead to increased overhead due to frequent context switching.
  • Coarse-grained multi-threading involves switching between different threads at a coarser level of granularity, such as after completing a group of instructions from one thread before moving on to another thread. This reduces the overhead associated with context switching but may not fully utilize available resources.
  • Simultaneous multi-threading (SMT) is similar to fine-grained multi-threading but goes further by allowing multiple instructions from different threads to be executed simultaneously within each clock cycle.

{width=50%}

There are drawbacks in each approach:

  • Fine-grain multithreading has empty issue windows, causing idle time.
  • Coarse-grained multithreading architecture is not commonly used, since it requires flushing the pipeline and can result in starvation.
  • SMT requires more complex hardware than either fine- or coarse-grained multithreading but can provide higher levels of parallelism and better resource utilization.

{width=50%}

ARM Cortex-a53 pipeline example

What actually happens in real world? Basically different systems are combined to achieve the most performance: for example cpu alone is not enough for tasks such as playing games in 4K or streaming on Twitch: GPUs are a common addition to systems. This is called heterogeneous architecture since it combines different types of processors. An example of heterogeneous architecture is the big.LITTLE architecture. Cortex-a53 (also the Raspberry Pi 3 CPU) can be used alone as an energy-efficient alternative to the Cortex-A57, or in a big.LITTLE configuration alongside a more powerful microarchitecture. The big.LITTLE configuration architecture is multicore and heterogeneous with a shared ISA, combining battery-efficient (LITTLE) and power-hungry (big) cores. Typically, only one side will be active, but all cores can access the same memory.