Category: PUBLICATIONS

  • Power, Performance, And Energy Management of Heterogeneous Architectures

    Power, Performance, And Energy Management of Heterogeneous Architectures

    Many core modern multiprocessor systems-on-chip offers tremendous power and performance optimization opportunities by tuning thousands of potential voltage, frequency and core configurations. Applications running on these architectures are becoming increasingly complex. As the basic building blocks, which make up the application, change during runtime, different configurations may become optimal with respect to power, performance or other metrics. Identifying the optimal configuration at runtime is a daunting task due to a large number of workloads and configurations. Therefore, there is a strong need to evaluate the metrics of interest as a function of the supported configurations. This thesis focuses on two different types of modern multiprocessor systems-on-chip (SoC): Mobile heterogeneous systems and tile based Intel Xeon Phi architecture. For mobile heterogeneous systems, this thesis presents a novel methodology that can accurately instrument different types of applications with specific performance monitoring calls. These calls provide a rich set of performance statistics at a basic block level while the application runs on the target platform. The target architecture used for this work (Odroid XU3) is capable of running at 4940 different frequency and core combinations. With the help of instrumented application vast amount of characterization data is collected that provides details about performance, power and CPU state at every instrumented basic block across 19 different types of applications. The vast amount of data collected has enabled two runtime schemes. The first work provides a methodology to find optimal configurations in heterogeneous architecture using classifiers and demonstrates an average increase of 93%, 81% and 6% in performance per watt compared to the interactive, ondemand and powersave governors, respectively. The second work using same data shows a novel imitation learning framework for dynamically controlling the type, number, and the frequencies of active cores to achieve an average of 109% PPW improvement compared to the default governors. This work also presents how to accurately profile tile based Intel Xeon Phi architecture while training different types of neural networks using open image dataset on deep learning framework. The data collected allows deep exploratory analysis. It also showcases how different hardware parameters affect performance of Xeon Phi.

  • Dynamic Resource Management of Heterogeneous Mobile Platforms via Imitation Learning.

    Dynamic Resource Management of Heterogeneous Mobile Platforms via Imitation Learning.

    The complexity of heterogeneous mobile platforms is growing at a rate faster than our ability to manage them optimally at runtime. For example, state-of-the-art systems-on-chip (SoCs) enable controlling the type (Big/Little), number, and frequency of active cores. Managing these platforms becomes challenging with the increase in the type, number, and supported frequency levels of the cores. However, existing solutions used in mobile platforms still rely on simple heuristics based on the utilization of cores. This paper presents a novel and practical imitation learning (IL) framework for dynamically controlling the type (Big/Little), number, and the frequencies of active cores in heterogeneous mobile processors. We present efficient approaches for constructing an Oracle policy to optimize different objective functions, such as energy and performance per Watt (PPW). The Oracle policies enable us to design low-overhead power management policies that achieve near-optimal performance matching the Oracle. Experiments on a commercial platform with 19 benchmarks show on an average 101% PPW improvement compared to the default interactive governor.

  • Exploration of Memory and Cluster Modes in Directory-Based Many-Core CMPs.

    Exploration of Memory and Cluster Modes in Directory-Based Many-Core CMPs.


    Networks-on-chip have become the standard interconnect solution to address the communication requirements of many-core chip multiprocessors. It is well-known that network performance and power consumption depend critically on the traffic load. The network traffic itself is a function of not only the application, but also the cache coherence protocol, and memory controller/directory locations. Communication between the distributed directory to memory can introduce hotspots, since the number of memory controllers is much smaller than the number of cores. Therefore, it is critical to account for directory memory communication, and model them accurately in architecture simulators. This paper analyzes the impact of directory memory traffic and different memory and cluster modes on the NoC traffic and system performance. We demonstrate that unrealistic models in a widely used multiprocessor simulator produce misleading power and performance predictions. Finally, we evaluate different memory and cluster modes supported by Intel Xeon-Phi processors, and validate our models on four different cache coherence protocols.Subodha Charles,

  • DyPO: Dynamic Pareto-Optimal Configuration Selection for Heterogeneous MpSoCs

    DyPO: Dynamic Pareto-Optimal Configuration Selection for Heterogeneous MpSoCs

    Modern multiprocessor systems-on-chip (MpSoCs) offer tremendous power and performance optimization opportunities by tuning thousands of potential voltage, frequency and core configurations. As the workload phases change at runtime, different configurations may become optimal with respect to power, performance or other metrics. Identifying the optimal configuration at runtime is infeasible due to the large number of workloads and configurations. This paper proposes a novel methodology that can find the Pareto-optimal configurations at runtime as a function of the workload. To achieve this, we perform an extensive offline characterization to find classifiers that map performance counters to optimal configurations. Then, we use these classifiers and performance counters at runtime to choose Pareto-optimal configurations. We evaluate the proposed methodology by maximizing the performance per watt for 18 single- and multi-threaded applications. Our experiments demonstrate an average increase of 93%, 81% and 6% in performance per watt compared to the interactive, ondemand and powersave governors, respectively.