Sites Inria

Version française

Centre de Conception du Logiciel

Workshop : Code performance, energy, hybrid-compilation, and debugging

Code © Inria / Photo Kaksonen

Inria Project-team Corse organizes on December 2016, 13-14 a workshop dedicated to code characterization, performance, energy, hybrid-compilation, and debugging, in the Centre de Conception du Logiciel (CCL), Grenoble.

  • Date : 13/12/2016 to 14/12/2016
  • Place : Minatec Campus, 17 Rue des Martyrs, Grenoble - Batiment 50C - Room C203/C206
  • Guest(s) : Alexandra Jimborean (Uppsala U.), Louis-Noël Pouchet (Colorado St. U.), Ayal Zaks (Intel), Kim Ahn (Uppsala U.), Fabian Grüber (Inria)
  • Organiser(s) : CORSE Project-team (Fabrice Rastello)


December 13


14h - 14h45

Alexandra Jimborean

Automatic Detection of Extended Data-Race-Free Regions

Data-race-free (DRF) parallel programming becomes a standard as newly adopted memory models of mainstream programming languages such as C++ or Java impose data-race-freedom as a requirement.We propose compiler techniques that automatically delineate extended data-race-free regions (xDRF), namely regions of code which provide the same guarantees as the synchronization-free regions (in the context of DRF codes). xDRF regions stretch across synchronization boundaries, function calls and loop back-edges and preserve data-race-free semantics, thus increasing the optimization opportunities exposed to the compiler and to the underlying architecture.

Our compiler techniques precisely analyze the threads’ memory accessing  behavior and data sharing in shared-memory, general-purpose parallel applications and can therefore infer the limits of xDRF code regions.We evaluate the potential of our technique by employing the xDRF region classification in a state-of-the-art, dual-mode cache coherence protocol. Larger xDRF regions reduce the coherence bookkeeping and enable optimizations for performance (6.1%) and energy efficiency (12.7%) compared to a  standard directory-based coherence protocol.

15h15 - 16h

Louis-Noel Pouchet

Source Code Analysis for Kernel Characterization and Categorization

Polyhedral program transformations can perform highly aggressive restructuring of programs with static control-flow. However the task of finding the actually best  transformation to optimize for speed or for energy remains a daunting challenge: to date the state of practice is to perform auto-tuning on the target device, running many different versions of the input program to observe which one actually performs best.
In this talk we present PolyFeat, a /fast/ static analysis tool which can characterize a program region at compile-time in less than one second for affine programs made of possibly thousands of lines of code. It computes numerous approximate metrics from the source code, such as data cache misses, operational intensity, OpenMP scaling potential, etc. As we show, these metrics can then be used to prune a space of transformations, or implement compile-time CPU frequency selection to optimize energy, for example.

16h30 - 17h15

Ayal Zaks

Extending Loop Vectorizer towards supporting Open MP4.5 SIMD and outer loop auto-vectorization

Currently, LoopVectorizer in LLVM is specialized in
auto-vectorizing innermost loops. SIMD and DECLARE SIMD
constructs introduced in OpenMP4.0 and enhanced in OpenMP4.5 are gaining popularity among performance hungry programmers due to the ability to specify a vectorization region much larger in scope than the traditional inner loop auto-vectorization would handle and also due to several advanced vectorizing compilers delivering impressive performance for such constructs. Hence, there is a growing interest in LLVM developer community in improving LoopVectorizer in order to adequately support OpenMP functionalities such as outer loop vectorization and whole function vectorization. In this Technical Talk, we discuss our approaches in achieving that goal through a series of
incremental steps and further extending it for outer loop

 December 14


10h - 10h40

Kim-Anh Tran

Compiling for energy efficient architectures: Hiding long-latencies on limited, energy-efficient cores

Memory latency becomes a performance bottleneck if long latency loadscannot be overlapped with useful computation.While aggressiveout-of-order processors are able to hide long latencies, limitedout-of-order and in-order cores fail to find enough independentinstructions to hide the delay.We propose software-only and software-hardware co-designs to overcome the performance degradation caused by long latency loads on small cores. Energy-efficient cores can, equipped with the appropriate compile-time support, significantly improve their performance formemory-bound applications. We separate loads from their uses, andoverlap their latencies with instructions from different blocks andloop iterations. Our techniques overcome restrictions which yieldedconventional compile-time techniques impractical: (i) staticallyunknown dependencies, (ii) insufficient independent instructions, and(iii) register pressure, and achieve a an average run time improvement of 10%,
 with a peak of 45% on memory-bound applications.

10h45 - 11h30

Fabian Gruber

Extending QEMU to Build a Bottleneck Model based Performance Debugging Tool

QEMU, short for Quick Emulator, is a CPU emulator that is able to run applications compiled for one architecture on another (such as running an ARM binary on an x86 CPU, or vice versa). QEMU is not based on an interpreter, but instead uses binary translation to allow efficient execution of foreign instructions. Performance debugging is the process of, first, finding performance problems, that is, pinpointing code regions with suboptimal resource utilization, and then diagnosing the causes for these problems. This talk presents ongoing work the CORSE team has done in collaboration with ST Microelectronics on extending QEMU to instrument executed programs in order to collect high-level performance metrics. The goal of this presentation is not only to present our work, but also to solicit feedback on our ideas from the audience.

14h - 14h45 

Diogo Sampaio

Profile Guided Hybrid Compilation (PhD defense)

Heat dissipation limitations caused a paradigm change in how computational capacity of chips are scaled, ranging from increasing the clock frequency to growing parallelism. In order to explore this characteristic computer applications must be made parallel, a hard job left to software developers. To aid in this process many optimizing compilers and frameworks have been developed, such as polyhedral compilation
tools (e.g. \pluto/).

This works advocates for the use of hybrid analyses when optimizing loops, regions where the majority of programs spend most of their time.
It proposes a framework that statically applies a sequence of complex loop transformations in a speculative manner. Based on memory access expressions it generates lightweight run-time tests to ensure that data dependencies are not violated by agiven transformation. Using information collected at run-time it discards transformations that would never be used due too constraining validity tests. At the heart of this technique is apowerful quantifier elimination scheme over multivariate integer polynomials, which provides a more precise result than any other
 known tool.
The soundness of the framework is demonstrated against amodified version of the \pb/ benchmark suite, where all datastructures have been linearized. Performing the same transformations that a polyhedral optimizer would apply over the original programs, our framework generates tests that correctly validate transformations uses, either by proving correct ones or blocking invalid ones. To further illustrate the generality of our run-time test generation scheme, we demonstrate the capacity to correctly generate tests for programs with polynomial memory
accesses, caused by packed triangular matrix access patterns.

Keywords: Hybrid-compilation Performance Debugging Energy CCL