GPUDirect technologies applied to GPU real-time packet processing applications

Elena Agostini


Real-time GPU processing of network packets is a technique useful to several different application domains: signal processing, network security and so on. These applications involve the CPU to be in the critical path to coordinate the network card to receive packets in GPU memory and notifying a packet-processing GPU workload function waiting on the GPU for a new set of packets. It's fundamental to maximize the zero-packet loss throughput at the lowest latency possible but sometimes, depending on the application and platform, CPU may become the real bottleneck. For this reason, it's also possible to completely remove the CPU from the critical path, promoting the GPU as the main player capable of receiving, processing and sending packets without the need of the CPU.

In this session we'll discuss all of these options exploring a new solution recently released in NVIDIA DOCA to solve these kind of problems.

On the Complexity of Data Storage and Data Movement

Gianfranco Bilardi

University of Padova, Italy

The impact of data movement on the performance of computing systems is quite significant and will only increase as technology gets closer to the fundamental limitations on message speed and storage density. Hence, the space and I/O complexity of computations are classical measures that continue to be relevant. In spite of many results achieved in over half a century of research on these measures, a number of issues remain only partially understood. This talk will present the DAG visit framework for the study of space and I/O complexity, which unifies a number of known results as well as enables the derivation of new ones. The approach also exposes interesting connections between I/O and space requirements of computations.

Joint work with Lorenzo De Stefani.

An Introduction to the NVIDIA Datacenter Platform

Jose Castanos

NVIDIA, California, USA

NVIDIA is a global leader in AI and HPC. Its state-of-the-art GPUs power some of the fastest supercomputers, AI engines in major public clouds, exciting games, and autonomous cars. But mastering AI at scale is much more than owning a very fast graphics card. Therefore, NVIDIA has built an integrated hardware and software stack, where hardware is accessed through common programming models, and then used by a large catalog of frameworks and applications for AI and HPC.

This talk will provide an overview of the NVIDIA Datacenter Platform. The NVIDIA GPUs, the BlueField DPUs and the Grace CPUs target workloads with different characteristics. We will describe the interactions between them, and show areas where their synergies enable novel applications and more optimal implementations. We will present the emerging frameworks for programming these new devices (such as DOCA), and we will summarize NVIDIA's AI/HPC application portfolio.

On the intrinsic difficulty of benchmarks

Adam Charane

Free University of Bozen-Bolzano, Italy

A cornerstone of the study of Artificial Intelligence/Machine Learning systems, algorithms, and hardware is benchmarking their performance, both in terms of quality and efficiency. In order to get a representative picture and draw generalizable conclusions, benchmarks should cover the widest possible array of cases, not only in terms of size, but also in terms of difficulty.

In this talk, we will survey several metrics that attempt to capture the intrinsic difficulty of datasets, with examples from classification, similarity search, and time series analysis.

Joint work with Matteo Ceccarello.

MemComputing applications in Machine Learning

Massimiliano Di Ventra

University of California San Diego, USA

MemComputing is a new physics-based approach to computation that employs time non-locality (memory) to both process and store information on the same physical location [*]. After a brief introduction to this paradigm, I will discuss its application in the field of Machine Learning, by showing efficient supervised and unsupervised training of neural networks, demonstrating its advantages over traditional sampling methods. Work supported by DARPA, DOE, NSF, CMRR, and MemComputing, Inc. (http://memcpu.com/).

[*] M. Di Ventra, MemComputing: Fundamentals and Applications (Oxford University Press, 2022).


Anne C. Elster

NTNU: Norwegian University of Science and Technology, Norway


Towards Accelerating AI using Fast and Feasible Matrix Multiplication

Tor Hadas and Noa Vaknin

Hebrew University, Israel

Training deep neural networks increasingly requires large resources. It involves significant time spent on matrix multiplication, typically between 45% - 95%. Most current math libraries (for CPU and GPU) and all state-of-the-art hardware accelerators (such as Google's TPU and Intel's / Habana Lab's Gaudi) are based on the cubic-time classic matrix multiplication algorithm, despite more than five decades of research on sub-cubic time algorithms. Why is that?

Many of the sub-cubic time algorithms are impractical, as they have large hidden constants in the arithmetic complexity, and enormous minimal applicable size. Yet, recent years have seen encouraging studies addressing these obstacles. In this talk I will review several of them:

  • We provide a high performance general matrix-matrix multiplication that combines fast base change method and pebbling game based optimization scheme applied to Strassen's algorithm. We reduce arithmetic and communication costs, as well as memory footprint. Our algorithm outperforms DGEMM of Intel's MKL on feasible matrix dimensions starting at $n = 1024$ and obtains up to nearly ×2 speedup for larger matrix dimensions.

  • Pan's four decades old fast matrix multiplication algorithms (based on trilinear aggregation method) have, to date, the lowest asymptotic complexity of all algorithms applicable to matrices of feasible dimensions. However, the large coefficients in the arithmetic cost of these algorithms make them impractical. We reduce these coefficients by 90% - 98%, in some cases down to 2, the same leading coefficient as the classical cubic time algorithm. We show that our results are optimal or close to optimal.

  • Fast recursive matrix multiplication algorithms call the cubic time classical algorithm on small sub-blocks, as the classical algorithm requires fewer operations on small blocks. We obtain a new algorithm that may outperform the classical one, even on small blocks, by trading multiplications for additions. This algorithm goes against the common belief that the classical algorithm is the fastest for small blocks. Specifically, we obtain an algorithm for multiplying 2 × 2 blocks using only four multiplications. This algorithm seemingly contradicts the lower bound of Winograd (1971) on multiplying 2 × 2 matrices. We provide a new lower bound matching our algorithm for 2 × 2 block multiplication, thus showing that our technique is optimal.

Joint work with Yoav Gross, Oded Schwartz.

Efficient Predictive Modeling of Loop Transformations for Optimizing CNNs

Mary Hall

University of Utah, USA

The explosive growth of machine learning applications has consequently created a demand for high-performance implementations of convolutional neural networks (CNNs). Optimizing CNNs automatically has proven difficult due to the large optimization choices in deeply nested loops, and the variability of ideal optimization strategies for different network parameters. Predictive models to guide compiler optimization are sometimes used to derive a sequence of loop transformations to optimize memory access performance via deploying learned models. However, training models for loop transformation often requires prohibitively expensive training data generation when predicting the combined effects of a transformation sequence. This talk will describe research to compose models for loop transformations to reduce the overhead of training.

High-Performance Computer Architecture Simulation using Deep Learning

Adolfy Hoisie

Brookhaven National Laboratories, USA

While cycle-accurate simulators are essential tools for architecture research, design, and development, their practicality is limited by an extremely long time-to-solution for realistic architectures and applications. We will describe a concerted effort aiming at developing machine learning (ML) techniques for architecture simulation with a spectrum of goals from accelerating discrete-event simulation (DES) using ML to establishing ML as an alternative to DES in the "bag-of-tools" of ModSim. We will strive in this talk to answer all the key questions you always wonder about: Is it doable? Is it practical? Is it fast? Is it accurate? What is the range of uses?

Massivizing High Performance Computing for AI and ML: VU on the Science, Design, and Engineering of AI and ML Ecosystems

Alexandru Iosup

Vrije Universiteit Amsterdam, The Netherlands

Wherever we look, our society is turning digital. Science and engineering, business-critical and economic operations, and online education and gaming rely increasingly on the effective digitalization of their processes. For digitalization to succeed, two key challenges need to be simultaneously addressed: (1) enabling faster, better, and ethical analysis and decision-making through artificial intelligence (AI) and machine learning (ML), and (2) enabling scalable, more available, and more sustainable infrastructure for AI/ML and other Information and Communication Technology (ICT) operations, through large, yet efficient and interoperable, computer ecosystems, largely automated. The latter is the grand challenge of massivizing computer systems.

Inspired by this challenge and by our experience with distributed computer systems for over 15 years, we focus on understanding, deploying, scaling, and evolving such computer ecosystems successfully, that is, satisficing performance, dependability, sustainability, and cost-effectiveness. We posit we can achieve this through an ambitious, comprehensive research program, which starts from the idea that we can address the grand, fundamental challenge by focusing on computer ecosystems rather than merely on (individual, small-scale) computer systems.

In this talk, we define computer ecosystems and differentiate them from mere systems. We formulate eight principles and introduce a reference architecture for computer ecosystems supporting AI/ML and beyond across the computing continuum, as a high-level, universal framework that may guide the science, design, and engineering of such ecosystems. We synthesize a framework of resource management and scheduling (RM&S) techniques, which we argue should be explored systematically in the next decade. We can use such techniques not only to support better AI/ML processes, but also to improve the ICT infrastructure that runs them. We show early results obtained experimentally, both through controlled real-world experiments using the GradeML framework and through what-if analysis using the OpenDC simulator.

This work could lead in particular to better workflow, big data, and graph processing frameworks supporting AI/ML, and the creation of new processes and services that depend on them. This vision aligns with the Manifesto on Computer Systems and Networking Research in the Netherlands [1], which the speaker co-leads. Many of our examples come from real-world prototyping and experimentation, grand experiments in computer systems, and/or benchmarking and performance analysis work conducted with the Cloud group of SPEC RG [2].

[1] Future Computer Systems and Networking Research in the Netherlands: A Manifesto, 2022. [Online] https://arxiv.org/pdf/2206.03259

[2] SPEC RG Cloud https://research.spec.org/working-groups/rg-cloud/

Towards cross-domain domain-specific compiler architecture

Paul Kelly

Imperial College, United Kingdom

Domain-specific languages enable the compiler to understand more about what you are trying to do. If we get it right, DSLs enable us, at the same time, both to boost programmer productivity and also to automate sophisticated domain-specific optimisations that would be hard to do by hand - yet are essential to achieving efficient use of the hardware. DSLs are power tools for performance programming. This talk will offer some of our experience in building DSLs that deliver productivity, performance and performance portability. DSLs enable us to find the right representation for a program so that complex optimisations turn out to be easy. This is compiler architecture. This talk will try to map out how to design domain-specific compiler architecture to expose and build on representations that are common across different DSLs.

Duo: A New Model for Hyperscale Computing

Bill McColl

Huawei, Switzerland

A new era of hyperscale computing is underway, driven by new cloud computing business models, new hardware for general-purpose and accelerated computing, new hardware for infrastructure processing, and new complex massive-scale intelligent applications and services that combine HPC, simulation, AI, big data analytics, and vast knowledge bases. What is now urgently needed are new theories and models that can provide a foundation to guide this new era in the computing industry. With such a foundation in place, we can begin to design the new algorithms, programming tools and software systems that we need. Duo provides a new computing model that addresses these objectives.

Early Experiences of Noise-Sensitivity Performance Analysis of a Distributed Deep Learning Framework

Bernd Mohr

Jülich Supercomputing Centre, Germany

Deep Learning (DL) applications are used to solve complex problems efficiently. These applications require complex neural network models composed of millions of parameters and huge amounts of data for proper training. This is only possible by parallelizing the necessary computations by so-called distributed deep learning (DDL) frameworks over many GPUs distributed over multiple nodes of a HPC cluster. These frameworks mostly utilize the compute power of the GPUs and use only a small portion of the available compute power of the CPUs in the nodes for I/O and inter-process communication, leaving many CPU cores idle and unused. The more powerful the base CPU in the cluster nodes, the more compute resources are wasted. In this presentation, we investigate how much of this unutilized compute resources could be used for executing other applications without lowering the performance of the DDL frameworks. In our experiments, we executed a noise-generation application, which generates a very-high memory, network or I/O load, in parallel with DDL frameworks, and use HPC profiling and tracing techniques to determine whether and how the generated noise is affecting the performance of the DDL frameworks. Early results indicate that it might be possible to utilize the idle cores for jobs of other users without affecting the performance of the DDL applications in a negative way.

The Potential and Opportunities of Matrix Processing

Jose Moreira

IBM, T. J. Watson Research Lab., USA

For the last 50 years, vector processing has been the technique of choice for improving performance of important computations. This success is in spite of the fact that vector processing does not offer an inherent advantage in computational intensity over scalar processing. During the past few years we have seen the rise of a truly different form of computation, namely matrix processing. Matrix processing, such as that exemplified by vector outer products, has a fundamentally different computational intensity, that scales with the size of the vectors. Recent GPUs and now some CPUs are being augmented with dedicated matrix units that can perform some of these matrix computations directly. Matrix processing will open new ways on how computations are performed and create new opportunities for high-performance computers to make even deeper impacts on human activities. In this talk, we will revisit some of the fundamentals of matrix processing, and describe existing systems with matrix units, such as the IBM POWER10 processor, and compare with other processors that are expected in the market soon. We will also discuss matrix processing in GPUs, which has become very popular, and explore what the near future holds for this powerful new computing model.

The Challenges of AI and HPC! What about Data?

Wolfgang Nagel

Technische Universität Dresden, Germany

Methods and techniques of Artificial Intelligence (AI) and Machine Learning (ML) have been investigated for decades in pursuit of a vision where computers can mimic human intelligence. In recent years, these methods have become more mature and, in some specialized applications – often supported by reinforced learning – evolved to super-human abilities, e.g. in image recognition or in games such as Chess and Go. AlphaFold is one major very productive outflow of these developments. At the same time, the still exponential increase in effective computing power - particularly through the development of specialized hardware particularly well-suited for AI applications – using highly optimized computer systems and distributed computing concepts – is driving and continues to drive the now widely visible success of AI methods.

In addition, sophisticated AI applications are usually designed to be highly iterative in their analysis of available data, while at the same time involving complex data-driven workflows/pipelines that can no longer be represented by monolithic applications. In the end, however, the mass availability of data of the most diverse provenance and discipline is also decisive for success. This data is brought together with data management technologies, and intelligent processes from data analytics and artificial intelligence will generate new knowledge from it in an efficient manner due to its broad availability, thus having a lasting influence on the value chain of the future.

The talk will summarize some major developments of TU Dresden in the last years, to give an idea how we will structure science areas and infrastructure to cope with the challenges of digitization.

Pain point of the AI/ML ASIC market

IL Park

SK Hynix, South Korea

The Application-Specific Integrated Circuit (ASIC) market is expected to grow faster than any other semiconductor sector. According to major IT market research firms, AI/ML ASIC is expected to be a major computing engine for diverse AI/ML applications within four years. Example applications are Augmented Reality or Virtual Reality devices, Metaverse hardware system, Blockchain system, all IoT edge devices (smart devices), automotives (infotainment and autopilot), all AI edge devices, and datacenters. McKinsey forecasted that ASIC in 2025 would account for 40 50% of computation hardware in AI Datacenter and 70% of computation hardware in AI Edge. Omdia projects the AI/ML ASIC market will reach 26 billion USD in 2025. This story is the bright side of the ASIC world.

Here comes the shadow side story of the ASIC world. It is extremely difficult to buy a commercial quality AI/ML ASIC product from the market. Commercial ASIC products account for only 1% of the AI/ML compute market for datacenter in 2020. Meanwhile ASICs made for in-house applications account for more than 5 market for datacenter in 2020. This statistic means that market demand for AI/ML ASIC is strong. But customers cannot find off-the-shelf products. So, they have to build one by themselves. I would like to discuss what is happening in the AI/ML ASIC world and why they cannot release a decent commercial product to the market.

k-Center Clustering with Outliers in Sliding Windows

Andrea Pietracaprina

University of Padova, Italy

Metric k-center clustering is a fundamental unsupervised learning primitive. Although widely used, this primitive is heavily affected by noise in the data, so a more sensible variant seeks for the best solution that disregards a given number z of points of the dataset, which are called outliers. We describe efficient algorithms for this important variant in the streaming model under the sliding window setting, where, at each time step, the dataset to be clustered is the window W of the most recent data items. For general metric spaces, our algorithms achieve O(1) approximation and, remarkably, require a working memory linear in k+z and only logarithmic in |W|. For spaces of bounded doubling dimension, the approximation can be made arbitrarily close to 3. For these latter spaces, we show, as a by-product, how to estimate the effective diameter of the window W, which is a measure of the spread of the window points, disregarding a given fraction of noisy distances. We also provide experimental evidence of the practical viability of the improved clustering and diameter estimation algorithms.

Joint work with Paolo Pellizzoni and Geppino Pucci.

The Katana Graph Intelligence Platform

Keshav Pingali

University of Texas at Austin and Katana Graphs, USA

Knowledge Graphs power applications in diverse verticals such as FinTech, Pharma and InfoSec. Graphs with hundreds of billions edges are not uncommon, and computations on such graphs include querying, analytics, and AI/ML. In many applications, it is necessary to combine these operations seamlessly to extract actionable intelligence as quickly as possible. Katana Graph is a start-up based in Austin and the Bay Area that is building a scale-out platform for seamless, high-performance computing on such graph data. I will describe the Katana Graph Intelligence Platform and the lessons I have learned in doing a startup based on academic research.

Scalable and Space-Efficient Robust Matroid Center Algorithms

Geppino Pucci

University of Padova, Italy

Given a dataset V of points from some metric space, a popular robust formulation of the k-center clustering problem requires to select k points (centers) of V which minimize the maximum distance of any point of V from its closest center, excluding the z most distant points (outliers) from the computation of the maximum. In this talk, we concentrate on an important constrained variant of the robust k-center problem, namely, the Robust Matroid Center (RMC) problem, where the set of returned centers are constrained to be an independent set of a matroid of rank k built on V. Instantiating the problem with the partition matroid yields a formulation of the fair k-center problem, which has attracted the interest of the ML community in recent years. We strive for accurate solutions of the RMC problem under general matroids, when confronted with large inputs. Specifically, we describe a coreset-based algorithm affording efficient sequential, distributed and streaming implementations. For any fixed ε>0, the algorithm returns solutions featuring a (3+ε)-approximation ratio, which is a mere additive term ε away from the 3-approximations achievable by the best known polynomial-time sequential algorithms. Interestingly, the algorithms obliviously adapt to the intrinsic complexity of the dataset, captured by its doubling dimension D. For wide ranges of the input parameters, our distributed/streaming implementations require two rounds/one pass and substantially sublinear local/working memory. We also discuss a set of experiments on real-world datasets, which provide clear evidence of the accuracy and efficiency of our algorithms and oft heir improved performance with respect to previous solutions.

Joint work with Matteo Ceccarello, Andrea Pietracaprina, and Federico Soldà.

Integration of Vendor-Agnostic Data Collection Interfaces in DCDB

Amir Raoofy

Technical University of Munich, Germany

Monitoring systems in HPC rely on low-level tools and interfaces for the data collection on systems, and in many cases, architecture- and vendor-specific interfaces, registers, and counters are used. Identifying the important registers and counters and how to interpret them is the main challenge, specifically when it comes to the monitoring of heterogeneous systems and components. On the other hand, architecture- and vendor-agnostic interfaces for exposing metrics and counters have the potential to take over the heavy lifting and, through that, simplify the development and maintenance of HPC monitoring systems.

DCDB, which is a community software developed and maintained by LRZ for monitoring HPC systems, exposes a plugin-based architecture to collect metrics from system hardware, system software, and infrastructure. While it is fairly easy to implement such a plugin to leverage vendor-specific interfaces, understanding and configuring such interfaces and available metrics stay demanding. In the meantime, vendor agnostic interfaces and tools, such as Variorum (for power measurement) and Likwid (for node-level performance counters), already address this. Therefore the development of a DCDB plugin for the integration of interfaces such as Variorum and Likwid helps to collect metrics from various architectures and vendors using a single plugin while delegating most of the heavy lifting. In this talk, I will provide an overview of how to design and integrate such vendor-agnostic interfaces for data collection in DCDB.

Metadata for HPC and HTC for AI, ML, and data pipelines

Larry Rudolph

Two Sigma, New York, USA

Lineage metadata can improve the performance of HPC and HTC computer for ML production pipelines. We present three innovations: (i) D4N, a data caching scheme for large number of similar ML pipelines in the cloud (ii) static and dynamic efficient metadata capture for debugging these pipelines and (iii) synthetic provenance graphs to protect sensitive information in these pipelines.

Metadata driven systems are gaining popularity as a necessary component of data-pipelines in general. Each of the five pillars: Lineage, Observability, Operations, Discovery and Governance, require accurate up-to-date,machine readable metadata and are essential in addressing potential disruptive unexpected emergent properties of scalable ML pipelines.

Towards algorithm-architecture co-design for machine learning

Saday Sadayappan

University of Utah, USA

With the end of Moore's Law scaling of VLSI, hardware customization is seen as the primary means of achieving significant further improvements in per-chip performance and energy-efficiency. Several academic designs as well as a number of commercial accelerators have been developed for machine learning. The design space of accelerator parameters (e.g., the number of processors, register/buffer sizes at different levels in the memory hierarchy, bandwidths for on-chip and off-chip data movement) is very large, with significant implications on energy efficiency and performance achievable for key tensor operators used in machine learning. This talk discusses a model-driven co-design approach, where analytical modeling of data movement for distributed 2D convolution on an accelerator array is used to optimize accelerator parameters for energy and/or performance.

Possible Roads to Low Carbon AI ?!

Sven-Bodo Scholz

Radboud University, Nijmegen, The Netherlands

Since the inception of computers, we started creating ever bigger compute systems in order to solve ever bigger problems. Yet, the utilisation of HPC systems for large applications typically is below 10%. The increasing popularity of compute intensive AI applications keeps fuelling this trend: systems grow further and hardware utilisation does not get substantially better.

Currently, about 3% of our greenhouse gas emissions are attributable to computing. If we continue to build increasingly big compute systems and, at the same time, try to achieve our overall emission targets, the contribution of computing will rise to 14% of all emissions in 2040 already.

The only way to counteract this trend without limiting our compute ambitions is to increase the utilisation of our existing compute systems. This requires codes to be better adjusted to the executing hardware, both, at compile time and at runtime.

In this talk, we present the opportunities and challenges that we identified so far when trying to generate high-performance codes from functional array programs in the context of the SaC language project and its compiler eco system (www.sac-home.org).

Active Memory Architecture for Dynamic Graph Computing

Thomas Sterling

Indiana University, Bloomington, USA

The leading-edge of High Performance Computing is challenged by the end of Moore's Law and new applications demands in, among other areas, \"AI\" as the term is being currently employed in the common lexicon. Special Purpose Devices and in some cases entire new class of systems have attracted significant investment in recent years with the intent of meeting the rapidly growing demand (at least in appearance) for supervised machine learning platforms. More generally, for both dynamic numeric and informatics problems, the basic data structure may not be sparse matrices but rather time-varying and irregular graphs. AMR and N-body numeric problems and unsupervised machine learning, contextual natural language processing, searches, sorting, hypothesis testing, decision making, and a host of NP complete problems requiring non-deterministic but convergent solutions make up a wide-array of present and future workflows requiring new large-scale solutions. A class of memory-centric architectures are emerging as a research focus to address both dimensions of the design and operation space. One such is the Active Memory Architecture under development as an example of an innovative memory-centric system incorporating non von Neumann architecture structures, semantic constructs, graph and runtime related overhead primitive mechanisms, and runtime resource management and task scheduling methods. This innovative technical strategy has been sponsored by NASA and is selected for support by IARPA/ARO. The fundamental principles and planned methods being undertaken with the intent of modeling, simulation, and evaluation will be presented along with a completed FPGA-based graph accelerator prototype.

Addressing Scaling Challenges in a High-level Accelerator Cluster Runtime System

Peter Thoman

University of Innsbruck, Austria

In contemporary large-scale HPC systems, the highest tiers of performance and efficiency are generally achieved by accelerator clusters, which are traditionally programmed with low-level or vendor-specific approaches such as MPI+CUDA. The Celerity runtime system provides a data-flow-centric high-productivity API for implementing HPC applications on such clusters, based on the established SYCL industry standard. It is designed to alleviate the development and maintenance burdens inherent in distributed memory systems as well as those introduced by accelerator programming.

A core feature of Celerity is the declarative specification of resource requirements in compute kernels with so-called \"range mappers\". Based on only this information, the Celerity system asynchronously builds and distributes a command graph at runtime, transparently splitting kernels across multiple nodes and performing the required data transfers.

In order to implement this automatic data dependency computation and command generation, the runtime system needs to precisely track the state of each distributed data buffer in the system. This imposes challenges on the algorithms and data structures employed, both when scaling deeper -- i.e. more time steps and more complex data access patterns -- and when scaling wider towards more parallelism.

In this talk, we will present several of the techniques developed over the past few years of Celerity development in order to address these challenges. A particular focus will be placed on two unique constructs: Horizons and Epochs. These allow capping the depth complexity axis of data structures with a freely configurable trade-off between structural complexity, and the level of asynchronicity possible during execution.

Machine learning techniques for optimally exploiting your compute power: Real benefit or additional overhead?

Carsten Trinitis

Technical University of Munich, Germany

The talk will outline the idea of the SenSE project which aims at using machine learning techniques to optimally balance the workload between cloud and edge. For analysing huge data sets from thermoacoustic sensors, machine learning and data distillation techniques will be used to reduce the amount of data such that the compute power on the edge side can be (partly) used for analysis, thus eliminating the need to transfer huge amounts of redundant data to the cloud. In addition, the workload on the cloud resulting in more efficient use of HPC compute power. However, applying these techniques imposes an overhead to the edge side as well, i.e. the idea is to find out the optimal tradeoff. In addition, another project on FPGA based image detection will be briefly discussed.

Improving the efficiency of ML/AI applications

Ana Lucia Varbanescu

University of Twente, Enschede, Netherlands

ML/AI workloads have quickly become major consumers of computing cycles in many systems. However, the efficiency achieved by such workloads on various computing systems can vary significantly. In this talk, we present three workload characterization case-studies that demonstrate how AI workloads use existing systems, and argue for better metrics to analyse their efficiency. We further discuss our modeling efforts for these workloads, and suggest system-level improvements for higher efficiency and better system utilization for AI workloads.

Knocking on Post Von Neumann's door: a research perspective on irregular applications

Flavio Vella

University of Trento, Italy

Many emerging workloads in data science and AI require low response time and efficiency to process a high-volume of data, eventually with low precision. Typical Von Neumann architectures operate on data fetched from memory despite their flexibility, they are highly inefficient for emerging workloads and, in particular, for computing memory-bounded irregular applications.

Albeit traditional architectures and accelerators still represent the mainstream computing systems for real-world workloads, recently post von Neumann architectures ranging from simple AI domain-specific accelerators to Quantum systems are springing up.

Thus, during this eventual transition, we need to investigate what are the challenges and opportunities that such emerging systems and their organization offer in terms of both system programming and computational models to achieve performance, portability, and productivity.

In this seminar, we will discuss recent enhancements on parallel irregular applications, such as graph processing, on modern parallel architectures.

Data loading in Digital Pathology image classification

Francesco Versaci

CRS4, Cagliari, Italy

Motivated by automated image annotation of prostate tumor, an important problem in Digital Pathology, we revisit the data loading pipeline for the general image classification problem. We present a data loader that enables images to be securely loaded from an Apache Cassandra NoSQL DB, over high-latency networks, while maintaining a throughput comparable to using local filesystems. We finally discuss the work in progress to integrate our data loader as a plugin for NVIDIA DALI, to support most deep learning frameworks.

Joint work with G. Busonera.

Giving away control for Scalability - Incremental Improvements for legacy HPC Programming Models

Josef Weidendorfer

Technical University of Munich, Germany

In legacy HPC programming models like MPI, the programmer is under full control of communication and workload distribution. This becomes a burden on recent complex heterogeneous HPC systems. There, orchestration and scheduling is better done by sophisticated runtimes, guided by system-side information. In this talk, I will discuss how a programming model which still feels similar to MPI can provide significant improvements without going the full path towards task-based frameworks where runtimes are under full control.