Loading...
Search for: sadrosadati--m
0.007 seconds

    An energy-efficient virtual channel power-gating mechanism for on-chip networks

    , Article Proceedings -Design, Automation and Test in Europe, DATE, 9 March 2015 through 13 March 2015 ; Volume 2015-April , March , 2015 , Pages 1527-1532 ; 15301591 (ISSN) ; 9783981537048 (ISBN) Mirhosseini, A ; Sadrosadati, M ; Fakhrzadehgan, A ; Modarressi, M ; Sarbazi Azad, H ; Sharif University of Technology
    Institute of Electrical and Electronics Engineers Inc  2015
    Abstract
    Power-gating is a promising method for reducing the leakage power of digital systems. In this paper, we propose a novel power-gating scheme for virtual channels in on-chip networks that uses an adaptive method to dynamically adjust the number of active VCs based on the on-chip traffic characteristics. Since virtual channels are used to provide higher throughput under high traffic loads, our method sets the number of virtual channel at each port selectively based on the workload demand, thereby do not negatively affect performance. Evaluation results show that by using this scheme, about 40% average reduction in static power consumption can be achieved with negligible performance overhead  

    Quantifying the difference in resource demand among classic and modern NoC workloads

    , Article Proceedings of the 34th IEEE International Conference on Computer Design, ICCD 2016, 2 October 2016 through 5 October 2016 ; 2016 , Pages 404-407 ; 9781509051427 (ISBN) Mirhosseini, A ; Sadrosadati, M ; Zare, M ; Sarbazi Azad, H ; Sharif University of Technology
    Institute of Electrical and Electronics Engineers Inc  2016
    Abstract
    This paper quantifies the difference in resource demand between modern and classic NoC workloads. In the paper, we show that modern workloads are able to better utilize higher numbers of VCs and smaller C factors in order to attain performance and energy efficiency. This is because of the high throughput and possible local congestions in their traffic pattern. As a result, such workloads are more suitable for concurrency and redundancy energy reduction techniques where the voltage and frequency are reduced simultaneously and the increased power budget is used for introducing additional resources to the network in order to improve the performance  

    BiNoCHS: bimodal network-on-chip for CPU-GPU heterogeneous systems

    , Article 2017 11th IEEE/ACM International Symposium on Networks-on-Chip, NOCS 2017, 19 October 2017 through 20 October 2017 ; 2017 ; 9781450349840 (ISBN) Mirhosseini, A ; Sadrosadati, M ; Soltani, B ; Sarbazi Azad, H ; Wenisch, T. F ; Sharif University of Technology
    Abstract
    CPU-GPU heterogeneous systems are emerging as architectures of choice for high-performance energy-efficient computing. Designing on-chip interconnects for such systems is challenging; CPUs typically benefit greatly from optimizations that reduce latency, but rarely saturate bandwidth or queueing resources. In contrast, GPUs generate intense traffic that produces local congestion, harming CPU performance. Congestion-optimized interconnects can mitigate this problem through larger virtual and physical channel resources. However, when there is little traffic, such networks become suboptimal due to higher unloaded packet latencies and critical path delays.We argue for a reconfigurable network... 

    Neda: supporting direct inter-core neighbor data exchange in GPUs

    , Article IEEE Computer Architecture Letters ; Volume 17, Issue 2 , 2018 , Pages 225-229 ; 15566056 (ISSN) Nematollahi, N ; Sadrosadati, M ; Falahati, H ; Barkhordar, M ; Sarbazi Azad, H ; Sharif University of Technology
    Institute of Electrical and Electronics Engineers Inc  2018
    Abstract
    Image processing applications employ various filters for several purposes, such as enhancing the images and extracting the features. Recent studies show that filters in image processing applications take a substantial amount of the execution time, and it is crucial to boost their performance to improve the overall performance of the image processing applications. Image processing filters require a significant amount of data sharing among threads which are in charge of filtering neighbor pixels. Graphics Processing Units (GPUs) attempt to satisfy the demand of data sharing by providing the scratch-pad memory, shuffle instructions, and on-chip caches. However, we observe that these mechanisms... 

    Energy-Efficient permanent fault tolerance in hard real-time systems

    , Article IEEE Transactions on Computers ; 2019 ; 00189340 (ISSN) Mireshghallah, F ; Bakhshalipour, M ; Sadrosadati, M ; Sarbazi Azad, H ; Sharif University of Technology
    IEEE Computer Society  2019
    Abstract
    Triple Modular Redundancy (TMR) is a historical and long-time-used approach for masking various kinds of faults. By employing redundancy and analyzing the results of three separate executions of the same program, TMR is able to attain excellent levels of reliability. While TMR provides a desirable level of reliability, it suffers from the high power consumption of the redundant hardware, a severe detriment to its broad adoption. The energy consumption of TMR can be mitigated if its operations are divided into two stages, and one stage is dropped in the absence of fault. Such an approach, which is evaluated in recent research, however, quickly fails in the presence of permanent faults, as we... 

    Energy-Efficient permanent fault tolerance in hard real-time systems

    , Article IEEE Transactions on Computers ; 2019 ; 00189340 (ISSN) Mireshghallah, F ; Bakhshalipour, M ; Sadrosadati, M ; Sarbazi Azad, H ; Sharif University of Technology
    IEEE Computer Society  2019
    Abstract
    Triple Modular Redundancy (TMR) is a historical and long-time-used approach for masking various kinds of faults. By employing redundancy and analyzing the results of three separate executions of the same program, TMR is able to attain excellent levels of reliability. While TMR provides a desirable level of reliability, it suffers from the high power consumption of the redundant hardware, a severe detriment to its broad adoption. The energy consumption of TMR can be mitigated if its operations are divided into two stages, and one stage is dropped in the absence of fault. Such an approach, which is evaluated in recent research, however, quickly fails in the presence of permanent faults, as we... 

    Data-Aware compression of neural networks

    , Article IEEE Computer Architecture Letters ; Volume 20, Issue 2 , 2021 , Pages 94-97 ; 15566056 (ISSN) Falahati, H ; Peyro, M ; Amini, H ; Taghian, M ; Sadrosadati, M ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology
    Institute of Electrical and Electronics Engineers Inc  2021
    Abstract
    Deep Neural networks (DNNs) are getting deeper and larger which intensify the data movement and compute demands. Prior work focuses on reducing data movements and computation through exploiting sparsity and similarity. However, none of them exploit input similarity and only focus on sparsity and weight similarity. Synergistically analysing the similarity and sparsity of inputs and weights, we show that memory accesses and computations can be reduced by 5.7× and 4.1×, more than what can be decreased by exploiting only sparsity, and 3.9× and 2.1×, more than what can be decreased by exploiting only weight similarity. We propose a new data-aware compression approach, called DANA, to effectively... 

    Efficient nearest-neighbor data sharing in GPUs

    , Article ACM Transactions on Architecture and Code Optimization ; Volume 18, Issue 1 , 2021 ; 15443566 (ISSN) Nematollahi, N ; Sadrosadati, M ; Falahati, H ; Barkhordar, M ; Drumond, M. P ; Sarbazi Azad, H ; Falsafi, B ; Sharif University of Technology
    Association for Computing Machinery  2021
    Abstract
    Stencil codes (a.k.a. nearest-neighbor computations) are widely used in image processing, machine learning, and scientific applications. Stencil codes incur nearest-neighbor data exchange because the value of each point in the structured grid is calculated as a function of its value and the values of a subset of its nearest-neighbor points. When running on Graphics Processing Unit (GPUs), stencil codes exhibit a high degree of data sharing between nearest-neighbor threads. Sharing is typically implemented through shared memories, shuffle instructions, and on-chip caches and often incurs performance overheads due to the redundancy in memory accesses. In this article, we propose Neighbor Data... 

    NURA: A framework for supporting non-uniform resource accesses in gpus

    , Article Proceedings of the ACM on Measurement and Analysis of Computing Systems ; Volume 6, Issue 1 , 2022 ; 24761249 (ISSN) Darabi, S ; Mahani, N ; Baxishi, H ; Yousefzadeh Asl Miandoab, E ; Sadrosadati, M ; Sarbazi Azad, H ; Sharif University of Technology
    Association for Computing Machinery  2022
    Abstract
    Multi-application execution in Graphics Processing Units (GPUs), a promising way to utilize GPU resources, is still challenging. Some pieces of prior work (e.g., spatial multitasking) have limited opportunity to improve resource utilization, while other works, e.g., simultaneous multi-kernel, provide fine-grained resource sharing at the price of unfair execution. This paper proposes a new multi-application paradigm for GPUs, called NURA, that provides high potential to improve resource utilization and ensures fairness and Quality-of-Service (QoS). The key idea is that each streaming multiprocessor (SM) executes Cooperative Thread Arrays (CTAs) belong to only one application (similar to the... 

    NURA: A framework for supporting non-uniform resource accesses in GPUs

    , Article Performance Evaluation Review ; Volume 50, Issue 1 , 2022 , Pages 39-40 ; 01635999 (ISSN) Darabi, S ; Mahani, N ; Baxishi, H ; Yousefzadeh, E ; Sadrosadati, M ; Sarbazi Azad, H ; Sharif University of Technology
    Association for Computing Machinery  2022
    Abstract
    Multi-application execution in Graphics Processing Units (GPUs), a promising way to utilize GPU resources, is still challenging. Some pieces of prior work (e.g. spatial multitasking) have limited opportunity to improve resource utilization, while others, e.g. simultaneous multi-kernel, provide fine-grained resource sharing at the price of unfair execution. This paper proposes a new multi-application paradigm for GPUs, called NURA, that provides high potential to improve resource utilization and ensure fairness and Quality-of-Service(QoS). The key idea is that each streaming multiprocessor (SM) executes the Cooperative Thread Arrays (CTAs) that belong to only one application (similar to... 

    NURA: A framework for supporting non-uniform resource accesses in GPUs

    , Article 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS/PERFORMANCE 2022, 6 June 2022 through 10 June 2022 ; 2022 , Pages 39-40 ; 9781450391412 (ISBN) Darabi, S ; Mahani, N ; Baxishi, H ; Yousefzadeh, E ; Sadrosadati, M ; Sarbazi Azad, H ; ACM SIGMETRICS ; Sharif University of Technology
    Association for Computing Machinery, Inc  2022
    Abstract
    Multi-application execution in Graphics Processing Units (GPUs), a promising way to utilize GPU resources, is still challenging. Some pieces of prior work (e.g. spatial multitasking) have limited opportunity to improve resource utilization, while others, e.g. simultaneous multi-kernel, provide fine-grained resource sharing at the price of unfair execution. This paper proposes a new multi-application paradigm for GPUs, called NURA, that provides high potential to improve resource utilization and ensure fairness and Quality-of-Service(QoS). The key idea is that each streaming multiprocessor (SM) executes the Cooperative Thread Arrays (CTAs) that belong to only one application (similar to... 

    Focus on What is Needed: Area and Power Efficient FPGAs Using Turn-Restricted Switch Boxes

    , Article 18th IEEE Computer Society Annual Symposium on VLSI, ISVLSI 2019, 15 July 2019 through 17 July 2019 ; Volume 2019-July , 2019 , Pages 615-620 ; 21593469 (ISSN) ; 9781538670996 (ISBN) Serajeh Hassani, F ; Sadrosadati, M ; Pointner, S ; Wille, R ; Sarbazi Azad, H ; Technical Committee on VLSI (TCVLSI) of IEEE Computer Society (CS) ; Sharif University of Technology
    IEEE Computer Society  2019
    Abstract
    Field-Programmable Gate Arrays (FPGAs) employ a significant amount of SRAM cells in order to provide a flexible routing architecture. While this flexibility allows for a rather easy realization of arbitrary functionality, the respectively required cells significantly increase the area and power consumption of the FPGA. At the same time, it can be observed that full routing flexibility is frequently not needed in order to efficiently realize the desired functionality. In this work, we are proposing an FPGA realization which focuses on what is needed and realizes only a subset of the possible routing options using what we call Turn-Restricted Switch-Boxes. While this may yield a slight...