Portfolio item number 1
Published:
Short description of portfolio item number 1
Published:
Short description of portfolio item number 1
Published:
Short description of portfolio item number 2
Published in ISPA, 2017
This paper presents an efficient OpenCL implementation on an ARMv8 Multi-Core CPU
Recommended citation: Jianbin Fang, Peng Zhang, Tao Tang, Chun Huang, Canqun Yang. "Implementing and Evaluating OpenCL on an ARMv8 Multi-Core CPU." ISPA. 2017. http://jianbinfang.github.io/files/2017-12-12-ocl2ft.pdf
Published in Parallel Computing, 2018
This paper presents a suite of benchmark to measure the capacility of the GPU memory system.
Recommended citation: Minquan Fang, Jianbin Fang, Weimin Zhang, Haifang Zhou, Jianxing Liao, Yuangang Wang. (2018). "Benchmarking the GPU memory at the warp level." Parallel Computing. 71:23-41. http://jianbinfang.github.io/files/2018-01-18-wbench.pdf
Published in FGCS, 2018
This paper presents an efficient and portable ALS solver for sparse matrix factorization in recommender systems
Recommended citation: Jing Chen, Jianbin Fang, Weifeng Liu, Tao Tang, Canqun Yang. "clMF: A Fine-Grained and Portable Alternating Least Squares Algorithm for Parallel Matrix Factorization." FGCS. 2018. http://jianbinfang.github.io/files/2018-04-24-clmf.pdf
Published in CF, 2018
This paper presents an efficient OpenCL implementation on Matrix-2000
Recommended citation: Peng Zhang, Jianbin Fang, Canqun Yang, Tao Tang, Chun Huang, Zheng Wang. "MOCL: An Efficient OpenCL Implementation for the Matrix-2000 Architecture." CF. 2018. http://jianbinfang.github.io/files/2018-03-15-mocl.pdf
Published in IPDPS, 2018
This paper is to tune the performance of streamed applications with machine learning.
Recommended citation: Peng Zhang, Jianbin Fang, Tao Tang, Canqun Yang, Zheng Wang. "Auto-tuning Streamed Applications on Intel Xeon Phi." IPDPS. 2018. http://jianbinfang.github.io/files/2018-01-22-mlstream.pdf
Published in HPCC, 2018
This paper presents adaptive optimization of Sparse Matrix-Vector Multiplication on two emerging many-core architectures.
Recommended citation: Shizhao Chen, Jianbin Fang, Donglin Chen, Chuanfu Xu, Zheng Wang. "Adaptive Optimization of Sparse Matrix-Vector Multiplication on Emerging Many-Core Architectures." HPCC. 2018. http://jianbinfang.github.io/files/2018-06-28-aspmv.pdf
Published in IJPP, 2018
This paper evaluates the performance of SpMV with five sparse storage formats on an ARMv8-based Many-Core Processor
Recommended citation: Donglin Chen, Jianbin Fang, Shizhao Chen, Chuanfu Xu, Zheng Wang. "Optimizing Sparse Matrix-Vector Multiplications on An ARMv8-based Many-Core Architecture." IJPP. 2018. http://jianbinfang.github.io/files/2018-09-11-ijpp.pdf
Published in HPCC, 2019
This paper presents an empirical approach to choose and switch MPI communication algorithms at runtime to optimize the application performance.
Recommended citation: Wenxu Zheng, Jianbin Fang, Juan Chen, et. al. "Auto-tuning MPI Collective Operations on Large-Scale Parallel Systems." HPCC. 2019. http://jianbinfang.github.io/files/2019-05-16-hpcc.pdf
Published in TPDS, 2019
This paper proposes GENIE, a QoS-aware dynamic scheduling framework for a shared GPU cluster, which achieves users QoS guarantee and high system utilization
Recommended citation: Zhaoyun Chen, Wei Quan, Mei Wen, Jianbin Fang, Jie Yu, Chunyuan Zhang, Lei Luo. "Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters." TPDS. 2019. http://jianbinfang.github.io/files/2019-07-29-tpds.pdf
Published in IJPP, 2019
This paper presents a quantitative study for characterizing the scalability of sparse matrix-vector multiplications (SpMV) on Phytium FT-2000+, an ARM-based HPC many-core architecture.
Recommended citation: Donglin Chen, Jianbin Fang, Chuanfu Xu, Shizhao Chen, Zheng Wang. "Characterizing Scalability of Sparse Matrix–Vector Multiplications on Phytium FT-2000+." IJPP. 2019. http://jianbinfang.github.io/files/2019-11-03-ijpp.pdf
Published in TPDS, 2020
This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures.
Recommended citation: Peng Zhang, Jianbin Fang, Canqun Yang, Chun Huang, Tao Tang, Zheng Wang. "Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures." TPDS. 2020. http://jianbinfang.github.io/files/2020-02-27-tpds.pdf
Published in CCF THPC, 2020
This is a survey article on parallel programming models for heterogeneous many-core architectures.
Recommended citation: Jianbin Fang, Chun Huang, Tao Tang, Zheng Wang. "Parallel Programming Models for Heterogeneous Many-Cores : A Comprehensive Survey." CCF THPC. 2020. http://jianbinfang.github.io/files/2020-04-12-ccf-thpc.pdf
Published in CCF ACA, 2020
This article dissects the memory system of the Phytium 2000+ many-core with microbenchmarks.
Recommended citation: Wanrong Gao, Jianbin Fang, Chuanfu Xu, Chun Huang. " Dissecting the Phytium 2000+ Memory Hierarchy via Microbenchmarking. " CCF ACA. 2020. http://jianbinfang.github.io/files/2020-05-11-ccf-aca.pdf
Published in PACT, 2020
This paper presents POEM, a novel framework that automatically learns useful code representations from graph-based program structures. At the core of POEM is a new graph neural network (GNN), which is specially designed for capturing the syntax and semantic information from the program abstract syntax tree and the control and data flow graph.
Recommended citation: Guixin Ye, Zhanyong Tang, Huanting Wang, Jianbin Fang, Songfang Huang, Zheng Wang. "Deep Program Structure Modeling Through Multi-Relational Graph-based Learning." PACT. 2020. http://jianbinfang.github.io/files/2020-07-16-pact.pdf
Published in NPC, 2020
This paper presents a NUMA-aware optimization technique for the SpMV operation on the Phytium 2000+ architecture.
Recommended citation: Xiaosong Yu, Huihui Ma, Zhengyu Qu, Jianbin Fang, Weifeng Liu. "NUMA-Aware Optimization of Sparse Matrix-Vector Multiplication on ARMv8-based Many-Core Architectures." NPC. 2020. http://jianbinfang.github.io/files/2020-08-21-npc.pdf
Published in ICTAI, 2020
Existing DL-based models have to be re-trained whenever the flow condition changes, which incurs significant training overhead for real-life scenarios with a wide range of flow conditions. This paper presents FLOWGAN, a novel conditional generative adversarial network for accurate prediction of flow fields in various conditions. FLOWGAN is designed to directly obtain the generation of solutions to flow fields in various conditions based on observations rather than re-training.
Recommended citation: Donglin Chen, Xiang Gao, Chuanfu Xu, Shizhao Chen, Jianbin Fang, Zhenghua Wang, Zheng Wang. "FlowGAN: A Conditional Generative Adversarial Network for Flow Prediction in Various Conditions." ICTAI. 2020. http://jianbinfang.github.io/files/2020-09-03-ictai.pdf
Published in TST, 2020
This article develops a novel resource allocation scheme for memory-bound applications running on High-Performance Computing (HPC) clusters, aiming to improve application performance without breaching peak power constraints and total energy consumption.
Recommended citation: Juan Chen, Xinxin Qi, Feihao Wu, Jianbin Fang, Yong Dong, Yuan Yuan, Zheng Wang, and Keqin Li. "More Bang for Your Buck: Boosting Performance with Capped Power Consumption." TST. 2020. http://jianbinfang.github.io/files/2020-11-01-tst.pdf
Published in JCST, 2020
This article presents a comprehensive performance evaluation of Phytium 2000+, an ARMv8-based 64-core architecture. We focus on the cache and memory subsystems, analyzing the characteristics that impact the high-performance computing applications.
Recommended citation: Jianbin Fang, Xiangke Liao, Chun Huang, Dezun Dong. "Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+." JCST. 2020. http://jianbinfang.github.io/files/2020-12-02-jcst.pdf
Published in IPDPS-2021, 2020
General Matrix Multiplication (GEMM) is a key subroutine in high-performance computing. There is a large body of work on evaluating and optimizing large-scale matrix multiplication, but how well the small-scale matrix multiplication (SMM) performs is largely unknown, especially for the ARMv8-based many-core architectures. In this work, we evaluate and characterize the performance of SMM subroutines on Phytium 2000+, an ARMv8-based 64-core architecture. The evaluation work is extensively performed with the mainstream open-source libraries including OpenBLAS, BLIS, BALSFEO, and Eigen. Given various experimental settings, we observe how well the small-scale GEMM routines perform on Phytium 2000+, and then discuss the impacting factors behind the performance behaviours of SMM.
Recommended citation: Weiling Yang, Jianbin Fang, Dezun Dong. "Characterizing Small-Scale Matrix Multiplications on ARMv8-based Many-Core Architectures." IPDPS. 2021. http://jianbinfang.github.io/files/2020-12-11-ipdps.pdf
Published in TPDS, 2021
This article presents an efficient implementation of the alternative least squares (ALS) algorithm called BALS built on top of a new sparse matrix format for parallel matrix factorization. Note that the reviewing process takes around 3 years spanning from April 2, 2018 to March 1, 2021, which is the most time-consuming one I have ever seen.
Recommended citation: Jing Chen, Jianbin Fang, Weifeng Liu, Canqun Yang. "BALS: Blocked Alternating Least Squares for Parallel Sparse Matrix Factorization on GPUs." TPDS. 2021. http://jianbinfang.github.io/files/2021-03-01-tpds.pdf
Published in FITEE, 2021
In this paper, we propose FlowDNN, a novel deep neural network (DNN) to efficiently learn flow representations from CFD results. FlowDNN saves computational time by directly predicting the expected flow fields based on given flow conditions and geometry shapes. FlowDNN is the first DNN that incorporates the underlying physical conservation laws of fluid dynamics with a carefully designed attention mechanism for steady flow prediction. This approach not only improves the prediction accuracy but also preserves the physical consistency of the predicted flow fields, which is essential for CFD.
Recommended citation: Donglin Chen, Xiang Gao, Chuanfu Xu, Siqi Wang, Shizhao Chen, Jianbin Fang, Zheng Wang. " FlowDNN: a physics-informed deep neural network for fast and accurate flow prediction." FITEE. 2021. http://jianbinfang.github.io/files/2021-05-04-fitee.pdf
Published in SC, 2021
This article presents LibShalom, an open-source libraryfor optimizing small and irregular-shaped GEMMs, explicitly targeting the ARMv8 architecture. LibShalom builds upon the classical Goto algorithm but tailors it to minimize the expensive memory accessing overhead for data packing and processing small matrices. It uses analytic methods to determine GEMM kernel optimization parameters, enhancing the computation and parallelization efficiencyof the GEMM kernels.
Recommended citation: Weiling Yang, Jianbin Fang, Dezun Dong, Xing Su, Zheng Wang. " LibShalom: Optimizing Small and Irregular-shaped Matrix Multiplications on ARMv8 Multi-Core." SC. 2021. http://jianbinfang.github.io/files/2021-06-22-sc.pdf
Published in Cluster, 2021
This paper presents the first comprehensive performance study on OpenMP barrier implementations on emerging ARMv8-based many-cores. We evaluate seven representative barrier algorithms on three distinct ARMv8 architectures: Phytium 2000+, ThunderX2, and Kunpeng920. We empirically show that the existing synchronization implementations exhibit poor scalability on ARMv8 architectures compared to the x86 counterpart. We then propose various optimization strategies for improving these widely used synchronization algorithms on each platform.
Recommended citation: Wanrong Gao, Jianbin Fang, Chun Huang, Chuanfu Xu, Zheng Wang. " Optimizing Barrier Synchronization on ARMv8 Many-Core Architectures." Cluster. 2021. http://jianbinfang.github.io/files/2021-07-06-cluster.pdf
Published in JCST, 2021
This paper presents a comprehensive study to evaluate cache architecture design on three representative ARMv8 multi-cores, Phytium 2000+, ThunderX2, and Kunpeng 920 (KP920). To this end, we develop the wrBench, a micro-benchmark suite to measure the realized latency and bandwidth of caches at different memory hierarchies when performing core-to-core communications.
Recommended citation: Wanrong Gao, Jianbin Fang, Chun Huang, Chuanfu Xu, Zheng Wang. " wrBench: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems." JCST. 2021. http://jianbinfang.github.io/files/2021-09-02-jcst.pdf
Published in HPCC, 2021
This paper presents a study of OpenMP synchronization implementation on two representative ARMv8 multi-core architectures, Phytium 2000+ and ThunderX2, by considering various OpenMP synchronization mechanisms offered by two mainstreamed OpenMP compilers, GCC and LLVM.
Recommended citation: Pengyu Wang, Wanrong Gao, Jianbin Fang, Chun Huang, Zheng Wang. " Characterizing OpenMP Synchronization Implementations on ARMv8 Multi-Cores." HPCC. 2021. http://jianbinfang.github.io/files/2021-10-24-hpcc.pdf
Published:
This is a description of your talk, which is a markdown files that can be all markdown-ified like any other post. Yay markdown!
Published:
This is a description of your conference proceedings talk, note the different field in type. You can put anything in this field.
Graduate course, Natioinal University of Defense Technology, College of Computer Science, 2017
We aim to teach graduates the advanced compiler technologies.
Graduate course, National University of Defense Technology, College of Computer Science, 2018
We aim to teach graduates the advanced compiler technologies.