Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Page Not Found

Page not found. Your pixels are in another canvas.

Jupyter notebook markdown generator

Posts

Blog Post number 4

less than 1 minute read

Published: August 14, 2015

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 3

less than 1 minute read

Published: August 14, 2014

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 2

less than 1 minute read

Published: August 14, 2013

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 1

less than 1 minute read

Published: August 14, 2012

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

portfolio

Portfolio item number 1

Published: April 14, 2023

Short description of portfolio item number 1

Portfolio item number 2

Published: April 14, 2023

Short description of portfolio item number 2

publications

Implementing and Evaluating OpenCL on an ARMv8 Multi-Core CPU

Published in ISPA, 2017

This paper presents an efficient OpenCL implementation on an ARMv8 Multi-Core CPU

Recommended citation: Jianbin Fang, Peng Zhang, Tao Tang, Chun Huang, Canqun Yang. "Implementing and Evaluating OpenCL on an ARMv8 Multi-Core CPU." ISPA. 2017. http://jianbinfang.github.io/files/2017-12-12-ocl2ft.pdf

Benchmarking the GPU memory at the warp level

Published in Parallel Computing, 2018

This paper presents a suite of benchmark to measure the capacility of the GPU memory system.

Recommended citation: Minquan Fang, Jianbin Fang, Weimin Zhang, Haifang Zhou, Jianxing Liao, Yuangang Wang. (2018). "Benchmarking the GPU memory at the warp level." Parallel Computing. 71:23-41. http://jianbinfang.github.io/files/2018-01-18-wbench.pdf

clMF: A Fine-Grained and Portable Alternating Least Squares Algorithm for Parallel Matrix Factorization

Published in FGCS, 2018

This paper presents an efficient and portable ALS solver for sparse matrix factorization in recommender systems

Recommended citation: Jing Chen, Jianbin Fang, Weifeng Liu, Tao Tang, Canqun Yang. "clMF: A Fine-Grained and Portable Alternating Least Squares Algorithm for Parallel Matrix Factorization." FGCS. 2018. http://jianbinfang.github.io/files/2018-04-24-clmf.pdf

MOCL: An Efficient OpenCL Implementation for the Matrix-2000 Architecture

Published in CF, 2018

This paper presents an efficient OpenCL implementation on Matrix-2000

Recommended citation: Peng Zhang, Jianbin Fang, Canqun Yang, Tao Tang, Chun Huang, Zheng Wang. "MOCL: An Efficient OpenCL Implementation for the Matrix-2000 Architecture." CF. 2018. http://jianbinfang.github.io/files/2018-03-15-mocl.pdf

Auto-tuning Streamed Applications on Intel Xeon Phi

Published in IPDPS, 2018

This paper is to tune the performance of streamed applications with machine learning.

Recommended citation: Peng Zhang, Jianbin Fang, Tao Tang, Canqun Yang, Zheng Wang. "Auto-tuning Streamed Applications on Intel Xeon Phi." IPDPS. 2018. http://jianbinfang.github.io/files/2018-01-22-mlstream.pdf

Adaptive Optimization of Sparse Matrix-Vector Multiplication on Emerging Many-Core Architectures

Published in HPCC, 2018

This paper presents adaptive optimization of Sparse Matrix-Vector Multiplication on two emerging many-core architectures.

Recommended citation: Shizhao Chen, Jianbin Fang, Donglin Chen, Chuanfu Xu, Zheng Wang. "Adaptive Optimization of Sparse Matrix-Vector Multiplication on Emerging Many-Core Architectures." HPCC. 2018. http://jianbinfang.github.io/files/2018-06-28-aspmv.pdf

Optimizing Sparse Matrix-Vector Multiplications on An ARMv8-based Many-Core Architecture

Published in IJPP, 2018

This paper evaluates the performance of SpMV with five sparse storage formats on an ARMv8-based Many-Core Processor

Recommended citation: Donglin Chen, Jianbin Fang, Shizhao Chen, Chuanfu Xu, Zheng Wang. "Optimizing Sparse Matrix-Vector Multiplications on An ARMv8-based Many-Core Architecture." IJPP. 2018. http://jianbinfang.github.io/files/2018-09-11-ijpp.pdf

Auto-tuning MPI Collective Operations on Large-Scale Parallel Systems

Published in HPCC, 2019

This paper presents an empirical approach to choose and switch MPI communication algorithms at runtime to optimize the application performance.

Recommended citation: Wenxu Zheng, Jianbin Fang, Juan Chen, et. al. "Auto-tuning MPI Collective Operations on Large-Scale Parallel Systems." HPCC. 2019. http://jianbinfang.github.io/files/2019-05-16-hpcc.pdf

Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters

Published in TPDS, 2019

This paper proposes GENIE, a QoS-aware dynamic scheduling framework for a shared GPU cluster, which achieves users QoS guarantee and high system utilization

Recommended citation: Zhaoyun Chen, Wei Quan, Mei Wen, Jianbin Fang, Jie Yu, Chunyuan Zhang, Lei Luo. "Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters." TPDS. 2019. http://jianbinfang.github.io/files/2019-07-29-tpds.pdf

Characterizing Scalability of Sparse Matrix–Vector Multiplications on Phytium FT-2000+

Published in IJPP, 2019

This paper presents a quantitative study for characterizing the scalability of sparse matrix-vector multiplications (SpMV) on Phytium FT-2000+, an ARM-based HPC many-core architecture.

Recommended citation: Donglin Chen, Jianbin Fang, Chuanfu Xu, Shizhao Chen, Zheng Wang. "Characterizing Scalability of Sparse Matrix–Vector Multiplications on Phytium FT-2000+." IJPP. 2019. http://jianbinfang.github.io/files/2019-11-03-ijpp.pdf

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures

Published in TPDS, 2020

This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures.

Recommended citation: Peng Zhang, Jianbin Fang, Canqun Yang, Chun Huang, Tao Tang, Zheng Wang. "Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures." TPDS. 2020. http://jianbinfang.github.io/files/2020-02-27-tpds.pdf

Parallel Programming Models for Heterogeneous Many-Cores : A Comprehensive Survey

Published in CCF THPC, 2020

This is a survey article on parallel programming models for heterogeneous many-core architectures.

Recommended citation: Jianbin Fang, Chun Huang, Tao Tang, Zheng Wang. "Parallel Programming Models for Heterogeneous Many-Cores : A Comprehensive Survey." CCF THPC. 2020. http://jianbinfang.github.io/files/2020-04-12-ccf-thpc.pdf

Dissecting the Phytium 2000+ Memory Hierarchy via Microbenchmarking

Published in CCF ACA, 2020

This article dissects the memory system of the Phytium 2000+ many-core with microbenchmarks.

Recommended citation: Wanrong Gao, Jianbin Fang, Chuanfu Xu, Chun Huang. " Dissecting the Phytium 2000+ Memory Hierarchy via Microbenchmarking. " CCF ACA. 2020. http://jianbinfang.github.io/files/2020-05-11-ccf-aca.pdf

Deep Program Structure Modeling Through Multi-Relational Graph-based Learning

Published in PACT, 2020

This paper presents POEM, a novel framework that automatically learns useful code representations from graph-based program structures. At the core of POEM is a new graph neural network (GNN), which is specially designed for capturing the syntax and semantic information from the program abstract syntax tree and the control and data flow graph.

Recommended citation: Guixin Ye, Zhanyong Tang, Huanting Wang, Jianbin Fang, Songfang Huang, Zheng Wang. "Deep Program Structure Modeling Through Multi-Relational Graph-based Learning." PACT. 2020. http://jianbinfang.github.io/files/2020-07-16-pact.pdf

NUMA-Aware Optimization of Sparse Matrix-Vector Multiplication on ARMv8-based Many-Core Architectures

Published in NPC, 2020

This paper presents a NUMA-aware optimization technique for the SpMV operation on the Phytium 2000+ architecture.

Recommended citation: Xiaosong Yu, Huihui Ma, Zhengyu Qu, Jianbin Fang, Weifeng Liu. "NUMA-Aware Optimization of Sparse Matrix-Vector Multiplication on ARMv8-based Many-Core Architectures." NPC. 2020. http://jianbinfang.github.io/files/2020-08-21-npc.pdf

FlowGAN: A Conditional Generative Adversarial Network for Flow Prediction in Various Conditions

Published in ICTAI, 2020

Existing DL-based models have to be re-trained whenever the flow condition changes, which incurs significant training overhead for real-life scenarios with a wide range of flow conditions. This paper presents FLOWGAN, a novel conditional generative adversarial network for accurate prediction of flow fields in various conditions. FLOWGAN is designed to directly obtain the generation of solutions to flow fields in various conditions based on observations rather than re-training.

Recommended citation: Donglin Chen, Xiang Gao, Chuanfu Xu, Shizhao Chen, Jianbin Fang, Zhenghua Wang, Zheng Wang. "FlowGAN: A Conditional Generative Adversarial Network for Flow Prediction in Various Conditions." ICTAI. 2020. http://jianbinfang.github.io/files/2020-09-03-ictai.pdf

More Bang for Your Buck: Boosting Performance with Capped Power Consumption

Published in TST, 2020

This article develops a novel resource allocation scheme for memory-bound applications running on High-Performance Computing (HPC) clusters, aiming to improve application performance without breaching peak power constraints and total energy consumption.

Recommended citation: Juan Chen, Xinxin Qi, Feihao Wu, Jianbin Fang, Yong Dong, Yuan Yuan, Zheng Wang, and Keqin Li. "More Bang for Your Buck: Boosting Performance with Capped Power Consumption." TST. 2020. http://jianbinfang.github.io/files/2020-11-01-tst.pdf

Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+

Published in JCST, 2020

This article presents a comprehensive performance evaluation of Phytium 2000+, an ARMv8-based 64-core architecture. We focus on the cache and memory subsystems, analyzing the characteristics that impact the high-performance computing applications.

Recommended citation: Jianbin Fang, Xiangke Liao, Chun Huang, Dezun Dong. "Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+." JCST. 2020. http://jianbinfang.github.io/files/2020-12-02-jcst.pdf

Characterizing Small-Scale Matrix Multiplications on ARMv8-based Many-Core Architectures

Published in IPDPS-2021, 2020

General Matrix Multiplication (GEMM) is a key subroutine in high-performance computing. There is a large body of work on evaluating and optimizing large-scale matrix multiplication, but how well the small-scale matrix multiplication (SMM) performs is largely unknown, especially for the ARMv8-based many-core architectures. In this work, we evaluate and characterize the performance of SMM subroutines on Phytium 2000+, an ARMv8-based 64-core architecture. The evaluation work is extensively performed with the mainstream open-source libraries including OpenBLAS, BLIS, BALSFEO, and Eigen. Given various experimental settings, we observe how well the small-scale GEMM routines perform on Phytium 2000+, and then discuss the impacting factors behind the performance behaviours of SMM.

Recommended citation: Weiling Yang, Jianbin Fang, Dezun Dong. "Characterizing Small-Scale Matrix Multiplications on ARMv8-based Many-Core Architectures." IPDPS. 2021. http://jianbinfang.github.io/files/2020-12-11-ipdps.pdf

BALS: Blocked Alternating Least Squares for Parallel Sparse Matrix Factorization on GPUs

Published in TPDS, 2021

This article presents an efficient implementation of the alternative least squares (ALS) algorithm called BALS built on top of a new sparse matrix format for parallel matrix factorization. Note that the reviewing process takes around 3 years spanning from April 2, 2018 to March 1, 2021, which is the most time-consuming one I have ever seen.

Recommended citation: Jing Chen, Jianbin Fang, Weifeng Liu, Canqun Yang. "BALS: Blocked Alternating Least Squares for Parallel Sparse Matrix Factorization on GPUs." TPDS. 2021. http://jianbinfang.github.io/files/2021-03-01-tpds.pdf

FlowDNN: a physics-informed deep neural network for fast and accurate flow prediction

Published in FITEE, 2021

In this paper, we propose FlowDNN, a novel deep neural network (DNN) to efficiently learn flow representations from CFD results. FlowDNN saves computational time by directly predicting the expected flow fields based on given flow conditions and geometry shapes. FlowDNN is the first DNN that incorporates the underlying physical conservation laws of fluid dynamics with a carefully designed attention mechanism for steady flow prediction. This approach not only improves the prediction accuracy but also preserves the physical consistency of the predicted flow fields, which is essential for CFD.

Recommended citation: Donglin Chen, Xiang Gao, Chuanfu Xu, Siqi Wang, Shizhao Chen, Jianbin Fang, Zheng Wang. " FlowDNN: a physics-informed deep neural network for fast and accurate flow prediction." FITEE. 2021. http://jianbinfang.github.io/files/2021-05-04-fitee.pdf

LibShalom: Optimizing Small and Irregular-shaped Matrix Multiplications on ARMv8 Multi-Core

Published in SC, 2021

This article presents LibShalom, an open-source libraryfor optimizing small and irregular-shaped GEMMs, explicitly targeting the ARMv8 architecture. LibShalom builds upon the classical Goto algorithm but tailors it to minimize the expensive memory accessing overhead for data packing and processing small matrices. It uses analytic methods to determine GEMM kernel optimization parameters, enhancing the computation and parallelization efficiencyof the GEMM kernels.

Recommended citation: Weiling Yang, Jianbin Fang, Dezun Dong, Xing Su, Zheng Wang. " LibShalom: Optimizing Small and Irregular-shaped Matrix Multiplications on ARMv8 Multi-Core." SC. 2021. http://jianbinfang.github.io/files/2021-06-22-sc.pdf

Optimizing Barrier Synchronization on ARMv8 Many-Core Architectures

Published in Cluster, 2021

This paper presents the first comprehensive performance study on OpenMP barrier implementations on emerging ARMv8-based many-cores. We evaluate seven representative barrier algorithms on three distinct ARMv8 architectures: Phytium 2000+, ThunderX2, and Kunpeng920. We empirically show that the existing synchronization implementations exhibit poor scalability on ARMv8 architectures compared to the x86 counterpart. We then propose various optimization strategies for improving these widely used synchronization algorithms on each platform.

Recommended citation: Wanrong Gao, Jianbin Fang, Chun Huang, Chuanfu Xu, Zheng Wang. " Optimizing Barrier Synchronization on ARMv8 Many-Core Architectures." Cluster. 2021. http://jianbinfang.github.io/files/2021-07-06-cluster.pdf

wrBench: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems

Published in JCST, 2021

This paper presents a comprehensive study to evaluate cache architecture design on three representative ARMv8 multi-cores, Phytium 2000+, ThunderX2, and Kunpeng 920 (KP920). To this end, we develop the wrBench, a micro-benchmark suite to measure the realized latency and bandwidth of caches at different memory hierarchies when performing core-to-core communications.

Recommended citation: Wanrong Gao, Jianbin Fang, Chun Huang, Chuanfu Xu, Zheng Wang. " wrBench: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems." JCST. 2021. http://jianbinfang.github.io/files/2021-09-02-jcst.pdf

Characterizing OpenMP Synchronization Implementations on ARMv8 Multi-Cores

Published in HPCC, 2021

This paper presents a study of OpenMP synchronization implementation on two representative ARMv8 multi-core architectures, Phytium 2000+ and ThunderX2, by considering various OpenMP synchronization mechanisms offered by two mainstreamed OpenMP compilers, GCC and LLVM.

Recommended citation: Pengyu Wang, Wanrong Gao, Jianbin Fang, Chun Huang, Zheng Wang. " Characterizing OpenMP Synchronization Implementations on ARMv8 Multi-Cores." HPCC. 2021. http://jianbinfang.github.io/files/2021-10-24-hpcc.pdf

teaching

Advanced Compiler Technology

Graduate course, Natioinal University of Defense Technology, College of Computer Science, 2017

We aim to teach graduates the advanced compiler technologies.

Advanced Compiler Technology

Graduate course, National University of Defense Technology, College of Computer Science, 2018

We aim to teach graduates the advanced compiler technologies.

Jianbin Fang

Sitemap

Pages

Posts

portfolio

publications

talks

teaching