Twistronics Blog

并行计算实验(4)——Jacobi迭代的MPI + OpenMP实现

April 17, 2014

初始代码

基于OpenMP、MPI实验代码

OpenMP

  • x_dim y_dim z_dim = 1000,t_steps = 30 时 2 11
  • x_dim y_dim z_dim = 1500,t_steps = 30 时 2 12

MPI

  • x_dim y_dim z_dim = 1500,t_steps = 3,12核时 3 19
  • x_dim y_dim z_dim = 1500,t_steps = 30, 12核时 3 20

尝试混合mpi和openmp

直接进行混合

#include <stdio.h>
#include <stdlib.h>
#include <memory.h>
#include <mpi.h>


const int x_dim = 200;
const int y_dim = 200;
const int z_dim = 200;

const int t_steps = 3;


int main(int argc, char *argv[])
{
	double(*in)[y_dim + 2][z_dim + 2] = (double(*)[y_dim + 2][z_dim + 2]) malloc(sizeof(double) * (x_dim + 2) * (y_dim + 2) * (z_dim + 2));
	double(*out)[y_dim + 2][z_dim + 2] = (double(*)[y_dim + 2][z_dim + 2]) malloc(sizeof(double) * (x_dim + 2) * (y_dim + 2) * (z_dim + 2));
	double(*temp)[y_dim + 2][z_dim + 2] = NULL;

	const int x_dim_add_2 = x_dim + 2;
	const int y_dim_add_2 = y_dim + 2;
	const int z_dim_add_2 = z_dim + 2;
	const int x_dim_add_1 = x_dim + 1;
	const int y_dim_add_1 = y_dim + 1;
	const int z_dim_add_1 = z_dim + 1;

	long long int n = x_dim_add_2;
	n *= y_dim_add_2;
	n *= z_dim_add_2;
	memset(in, 0, n);
	memset(out, 0, n);

    int rank, tot, i;
    MPI_Status state;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &tot);

    int step = x_dim / tot + 1;

	int start_x = 1 + rank * step;
	int end_x = start_x + step;
	end_x = end_x < x_dim_add_1 ? end_x : x_dim_add_1;

    for (int t = 0; t < t_steps; t++) {


    	if(t != 0 && rank != 0)	{
			for(int y = 0; y < y_dim_add_2; y++)
    			MPI_Recv(&in[start_x -1][y][0], sizeof(double) * z_dim_add_2, MPI_CHAR, rank -1, 0, MPI_COMM_WORLD, &state);
    	}

    	if(t != 0 && rank != tot - 1)	{
			for(int y = 0; y < y_dim_add_2; y++)
				MPI_Recv(&in[end_x + 1][y][0], sizeof(double) * z_dim_add_2, MPI_CHAR, rank + 1, 0, MPI_COMM_WORLD, &state);
    	}
    	
    	#pragma omp parallel for schedule(dynamic, 1)
		for (int x = start_x; x < end_x; x++) {	
			for (int y = 1; y < y_dim_add_1; y++) {
				#pragma ivdep
				for (int z = 1; z < z_dim_add_1; z++) {
					out[x][y][z] = 0.4 * in[x][y][z]
						+ 0.1 * (in[x - 1][y][z] + in[x + 1][y][z]
							+ in[x][y - 1][z] + in[x][y + 1][z]
							+ in[x][y][z - 1] + in[x][y][z + 1]);
				}
			}
		}

		if(rank != 0)	{

			for(int y = 0; y < y_dim_add_2; y++)	
    			MPI_Send(&out[start_x][y][0], sizeof(double) * z_dim_add_2, MPI_CHAR, rank - 1, 0, MPI_COMM_WORLD);	
    	}

    	if(rank != tot - 1)	{
    		for(int y = 0; y < y_dim_add_2; y++)
    			MPI_Send(&out[end_x][y][0], sizeof(double) * z_dim_add_2, MPI_CHAR, rank + 1, 0, MPI_COMM_WORLD);	
     	}
     	

		temp = out;
		out = in;
		in = temp;
	}

	free(in);
	free(out);

	return 0;
}

使用命令 mpiexec -n 12 ./a.out 运行时出现以下错误

mpiexec_localhost.localdomain: cannot connect to local mpd (/tmp/mpd2.console_htest); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)

继续进行修改,将编译的命令改为

export OMP_NUM_THREADS=2
mpicc -O3 main.cpp -fopenmp

此时成功运行

4 1

现在是12进程,每个进程两个线程的情况

优化 openmp mpi 混合使用

增大问题规模进行测试

  • x_dim y_dim z_dim = 500,t_steps = 3, 12进程、每个进程2线程时 4 2
  • x_dim y_dim z_dim = 1000,t_steps = 3, 12进程、每个进程2线程时 4 3
  • x_dim y_dim z_dim = 1000,t_steps = 30, 12进程、每个进程2线程时 4 4

修改线程、进程数目

  • x_dim y_dim z_dim = 1000,t_steps = 30, 6进程、每个进程4线程时 4 5 好像没什么变化

继续修改

  • x_dim y_dim z_dim = 1000,t_steps = 30, 4进程、每个进程6线程时 4 6 快了一点
  • x_dim y_dim z_dim = 1000,t_steps = 30, 2进程、每个进程12线程时 4 7 忽然感觉是进程越少越快,下一步直接极端一点,变成一进程进行测试 4 8 果然,可能是因为mpi通信比共享内存要慢一些的原因

修改问题规模

  • x_dim y_dim z_dim = 1500,t_steps = 30, 12进程、每个进程2线程时 4 9
  • x_dim y_dim z_dim = 1500,t_steps = 30, 6进程、每个进程4线程时 4 10
  • x_dim y_dim z_dim = 1500,t_steps = 30, 4进程、每个进程6线程时 4 11
  • x_dim y_dim z_dim = 1500,t_steps = 30, 2进程、每个进程12线程时 4 12
  • x_dim y_dim z_dim = 1500,t_steps = 30, 1进程、每个进程24线程时 4 13

通过对比发现,上面这种 openmp + mpi 的混合实现的效率,比之前纯mpi要快,但是比之前纯openmp要慢