#pragma pack(1) 单身狗，自己挖坑自己踩

xiaoxiao2021-03-25 60

周六天不好，还被叫去加班写文档，心情很不愉快；周日阳光明媚，高高兴兴晃荡晃荡去加班调BUG

问题：main函数调用dpdk静态库函数rte_eal_remote_launch，传入回调函数指针capture_core，以及capture_core要用到的结构体指针config，但是capture_core被调用后发现main设置的config里的一个指针成员mutex值不对了，结果__sync_bool_compare_and_swap (config->mutex, lock, 1)报segmentfault

结构体:

struct core_capture_config { struct rte_ring * ring[RING_MAX]; bool volatile * stop_condition; struct core_capture_stats * stats; uint8_t port; uint8_t queue; unsigned int ring_num; hashtable_t *ht; int *mutex; int i; unsigned long bond_ip; };

思路：

1.多线程什么地方把内存给破坏了？

停了所有其他线程，可是还是一样的问题

2.用到valgrind看看什么地方内存使用有问题？

然而用了dpdk，valgrind跑不起来，ERROR: This system does not support "RDRAND" valgrind dpdk，网上说是valgrind的缺陷，可是提供的解决方案补丁也没好用

而且进一步发现在main函数里看config成员值并没有变

3.想不通，难道rte_eal_remote_launch使用的config并不是我传的config，而是复制了一份？

gdb看了一下两个config地址发现是一样的。。。

虽然基本排除dpdk库的问题，而且心想有问题得排查自己的代码，像DPDK这种INTEL提供的库你还想轻易找个BUG，但毕竟人家是开源的

lib/librte_eal/linuxapp/eal/eal_thread.c：

主要函数：

/* * Send a message to a slave lcore identified by slave_id to call a * function f with argument arg. Once the execution is done, the * remote lcore switch in FINISHED state. */ int rte_eal_remote_launch(int (*f)(void *), void *arg, unsigned slave_id) { int n; char c = 0; int m2s = lcore_config[slave_id].pipe_master2slave[1]; int s2m = lcore_config[slave_id].pipe_slave2master[0]; if (lcore_config[slave_id].state != WAIT) return -EBUSY; lcore_config[slave_id].f = f; lcore_config[slave_id].arg = arg; /* send message */ n = 0; while (n == 0 || (n < 0 && errno == EINTR)) n = write(m2s, &c, 1); if (n < 0) rte_panic("cannot write on configuration pipe\n"); /* wait ack */ do { n = read(s2m, &c, 1); } while (n < 0 && errno == EINTR); if (n <= 0) rte_panic("cannot read on configuration pipe\n"); return 0; } /* main loop of threads */ __attribute__((noreturn)) void * eal_thread_loop(__attribute__((unused)) void *arg) { char c; int n, ret; unsigned lcore_id; pthread_t thread_id; int m2s, s2m; char cpuset[RTE_CPU_AFFINITY_STR_LEN]; thread_id = pthread_self(); /* retrieve our lcore_id from the configuration structure */ RTE_LCORE_FOREACH_SLAVE(lcore_id) { if (thread_id == lcore_config[lcore_id].thread_id) break; } if (lcore_id == RTE_MAX_LCORE) rte_panic("cannot retrieve lcore id\n"); m2s = lcore_config[lcore_id].pipe_master2slave[0]; s2m = lcore_config[lcore_id].pipe_slave2master[1]; /* set the lcore ID in per-lcore memory area */ RTE_PER_LCORE(_lcore_id) = lcore_id; /* set CPU affinity */ if (eal_thread_set_affinity() < 0) rte_panic("cannot set affinity\n"); ret = eal_thread_dump_affinity(cpuset, RTE_CPU_AFFINITY_STR_LEN); RTE_LOG(DEBUG, EAL, "lcore %u is ready (tid=%x;cpuset=[%s%s])\n", lcore_id, (int)thread_id, cpuset, ret == 0 ? "" : "..."); /* read on our pipe to get commands */ while (1) { void *fct_arg; /* wait command */ do { n = read(m2s, &c, 1); } while (n < 0 && errno == EINTR); if (n <= 0) rte_panic("cannot read on configuration pipe\n"); lcore_config[lcore_id].state = RUNNING; /* send ack */ n = 0; while (n == 0 || (n < 0 && errno == EINTR)) n = write(s2m, &c, 1); if (n < 0) rte_panic("cannot write on configuration pipe\n"); if (lcore_config[lcore_id].f == NULL) rte_panic("NULL function pointer\n"); /* call the function and store the return value */ fct_arg = lcore_config[lcore_id].arg; ret = lcore_config[lcore_id].f(fct_arg); lcore_config[lcore_id].ret = ret; rte_wmb(); lcore_config[lcore_id].state = FINISHED; } /* never reached */ /* pthread_exit(NULL); */ /* return NULL; */ }可以看出他对参数arg是没有额外处理的，而且在这里加了输出语句发现一进rte_eal_remote_launch，config->mutex指针的值就变了，真是神奇。。。

3.静态库和main函数malloc出来的地址空间不一致？

这个之前在win上使用dll遇到过，但是linux也没查到相关内容，而且这块代码之前都是好使的，调用方法也是标准的

4.进一步gdb调试发现结构体成员值有点串，相邻成员之间的值好像拼在一起了

发现相邻成员混在一起猜测是对齐的问题，gdb打印各变量地址：

函数调用外

函数调用内而且sizeof结构体也不一样，应该就是对齐的问题的，但是想不通，同一个程序同一个结构体在不同地方怎么会对齐方式不同呢既然发现可能是对齐问题那就用pack指定一下对齐方式吧，可惜基础不扎实，写了个pack(32)...没好使就这样BUG没解决，周一上班可咋整。。。没办法就各种问人，虽然都没有明确答案，不过感谢老段特地打电话过来，更坚定了是对齐的问题，而刘兄的话更是一语惊醒梦中人，加个pack(1)试试啊。。我已经觉得我的pack(32)是个什么鬼了。。。今天一早跑去接着调，一编译才发现pack(32)直接编译warning了。。。赶紧换成pack(1)，果然没问题了。。。这样就算基本解决问题了，可是还是想不通为啥要指定pack，（。。。中间有点思路记不清了），全局搜索了一下pack，基本都是一些头文件里涉及网络传输的结构体被pack(1),pack()包裹的，然而有部分我新增的功能是从原有代码复制改写的，有一处只有pack，前面没有对应的pack(1)，这倒问题不大，另一处是只有pack(1)而没有pack()，结果一个c文件包含了这个头文件和core_capture_config所在头文件，而另一个c文件只包含了core_capture_config所在头文件，这样两边对齐方式就不一样了，把新增的代码改正了，core_capture_config也就不需要指定pack了为啥代码格式只有C++没有C

转载请注明原文地址: https://ju.6miu.com/read-38202.html

技术

最新回复(0)