Weird performance measurement on KNL (A weird performance bug)

5 Oct

After the first run, the performance drops.

Theoretically, after the first run, the cache becomes hot, and the JIT should work. However, the cycle number on macro-kernel/micro-kernel, and the GFLOPS performance are all decreasing.

The first run looks normal. It has similar performance with reference implementation. So that means the compiler for the assembly should work. The only possible reason, I can guess, is that A, B, C, packA or packB are not written back to the real “memory”, while the second run starts. So there are some latencies there. However, there are no such issues in BLIS.

Looking at BLIS packm implementations, it actually requests the packing buffer from the memory pool. BLIS has its own memory manager. However, my implementation just malloc in the beginning, but free in the end. For each run, it has to malloc again, and write/read that new allocated buffer. Maybe that’s the reason the performance drops after the first run.

I just do a quick and dirty trick to fix it. For the first run, I allocate the packing buffer, but don’t free it in the end. The second run just reuses this packing buffer. (I didn’t free it in the end…That’s the dirty part.

 

Leave a comment