Create an account

Very important

  • To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
  • You will only see chat messages from people who are at or below your level.
  • More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
  • Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)


Thread Rating:
  • 750 Vote(s) - 3.51 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Optimal SIMD algorithm to rotate or transpose an array

#1
I am working on a data structure where I have an array of 16 uint64. They are laid out like this in memory (each below representing a single int64):

A0 A1 A2 A3 B0 B1 B2 B3 C0 C1 C2 C3 D0 D1 D2 D3

The desired result is to transpose the array into this:

A0 B0 C0 D0 A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3

The rotation of the array 90 degrees is also an acceptable solution for the future loop:

D0 C0 B0 A0 D1 C1 B1 A1 D2 C2 B2 A2 D3 C3 B3 A3

I need this in order to operate on the arrow fast at a later point (Traverse it sequentially with another SIMD trip, 4 at a time).

So far, I have tried to "blend" the data by loading up a 4 x 64 bit vector of A's, bitmaskising and shuffling the elements and OR'ing it with B's etc and then repeating that for C's... Unfortunately, this is 5 x 4 SIMD instructions per segment of 4 elements in the array (one load, one mask, one shuffle, one or with next element and finally a store). It seems I should be able to do better.

I have AVX2 available and I a compiling with clang.



Reply

#2
uint64_t A[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
__m256i row0 = _mm256_loadu_si256((__m256i*)&A[ 0]); //0 1 2 3
__m256i row1 = _mm256_loadu_si256((__m256i*)&A[ 4]); //4 5 6 7
__m256i row2 = _mm256_loadu_si256((__m256i*)&A[ 8]); //8 9 a b
__m256i row3 = _mm256_loadu_si256((__m256i*)&A[12]); //c d e f

I don't have hardware to test this on right now but something like the following should do what you want

__m256i tmp3, tmp2, tmp1, tmp0;
tmp0 = _mm256_unpacklo_epi64(row0, row1); //0 4 2 6
tmp1 = _mm256_unpackhi_epi64(row0, row1); //1 5 3 7
tmp2 = _mm256_unpacklo_epi64(row2, row3); //8 c a e
tmp3 = _mm256_unpackhi_epi64(row2, row3); //9 d b f
//now select the appropriate 128-bit lanes
row0 = _mm256_permute2x128_si256(tmp0, tmp2, 0x20); //0 4 8 c
row1 = _mm256_permute2x128_si256(tmp1, tmp3, 0x20); //1 5 9 d
row2 = _mm256_permute2x128_si256(tmp0, tmp2, 0x31); //2 6 a e
row3 = _mm256_permute2x128_si256(tmp1, tmp3, 0x31); //3 7 b f

The

__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm)

intrinsic selects 128-bit lanes from two sources. You can read about it in [the Intel Intrinsic Guide][1]. There is a version `_mm256_permute2f128_si256` which only needs AVX and acts in the floating point domain. I used this to check that I used the correct control words.

[1]:

[To see links please register here]

Reply

#3
An alternative is to use the *gather* instructions, you can load directly the transposed matrix. The five lines of code below are ok with gcc on a i7-Haswell.

int32_t stride = 4 * sizeof(A[0]);
__m128i r128_gather_idx = _mm_set_epi32(3 * stride, 2 * stride, 1 * stride, 0 * stride);
__m256i row0 = _mm256_i32gather_epi64(reinterpret_cast<long long const *>(&A[ 0]), r128_gather_idx, 1);
__m256i row1 = _mm256_i32gather_epi64(reinterpret_cast<long long const *>(&A[ 1]), r128_gather_idx, 1);
__m256i row2 = _mm256_i32gather_epi64(reinterpret_cast<long long const *>(&A[ 2]), r128_gather_idx, 1);

Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

©0Day  2016 - 2023 | All Rights Reserved.  Made with    for the community. Connected through