Does an assembly change, if we write (b + a) instead (a + b)?
Let's check out.
Let's write:
and compile it with risc-v gcc 8.2.0:
Now write the following:
And get:
The difference is obvious.
Now do the same using clang (rv64gc trunk). In both cases we get the same result:
The result is the same we got from gcc in the first case. Compilers are smart now, but not so smart yet.
Let's try to find out, what happened here and why. Arguments of a function __int128 add1(__int128 a, __int128 b) are passed through registers a0-a3 in the following order: a0 is a low word of «a» operand, a1 is a high word of «a», a2 is a low word of «b» and a3 is the high word of «b». The result is returned in the same order, with a low word in a0 and a high word in a1.
Then high words of two arguments are added and the result is located in a1, and for low words, the result is located in a0. Then the result is compared against a2, i.e. the low word of «b» operand. It is necessary to find out if an overflow has happened at an adding operation. If an overflow has happened, the result is less than any of the operands. Because the operand in a0 does not exist now, the a2 register is used for comparison. If a0 < a2, the overflow has happened, and a2 is set to «1», and to «0» otherwise. Then this bit is added to the high word of the result. Now the result is located in (a1, a0).
A completely similar text is generated by Clang (rv32gc trunk) for the 32-bit core, if the function has 64-bit arguments and the result:
The assembler:
There is absolutely the same code. Unfortunately, a type __int128 is not supported by compilers for 32-bit architecture.
Here there is a slight possibility for the core microarchitecture optimization. Considering the RISC-V architecture standard, a microarchitecture can (but not has to) detect instruction pairs (MULH[[S]U] rdh, rs1, rs2; MUL rdl, rs1, rs2) and (DIV[U] rdq, rs1, rs2; REM[U] rdr, rs1, rs2) to process them as one instruction. Similarly, it is possible to detect the pair (add rdl, rs1, rs2; sltu rdh, rdl, rs1/rs2) and immediately set the overflow bit in the rdh register.
Let's check out.
Let's write:
__int128 add1(__int128 a, __int128 b) {
return b + a;
}
and compile it with risc-v gcc 8.2.0:
add1(__int128, __int128):
.LFB0:
.cfi_startproc
add a0,a2,a0
sltu a2,a0,a2
add a1,a3,a1
add a1,a2,a1
ret
Now write the following:
__int128 add1(__int128 a, __int128 b) {
return a + b;
}
And get:
add1(__int128, __int128):
.LFB0:
.cfi_startproc
mv a5,a0
add a0,a0,a2
sltu a5,a0,a5
add a1,a1,a3
add a1,a5,a1
ret
The difference is obvious.
Now do the same using clang (rv64gc trunk). In both cases we get the same result:
add1(__int128, __int128): # @add1(__int128, __int128)
add a1, a1, a3
add a0, a0, a2
sltu a2, a0, a2
add a1, a1, a2
ret
The result is the same we got from gcc in the first case. Compilers are smart now, but not so smart yet.
Let's try to find out, what happened here and why. Arguments of a function __int128 add1(__int128 a, __int128 b) are passed through registers a0-a3 in the following order: a0 is a low word of «a» operand, a1 is a high word of «a», a2 is a low word of «b» and a3 is the high word of «b». The result is returned in the same order, with a low word in a0 and a high word in a1.
Then high words of two arguments are added and the result is located in a1, and for low words, the result is located in a0. Then the result is compared against a2, i.e. the low word of «b» operand. It is necessary to find out if an overflow has happened at an adding operation. If an overflow has happened, the result is less than any of the operands. Because the operand in a0 does not exist now, the a2 register is used for comparison. If a0 < a2, the overflow has happened, and a2 is set to «1», and to «0» otherwise. Then this bit is added to the high word of the result. Now the result is located in (a1, a0).
A completely similar text is generated by Clang (rv32gc trunk) for the 32-bit core, if the function has 64-bit arguments and the result:
long long add1(long long a, long long b) {
return a + b;
}
The assembler:
add1(long long, long long): # @add1(long long, long long)
add a1, a1, a3
add a0, a0, a2
sltu a2, a0, a2
add a1, a1, a2
ret
There is absolutely the same code. Unfortunately, a type __int128 is not supported by compilers for 32-bit architecture.
Here there is a slight possibility for the core microarchitecture optimization. Considering the RISC-V architecture standard, a microarchitecture can (but not has to) detect instruction pairs (MULH[[S]U] rdh, rs1, rs2; MUL rdl, rs1, rs2) and (DIV[U] rdq, rs1, rs2; REM[U] rdr, rs1, rs2) to process them as one instruction. Similarly, it is possible to detect the pair (add rdl, rs1, rs2; sltu rdh, rdl, rs1/rs2) and immediately set the overflow bit in the rdh register.