For example, there is a cycle of transferring YUY2 buffer to BGR buffer (from 100 to 900+ thousand iterations):
OffsetBGR = 0; while(OffsetYUY2 < VHDR->dwBufferLength){ // Read from 4-byte YUY2 block Y1 = VHDR->lpData[OffsetYUY2++] - 16; U = VHDR->lpData[OffsetYUY2++] - 128; Y2 = VHDR->lpData[OffsetYUY2++] - 16; V = VHDR->lpData[OffsetYUY2++] - 128; // Record to 6-byte BGR block FrameData[OffsetBGR++] = GET_B(Y1, U, V); FrameData[OffsetBGR++] = GET_G(Y1, U, V); FrameData[OffsetBGR++] = GET_R(Y1, U, V); FrameData[OffsetBGR++] = GET_B(Y2, U, V); FrameData[OffsetBGR++] = GET_G(Y2, U, V); FrameData[OffsetBGR++] = GET_R(Y2, U, V); } ... and there are algorithms for obtaining individual elements by macros:
#define CLAMP(t) ((t>255)?255:((t<0)?0:t)) // YUV to RGB #define GET_R(Y,U,V) CLAMP(((298 * Y + 409 * V + 128) >> 8)) #define GET_G(Y,U,V) CLAMP(((298 * Y - 100 * U - 208 * V + 128) >> 8)) #define GET_B(Y,U,V) CLAMP(((298 * Y + 516 * U + 128) >> 8)) ... or embedded functions:
inline unsigned char clapm_byte(int value){return (value>255)?255:((value<0)?0:value);} inline unsigned char R(char Y, char V){return clapm_byte((298 * Y + 409 * V + 128) >> 8);} inline unsigned char G(char Y, char U, char V){return clapm_byte((298 * Y - 100 * U - 208 * V + 128) >> 8);} inline unsigned char B(char Y, char U){return clapm_byte((298 * Y + 516 * U + 128) >> 8);} What in long cycles to use more efficiently in terms of performance - macros or embedded functions? And is it normal to use inline in inline if the function is small?