Bits and pieces

Иногда хочется работать с отдельными битами внутри значения.

Вспомним кодировку UTF-8:

0xxxxxxx — cимвол ASCII
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Операторы сдвига я языке Си:

x << n
x >> n

Undefined behaviour, если:

— An expression is shifted by a negative number or by an amount greater than or equal to the width of the promoted expression (6.5.7). — An expression having signed promoted type is left-shifted and either the value of the expression is negative or the result of shifting would be not be representable in the promoted type (6.5.7).

Например:

7 << 2           // 28
7 << -1          // UB
7 << 30          // на нашей платформе UB
-1 >> 1          // implementation defined
0xABCDEFFFu << 4 // 0xBCDEFFF0u
0 << 32          // на нашей платформе UB

Давайте пробовать классифицировать байты из UTF-8:

_Bool is_ascii(char b) {
    // return (b >> 7) == 0; // nope, could be signed char
    return ((unsigned char) b >> 7) == 0;
}

_Bool is_continuation(unsigned char b) {
    // return (b >> 6) == 0b10; // valid in C++14
    return (b >> 6) == 2;
}

_Bool is_2_byte_start(unsigned char b) {
    return (b >> 5) == 6; // 0b110
}

Вместо сдвигов можно воспользоваться битовыми масками:

_Bool is_2_byte_start(unsigned char b) {
    return (b & 0xE0) == 0xC0; // 0b1110'0000, 0b1100'0000
}

Битовые поля:

struct u8char {
    uint8_t sign_bit : 1;
    uint8_t tail_bits : 7;
}

union char_breaker {
    uint8_t number;
    struct u8char fields;
}

...
union char_breaker cb = {.number = 'x'};
cb.fields.sign_bit; // implementation defined if this is the sign bit :-(

setjmp / longjmp

#include <setjmp.h>
int setjmp(jmp_buf env);
void longjmp(jmp_buf env);

Inline assembly

Документация GCC

// Basic asm
asm("nop");

// Extended asm
// asm(template : outputs : inputs : clobbers)

Какое-то красивое подробное описание

g - general effective address
m - memory effective address
r - register
i - immediate value, 0..0xffffffff
n - immediate value known at compile time.
    ("i" would allow an address known only at link time)

But there are some i386-specific ones described in the processor-specific
part of the manual and in more detail in GCC's i386.h:

q - byte-addressable register (eax, ebx, ecx, edx)
A - eax or edx
a, b, c, d, S, D - eax, ebx, ecx, edx, esi, edi respectively

I - immediate 0..31
J - immediate 0..63
K - immediate 255
L - immediate 65535
M - immediate 0..3 (shifts that can be done with lea)
N - immediate 0..255 (one-byte immediate value)