Introduction to assembly

  • Why it matters for reverse engineering: assembly is the human-readable form of the machine instructions processors actually execute.

Bits and bytes

  • Binary digits (bits): represent on/off electrical states.
  • Byte history: originally varied (4–6 bits), standardized to 8 bits in the 1960s (EBCDIC, ASCII), giving 256 possible values (0–255).
  • Numeric ranges: unsigned 0–255; signed two’s-complement –128–127.

Character encoding

  • ASCII (7 bits): 128 characters (English letters, digits, control codes).
  • EBCDIC (8 bits): IBM’s set with code pages for other languages.
  • Unicode / UTF-8: variable-length (1–4 bytes), backward-compatible with ASCII, covers all world scripts.

Machine code vs. assembly

  • Machine code: raw binary instruction encodings only CPUs “understand.”
  • Assembly language: mnemonic shorthand (e.g. ADD, LDR) for those encodings, plus registers and syntax for operands.
  • Example (Thumb 16-bit):

    ADD R1, R0, #2     ; R1 = R0 + 2
    ; encoding: 0001110 010 000 001 → 0x1C81
    
  • Load/store example:

    LDR R3, [R2]       ; load 32-bit word from address in R2 into R3
    

Assembling

  • Assembler role: converts .s files into object files (.o), handling directives (e.g. .section .text, labels, .global).
  • Linking: combines object files into executables, resolves symbols.
  • Labels & ADR/LDR:

    myvalue: .word 2        ; data label
    ADR   R2, myvalue       ; R2 ← &myvalue
    LDR   R3, [R2]          ; R3 ← *(&myvalue)
    
  • Syscall “Hello”:

    mov   r0, #1            ; stdout
    adr   r1, mystring      ; &"Hello\n"
    mov   r2, #6            ; length
    mov   r7, #4            ; write syscall
    svc   #0
    mystring: .string "Hello\n"
    

Cross-assemblers

  • Architecture matters: same bytes decode differently on AArch32, AArch64, x86_64, etc.
  • Cross-assembly: run an ARM assembler on x86 to target ARM:

    arm-linux-gnueabihf-as my.s -o my.o
    arm-linux-gnueabihf-ld my.o -o my   # 32-bit ARM
    aarch64-linux-gnu-gcc main.c -o a64  # 64-bit ARM
    

High-level languages

  • C/C++ vs. assembly:

    • High-level: portable, abstract (if/else, loops, named vars), compiled by GCC/G++.
    • Low-level: exact control over instructions; needed for bootloaders, firmware, performance-critical routines, shellcode, OS internals.

Disassembling

  • Objective: recover assembly from a binary.
  • Tools:

    • objdump -d for quick dumps.
    • Interactive: Ghidra, IDA Pro, Radare2.
  • Use cases: malware analysis, compiler validation, exploit development.

Decompilation

  • Beyond disassembly: generates C-like pseudocode.
  • Pros: higher-level overview, easier to skim.
  • Cons: lossy (symbols, var names, comments gone), may mislead.
  • Example: compare IDA vs. Ghidra output for the same function—readable at a glance, but always verify against raw assembly.