Introduction to assembly
- Why it matters for reverse engineering: assembly is the human-readable form of the machine instructions processors actually execute.
Bits and bytes
- Binary digits (bits): represent on/off electrical states.
- Byte history: originally varied (4–6 bits), standardized to 8 bits in the 1960s (EBCDIC, ASCII), giving 256 possible values (0–255).
- Numeric ranges: unsigned 0–255; signed two’s-complement –128–127.
Character encoding
- ASCII (7 bits): 128 characters (English letters, digits, control codes).
- EBCDIC (8 bits): IBM’s set with code pages for other languages.
- Unicode / UTF-8: variable-length (1–4 bytes), backward-compatible with ASCII, covers all world scripts.
Machine code vs. assembly
- Machine code: raw binary instruction encodings only CPUs “understand.”
- Assembly language: mnemonic shorthand (e.g.
ADD
,LDR
) for those encodings, plus registers and syntax for operands. -
Example (Thumb 16-bit):
ADD R1, R0, #2 ; R1 = R0 + 2 ; encoding: 0001110 010 000 001 → 0x1C81
-
Load/store example:
LDR R3, [R2] ; load 32-bit word from address in R2 into R3
Assembling
- Assembler role: converts
.s
files into object files (.o
), handling directives (e.g..section .text
, labels,.global
). - Linking: combines object files into executables, resolves symbols.
-
Labels & ADR/LDR:
myvalue: .word 2 ; data label ADR R2, myvalue ; R2 ← &myvalue LDR R3, [R2] ; R3 ← *(&myvalue)
-
Syscall “Hello”:
mov r0, #1 ; stdout adr r1, mystring ; &"Hello\n" mov r2, #6 ; length mov r7, #4 ; write syscall svc #0 mystring: .string "Hello\n"
Cross-assemblers
- Architecture matters: same bytes decode differently on AArch32, AArch64, x86_64, etc.
-
Cross-assembly: run an ARM assembler on x86 to target ARM:
arm-linux-gnueabihf-as my.s -o my.o arm-linux-gnueabihf-ld my.o -o my # 32-bit ARM aarch64-linux-gnu-gcc main.c -o a64 # 64-bit ARM
High-level languages
-
C/C++ vs. assembly:
- High-level: portable, abstract (if/else, loops, named vars), compiled by GCC/G++.
- Low-level: exact control over instructions; needed for bootloaders, firmware, performance-critical routines, shellcode, OS internals.
Disassembling
- Objective: recover assembly from a binary.
-
Tools:
objdump -d
for quick dumps.- Interactive: Ghidra, IDA Pro, Radare2.
- Use cases: malware analysis, compiler validation, exploit development.
Decompilation
- Beyond disassembly: generates C-like pseudocode.
- Pros: higher-level overview, easier to skim.
- Cons: lossy (symbols, var names, comments gone), may mislead.
- Example: compare IDA vs. Ghidra output for the same function—readable at a glance, but always verify against raw assembly.