Notes and labs for the course 15-213 Introduction to Computer Systems at CMU
- Data types
- char, short, int, long, float, double, pointer
- Word size equals length of pointer datatype
- Bit-level operations
- Signed / unsigned conversion
- Byte ordering
- Big Endian: Sun, PPC Mac, Internet
- Little Endian: x86, ARM
- Numeric form: (-1)^s*M*(2^E)
- Encoding
- s: sign bit
- exp: E
- frac: M
- Three kinds of values
- Denormalized: exp = 0
- Normalized: 0 < exp < 11..11
- Special: exp = 11...11 (e.g. inf & NaN)
- Roundings
- History
- Registers
- Addressing modes
- Normal:
(R)
->Mem[Reg[R]]
- Displacement:
D(R)
->Mem[Reg[R] + D]
- Complete:
D(Rb,Ri,S)
->Mem[Reg[Rb] + S*Reg[Ri] + D]
(Rb,Ri)
->Mem[Reg[Rb] + Reg[Ri]]
D(Rb,Ri)
->Mem[Reg[Rb] + Reg[Ri] + D]
(Rb,Ri,S)
->Mem[Reg[Rb] + S*Reg[Ri]]
- Normal:
- Some instructions
movq Src, Dst
- Cannot do memory-memory transfer with a single instruction
- Intel docs use
mov Dst, Src
leaq Src, Dst
- Src is address mode expression, set Dst to address denoted by expression
- Similar to
p = &x[i]
- Used for arithmetics for form like
x + k * y
- Does not change condition codes
addq/subq Src, Dst
imulq Src, Dst
salq/sarq/shrq Src, Dst
xorq/andq/orq Src, Dst
pushq src
popq dest
incr dest
- Compiler, Assembler, Linker & Loader
- Compiler
- Translates C files (.c) into assembly files (.s)
- Assembler
- Translates assembly files (.s) into object files (.o)
- Missing linkage between compilation units
- Linker
- Resolve references between object files
- Combine with static libraries (malloc, printf, etc)
- Dynamic linked libraries
- Linking occurs at runtime
- Does not take too much disk space
- Compiler
- Controls
- Procedures
- Passing control
- Procedure call:
call label
- Push return address into stack
- Jump to label
- Procedure return:
ret
- Pop return address from stack
- Jump to this address
- Return address: Address of next instruction after the call statement
- Procedure call:
- Passing data
- First 6 arguments:
%rdi
,%rsi
,%rdx
,%rcx
,%r8
,%r9
- Other arguments passed using stack
- Return value:
%rax
- IA-32 pass all arguments in stack
- Concept of stack frames:
- Marked by
%rbp
(optional) and%rsp
- No additional mechanism for recursion is needed
- Marked by
- Register saving conditions
- Caller saved
%rdi
,%rsi
,%rdx
,%rcx
,%r8
,%r9
,%rax
,%r10
,%r11
- Callee saved
%rbx
,%r12
,%r13
,%r14
,%rbp
%rsp
is also a special form of callee-saved
- Caller saved
- First 6 arguments:
- Memory management
- ABI: Application Binary Interface
- Passing control
- Data
- Address space
- Vulnerablities
- Protection
- Use routines limiting string lengths (user-level)
- Randomized stack offsets
- Nonexecutable code segments
- Stack canaries
- Optimization by programmer
- Optimization blockers
- Procedures: Seen as a "black box"
- Procedures may have side effects
- May not return same result with same argument
- Fix: Use inline functions (GCC with -O1 within single file)
- Memory aliasing: Two memory references specify single location
- The following code does memory load and store every time, because compiler assume possibility of memory aliasing:
- Load and store take multiple clock cycles
- Easily caused by direct access to storage structures
- Fix: Define local variable to tell compiler not to check for aliasing
- Get in habit of introducing local variables accumulating within loops
- The following code does memory load and store every time, because compiler assume possibility of memory aliasing:
- Procedures: Seen as a "black box"
- Optimization (by programmer) limitations
- Most performed within procedures. Newer versions of GCC do interprocedual optimization, but not between codes in different files
- Based on static information
- Conservative: Must not change program behavior
- Instruction-level parallelism
- Superscalar processor: Issue and execute multuple instructions per cycle, and instructions are scheduled dynamically
- Some instruction have >1 clock cycle latency, but can be pipelined:
- Unrolling
- Break sequential dependency to break through latency bound (to approach throughput bound)
can be optimized to:for(int i = 0; i < limit; ++i) x = x + d[i];
but to break sequential dependency:for(int i = 0; i < limit; i += 2) x = (x + d[i]) + d[i + 1];
for(int i = 0; i < limit; i += 2) x = x + (d[i] + d[i + 1]);
- adding separate accumulators
- Break sequential dependency to break through latency bound (to approach throughput bound)
- Branch prediction
- Backward branches are often loops, predict taken
- Forward branches are often if, predict not taken
- Average better than 95% accuracy
- Storage technologies
- RAMs
- Volatile: SRAM & DRAM (caches & main memories)
- Nonvolatile: ROM, PROM, EPROM, EEPROM (firmware, ssd & disk caches)
- Rotating disks
- SSDs
- Page can be written only after its block has been erased
- RAMs
- Locality
- Temporal locality
- Spatial locality
- Hierarchy
- Caches
- Cache memories
- Concept of locality
- General organization
- Metrics
- Miss rate
- Hit time
- Miss penalty
- Write cache-friendly code
- Make the common cases go first
- Minimize the misses in inner loops
- Try to maximize spatial locality by reading objects sequentially with stride 1
- Try to maximize temporal locality by using an object as often as possible once it's read from memory
- Example of matrix multiplication
- In which order to arrange the loops? Do miss rate analysis!
- It turns out: kij/ikj > ijk/jik > jki/kji
- Use blocking: multiplying by sub-matrices
- Concept of locality
- Why linkers?
- Modularity
- Efficiency (separate complilation)
- Two kind of linking
- Static linking
- Dynamic linking
- What does linker do?
- Symbol resolution
- Functions,
global
vars,static
vars - Definitions are stored in symbol table, an array of entries (name, size, location)
- Three kind of symbols:
- Global symbols: non-static functions and non-static vars
- External symbols: defined in other modules
- Local symbols: static functions and static vars
- Note: Do not confuse local symbols with local variables. Local variables are allocated in stack at runtime, and have nothing to do with linker.
- Symbol resolution
- Symbols are strong or weak:
- Strong: functions and initialized globals
- Weak: uninitialized globals
- Multiple strong symbols are not allowed
- Choose the strong symbol over weak symbols
- If there are multiple weak symbols, choose arbitrary one
- May cause undefined behavior over different compilers
- Fix: use
static
and explicitextern
- Symbols are strong or weak:
- Functions,
- Relocation
- Merge text and data segment
- Relative location -> absolute location
- Updates symbol table
- Relocation entries are used to aid symbol resolving:
a: R_X86_64_32 array
- Relocation entries are used to aid symbol resolving:
- Symbol resolution
- Three kinds of object files
- Relocatable object file (.o file)
- Executable object file (a.out file)
- Shared object file (.so file or .dll file)
- ELF format (Executable and Linkable Format)
- All 3 object files use ELF format
- Static libraries (.a archive files)
- Concatenate related relocatable object files into a single file with an index (called an archive)
- During linking, only referenced .o files are linked
- Command line order matters!
- During scan, keep a list of currently unresolved references
- If any entries in the unresolved list at end of scan, then error
- Fix: put libraries at the end of command line
- Commonly used libraries:
libc.a
(the C standard library)limb.a
(the C math library)
- Disadvantages
- Duplication in storage
- Bug fixes require relink
- Fix: shared libraries
- Shared libraries
- Dynamic linking can happen at:
- Load time
- Handled by the dynamic linker
libc.so
usually dynamically linked
- Run time
dlopen()
interface in linux
- Load time
- Dynamic linking can happen at:
- Library interpositioning
- Can happen at:
- Compile time
- Link time
- Load/run time
- Can be used for:
- Detecting memory leaks
- Generating address traces
- Can happen at:
- ECFs exists in all levels:
- Exceptions (low level)
- Processor responses to external events
- Exception tables
- Context switch
- Signals
- Nonlocal jumps
- Exceptions (low level)
- Exceptions (equivalent to user-kernel transition)
- Asynchronous (Interrupts)
- Indicated by INT pin
- Control flow returns to next instruction
- Synchronous
- Traps
- Intentional (syscall, breakpoints)
- Control flow returns to next instruction
- Faults
- Unintentional but possibly recoverable
- Control flow returns to current instruction or aborts
- Aborts
- Unintentional and unrecoverable
- Traps
- Asynchronous (Interrupts)
- Context switches
- From a programmer's perspective, a process can be:
- Running: Executing or will be scheduled
- Stopped: Suspended and will not be scheduled until further notice
- Terminated: Stopped permanently (zombie)
- Process terminates when:
SIGTERM
received- Return from
main()
- Called
exit()
- Process terminates when:
- Creating process:
fork()
- Reaping child processes:
wait()
- Terminated processes become zombies, because its parent may use its exit status or OS tables
wait()
andwaitpid()
reap zombie child processes- If parent don't reap:
- If parent doesn't terminate: Never diminishes (a kind of memory leak)
- If parent does terminate: Reaped by
init
process (pid == 1)
- So only need to explicitly reap long-running processes
- Loading and running processes:
execve()
int execve(char *filename, char *argv[], char *envp[])
- Loads and runs in the current process
- Overwrites code, data and stack
- Retains PID, open files (e.g.
stdout
), and signal context - Called once and never return (except error)
- Process groups
- Can be get and set by
getpgrp()
andsetpgid()
- Kill all process in a group with
kill -n -<pid>
- Can be get and set by
- Unix shell: An application that runs program on behalf of the user
- Shell contains a basic loop and a
eval()
function - Two cases in
eval()
:- Shell built-in command
- Not build-in, use
fork()
andexecve()
- Motivation: How to reap both foreground and background jobs?
- Basic loop: Only reaps foreground jobs
- Fix: Signals
- Shell contains a basic loop and a
- Signals
- Akin to exceptions and interrupts
- Sent from signal (sometimes at the request of another process via
kill
) - Identified by an integer
- Controlled by per-process
pending
andblocked
bit vectorspending
vector set and cleared by kernel when signals is sent or receivedblocked
vector can be manipulated bysigprocmask()
function- So, signals cannot be queued
- Send:
pending
bit set - Receive: process reacts to the signal, clears
pending
bit- Ignore
- Terminate
- Catch (using user-level function called signal handler)
- Kernels checks for
pnb = pending & ~blocked
at beginning of a time-slice- If
pnb == 0
:- Pass control to next instruction in the process logical flow
- Else
- Choose lease non-zero bit in
pnb
and forces the process to receive the signal - The receipt of the signal triggers some action by the process (clears
pending
bit) - Repeat for all remaining nonzero bits
- Pass control to next instruction in the process logical flow
- Choose lease non-zero bit in
- If
- Default action can be one of:
- Termination
- Stop until restarted by
SIGCONT
- Ignore
- Override default action by installing
signal handlers
:handler_t *signal(int signum, handler_t *handler)
handler
can be one of:SIG_IGN
: IgnoreSIG_DFL
: Revert to default- Function pointer to a user-level signal handler
- Signal handlers are a form of concurrency
- Signal handlers can be nested
- So we need blocking
- Implicit blocking: blocks pendings signals of same type
- Explicit blocking:
sigprogmask()
with supporting functions of:sigemptyset()
sigfillset()
sigaddset()
sigdelset()
- So we need blocking
- How to write safe handlers?
- Keep handlers as simple as possible
- Call only
async-signal-safe
function in handlersasync-signal-safe
functions are reentrent (access only local variables on stack), or cannot be interrupted by another signal handlerprintf()
,malloc()
andexit()
are not safewrite()
is the only signal-safe output function
- Save and restore
errno
on entry and exit - Protect accesses to shared data structures by temporarily blocking all signals in both handler and
main()
- Declare global variables to be
volatile
, to prevent from being optimized into registers - Declare global flags as
volatile sig_atomic_t
- Flag: variable only read or written (not
flag++
orflag+=10
) volatile sig_atomic_t
are ints on most systems
- Flag: variable only read or written (not
- Avoid race conditions
- Cannot make
any
assumption regarding execution order - However, we can control when handlers run by blocking
- Cannot make
- Explicitly waiting for signals: suppose handler sets global variable
pid
:- Spin wait:
while(!pid) {}
- Wasteful
- Pause:
while(!pid) pause()
- Race condition
- Sleep:
while(!pid) sleep(1)
- Too slow
- Solution:
sigsuspend
int sigsuspend(const sigset_t *mask)
- Equivalent to atomic:
sigprocmask(SIG_BLOCK, &mask, &prev); pause() sigprocmask(SIG_BLOCK, &prev, NULL);
- Spin wait:
- Portable signal handling
- Problem: Different versions of unix have different signal handling semantics
- Solution: Use
sigaction
- Physical Addressing: Used in microcontrollers, embedded systems, etc.
- Mentality: Main memory is a fully-associative cache for disk
- Load doesn't necessarily happen with
execve()
. It only allocates virtual address space with valid bit of 0 - Loading is a result of a page fault (demand paging)
- Load doesn't necessarily happen with
- Kernel memory invisible to application program. Kernel's address space starts with 1.
- Every memory access go through cache memory:
- Address translation: Multi-level page tables
- TLB: Small set-associative hardware cache in MMU
- Works only because of locality
- Unix I/O
- Opening and closing files:
open()
,close()
- Reading and writing files:
read()
,write()
- Changing file position:
lseek()
- View file metadata:
stat()
stat()
are both a syscall and a linux program- Syscalls are in second section of man:
man 2 stat
- Always check return codes for these syscalls
- Opening and closing files:
- File types: Regular, directory, socket, named pipes, symlinks, character and block devices
- Short counts: (
nbytes < sizeof(buf)
) are possible - Wrapper: RIO (robust I/O) package
- Unbuffered I/O of binary data:
rio_readn()
andrio_writen()
- Buffered I/O of text or binary:
rio_readlineb()
andrio_readnb()
- RIO package is better for input and output on network sockets
- Unbuffered I/O of binary data:
- Standard I/O
- Opening and closing:
fopen()
andfclose()
- Reading and writing bytes:
fread()
andfwrite()
- Reading and writing text lines:
fgets()
andfputs()
- Formatted reading and writing:
fscanf()
andfprintf()
- C program begin with 3 open files:
stdin
(descriptor 0)stdout
(descriptor 1)stderr
(descriptor 2)
- Opening and closing:
- Trace syscalls with the Linux
strace
program - Choosing I/O functions
- General: Use highest-level functions
- When to use Unix I/O: Signal handlers because unix I/O functions are
async-signal-safe
- When to use standard I/O: Disks, terminals
- When to use RIO: Network sockets
- How kernel represents open files
- Open file table: An instance of opening file
- If a process opens a file twice, there are two open file tables pointing to the same v-node table
- V-node table: File metadata (regardless of whether file is open)
- After
fork()
,refcnt
is incremented:
- Two processes share a same instance of opened file (including file position)
dup2(int oldfd, int newfd)
: Used for I/O redirection
- Open file table: An instance of opening file
- Recommended references:
- W. Richard Stevens & Stephen A. Rago, Advanced Programming in the Unix Environment, 2 nd Edition, Addison Wesley, 2005
- End-to-end Core i7 Address Translation
- L1 d-cache
index
andoffset
have 12 bits is NOT a coincidence: Speed up address translation
- Linux organizes VM as collections of areas:
- Fault handling: Traverse the
vm_area_struct
s to check if page is allocated
- Fault handling: Traverse the
- Private Copy-on-write (COW)
- Memory Mapping:
void *mmap(void *start, int len, int prot, int flags, int fd, int offset)
start
: A hint addressprot
:PROT_READ
,PROT_WRITE
,PROT_EXEC
flags
:MAP_ANON
,MAP_PRIVATE
,MAP_SHARED
- Returns a pointer to the start of mapped area (may not be start)
- Allocators: Maintain the heap as a collection of variable sized blocks, which are either allocated or free
- Explicit allocator: Application allocates and frees
- Implicit allocator: Application allocates but not frees
- The
malloc
packagevoid *malloc(size_t size)
void free(void *p)
calloc
: Initializes allocated blocks to 0realloc
: Changes size of previously allocated blocksbrk
: Used internally by allocators to grow and shrink the heap
- Constraints:
- Applications have few constraints
- Allocators have many constraints:
- Can't assume allocation patterns
- Must respond immediately to
malloc
(can't defer allocation) - Can't relocate allocated memory
- Performance goal (2 conflicting goals)
- Throughput: Number of completed requests per unit time
- Peak memory utilization: How to efficiently use memory
- Fragmentation
- Internal fragmentation:
- Payload smaller than block size
- Easy to measure
- External fragmentation:
- Enough aggregate heap memory, but no single free block large enough
- Difficult to measure
- Internal fragmentation:
- Keeping track of free blocks
- How to find a free block
- First fit
- Next fit
- Best fit
- Know how much to free: Header
- Implicit list
- Allocating a free block: May need to split the block
- Freeing a block: Have to coalesce free blocks (4 cases):
- Singly-linked list cannot free previous block in constant time
- Fix: Doubly-linked list (head and footer)
- Optimization:
- Allocated blocks doesn't need coalescing
- We have extra bits to encode whether previous block is allocated
- So, allocated blocks doesn't need footer
- Implicit lists are not commonly used because of linear time. However, the concepts of splitting and coalescing are general to all allcators
- Explicit free list
- Maintain list of free blocks using payload area
- Blocks can be in any order (depending on insertion policy)
- Unordered: LIFO, FIFO
- Address-ordered
- Much faster than implicit lists when memory is full
- Segregated list
- Garbage collection
- Mark and sweep collecting
- Allocate using
malloc
until run out of space - Use extra mark bit for each block
- Root nodes: Pointers in stack/data section
- Does not distinguish between pointers/non-pointers, thus "safe"
- Mark: Start at root nodes and do DFS
- Sweep: Start at beginning of VM, and free unmarked blocks
- How to find beginning of block? -- Use a balanced tree
- Allocate using
- Mark and sweep collecting
- Client-Server Architecture
- Network Architecture
- Socket
- Host and Service Conversion:
getaddrinfo
- Client/Server interface
telnet
: Testing servers- Creates TCP connection with a server (starts a session)
- Since the encoding of HTTP is ascii, we can hard-code http requests
- HTTP
- Content: A sequence of bytes in MIME (Multipurpose Internet Mail Extension) type
- The contents can be either static or dynamic
- Dynamic content:
- Produced by server-side program
- If URI containts
cgi-bin
then serve dynamic content - Use
fork()
andexec()
to execute new program - Use env-var
QUERY_STRING
to pass parameters
- Iterative servers have serious flaws.
- Easily get blocked by single misbehaving client
- Note: Blocking does not happen upon client calling
connect()
orwrite()
, but uponread()
. This is because server's kernel provides buffering
- Note: Blocking does not happen upon client calling
- So we need concurrent servers
- Easily get blocked by single misbehaving client
- Process-based servers
- Parent must close connected socket (parent doesn't get reaped)
- Child should close listening socket (child gets reaped)
- Reap child with
SIGCHLD
handler
- Event-based servers
- Manage multiple connections in user space
- Determine events using
select()
orepoll()
- Design of choice for high-performance web servers
- However, hard to provide find-grained concurrency
- Cannot take advantage of multi-core
- Thread-based servers
- Can run threads in
detached
mode. It will run independently, and get reaped automatically - Possible race conditions when passing parameters to new thread in
pthread_create()
- Can run threads in