Homework 1 - solutions

Use the SUN Cereal machines in Hill 248 to run all the programs. Note that the descriptions here are minimal - you are expected to provide more details. :) Since i wrote up most of these in a day, they may not be the best solutions, but they are illustrative. Please point out bugs/typos (from cut&paste) etc. if any.


(A) Cache measurements


(1) Write a C program to measure the size of L1 and L2 caches - using capacity misses.


To determine the cache size, the approach here is to try to access blocks of data of increasing size until the size becomes greater than the L1/L2 cache size. We notice a jump in the access times when we cross these boundaries(L1 and L2 cache sizes).

Accesses must be done in terms of cache lines. The cache line size also can be determined by accessing strides of various lengths. When one crosses the size of the cache line, a difference in the access time can be noticed. I'm assuming a cache line size of 32 bytes for the L1 cache and 64 bytes for L2.

The following program determines the L1 & L2 cache sizes using capacity misses. Run the following program multiple times, each time with a different guess for the number of cache lines. Figure out the cache sizes by noticing a jump in the access times. For example, if you increase the guess by a factor of two each time, you'll see a jump(by more than a factor of two) when you cross the cache sizes. The program assumes that the cache is split (i.e. there are separate data and instruction caches). Click here to see the output and results.

      #include <sys/time.h>
#define PAGE_SIZE 8192
#define TOTAL_MEM_SIZE (4096 * 1024)
#define L1_CACHE_LINE_SIZE 32 

      main(int argc, char *argv[]) {
    float total_time;
    register int *a, *b, i, num_cache_lines_guess, num_iter, k, l;
    struct timeval t1, t2;

    if(argc < 2) {
        printf("Usage: %s guess_num_cache_lines\n", argv[0]);
        exit(1);
    }

    num_cache_lines_guess = atoi(argv[1]);
    b = a = (void *) sbrk(TOTAL_MEM_SIZE);

    /* Touch all pages */
    for(k = 0; k < TOTAL_MEM_SIZE/PAGE_SIZE; k++, a += 2048)
        *a = 1;

    num_iter = TOTAL_MEM_SIZE / (num_cache_lines_guess * L1_CACHE_LINE_SIZE); 

          gettimeofday(&t1,0);

    for(a = b,k = 0; k < num_iter; k++, a = b) {

        /* Each iteration in the loop accesses one cache line */
        for(l = 0; l < num_cache_lines_guess; l++,a += 8)
            i = *a;
    }

    gettimeofday(&t2,0);

    total_time = ((t2.tv_sec-t1.tv_sec)*1000000 + (t2.tv_usec-t1.tv_usec));
    printf("Access time for block of size %d : %f microseconds\n",
    num_cache_lines_guess * L1_CACHE_LINE_SIZE, total_time/num_iter);

    }


(2) Write a C program to measure the size of L1 cache - using conflict misses. What are the assumptions under which your program finds the correct size?

    Here, the approach is to generate accesses so that the same instruction generates accesses to addresses that map to the same location in cache. Addresses that are "cache size" apart will have this property. Two important assumptions for this program to work are that the cache is direct mapped and virtually addressed. The output and result follows the program.

      #include <unistd.h>

#define NUM_PAGES 1024
#define MAX_GUESS 128*1024

main() {
    volatile char *a;
    register int pagesize, i, incr;
    hrtime_t start, end;
    long long int calc;

    pagesize = sysconf(_SC_PAGESIZE);

    printf("PAGESIZE = %d\n", pagesize);

    a = (char *) memalign(pagesize, pagesize*NUM_PAGES);

    for (i=0; i < NUM_PAGES; i++) a[pagesize*i] = 0;

    for(incr=1; incr < MAX_GUESS; incr*=2) {

        start = gethrtime();
        for(i=pagesize*NUM_PAGES/4; i<pagesize*(NUM_PAGES/4 + 1); i++)
            a[i] += a[i+incr];
        end = gethrtime();

        calc = (end - start) / (pagesize);

        printf("Avg time (%d accesses, incr = %d) = %lld nsec\n", pagesize, incr, calc);
        usleep(1000);
    } 

}

Results for trix.rutgers.edu
L1 cache size = 16K
Actual data
    Avg time (8192 accesses, incr = 1) = 89 nsec
    Avg time (8192 accesses, incr = 2) = 86 nsec
    Avg time (8192 accesses, incr = 4) = 86 nsec
    Avg time (8192 accesses, incr = 8) = 86 nsec
    Avg time (8192 accesses, incr = 16) = 86 nsec
    Avg time (8192 accesses, incr = 32) = 86 nsec
    Avg time (8192 accesses, incr = 64) = 86 nsec
    Avg time (8192 accesses, incr = 128) = 86 nsec
    Avg time (8192 accesses, incr = 256) = 86 nsec
    Avg time (8192 accesses, incr = 512) = 86 nsec
    Avg time (8192 accesses, incr = 1024) = 86 nsec
    Avg time (8192 accesses, incr = 2048) = 87 nsec
    Avg time (8192 accesses, incr = 4096) = 87 nsec
    Avg time (8192 accesses, incr = 8192) = 88 nsec
    ==> L1 size = 16384 bytes
    Avg time (8192 accesses, incr = 16384) = 151 nsec
    Avg time (8192 accesses, incr = 32768) = 151 nsec
    Avg time (8192 accesses, incr = 65536) = 151 nsec

(3) Determine the L1 and L2 cache miss penalties (by modifying one of the above programs).  

The approach here is to try to access the caches in terms of the respective cache lines, and make sure that each access is a miss. (These programs actually measure the miss latency. Miss penalty will be latency - hittime. Measuring hittime is trivial.)

      #include <sys/time.h>
#define PAGE_SIZE 8192
#define L1_CACHE_LINE_SIZE 32
#define L1_CACHE_SIZE 16384

main(int argc, char *argv[]) {
    float total_time;
    int num_accesses;
    register int *a, *b, i, num_iter, k, l;
    struct timeval t1, t2;

    num_iter = 1024;

          b = a = (void *) sbrk(2 * L1_CACHE_SIZE);

    /* Touch all pages */
    for(k = 0; k < 2 * L1_CACHE_SIZE/PAGE_SIZE; k++, a += 2048)
        i = *a;

    gettimeofday(&t1,0);

    for(a = b,k = 0; k < num_iter; k++, a = b) {

        /* Alternate accesses to blocks of data, each of size = L1_CACHE_SIZE. These blocks
            mutually knock each other out of the L1 cache, resulting in L1 cache misses for every access */

        /* Each iteration in the loop accesses one cache line */
        for(l = 0; l < L1_CACHE_SIZE/L1_CACHE_LINE_SIZE; l++,a += 8)
            i = *a;

        for(l = 0; l < L1_CACHE_SIZE/L1_CACHE_LINE_SIZE; l++,a += 8)
            i = *a;
    } 

          gettimeofday(&t2,0);

    total_time = ((t2.tv_sec-t1.tv_sec)*1000000 + (t2.tv_usec-t1.tv_usec));
    num_accesses = num_iter * (L1_CACHE_SIZE/L1_CACHE_LINE_SIZE) * 2.0;
    printf("L1 cache miss latency: %f microseconds\n", total_time/num_accesses);

L1 cache miss latency = ~50 nanoseconds
Actual data (from trix)
> repeat 10 ./miss1
L1 cache miss latency: 0.050288 microseconds
L1 cache miss latency: 0.050339 microseconds
L1 cache miss latency: 0.050270 microseconds
L1 cache miss latency: 0.049543 microseconds
L1 cache miss latency: 0.050138 microseconds
L1 cache miss latency: 0.049687 microseconds
L1 cache miss latency: 0.050120 microseconds
L1 cache miss latency: 0.049520 microseconds
L1 cache miss latency: 0.050299 microseconds
L1 cache miss latency: 0.061378 microseconds 
The above program can be modified to measure L2 miss latency by changing 
the L1_CACHE_SIZE and L1_CACHE_LINE_SIZE to the L2 cache and line sizes
respectively (512K and 64 in this case).
L2 cache miss latency = ~291 nanoseconds
> repeat 10 ./miss2
L2 cache miss latency: 0.291649 microseconds
L2 cache miss latency: 0.291595 microseconds
L2 cache miss latency: 0.291650 microseconds
L2 cache miss latency: 0.291899 microseconds
L2 cache miss latency: 0.291164 microseconds
L2 cache miss latency: 0.295575 microseconds
L2 cache miss latency: 0.292015 microseconds
L2 cache miss latency: 0.293448 microseconds
L2 cache miss latency: 0.291755 microseconds
L2 cache miss latency: 0.291537 microseconds

(B) Implementing a shell


Write a simple Unix shell program that 

(It should also handle white space between components in a single command line correctly. i.e. "cmd | more", "cmd|  more", "   cmd |more", "cmd|more   " should all work.)

Here is the shell program.


(C) Page Fault Measurements


Write C programs to measure the time taken to service a page fault (segmentation violation). This should be measured in two ways:

(1) Protect a page using mprotect, and then try to access it. In the handler for segmentation violation, unprotect it.

      #include <sys/time.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <unistd.h>
#include <signal.h>

#define ITER 9999

char *a;
int pagesize;
int counter = ITER;
struct timeval begin, end;

void handler() {
    if(counter--)
        return;
    gettimeofday(&end, NULL);
    if(mprotect(a, pagesize, PROT_WRITE) <0) { 
        perror("mprotect:"); exit();
    }
}

main() {
    struct sigaction sa;
    long diff;

    pagesize = sysconf(_SC_PAGESIZE);

    a = (char *)valloc(pagesize);

    sa.sa_handler = handler;
    sigaction(SIGSEGV, &sa, NULL);

    if(mprotect(a, pagesize, PROT_NONE) <0) { 
        perror("mprotect:"); exit();
    }

    gettimeofday(&begin, NULL);

    *a = 1;

    diff = (end.tv_sec - begin.tv_sec)*1000000 + (end.tv_usec - begin.tv_usec);
    printf("fault time = %f microsecs\n", (float)diff/(ITER+1));
}

Results for trix.rutgers.edu

Protection fault handling time in kernel = ~96 microseconds

Actual data

      > repeat 10 ./flt1
fault time = 99.181396 microsecs
fault time = 99.248001 microsecs
fault time = 96.362000 microsecs
fault time = 97.499802 microsecs
fault time = 100.148201 microsecs
fault time = 96.206497 microsecs
fault time = 96.246597 microsecs
fault time = 96.569298 microsecs
fault time = 97.553101 microsecs
fault time = 96.353699 microsecs


(2) Use a null pointer and try to access it. In the signal handler, skip the faulting instruction by incrementing the program counter, which you can find on the stack. The following code fragment gets the program counter from the stack.

signal_handler(int t) {
    int *p = &t + 42; /* program counter available at *(&t + 42) */
}

      #include <sys/time.h>
#include <unistd.h>
#include <signal.h>

#define ITER 9999

int counter = ITER;
struct timeval begin, end;

void handler(int t) {
    int *p = &t + 42;

    if(counter--)
        return;
    gettimeofday(&end, NULL);

    (*p)+=4;
}

main() {
    char *a;
    struct sigaction sa;
    long diff;

    a = NULL;

    sa.sa_handler = handler;
    sigaction(SIGSEGV, &sa, NULL);

    gettimeofday(&begin, NULL);

    *a = 1;

    diff = (end.tv_sec - begin.tv_sec)*1000000 + (end.tv_usec - begin.tv_usec);
    printf("fault time = %f microsecs\n", (float)diff/(ITER+1));
}

Results from trix.rutgers.edu

Segmentation fault handling time in kernel = ~85 microseconds

Actual data

      > repeat 10 ./flt2
fault time = 85.142899 microsecs
fault time = 85.683899 microsecs
fault time = 85.086304 microsecs
fault time = 86.544403 microsecs
fault time = 87.665703 microsecs
fault time = 87.082603 microsecs
fault time = 85.144501 microsecs
fault time = 85.115402 microsecs
fault time = 87.924004 microsecs
fault time = 86.466599 microsecs