CDMP Research Project
Optimal Scheduling for Multicore Processors

home

blog

project

about me

Week 16 [August 17, August 23]

This week was spent winding up loose ends and writing my final report. We didn't make the August deadline for the paper, but we still plan to publish our findings in a journal in the next few months. Overall, this has been an excellent experience. Thanks to Professor Fedorova and all the other students I worked with on this project!

Week 15 [August 10, August 16]

This week I compiled data for another part of the project. Whereas my research has focused on characterizing a program's sensitivity to L2 cache size, last semester another student, Daniel Shelepov, researched and wrote a paper on characterization of a programs sensitivity to clock speed. His model uses a program's speed's elasticity to change in clock frequency. To validate this model, I calculated the elasticity of the speed of various benchmarks to change in clock frequency using data from a number of different machines.

Week 14 [August 3, August 9]

This week I tested some strategies for making my model more accurate. The data used to estimate program sensitivity to L2 cache size are collected using a toolkit for the Pin framework called MICA. The data collected is the number of memory accesses with different reuse distances. This data is returned as a series of numbers representing the number of memory accesses with reuse distance in various ranges. The miss estimate model requires the number of memory accesses at each reuse distance, but since we only have the number of memory accesses in various ranges, we estimate by assuming that each memory access with reuse distance in a given range had a reuse distance of 4/3 the lower bound of the range. I came up with several ways to estimate the number of memory accesses at several reuse distances in the range. In order to test these new methods, Daniel and I rewrote the MICA program for reuse distances to collect data on smaller ranges. However, comparing my new methods to and the old method to the finer data did not show any improvement from the old method to any of the new methods, and so they will not be used.

Week 13 [July 27, August 2]

It looks like we will not be able to make the August deadline for the conference we wanted to present our findings at. There have been some unforeseen complications with writing the scheduler. We will aim for a later deadline, probably closer to November, so that we have lots of time to test everything.

This week I finished up my model by examining how misses might translate to performance. I compared the output of my Matlab program that estimates misses per instruction to the misses per instruction from the simulations. The results looked good. It's fairly simple to estimate how the misses will affect the total number of cycles - multiply the number of misses by some latency such as 200 cycles and that gives the total number of cycles added by L2 misses. However, there are also a certain number of cycles that the program would require anyway, even with na infinite L2 cache. This baseline is dependent on the micro-architecture and so is impossible to calculate at compile time. I decided to go with the total number of instructions as the baseline number of cycles, since this is the minimum number of cycles a program requires. However, this is a very rough estimate, since there are a number of factors besides L2 cache size which would add cycles to the total.

Week 12 [July 20, July 26]

This week I finished rerunning the simulations and compiled all the IPC data for each benchmark with each L2 cache size.

Based on my data, it won't be possible to represent a program's sensitivity to L2 cache size with a single number. The change in IPC over cache size of the benchmarks varies between increasing rate of improvement and decreasing rate of improvement at unpredictable points, making the various IPC vs. L2 cache size functions impossible to characterize with a single parameter. However, since there are only a small number of reasonable L2 cache sizes - powers of 2 between 16KB and 16MB - the L2 cache signature will simply be composed of 10 numbers, one for each of 32KB to 16MB, somehow showing the improvement in performance for each cache size from the one before it (or from 16KB).

In order to estimate the improvement from one cache size to another, I use Daniel Shelepov's method for estimating the number of misses probabilistically based on the reuse distances of each memory access. At Daniel's request, I went through the model myself to validate each step and I found it quite sound. I wrote a Matlab script to estimate the misses per instruction from the memory reuse distance data using this model.

Week 11 [July 13, July 19]

This week I finished collecting data on the memory reuse distances of the benchmarks. This is data will be used to come up with L2 cache signatures for each of the benchmarks. To collect this data, I used a tool called MICA.

After looking at the data I'd collected from Simics simulations, it seemed there were some discrepancies between the number of misses and the total number of cycles. I did some research and found that some of the misses in Simics don't result in additional cycles, and are instead recorded as "lost stall cycles", a value which needs to be collected before and after each simulation. Since I hadn't been collecting data before the simulations, I will have to run them all over again. I began re-running the simulations once again and I wrote a program to calculate the number total number of cycles from the simulation data by adding the lost stall cycles from after the simulation, and subtracting the lost stall cycles from before the simulation.

Week 10 [July 6, July 12]

Since some of the benchmarks are written in Fortran, I had to figure out how to call a C function (the function to insert a magic breakpoint) from Fortran. This wasn't particularly difficult, but compiling it was a bit more tricky. For a description I wrote of how to call a C function from Fortran, click here.

I recompiled the Fortran benchmarks with the magic instruction, so now all the benchmarks have been recompiled. I continued to run simulations with these new "magic" benchmarks.

Week 9 [June 29, July 5]

This week was pretty busy, since we want to finish our research in time to submit a paper for a conference and the deadline is in early August.

In order to ensure that my simulations are as accurate as possible, I had to insert a breakpoint into each benchmark that would stop the simulation momentarily after the program has been loaded and just before it has begun running so that I could begin my measurements at that point. To do this, I had to insert some code at the beginning of the main method of each benchmark and then recompile the benchmarks. At this point I've only recompiled the C and C++ benchmarks. I'll recompile the Fortran ones next week.

Once I'd recompiled the benchmarks, I began running the simulations again with the new benchmarks. These simulations took much longer since the simulated cache was now being used. As my simulations finished, I graphed the data.

As my benchmark simulations were running, I also ran a program called mica with each benchmarks to measure the memory reuse distances for each program. The memory reuse distance of a memory access is the number of unique memory addresses accessed between the current access and the previous access to the same address. mica returns the number of memory accesses with reuse distances in various ranges. This statistics is independent of the hardware. In our framework, we will use this information about a program to predict how well it will respond to a larger l2 cache.

Week 8 [June 22, June 28]

This week I began analyzing my data with some other students working on the project and it became clear that something was wrong with the simulation. Sasha figured out that my simulations were not really simulating cache accesses. To force the simulator to do this I had to disable the simulators default caches so that it would recognize the ones I had configured. Sasha also suggested that to make the simulations more accurate I could start all my measurements after the program has loaded but just before it begins running. The way I was doing it was to begin my measurement, call the program, and then measure again at the end. In order to skip the loading portion of the process, I will have to recompile the benchmarks with a special no op instruction that the simulator recognizes. When this special instruction is read, control is returned from the simulated machine to the host machine and at that point I can begin the measurement before returning control to the simulated machine.

Week 7 [June 15, June 21]

I spent most of this week running more simulations. I also began to generate graphs of the data in Matlab.

Week 6 [June 8, June 14]

This week I set up a simulated machine with x86 architecture and installed linux. I then wrote scripts to run the benchmarks with l2 caches from 16-4096KB and measure the IPC. I began running the simulations.

Week 5 [June 1, June 7]

This week, I began running simulations of several different benchmarks on machines with varying l2 cache sizes. I'll be spending the next couple of weeks collecting similar data that I will use to develop a measure of how much a program benefits from a larger l2 cache.

Week 4 [May 25, May 31]

This week, I worked with a program called Simics which is used to simulate different types of processors. I ran benchmarks on a simulated machine with varying l1 and l2 cache sizes as well as varying cache line size.

Week 3 [May 18, May 24]

This week, I continued to read about heterogeneous multicore processors. I read about a proposed technique for optimal scheduling of heterogeneous multicore processors that involves gathering information about each program as it runs and then reassigning the programs to appropriate cores based on its performance. This is related to the research I'll be doing since it relates to optimal scheduling. However I will research ways to identify characteristics of the program's performance at compile time so that the operating system can schedule it based on this information.

I also measured some cache usage statistics of a particular processor using cputrack and running several benchmarks from SPEC CPU2000. This exercise was to familiarize myself with some of the tools available for measuring performance.

Week 2 [May 11, May 17]

This week I learned about a research tool called Pin, which injects binary code into a running program in order to collect statistics about the program execution, such as types of machine instructions executed, locations of memory reads, etc. This information can be used to evaluate the efficiency of a processor, and diagnose where the major inefficiencies lie. For example, certain statistics can indicate how many cache misses occur in the program's execution. Others may indicate how well-suited a program is to instruction level parallelism.

In addition, I began reading some papers about heterogeneous multicore processors, since this is the type of processor I'll be researching this summer.

Week 1 [May 4, May 10]

This week I spent most of my time reading and reviewing the basics of computer architecture. I also read about pipelining, multi-threading, and some other types of parallelism.