Saturday, May 22, 2021

Cache Controller Design Verilog Code

Cache is a small piece of memory present in CPUs used to improve memory access times. Let us see how to design a cache controller in Verilog to control such a cache.

The reader is assumed to have a basic knowledge of memory organization and cache. This cache controller design has been done in this book Computer Organization and Design - by David A. Patterson and John L. Henessy. In the 5th edition of this book, section 5.9 describes a Finite State Machine to control a simple cache.
We will go ahead with a similar design model but write a unique Verilog code. 

Specifications:
  • Direct Mapped Cache
  • Block size = 4 words (1 word = 4 bytes)
  • Cache size = 1024 blocks
  • Write-back scheme
Structure:

Image Source: Computer Organization and Design by Patterson and Henessy - 5th edition

    
Basically, we have two interfaces for the cache controller: CPU interface and memory interface.
(Note: We will use the term memory to refer to main memory and not cache memory.)
The cache controller has to get requests from the CPU. The controller will perform the required write/read from the cache.

Write operation:
For a write operation, the dirty bit has to be set during writes due to write-back scheme. There are 2 cases to be checked during write operations.
1. If the dirty bit is not set for the tag in cache, then the write data can be directly written to the cache location.
2. If the dirty bit is set for the tag in cache, it means the data present in this tag location in cache, has to be first written to memory. Then we can replace the write data with the current tag location in cache.

Read operation:
If the cache was able to retrieve the required read data, it is termed a cache hit and the request will be processed.
If not, it is termed a cache miss and the cache controller now communicates with the main memory (memory interface) to perform the read operation.
Here again we have 2 cases:
1. If the dirty bit is not set for the tag in cache, then memory can read data from the requested address and replace the data in the corresponding tag location in cache.
2. If the dirty bit is set for the tag in cache, it means the data present in this tag location in cache, has to be first written to memory. Then only we can replace it with the newly requested data from memory. 

FSM Logic:

We have 4 states for the FSM:
  • IDLE: Wait in this state until a valid CPU request is received.
  • COMPARE_TAG: Explore the following possibilities
    Read hit? Then cache is ready with read data. Set tag and valid bit.
    Write? Check for dirty bit and perform action. Set tag, valid bit and dirty bit at the end.
    Read miss? Now check for dirty bit.
                        Clean? Goto ALLOCATE
                        Dirty? Goto WRITE_BACK
  • ALLOCATE: Here, we wait for the memory to provide us with read data and update the cache.
  • WRITE_BACK: Here, we perform the write operation first to memory for the old data that had set the dirty bit. After that, obtain the desired read data for current address and update the cache.
Verilog Code:

module cache_controller (
	clk,
	rst_n,
	cpu_req_addr,  //CPU signals
	cpu_req_datain,
	cpu_req_dataout,
	cpu_req_rw,
	cpu_req_valid,
	cache_ready,   //Cache ready
	mem_req_addr,  //Main memory signals	
	mem_req_datain,
	mem_req_dataout,
	mem_req_rw,
	mem_req_valid,
	mem_req_ready
);

parameter IDLE        = 2'b00;
parameter COMPARE_TAG = 2'b01;
parameter ALLOCATE    = 2'b10;
parameter WRITE_BACK  = 2'b11;

input clk;
input rst_n;

//CPU request to cache controller
input [31:0] cpu_req_addr;
input [127:0] cpu_req_datain;
output [31:0] cpu_req_dataout;
input cpu_req_rw; //1=write, 0=read
input cpu_req_valid;

//Main memory request from cache controller
output [31:0] mem_req_addr;
input [127:0] mem_req_datain;
output [31:0] mem_req_dataout;
output mem_req_rw;
output mem_req_valid;
input mem_req_ready;

//Cache ready
output cache_ready;

//Cache consists of tag memory and data memory
//Tag memory = valid bit + dirty bit + tag
reg [23:0] tag_mem [1023:0];
//Data memory holds the actual data in cache
reg [127:0] data_mem [1023:0];

reg [1:0] present_state, next_state;
reg [31:0] cpu_req_dataout, next_cpu_req_dataout;
reg [31:0] cache_read_data;
reg cache_ready, next_cache_ready;
reg [31:0] mem_req_addr, next_mem_req_addr;
reg mem_req_rw, next_mem_req_rw;
reg mem_req_valid, next_mem_req_valid;
reg [127:0] mem_req_dataout, next_mem_req_dataout;
reg write_datamem_mem; //write operation from memory
reg write_datamem_cpu; //write operation from cpu
reg tagmem_enable;
reg valid_bit, dirty_bit;

reg [31:0] cpu_req_addr_reg, next_cpu_req_addr_reg;
reg [127:0] cpu_req_datain_reg, next_cpu_req_datain_reg;
reg cpu_req_rw_reg, next_cpu_req_rw_reg;

wire [17:0] cpu_addr_tag;
wire [9:0] cpu_addr_index;
wire [1:0] cpu_addr_blk_offset;
wire [1:0] cpu_addr_byte_offset;
wire [19:0] tag_mem_entry;
wire [127:0] data_mem_entry;
wire hit;

//CPU Address = tag + index + block offset + byte offset
assign cpu_addr_tag         = cpu_req_addr_reg[31:14];
assign cpu_addr_index       = cpu_req_addr_reg[13:4];
assign cpu_addr_blk_offset  = cpu_req_addr_reg[3:2];
assign cpu_addr_byte_offset = cpu_req_addr_reg[1:0];

assign tag_mem_entry  = tag_mem[cpu_addr_index];
assign data_mem_entry = data_mem[cpu_addr_index];
assign hit = tag_mem_entry[19] && (cpu_addr_tag == tag_mem_entry[17:0]);

initial begin
$readmemh("tag_memory.mem", tag_mem);	//load initial values for tag memory
end

initial begin
$readmemh("data_memory.mem", data_mem);	//load initial values for data memory
end

always @ (posedge clk or negedge rst_n)
begin
  if(!rst_n)
  begin
	tag_mem[cpu_addr_index]  <= tag_mem[cpu_addr_index];
	data_mem[cpu_addr_index] <= data_mem[cpu_addr_index];
	present_state   	<= IDLE;
	cpu_req_dataout 	<= 32'd0;
	cache_ready     	<= 1'b0;
	mem_req_addr    	<= 32'd0;
	mem_req_rw      	<= 1'b0;
	mem_req_valid   	<= 1'b0;
	mem_req_dataout 	<= 128'd0;
	cpu_req_addr_reg	<= 1'b0;
	cpu_req_datain_reg  	<= 128'd0;
	cpu_req_rw_reg  	<= 1'b0;
  end
  else
  begin
    	tag_mem[cpu_addr_index]  <= tagmem_enable ? {4'd0,valid_bit,dirty_bit,cpu_addr_tag} : tag_mem[cpu_addr_index];
   	data_mem[cpu_addr_index] <= write_datamem_mem ? mem_req_datain : write_datamem_cpu ? cpu_req_datain_reg : data_mem[cpu_addr_index];
	present_state   	<= next_state;
	cpu_req_dataout 	<= next_cpu_req_dataout;
	cache_ready     	<= next_cache_ready;
	mem_req_addr    	<= next_mem_req_addr;
	mem_req_rw      	<= next_mem_req_rw;
	mem_req_valid   	<= next_mem_req_valid;
	mem_req_dataout 	<= next_mem_req_dataout;
	cpu_req_addr_reg	<= next_cpu_req_addr_reg;
	cpu_req_datain_reg  	<= next_cpu_req_datain_reg;
	cpu_req_rw_reg  	<= next_cpu_req_rw_reg;
  end
 end
 
always @ (*)
begin
write_datamem_mem    = 1'b0;
write_datamem_cpu    = 1'b0;
valid_bit            = 1'b0;
dirty_bit            = 1'b0;
tagmem_enable        = 1'b0;
next_state           = present_state;
next_cpu_req_dataout = cpu_req_dataout;
next_cache_ready     = 1'b1;
next_mem_req_addr    = mem_req_addr;
next_mem_req_rw      = mem_req_rw;
next_mem_req_valid   = mem_req_valid;
next_mem_req_dataout = mem_req_dataout;
next_cpu_req_addr_reg  = cpu_req_addr_reg;
next_cpu_req_datain_reg  = cpu_req_datain_reg;
next_cpu_req_rw_reg  = cpu_req_rw_reg;

case (cpu_addr_blk_offset)
	2'b00: cache_read_data   = data_mem_entry[31:0];
	2'b01: cache_read_data   = data_mem_entry[63:32];
	2'b10: cache_read_data   = data_mem_entry[95:64];
	2'b11: cache_read_data   = data_mem_entry[127:96];
	default: cache_read_data = 32'd0;
endcase

case(present_state)
  IDLE:
  begin
    if (cpu_req_valid)
    begin
    next_cpu_req_addr_reg  = cpu_req_addr;
    next_cpu_req_datain_reg  = cpu_req_datain;
    next_cpu_req_rw_reg  = cpu_req_rw;
    next_cache_ready = 1'b0;  
    next_state = COMPARE_TAG;
    end
    else
    next_state = present_state;
  end
  
  COMPARE_TAG:
  begin
    if (hit & !cpu_req_rw_reg) //read hit
    begin
    next_cpu_req_dataout = cache_read_data;
    next_state = IDLE;
    end
    else if (!cpu_req_rw_reg) //read miss
    begin
    next_cache_ready = 1'b0;  
	  if (!tag_mem_entry[18]) //clean, read new block from memory
	  begin
	  next_mem_req_addr = cpu_req_addr_reg;
	  next_mem_req_rw = 1'b0;
	  next_mem_req_valid = 1'b1;
	  next_state = ALLOCATE;
	  end
	  else 					  //dirty, write cache block to old memory address, then read this block with curr addr
	  begin
	  next_mem_req_addr = {tag_mem_entry[17:0],cpu_addr_index,4'd0}; //old tag, current index, offset 00
	  next_mem_req_dataout = data_mem_entry;
	  next_mem_req_rw = 1'b1;
	  next_mem_req_valid = 1'b1;
	  next_state = WRITE_BACK;
	  end
    end
    else //write operation
    begin
	  if (tag_mem_entry[18]) //dirty, write cache block to old memory address and then write new cache entry
	  begin
          next_cache_ready = 1'b0;  
	  next_mem_req_addr = {tag_mem_entry[17:0],cpu_addr_index,4'd0}; 
	  next_mem_req_dataout = data_mem_entry;
	  next_mem_req_rw = 1'b1;
	  next_mem_req_valid = 1'b1;
	  next_state = WRITE_BACK;
	  end
          else
	  begin
          valid_bit = 1'b1;
          dirty_bit = 1'b1;
          tagmem_enable = 1'b1;
          write_datamem_cpu = 1'b1;
          next_state = IDLE;
	  end
end end ALLOCATE: begin next_mem_req_valid = 1'b0; next_cache_ready = 1'b0; if(!mem_req_valid && mem_req_ready) //wait for memory to be ready with read data begin write_datamem_mem = 1'b1; //write to data mem valid_bit = 1'b1; //make the tag mem entry valid dirty_bit = 1'b0; tagmem_enable = 1'b1; next_state = COMPARE_TAG; end else begin next_state = present_state; end end WRITE_BACK: begin next_cache_ready = 1'b0; next_mem_req_valid = 1'b0; if(!mem_req_valid && mem_req_ready) //write is done, now read begin valid_bit = 1'b1; dirty_bit = 1'b0; tagmem_enable = 1'b1; next_mem_req_addr = cpu_req_addr_reg; next_mem_req_rw = 1'b0; next_mem_req_valid = !cpu_req_rw_reg;
        next_state = cpu_req_rw_reg ? COMPARE_TAG : ALLOCATE;
end else begin next_state = present_state; end end endcase end endmodule

Testbench:

module cache_controller_tb;

reg clk;
reg rst_n;
reg [31:0] cpu_req_addr;
reg [127:0] cpu_req_datain;
wire [31:0] cpu_req_dataout;
reg cpu_req_rw; //1=write, 0=read
reg cpu_req_valid;

wire [31:0] mem_req_addr;
reg [127:0] mem_req_datain;
wire [127:0] mem_req_dataout;
wire mem_req_rw;
wire mem_req_valid;
reg mem_req_ready;

cache_controller dut (
    .clk	       (clk),
    .rst_n             (rst_n),
    .cpu_req_addr      (cpu_req_addr),  
    .cpu_req_datain    (cpu_req_datain),
    .cpu_req_dataout   (cpu_req_dataout),
    .cpu_req_rw        (cpu_req_rw),
    .cpu_req_valid     (cpu_req_valid),
    .cache_ready       (cache_ready),   
    .mem_req_addr      (mem_req_addr),  
    .mem_req_datain    (mem_req_datain),
    .mem_req_dataout   (mem_req_dataout),
    .mem_req_rw        (mem_req_rw),
    .mem_req_valid     (mem_req_valid),
    .mem_req_ready     (mem_req_ready)
	);
	
always #5 clk = ~clk;

task reset_values ();
begin
  clk = 1'b1; rst_n = 1'b0;
  cpu_req_addr = 32'd0; cpu_req_datain = 128'd0; cpu_req_rw = 1'b0; cpu_req_valid = 1'b0; 
  mem_req_datain = 128'd0; mem_req_ready = 1'b1;
  #20 rst_n = 1'b1;
end
endtask

task write_cpu (input [31:0]addr, input [127:0]data);
begin
  #20 cpu_req_addr = addr; cpu_req_rw = 1'b1; cpu_req_valid = 1'b1; cpu_req_datain = data; //write to cache (AB)
  #10 cpu_req_addr = 32'd0; cpu_req_rw = 1'b0; cpu_req_valid = 1'b0; cpu_req_datain = 128'd0;
end
endtask

task read_cpu (input [31:0]addr);
begin
  #20 cpu_req_addr = addr; cpu_req_rw = 1'b0; cpu_req_valid = 1'b1;  
  #10 cpu_req_addr = 32'd0; cpu_req_rw = 1'b0; cpu_req_valid = 1'b0; 
end 
endtask

initial begin
  reset_values();
  write_cpu (332'hAB00,128'h1122);  //write 
  read_cpu  (32'hAB00);		    //read hit (same tag, same index)
  read_cpu  (32'hBB00);		    //read miss, clean (same tag, diff index)
  @(negedge mem_req_valid) mem_req_ready = 1'b0;
  #20 mem_req_datain = 128'h3344; mem_req_ready = 1'b1;  

  read_cpu  (32'hEB00);		   //read miss, dirty (diff tag, same index)
  @(negedge mem_req_valid) mem_req_ready = 1'b0;
  #20 mem_req_ready = 1'b1; 
  @(negedge mem_req_valid) mem_req_ready = 1'b0;
  #30 mem_req_datain = 128'h5566; mem_req_ready = 1'b1; 
end
endmodule

Simulation Waveform:



The sequence of events taking place in the waveform is described below:
  1. A valid CPU write request arrives with address 'hAB00. The write data is written to the cache location specified by the tag and index. But it is not yet written to main memory since it is a write back scheme (dirty bit is set to 1)
  2. Next, a CPU read request arrives with the same address. Since the data is already present in cache, it is a cache hit and the previously written data is read out.
  3. Next, a CPU read request arrives with the same tag but a different index from the first request. So it is a read miss and clean, due to which data is read from memory and sent out. (Memory has been loaded with some initial values)
  4. Next, a CPU read request arrives with a different tag but the same index as the first request. This is a case of read miss and dirty. So the cache data is first written to memory at the old address, then the required data is read out from memory at current address and sent out.

Conclusion:

Thus we have designed a simple cache controller model for a cache memory in Verilog.

14 comments:

  1. why tag memory is not
    reg [19:0] tag_mem [1023:0]; because 18 bits for tag, 1 bit for dirty, 1 bit for valid bit?

    ReplyDelete
    Replies
    1. You're right. 20-bits width is enough for tag_mem, but I had added 4 spare bits. It's not required though.

      Delete
    2. please can you send tag_mem and data_mem files on nishitnathwani97@gmail.com

      Delete
    3. I have added a hyperlink in the code, you can check now. It is initialized to all 0s.

      Delete
  2. Replies
    1. I haven't used gtkwave, see if this site helps:
      https://iverilog.fandom.com/wiki/GTKWave

      Delete
  3. how to view the simulation waveform?

    ReplyDelete
    Replies
    1. It depends on the simulator tool which you are using, check the user guide on how to dump the waveform and open the simulator.

      Delete
  4. When there is two consecutive write operation (different tag , same index). first write will set dirty bit, but without even considering a valid data in the cache block the second write will replace it. I think this is a bug in the write operation part

    ReplyDelete
    Replies
    1. Hello, you are right. There was a bug during consecutive write operation (different tag , same index).
      The design code and description has been updated. Please check and let me know if the issue is solved. Thank you once again for detecting the issue.

      Delete
  5. Whats the license of this implementation? Am I allowed to use this for a open source project? Want to implement this on the Mistery FPGA Core (Atari STe Implementation) as the cache controller is one of the last missing bits.

    ReplyDelete
    Replies
    1. Hello, you are free to use this work as long as you point to this website as your reference source. I also suggest you to do a thorough verification and feel free to let me know in case of any bugs, thanks!

      Delete