M65C02 - Microprogrammed Synthesizable IP Core
This project provides a microprogrammed synthesizable IP core compatible with the WDC and Rockwell 65C02 microprocessors.
This project demonstrates the integration of the core, M65C02_Core, with several components, usually supplied by the core's integrator, so that a complete soft-processor is available. The core itself expects several external components to be supplied by the integrator: (1) interrupt controller, (2) memory, and (3) I/O interface buffers. This project integrates examples of those external components with the core logic into a soft-microprocessor in a Xilinx Spartan 3A FPGA: XC3S50A-4VQ100I.
For this project, a rudimentary interrupt controller has been implemented that provides a functioning interrupt system compatible with a standard 65C02. A Block RAM is used for implementation of a Boot ROM, and a four cycle external interface is provided for easy interfacing to asynchronous SRAM and Flash EPROM. With the exception of the internal clock generator, which uses a Xilinx Digital Clock Manager (DCM), the design files uses inference for all logic. Any other FPGA family which supports synchronous Block RAMs should be able to support the core and implement the soft-microprocessor demonstrated by this project. (The core of the project has been used to synthesize a similar 65C02 microprocessor for an Altera Cyclone II/III FPGA in the DE0/DE1 FPGA development boards. This particular project is being used in a custom designed board with the Spartan 3A FPGA previously defined.)
The core of the processor is implemented using a microprogram controller (MPC). The MPC, a reimplementation of the Fairchild 9408 microprogram sequencer, provides all of the microprogram control logic used to implement the instruction sequencer and instruction decoding. The instruction sequencing control microprogram is implemented using a single Block RAM organized as 512 x 32 ROM. Of this ROM, the upper 256 words are are part of the instruction decoder. These ROM locations are used to provide the first microword of an instruction. A second, 256 x 32 Block RAM provides the control signals for the core which are fixed for each specific instruction. This second ROM essentially functions as an instruction decoder/ALU control word ROM.
The complex addressing modes of the processor requires instruction decoding to determine two components for the execution engine: (1) the addressing mode, and (2) the ALU control signals. The ROMs used in the implementation of this core provide these two functions. Essentially, when the microprogram sequence provided by the first ROM has fetched any operands required by an instruction, then the ALU is commanded to perform the necessary operations as determined by the control word provided by the second ROM. The sequence control ROM is accessed every memory cycle, and the instruction decode/ALU control word ROM is accessed once per instruction cycle. The opcode is applied to both ROMs simultaneously at the completion of the instruction fetch cycle. (Generally speaking, the concept of an instruction decoder is that of a static decoder driven by the contents of an instruction register (IR). That concept is not employed in the M65C02. The instruction opcode is simultaneously applied to both microprogram ROMs on the same cycle it is loaded into the IR. In one ROM, the instruction initiates the fetch of the next instruction, or an operand, or an address. In the other, it looks up the ALU and register control signals. The IR holds the opcode for the instruction, but it is otherwise unused in the M65C02.)
The instruction fetch cycle of the next instruction is generally overlapped with the execution of previous instruction Read-modify-write instructions break this overlapped execution cycle. Instructions which alter program flow (JMP, Bxx, JSR, RTS, RTI, etc.) also break the overlapped instruction fetch/execution model. A unique feature of this core is that branch instructions execute in 2 cycles regardless of the condition. This feature of the M65C02 core has the potential to provide save a significant number of clock cycles in any program that requires a lot of conditional branching. (The M65C02 microprogram is pipelined, and instruction execution is similarly pipelined whenever possible.)
The core is divided into four modules: (1) core, (2) MPC, (3) address generator, and (3) ALU. The core module instantiates the other three modules, provides the MPC next address/branch address logic, instantiates the microprogram and instruction decoder ROMs, implements the output data bus multiplexer, and the temporary operand registers.
The MPC incorporates the microprogram sequencer control logic, a micro-subroutine stack (not actually used in the implementation of the M65C02), and a micro-cycle length controller. The micro-cycle length controller implements a fixed length microcycle of four clocks. It also implements wait state logic which inserts memory cycle extensions of four clock. The inclusion of the micro-cycle length controller significantly reduces the issues encountered when attaching standard asynchronous RAMs and EPROMs to the M65C02 soft-microprocessor. The core logic is able to execute all instructions in a single cycle, but implementing single cycle external memory is simply not feasible at the speeds attainable with the core itself. Since the target is the smallest of the Spartan 3A family, there simply is not enough internal block RAM memory to provide a reasonable soft-microprocessor implementation. With other design constraints, the micro-cycle length could be reduced or eliminated in order to extract additional performance from the core.
The address generator incorporates the memory address register, and the program counter. Separate address generators are used for the memory address and the program counter. The focus is on overall performance, and the additional logic increases the number of slices/LUTs in the implementation. However, the additional resources allow all dead cycles to be removed. This results in many instructions having a reduced number of memory cycles compared to the W65C02 or R65C02. In fact, there is a reduction of at least 1 memory cycle in approximately 40% of the instructions. A special feature of this core is that all branch instructions require only 2 cycles rather than 2 (branch not taken) or 3 (branch taken) as is the case for a standard 65C02. Since the majority of the branches in a loop are of the branch taken variety, this optimization alone can provide a substantial improvement to a program's execution time.
The ALU contains all of the logic for the A, X, Y, P, and S registers. The ALU supports both binary and BCD modes. With the micro-cycle length controller setting the basic memory cycle as four clock periods in length, the BCD mode ADC/SBC execute in a single memory cycle. (If configured to operate as single cycle core, the decimal mode instructions automatically insert a single wait state. With the four cycle micro-cycle implementation provided, the extra cycle of the BCD instructions is absorbed into the address output phase of the following memory cycle. Thus, there's no penalty for the decimal mode ADC/SBC instructions.) In the ALU, the stack pointer is implemented as a loadable up/down counter, but it is also augmented with its own dedicated incrementer so that both push and pop operations are only 2 memory cycles in length.
As implemented in this project, the external memory interface attempts to replicate operational characteristics of the 6502/65C02 memory interface. Due to the fact that all external signals are registered in the IOBs of the FPGA, the address and data are not output until the rising edge of Phi2. This is different than a standard 6502/65C02 where the address and output data are enabled during Phi1, but not considered stable until the rising edge of Phi2. The integral micro-cycle controller acts as the Phi1/Phi2 clock generator. The basic machine/memory cycle is shifted from the micro-cycle to account for the register in the data input path of the IOB. The standard memory cycle is two clock periods for Phi1 and two clock periods for Phi2. On the falling edge of Phi2, the data from the memory is registered into the FPGA in the IOB.
Several integral address decoders are included in the design. Although the standard 6502/65C02 R/W signal is provided, the M65C02 also provides separate read (nOE) and write (nWR) strobes signal for direct attachment to SRAMs and EPROMs. These signals do not assert except during Phi2. Thus, they should be used instead of external combinatorial logic to control the nOE and nWR signals on standard SRAMs and EPROMs.
Synthesis/PAR Results - XC3S50A-4VQ100I FPGA
|Number of Slice Flip Flops||248||1408||17%|
|Number of 4 input LUTs||647||1408||45%|
|Number of occupied Slices||400||704||56%|
|Number of Slices related logic||400||400||100%|
|Number of Slices unrelated logic||0||400||0%|
|Total Number of 4 input LUTs||661||1408||46%|
|Number used as logic||646|
|Number used as a route-thru||14|
|Number used as Shift registers||1|
|Number of bonded IOBs|
|Number of bonded pads||54||68||79%|
|IOB Flip Flops||79|
|Number of BUFGMUXs||1||24||16%|
|Number of DCMs||1||2||50%|
|Number of RAMB16BWEs||3||3||100%|
The core has undergone significant testing and verification. A Self-checking testbench is provided for the ALU. Self-checking programs, some written by me and some written by third parties, are also provided. The core and the resulting soft-microprocessor have passed all of these tests, and this provides good confidence that the processor core, as provided in this project, is very stable.
Klaus Dormann's extensive 6502 functional test program has been successfully executed in both simulation and on the target HW, i.e. the XC3S50A/XC3S200A Development Board used in this project. Full source and the memory initialization files for the FPGA for this functional test program set are included in the repository. Klaus Dormann maintains this functional test program set on GitHUB.
In some applications, certain implementation decisions used in the M65C02 may produce undesired behavior, but otherwise, the core may be considered error free.
First, the core does not attempt to replicate in a cycle accurate way the behavior of the original 6502/65C02 microprocessors. The core, as provided here, removes as many dummy memory cycles and overlaps instruction fetch and execution as much as possible. This means that many instructions execute in fewer cycles compared to the 6502/65C02 processors. Furthermore, additional address generation logic has been included so that branch instruction execute in 2 memory cycles rather than the usual 2 (condition false) or 3 (condition true) cycles required by the 6502/65C02 processors.
Second, the core implements all undefined instructions as single cycle NOPs. This implementation decision will give different results for the M65C02 than either of the 6502 or the 65C02. In the 6502, undefined instructions have side effects, some of which have proven useful to some programmers, and generally result in variable execution times. The 65C02, on the other hand, eliminated the side effects of the undefined instructions, but allowed multiple memory cycles for some undefined instructions. If existing code relies on the behavior of the 6502/65C02 to undefined opcodes, then the M65C02 core is not a potential replacement. The behavior of the 65C02, with respect to undefined opcodes, can be incorporated into the M65C02 with some simple microprogram ROM changes, which can be inserted into the final configuration bit stream using the Data2MEM utility.
Third, the behavior of the M65C02 to a BRK is consistent with the intent. However, existing 6502/65C02 processors push, as the return address, the address of the second byte after the BRK instruction. This particular characteristic, coupled with the fact that the BRK and IRQ traps share a single service routine, means that debuggers and other such utilities must adjust the return address on the processor stack in order to return instruction processing to the instruction after BRK. In contrast, the M65C02 pushes the address of the BRK instruction. This means that no manipulation of the return address on the stack is required to continue with the instruction following BRK. (Note: if the BRK instruction is inserted by a dubugger, then it may require adjustment of the return address on the stack in order to restore the program to the state before the break point was inserted. If the debugger or monitor requires 6502/65C02 BRK behavior and can't be adjusted, then the M65C02 core is not a viable candidate.
Fourth, on the M65C02, BRK, IRQ/NMI, and JSR all push the address of the last byte in the instruction. For BRK this is the address of the BRK opcode itself, for JSR it's the address of the high byte of the target address, and for any interruptable instruction, it's the address of the last byte of the instruction. For two byte instructions, it's the address of the second byte, and for three byte instructions it's the address of the third byte. This allows the implementation of RTS/RTI using a consistent manner: the return address pulled from the stack is always incremented by one regardless of whether an RTS or RTI instruction is being executed. If specific expectations regarding the return address on the stack are not required, then the M65C02's behavior is transparent and should not be an issue.
The fifth and final limitation is that not all M65C02 instructions are interruptable. Instructions such as CLI and SEI are not interruptable. This implementation was chosen so that these instructions would not have to be implemented as two cycle instructions to account for the pipelined fetch/execute nature of the M65C02. The M65C02 also does not allow the interruption of any program flow control instruction. Thus, all jumps, branches, calls, and returns are not interruptable. Interrupts (NMI or IRQ) will be delayed until after the completion of the first instruction after a jump, branch, call, or return instruction. This means that using a self-referencing loop, e.g. Here: bra here, to wait for an interrupt is not allowed with the M65C02. Any such loop must include at least one interruptable instruction, or use the WAI instruction which is expressly intended for this situation.