Subject: Re: Z80 emulator to learn assembly? From: Paul Urbanus Date: 1997/01/31 Message-ID: <32F28691.2ADC@onramp.net> References: <32edd486.3490767@news.airmail.net> <5cgr4r$nqb@hecate.umd.edu> <5cipc6$ra9@dinkel.civ.utwente.nl> Content-Type: text/plain; charset=us-Ascii Organization: OnRamp Technologies; ISP; Dallas/Ft Worth/Houston, TX USA Mime-Version: 1.0 Newsgroups: comp.emulators.misc,comp.os.cpm X-Mailer: Mozilla 3.01 (Win16; I) Marcel de Kogel wrote: > > On 26 Jan 1997 23:59:23 GMT, marat@Glue.umd.edu (Marat Fayzullin) > wrote: > > >Rogers Cadenhead (rcade@airmail.net) wrote: > >: I am learning Z80 assembly language programming so that I can write > >: some new Colecovision games and figure out how some of my old > >: favorites were written. > > > >: What's the best Z80 emulator I can find for DOS or Win95 that I can > >: use to run the programs I'm writing? I've read that CPM emulators are > >: the best choice. > >As you are going to write Colecovision programs, a crossassembler+ColEm > >combination will probably be the best. Also, check AdamEm, the Coleco > >Adam emulator by Marcel de Kogel. > > > >Marat > > I'm working on some as well, and while writing them I found some very > interesting features in the VDP design (e.g. there's no such thing as > truly seperate read and write addresses) I didn't find described > anywhere. While I've implemented most in ADAMEm, I didn't find out how > some of this really works (e.g. I get mixed results when reading VRAM > after setting a new write address). This is why I prefer using Mission > and an MSX for testing purposes; In fact, it's why I wrote Mission in > the first place. Of course, final testing is done on the CV itself, > and you'll need an MSX to run Mission natively > > Marcel Marcel, The VDP (Video Display Processor) chip used in the Colecovision and in the early (MSX-1?) systems was the Texas Instruments TMS9918A, which was also used in the TI99/4A Home Computer. This machine came onto the market in 1980, in the midst of the video game/home computer boom. During that time, I worked for TI as a student, and in 1982 I co-authored (along with Jim Dramis) a game for the 99/4A called PARSEC, among other things. All of us game programmers always lamented the fact that the VDP memory was 'indirectly' mapped instead of direct, which of course limited the amount of raw bit pushing we could do. Anyway, I think the following will (hopefully) clear up your confusion regarding accessing VDP memory. Note that later versions of the MSX systems (MSX-2?) used a superset of the TI9918A, the YM9938, which was made by Yamaha. The following discussion applies only to the TI9918A VDP chip. You are correct when you stated that there is only ONE memory address register in the VDP, and this is used for both reading and writing data. Thus, there must be a way to indicate to the VDP whether the address which has been written is to be used for reading or writing data. This is done by using one of the upper address bits in the 16 bit address. Since the 9918A can only address 16k bytes of memory, the upper two bits in the address (A14-A15) will always be zero. While bit 15 (the most significant) is always set to zero, bit 14 is used to distinguish between a read and a write address. The following shows how this bit affects subsequent VRAM data accesses. VRAM address bit 14 | VRAM data access function ------------------------------------------------------------------------ 0 | VRAM address specifies location to read | (initiate the read/increment the address counter) ------------------------------------------------------------------------ 1 | VRAM address specifies location to write | (wait for data write to VDP before actual write | to VRAM, then increment address counter) If you are simply calculating the address for writing data, then using that address as the write address without setting bit 14=1, this might cause some unexpected behavior. If bit 14=0, this will cause the VDP to initiate a read cycle and then increment the address counter, thus giving the impression that the "write address" has been set to (address+1). As I stated before, there really is only one address register, so when you perform data reads/writes you are affecting the same register. I'm sure that the VDP designers (at TI, anyway) didn't expect people to interleave data reads and writes without resetting the address, so any undocumented operation may or may not be supported on all revisions of the chip. As will any 'undocumented bugs/features', I'd be concerned about the implementation of these 'features' in 9918A clones, such as the Yamaha 9938. Another thing you should be aware of is the timing constraints placed on address and data accesses to the VDP RAM. Actual reading/writing of the VDP RAM (VRAM) by the CPU can only occur when the VDP is not reading the memory for the purpose of generating the screen image. In some display modes, most of the memory bandwidth is utilized for generating the image, leaving little time (unfortunately) for the CPU to access memory. The worst case scenario is in graphics modes I,II, where the VDP uses almost all of the memory bandwidth to generate the screen image. In this mode, only 1 memory access out of 16 is designated for the CPU - the rest are allocated for screen refresh. According to the 9918A (VDP) Data Manual, there are two timing constraints to be followed when access VRAM. 1. After the second address byte (MSByte) has been written to the VDP, there must be a 2 microsecond wait before any data read/write accesses can occur. This constraint ALWAYS applies, no matter which display mode is in effect or which part of the screen (active video, vertical sync/blanking) is being displayed. In the table below, this is referred to as 'VDP Delay'. 2. The second timing constraint depends on which display mode is active in the VDP, and which part of the screen (active video, vertical sync/blanking) is being displayed. The following table shows these timing constraints. In the table, this second delay constraint is referred to as 'Time waiting for an access window'. | | VDP | Time waiting for | Total Condition | Mode | Delay | an access window | time ------------------------------------------------------------------------ Active Display Area | Text | 2 us | 0 - 1.1 us | 2 - 3.1 us ------------------------------------------------------------------------ Active Display Area | Graphics | 2 us | 0 - 5.95 us | 2 - 8 us | I,II | | | ------------------------------------------------------------------------ 4300 us after | All | 2 us | 0 us | 2 us Vertical Interrupt | | | | ------------------------------------------------------------------------ Register 1, bit 1=0 | All | 2 us | 0 us | 2 us (display is blanked)| | | | ------------------------------------------------------------------------ Active Display Area | Multicolor | 2 us | 0 - 1.5 us | 2 - 3.5 us ------------------------------------------------------------------------ Examination of the above access window table yields the following observations. 1. Always try to do massive VRAM moves during the vertical retrace period, since that is when max memory bandwidth is available to the CPU, theoretically 500 Kbytes/sec. This is especially important in Graphics modes I & II, which will be used for almost ALL games. Theoretically, one can move (4300 us/2 us) 2150 bytes to/from the VRAM in one vertical blanking time. 2. If you need to move lots of data, such as completely changing screens, set the blanking bit it VDP register 1 to 0, then read/write the data. WHY DOES THE VDP NEED SO MUCH BANDWIDTH TO REFRESH THE SCREEN, AND OTHER STUFF YOU REALLY DON'T NEED TO KNOW ABOUT THE 9918A? -------------------------------------------------------------- The following is provided as additional background information, and may be considered excess, but I give for it those who might want to understand how the bandwidth is used in Graphics modes I & II. First, consider the overriding considerations for the guys who did the 9918A chip design. Of course, the part must function, but more importantly, the die size must be as small as possible to keep the cost down. After all, this chip was targeted toward a consumer market. Now, a little background on VDP memory and pixel timing. The master clock for the VDP is the color burst frequency X 3. All subsequent calculations are for the NTSC version of the part, although the PAL numbers will be similar. The color burst frequency is 3.579545 MHz. While I don't know pi to this many digits, the color burst frequency is very handy to know when working with NTSC video. So, the master clock frequency is given by Fmaster = Fcolorburst * 3 = 3.579545 MHz * 3 = 10.7386 MHz The period of the master clock is given by Tmaster = 1/Fmaster = 1/10.7386 MHz = 93.12 ns (nanoseconds) Each memory access takes four master clock times, so the memory access time is given by Tmem = Tmaster * 4 = 93.12 ns * 4 = 372.5 ns = 0.3725 us The horizontal line time, or the amount of time from the start of one horizontal display line to the next horizontal display line is specified in the data sheet as Thorz = 63.695 us. So, the total number of times which VDP memory can be accessed in a single horizontal scan line is given by Mhorz = Max number of memory accesses in a horizontal line = Thorz/Tmem = 63.695 us/0.3725 us = 171 memory access per horizontal line, max We now know how many memory accesses are available to be allocated for display refresh and CPU accesses combined. Next, let's find out how many memory accesses are requied to build up a single horizontal scan line in Graphics modes I or II. Any unused accesses can theoretically be allocated to the CPU. Any one active scan line is composed of up to six layers of graphic data (listed in back to front hierarchy): 1. Background color (from VDP register #7) 2. Character pattern/color info 3. Sprites (min number=0, max number=4) NOTE: there may never be more than 4 sprites on a horizontal scan line Let's see how many memory accesses are required to get the data for the three different 'planes' described above. First, the background color requires zero memory accesses, as it is held the lower 4 bits of VDP register 7. Next, is the character data. There are 32 characters per scan line, and each character in the scan line requires the following memory accesses to retrieve the data required to generate the pixel data for that character. 1. Read character number from Pattern Name Table (PNT) 2. Read character bitmap data from Pattern Generator Table (PGT) 3. Read character color info from Pattern Color Table (PCT) As you can see, it takes three memory accesses for each character, and so the total number of memory accesses required per scan line to build up the character display plane is given by Mchar = 32 characters/scan line X 3 mem accesses/character = 96 memory accesses per scan line for character plane Finally, the sprite planes must be processed. The 9918A allows up to four sprites (out of 32) to be displayed on a scan line, and sprite #0 has the highest priority - that is, it will be the frontmost. To determine which sprite will be visible on any given scan line, the Y-position of all 32 sprites must be read from the Sprite Attribute Table (SAT) in VRAM and compared against the current scan line number. When doing the compare, the Mag bit from VDP register 1 must be taken into account, since the magnification is in both the x and y directions. If the Y-location of the sprite is such that it is to be displayed on this scan line, then the sprite number (0-32) is placed in one of four temporary holding registers (SR0-SR3), if all four registers are not already filled. SR0 fills first, SR3 fills last, and SR0 specifies the frontmost sprite plane and SR3 specifies the rearmost sprite plane. While the Y-locations of these active sprites may be saved inside the VDP, I suspect they are not. Keeping these Y-locations would require 4 extra holding registers, which can be eliminated by refetching the Y-locations later, albeit at the 'cost' of more memory access. However, 4 registers affects the chip die size, but it is not clear that the VDP user even knows about the 'cost' of these extra memory cycles. In the worst case, the first 28 sprites, 0-27, are not displayed on a given scan line, but sprites 28-31 will be displayed. In this case, the Y-location all 32 sprites may have to be read. For the purposes of memory access calculations, we must assume that all 32 sprite Y-locations will have to be read. Therefore, we define the number of memory accesses required to test which sprites should be displayed on a given scan line as Msprite_test = 32 memory cycles (1 Y-location per sprite) After it is determined which sprites need to be displayed, the data for the four sprites (again, worst case) to be displayed must be fetched. For each sprite, there are 4 bytes (Y-location, X-location, pattern number, color/early clock) which need to be fetched from the Sprite Attribute Table. When the Size bit in VDP Register is set to 1, indicating double size sprites, two bytes of sprite pattern data must be read from the Sprite Pattern Generator Table. Again, this is the worst case. Therefore, six memory cycles are required for each sprites which is to be displayed. So, we now define the maximum number of memory cycles required to fetch the data needed to display four sprites on a scan line as Msprite_data = 4 sprites line x 6 memory cycles/sprite = 24 memory cycles Now, let's summarize the maximum total number of accesses required for displaying four sprites on a scan line as Msprite = total number of memory accesses per scan line for sprite display = test which sprites are on this line + sprite display data access = Msprite_test + Msprite_data = 32 + 24 = 56 memory cycles Whew! Finally, we can calculate the number of memory cycles used to refresh one active scan line in the display. This is given by Mdisplay = Mchar + Msprite = 96 + 56 = 152 For those of you who I have not totally confused, the end is now in sight. We are ready to compute the number of memory cycles available to the CPU. Drum roll, please! Mcpu = Mem accesses in one horizontal scan line - display mem accesses = Mhorz - Mdisplay = 171 - 152 = 19 memory accesses available for the CPU If the CPU can access the memory every 5.95 us, then the total number of CPU accesses allowed in a horizontal line time is given by Mcpu_horz = horizontal line time/memory access time = 63.695 us/5.95 us = 10.7 CPU memory access per horizontal scan line If one rounds 10.7 up to 11, that would seem to indicate that there are 8 memory cycles (19-11=8) which are unused. Perhaps those extra 8 cycles could have been used to allow a fifth sprite on a line, since each sprite costs a total of seven memory cycles (1 for y-test, then 6 more if displayed). However, that would only leave one memory cycle to spare. Also, there are scheduling and sycronization issues involved regarding the sprites, and it would have probably required too much chip area to squeeze in that one extra sprite. Or, maybe those 19 cycles could have been all allocated to CPU accesses. However, remember the earlier statement that every 16th memory accesses is allocated for the CPU. It is relatively simple (cheap) to decode this CPU access slot from the horizontal counter inside of the VDP which is used for overall horizontal timing. If, instead, we take the 19 cycles and divide them into the horizontal line time, we get Taccess_best = 63.695 us/19 memory cyles = 3.35 us between CPU data accesses If memory cyclces take 372.5 ns, then the CPU could have every ninth memory cycle (3.35/0.3725). Since this is not an integral power of two, a separate CPU access counter would be required and would take more chip area (cost) than a simple decode of the lower four bits of the horizontal counter. To summarize, the sprites take up slightly more than 1/3 of the display bandwidth. Unfortunately, the chip designers did not incude a way to turn off the sprites and thus allow 1 of four memory accesses to be allocated to the CPU. I hope this information will be useful or educational to someone out there. Maybe you now have a better understanding of how the video hardware in the Colecovision works. Paul Urbanus urb@urbonix.com P.S. In anticipation of doing some work for the Colecovision, I built a single-board computer (SBC) that used the TI processor. This SBC attached to the expansion port of the Colecovision and used DMA (Direct Memory Access) to access the hardware. Since we already had a debugger written for the TI99/4A, I modified it for the SBC so we could learn how to access the Colecovision hardware. HERE'S THE PERVERSE PART - since we didn't know any Z-80 assembly, we wanted to examine some of the code in Coleco games. So we wrote a symbolic Z80 dissassembler IN TI9900 ASSEMBLY LANGUAGE. What were we thinking??? *** *** * Paul Urbanus urb@urbonix.com * * * * Never wrestle with a hog - you get dirty and the hog likes it. * *** ***