Subject: Re: Z80 emulator to learn assembly?
From: Paul Urbanus <urb@onramp.net>
Date: 1997/01/31
Message-ID: <32F28691.2ADC@onramp.net>
References: <32edd486.3490767@news.airmail.net> <5cgr4r$nqb@hecate.umd.edu> <5cipc6$ra9@dinkel.civ.utwente.nl>
Content-Type: text/plain; charset=us-Ascii
Organization: OnRamp Technologies; ISP;  Dallas/Ft Worth/Houston, TX USA
Mime-Version: 1.0
Newsgroups: comp.emulators.misc,comp.os.cpm
X-Mailer: Mozilla 3.01 (Win16; I)


Marcel de Kogel wrote:
>
> On 26 Jan 1997 23:59:23 GMT, marat@Glue.umd.edu (Marat Fayzullin)
> wrote:
>
> >Rogers Cadenhead (rcade@airmail.net) wrote:
> >: I am learning Z80 assembly language programming so that I can write
> >: some new Colecovision games and figure out how some of my old
> >: favorites were written.
> >
> >: What's the best Z80 emulator I can find for DOS or Win95 that I can
> >: use to run the programs I'm writing? I've read that CPM emulators are
> >: the best choice.
> >As you are going to write Colecovision programs, a crossassembler+ColEm
> >combination will probably be the best. Also, check AdamEm, the Coleco
> >Adam emulator by Marcel de Kogel.
> >
> >Marat
>
> I'm working on some as well, and while writing them I found some very
> interesting features in the VDP design (e.g. there's no such thing as
> truly seperate read and write addresses) I didn't find described
> anywhere. While I've implemented most in ADAMEm, I didn't find out how
> some of this really works (e.g. I get mixed results when reading VRAM
> after setting a new write address). This is why I prefer using Mission
> and an MSX for testing purposes; In fact, it's why I wrote Mission in
> the first place. Of course, final testing is done on the CV itself,
> and you'll need an MSX to run Mission natively
>
> Marcel

Marcel,

The VDP (Video Display Processor) chip used in the Colecovision and in
the early (MSX-1?) systems was the Texas Instruments TMS9918A, which was
also used in the TI99/4A Home Computer. This machine came onto the
market in 1980, in the midst of the video game/home computer boom.
During that time, I worked for TI as a student, and in 1982 I
co-authored (along with Jim Dramis) a game for the 99/4A called PARSEC,
among other things. All of us game programmers always lamented the fact
that the VDP memory was 'indirectly' mapped instead of direct, which of
course limited the amount of raw bit pushing we could do. Anyway, I
think the following will (hopefully) clear up your confusion regarding
accessing VDP memory. Note that later versions of the MSX systems
(MSX-2?) used a superset of the TI9918A, the YM9938, which was made by
Yamaha. The following discussion applies only to the TI9918A VDP chip.

You are correct when you stated that there is only ONE memory address
register in the VDP, and this is used for both reading and writing data.
Thus, there must be a way to indicate to the VDP whether the address
which has been written is to be used for reading or writing data. This
is done by using one of the upper address bits in the 16 bit address.

Since the 9918A can only address 16k bytes of memory, the upper two bits
in the address (A14-A15) will always be zero. While bit 15 (the most
significant) is always set to zero, bit 14 is used to distinguish
between a read and a write address. The following shows how this bit
affects subsequent VRAM data accesses.

VRAM address bit 14  |         VRAM data access function
------------------------------------------------------------------------
         0           |   VRAM address specifies location to read
                     | (initiate the read/increment the address counter)
------------------------------------------------------------------------
         1           |  VRAM address specifies location to write
                     | (wait for data write to VDP before actual write
                     |  to VRAM, then increment address counter)

If you are simply calculating the address for writing data, then using
that address as the write address without setting bit 14=1, this might
cause some unexpected behavior. If bit 14=0, this will cause the VDP to
initiate a read cycle and then increment the address counter, thus
giving the impression that the "write address" has been set to
(address+1). As I stated before, there really is only one address
register, so when you perform data reads/writes you are affecting the
same register.

I'm sure that the VDP designers (at TI, anyway) didn't expect people to
interleave data reads and writes without resetting the address, so any
undocumented operation may or may not be supported on all revisions of
the chip. As will any 'undocumented bugs/features', I'd be concerned
about the implementation of these 'features' in 9918A clones, such as
the Yamaha 9938.

Another thing you should be aware of is the timing constraints placed on
address and data accesses to the VDP RAM. Actual reading/writing of the
VDP RAM (VRAM) by the CPU can only occur when the VDP is not reading the
memory for the purpose of generating the screen image. In some display
modes, most of the memory bandwidth is utilized for generating the
image, leaving little time (unfortunately) for the CPU to access memory.
The worst case scenario is in graphics modes I,II, where the VDP uses
almost all of the memory bandwidth to generate the screen image. In this
mode, only 1 memory access out of 16 is designated for the CPU - the
rest are allocated for screen refresh.

According to the 9918A (VDP) Data Manual, there are two timing
constraints to be followed when access VRAM.

1. After the second address byte (MSByte) has been written to the VDP,
there must be a 2 microsecond wait before any data read/write accesses
can occur. This constraint ALWAYS applies, no matter which display mode
is in effect or which part of the screen (active video, vertical
sync/blanking) is being displayed. In the table below, this is referred
to as 'VDP Delay'.

2. The second timing constraint depends on which display mode is active
in the VDP, and which part of the screen (active video, vertical
sync/blanking) is being displayed. The following table shows these
timing constraints. In the table, this second delay constraint is
referred to as 'Time waiting for an access window'.

                    |            |  VDP  | Time waiting for |  Total
   Condition        |    Mode    | Delay | an access window |  time
------------------------------------------------------------------------
Active Display Area |   Text     | 2 us  |   0  -  1.1  us  | 2 - 3.1 us
------------------------------------------------------------------------
Active Display Area |  Graphics  | 2 us  |   0  -  5.95 us  | 2 - 8   us
                    |    I,II    |       |                  |
------------------------------------------------------------------------
4300 us after       |    All     | 2 us  |      0       us  |   2     us
Vertical Interrupt  |            |       |                  |
------------------------------------------------------------------------
Register 1, bit 1=0 |    All     | 2 us  |      0       us  |   2     us
(display is blanked)|            |       |                  |
------------------------------------------------------------------------
Active Display Area | Multicolor | 2 us  |    0  -  1.5 us  | 2 - 3.5 us
------------------------------------------------------------------------

Examination of the above access window table yields the following
observations.

1. Always try to do massive VRAM moves during the vertical retrace
period, since that is when max memory bandwidth is available to the CPU,
theoretically 500 Kbytes/sec. This is especially important in Graphics
modes I & II, which will be used for almost ALL games. Theoretically,
one can move (4300 us/2 us) 2150 bytes to/from the VRAM in one vertical
blanking time.

2. If you need to move lots of data, such as completely changing
screens, set the blanking bit it VDP register 1 to 0, then read/write
the data.


   WHY DOES THE VDP NEED SO MUCH BANDWIDTH TO REFRESH THE SCREEN,
   AND OTHER STUFF YOU REALLY DON'T NEED TO KNOW ABOUT THE 9918A?
   --------------------------------------------------------------

The following is provided as additional background information, and may
be considered excess, but I give for it those who might want to
understand how the bandwidth is used in Graphics modes I & II.

First, consider the overriding considerations for the guys who did the
9918A chip design. Of course, the part must function, but more
importantly, the die size must be as small as possible to keep the cost
down. After all, this chip was targeted toward a consumer market.

Now, a little background on VDP memory and pixel timing. The master
clock for the VDP is the color burst frequency X 3. All subsequent
calculations are for the NTSC version of the part, although the PAL
numbers will be similar. The color burst frequency is 3.579545 MHz.
While I don't know pi to this many digits, the color burst frequency is
very handy to know when working with NTSC video. So, the master clock
frequency is given by

Fmaster =  Fcolorburst * 3
        =  3.579545 MHz * 3
        = 10.7386 MHz

The period of the master clock is given by

Tmaster = 1/Fmaster
        = 1/10.7386 MHz
        = 93.12 ns (nanoseconds)

Each memory access takes four master clock times, so the memory access
time is given by

Tmem    = Tmaster * 4
        = 93.12 ns * 4
        = 372.5 ns = 0.3725 us

The horizontal line time, or the amount of time from the start of one
horizontal display line to the next horizontal display line is specified
in the data sheet as Thorz = 63.695 us. So, the total number of times
which VDP memory can be accessed in a single horizontal scan line is
given by

Mhorz  = Max number of memory accesses in a horizontal line
       = Thorz/Tmem
       = 63.695 us/0.3725 us
       = 171 memory access per horizontal line, max

We now know how many memory accesses are available to be allocated for
display refresh and CPU accesses combined.

Next, let's find out how many memory accesses are requied to build up a
single horizontal scan line in Graphics modes I or II. Any unused
accesses can theoretically be allocated to the CPU.

Any one active scan line is composed of up to six layers of graphic data
(listed in back to front hierarchy):

1. Background color (from VDP register #7)
2. Character pattern/color info
3. Sprites (min number=0, max number=4)
   NOTE: there may never be more than 4 sprites on a horizontal scan
line

Let's see how many memory accesses are required to get the data for the
three different 'planes' described above.

First, the background color requires zero memory accesses, as it is held
the lower 4 bits of VDP register 7.

Next, is the character data. There are 32 characters per scan line, and
each character in the scan line requires the following memory accesses
to retrieve the data required to generate the pixel data for that
character.

1. Read character number from Pattern Name Table (PNT)
2. Read character bitmap data from Pattern Generator Table (PGT)
3. Read character color info from Pattern Color Table (PCT)

As you can see, it takes three memory accesses for each character, and
so the total number of memory accesses required per scan line to build
up the character display plane is given by

Mchar = 32 characters/scan line X 3 mem accesses/character
      = 96 memory accesses per scan line for character plane

Finally, the sprite planes must be processed. The 9918A allows up to
four sprites (out of 32) to be displayed on a scan line, and sprite #0
has the highest priority - that is, it will be the frontmost.

To determine which sprite will be visible on any given scan line, the
Y-position of all 32 sprites must be read from the Sprite Attribute
Table (SAT) in VRAM and compared against the current scan line number.
When doing the compare, the Mag bit from VDP register 1 must be taken
into account, since the magnification is in both the x and y directions.

If the Y-location of the sprite is such that it is to be displayed on
this scan line, then the sprite number (0-32) is placed in one of four
temporary holding registers (SR0-SR3), if all four registers are not
already filled. SR0 fills first, SR3 fills last, and SR0 specifies the
frontmost sprite plane and SR3 specifies the rearmost sprite plane.
While the Y-locations of these active sprites may be saved inside the
VDP, I suspect they are not. Keeping these Y-locations would require 4
extra holding registers, which can be eliminated by refetching the
Y-locations later, albeit at the 'cost' of more memory access. However,
4 registers affects the chip die size, but it is not clear that the VDP
user even knows about the 'cost' of these extra memory cycles.

In the worst case, the first 28 sprites, 0-27, are not displayed on a
given scan line, but sprites 28-31 will be displayed. In this case, the
Y-location all 32 sprites may have to be read. For the purposes of
memory access calculations, we must assume that all 32 sprite
Y-locations will have to be read. Therefore, we define the number of
memory accesses required to test which sprites should be displayed on a
given scan line as

Msprite_test = 32 memory cycles (1 Y-location per sprite)

After it is determined which sprites need to be displayed, the data for
the four sprites (again, worst case) to be displayed must be fetched.
For each sprite, there are 4 bytes (Y-location, X-location, pattern
number, color/early clock) which need to be fetched from the Sprite
Attribute Table. When the Size bit in VDP Register is set to 1,
indicating double size sprites, two bytes of sprite pattern data must be
read from the Sprite Pattern Generator Table. Again, this is the worst
case. Therefore, six memory cycles are required for each sprites which
is to be displayed. So, we now define the maximum number of memory
cycles required to fetch the data needed to display four sprites on a
scan line as

Msprite_data = 4 sprites line x 6 memory cycles/sprite
             = 24 memory cycles

Now, let's summarize the maximum total number of accesses required for
displaying four sprites on a scan line as

Msprite = total number of memory accesses per scan line for sprite
display
        = test which sprites are on this line + sprite display data
access
        = Msprite_test + Msprite_data
        =     32       +     24
        = 56 memory cycles

Whew! Finally, we can calculate the number of memory cycles used to
refresh one active scan line in the display. This is given by

Mdisplay = Mchar + Msprite
         =  96   +   56
         = 152

For those of you who I have not totally confused, the end is now in
sight. We are ready to compute the number of memory cycles available to
the CPU.

Drum roll, please!

Mcpu  = Mem accesses in one horizontal scan line - display mem accesses
      = Mhorz - Mdisplay
      =  171  -   152
      = 19 memory accesses available for the CPU

If the CPU can access the memory every 5.95 us, then the total number of
CPU accesses allowed in a horizontal line time is given by

Mcpu_horz = horizontal line time/memory access time
          = 63.695 us/5.95 us
          = 10.7 CPU memory access per horizontal scan line

If one rounds 10.7 up to 11, that would seem to indicate that there are
8 memory cycles (19-11=8) which are unused.

Perhaps those extra 8 cycles could have been used to allow a fifth
sprite on a line, since each sprite costs a total of seven memory cycles
(1 for y-test, then 6 more if displayed). However, that would only leave
one memory cycle to spare. Also, there are scheduling and sycronization
issues involved regarding the sprites, and it would have probably
required too much chip area to squeeze in that one extra sprite.

Or, maybe those 19 cycles could have been all allocated to CPU accesses.
However, remember the earlier statement that every 16th memory accesses
is allocated for the CPU. It is relatively simple (cheap) to decode this
CPU access slot from the horizontal counter inside of the VDP which is
used for overall horizontal timing. If, instead, we take the 19 cycles
and divide them into the horizontal line time, we get

Taccess_best = 63.695 us/19 memory cyles
             = 3.35 us between CPU data accesses

If memory cyclces take 372.5 ns, then the CPU could have every ninth
memory cycle (3.35/0.3725). Since this is not an integral power of two,
a separate CPU access counter would be required and would take more chip
area (cost) than a simple decode of the lower four bits of the
horizontal counter.

To summarize, the sprites take up slightly more than 1/3 of the display
bandwidth. Unfortunately, the chip designers did not incude a way to
turn off the sprites and thus allow 1 of four memory accesses to be
allocated to the CPU.

I hope this information will be useful or educational to someone out
there. Maybe you now have a better understanding of how the video
hardware in the Colecovision works.

Paul Urbanus
urb@urbonix.com


P.S. In anticipation of doing some work for the Colecovision, I built a
single-board computer (SBC) that used the TI processor. This SBC
attached to the expansion port of the Colecovision and used DMA (Direct
Memory Access) to access the hardware. Since we already had a debugger
written for the TI99/4A, I modified it for the SBC so we could learn how
to access the Colecovision hardware. HERE'S THE PERVERSE PART - since we
didn't know any Z-80 assembly, we wanted to examine some of the code in
Coleco games. So we wrote a symbolic Z80 dissassembler IN TI9900
ASSEMBLY LANGUAGE. What were we thinking???


***                                                                ***
*                 Paul Urbanus    urb@urbonix.com                    *
*                                                                    *
*   Never wrestle with a hog - you get dirty and the hog likes it.   *
***                                                                ***