GPU_Display_Processing


An unfinished implementation of this is now in the Git repo outlink.

Custom code can be run on the spare VPU core in the GPU. It could read the vector pipeline prepared by the CPU and display the contents on the Vectrex screen, without the influence of interrupts or multi-core RAM access locks which affect the CPU. The VPU could also redraw a previous frame if the CPU has not been able to prepare one in time for the next display update.

CPU-VPU Communication

The issue of exchanging data between the CPU and VPU is the same as with the official GPU driver for HDMI/composite video. This uses the mailbox outlink interface for exchange of messages, and a configurable block of RAM for storing frame data. Apparantly MAILBOX1 and MAILBOX2 are unused outlink, so these could be used without conflicts. The RGBtoHDMI outlink code suggests that the mailbox interface uses the register space described in the unofficial register map outlink as "Multicore Sync Block" (VPU address range 0x7e000000 - 0x7e000fff) for data exchange with the CPU.

The standard GPU firmware uses a configurable area of RAM to store framebuffer data, and a similar sort of thing is required for the vector pipeline data. Although the VPU can access all the RAM visible to the CPU, when running Linux the virtual address space spanned by the pipeline data is unlikely to be allocated contiguously outlink, so even if the physical address of the starting point can be looked up outlink, it doesn't imply that the rest of the pipeline data follows that address in physical memory. Note that this shouldn't apply in bare-metal where physical address mapping can be used by the CPU code.

In Linux, this requires that the pipeline data be copied to a separate memory location. The obvious option is the framebuffer RAM reserved by the GPU, although obviously then we need to make sure that no display drivers write to it, and the GPU firmware (still running on the other VPU core) doesn't read it. There are some notes and links to examples in this discussion about writing to the framebuffer in bare-metal outlink, although I'm not entirely sure how best to do it in Linux (can the data just be written to /dev/fb0 as if it were a bitmap, with the video output disabled? Or would the framebuffer driver interfere somewhere?).

Besides allocating RAM for the framebuffer, it turns out that the GPU firmware includes a mailbox instruction for allocating other contiguous memory in the GPU's memory area, using the Allocate Memory outlink mailbox property tag. There's an example outlink of using this to store data from the CPU that's read out by the DMA peripheral, so that should work the same, just with the VPU reading the data instead of the DMA controller. The base address of the allocated memory region can be written to a register in the "Multicore Sync Block", where the other VPU core running the PiTrex code can read it. The code from that discussion was used here outlink in the fbcp-ili9341 outlink project (high-framerate driver for SPI LCD displays).

The total memory reserved by the GPU is set in config.txt with the gpu_mem outlink option, which defaults to 64MB for the Pi Zeros. The amount of this memory that's used by the framebuffer (if video output is enabled) depends on the display resolution.

Comms Initialisation

Initial loading of the VPU executable and allocation of the GPU memory for the vector display pipeline (and associated parameters) is inplemented in a separate loader program, which should be run after a cold boot. The starting addresses of these memory areas are stored in /tmp/pitrex_gpu_mem, which also serves as a flag showing that the VPU executable was loaded successfully. The Vectrex Interface Library reads this file to see where it should copy the pipeline data to be read by the VPU. This avoids the need to allocate/free the GPU memory and VPU executable with each PiTrex program start/end.

Non-Display data

Ideally the VPU should handle all GPIO access, at least after initialisation, because otherwise the CPU has to wait for it to finish processing pipeline data before it can do things like read the controller inputs or play sounds (via the PSG or DAC), so as to avoid bus conflicts. Failing that, it should be possible to achieve decent performance by concentrating all CPU I/O operations between when the CPU receives notice that the VPU has completed displaying the last frame of display data, and starting display of the next frame.

CPU Implementation

For testing in a more friendly debugging environment, a program performing the same functions as the VPU display processor binary can be made to run on the CPU. This should share as much code as possible with the VPU program, and be built with a similar compiler.

If other single-board-computer platforms allow individual CPU cores to run without cache wait states, then this might also be useful for optimising performance on them without needing to use their GPU.