GPU

The Raspberry Pi Zero (and Zero 2) use the Videocore IV GPU. This consists of dual Vector Processor Units (VPUs) and 12(!) Quad Processor Units (QPUs). All of these are able to be programmed using open-source toolchains, though documentation and compiler features are often limited. On both Pi Zero models the VPUs run at 400MHz and the QPUs run at 300MHz (other Pis except for the Pi 3B+ run them all at 250MHz) (this is set by the "core_freq" setting - see defaults table in the overclocking outlink section of the config.txt docs). The Pi 4 uses the Videocore VI GPU, and although similar, it has some breaking changes outlink .

Code can be run on the GPU by replacing the firmware image that is loaded at boot, or using the poorly-documented mailbox.c functions which interact with the official firmware running on the first VPU after the system has booted. The former option requires that an open-source bootloader be implemented in order to start the CPU and load an OS, and options here are limited. The latter option can be used within Linux after boot, and allows running software on the QPUs and on the one VPU which is not used by the official firmware.

Details on running code on the VPU from Linux using the mailbox execute_code function are described in this forum thread outlink .

Broadcom released documentation outlink for a similar VideoCore IV GPU design. However unlike the one used in the Raspberry Pis it doesn't have a VPU, so the information there only relates to programming the QPUs, not the VPU outlink . There is an unofficial manual outlink covering both the QPUs and VPUs.

Examples of GPU software for the Raspberry Pi Videocore IV:

blinker01 Is the simplest example from the reverse-engineering project by Herman Hermitage. It uses the project's tinyasm assembler to generate binaries for the VPUs. It requires replacing the official GPU firmware and does not boot the CPU.
VC4ASM Is an assembler for the QPUs. The same author also wrote the vcio2 kernel driver which allows user-space communication to the QPUs without root privileges.
There are three open-source compiler toolchains for the VPUs, one based on GNU tools (gcc, etc.), this one based on LLVM , and VideoCore is also supported as a target of the vbcc C compiler .
There are also some rough examples of compiling stuff for the VPUs here .
GPU_FFT is an early example of running data-processing code on the QPUs, which is run after boot using Broadcom's mailbox routines.
vcpoke is a very simple demonstration of running code on the 'spare' VPU processor, using Broadcom's mailbox.c routines.
VPU-Example is based on vcpoke and uses the VASM assembler to generate a VPU binary that toggles a GPIO pin. It includes a program (vcrun) to run a separate binary on the VPU, so they don't have to be integrated into the same executable like vcpoke. The vcrun program has the VPU code size limited to 1024 bytes, so this is presumably the binary size limit for the mailbox execute_code function.
vpu_cpuid prints the ID of the VPU by running code on the VPU using the mailbox execute_code function via a separate mailbox library. related software by the same author .
RGBtoHDMI uses a VPU driver written in assembly , which is assembled using VASM .
This guide describes using the QPUs to optimise data processing tasks.

There are some more related projects and examples listed under the videocore-iv tag outlink on GitHub.

Bare-minimum GPU usage - GPU Timer

For the PiTrex the main issue is the need for accurate timing during vector draw operations in order to ensure currect positioning on the screen. When running Linux, this requires disabling the system 'tick' interrupts, which prevents operations such as talking to USB/bluetooth audio/controller peripherals during draw operations. In both Linux and bare-metal, using more than one CPU core of the Pi Zero 2 has also proven to cause occasional interruptions to I/O operations.

The one particular time-critical operation is to turn off the beam at a delayed time after the T1 timer ends, allowing for electrical delays that cause the beam to keep moving after the time that #RAMP is brought Low, and therefore ensuring that consecutively drawn lines appear joined on the screen. This T1 end delay must not be extended by Linux system interrupts, which can cause bright spots at the end of lines.

This single job is conveniently simple enough that it could be implemented in assembly for one of the VPUs, very similar to the blinker01 outlink , except using the newer method of loading onto the spare VPU after boot using the execute_code function from mailbox.c, as in vcpoke, VPU-Example, and vpu_cpuid.

Detecting Linux system interrupts

See also Interrupt_Detection_Detail.

Beyond the T1 end delay, all write operations that occour between beam-zeroing operations must take place before electrical losses cause the beam to drift from its previous position. If execution is delayed between zeroing by Linux system interrupts, and optimisation is enabled so that lines are not all drawn from zero, the drift can cause line segments to be drawn in the wrong location until the next zeroing operation is encountered in the vector pipeline.

Previously this issue was prevented by disabling Linux system Interrupts for the whole drawing operation. But with the VPU timer, it's now possible to compare delay results between the VPU, the CPU, and the ARM System Timer, and therefore detect when the CPU's delay has been extended by a Linux system interrupt. When that is detected, the beam is zeroed before the next line drawing operation, instead of being drawn relative to the previous line.

This means that Linux system interrupts needn't be disabled while line drawing operations are in progress, which frees up the CPU for other tasks (running MAME emulation, talking to Bluetooth devices, etc.).

Further GPU usage

See also GPU_Display_Processing.

Wait states during read and write operations also chew up CPU time and could be avoided by offloading these tasks entirely to the VPU, so that the CPU doesn't talk to the PiTrex directly at all.

Furthermore the entire vector pipeline could potentially be transfered to, and processed by, the VPU instead of the CPU. This requires that the Vectrex Interface library be adapted for building using one of the VPU compilers, providing that all the required capabilities exist.

This can be done either by loading code onto the second VPU core after booting, or modifying an open-source Linux bootloader VPU binary (probably this outlink , or this more active repo outlink ). The difficulty with the latter is that besides booting Linux, the closed-source GPU firmware also initialises the peripheral devices, and little of this has been implemented in the open-source firmware, so WiFi and Bluetooth probably wouldn't be usable.

VPU development notes

The VPU clock speed varies with CPU frequency scaling, so to achieve accurate timing the CPU frequency throttling needs to be disabled ("performance" CPU governor in Linux), or set core_freq_min to 400 in config.txt.

The latest version of the VASM outlink assembler (1.9a) doesn't seem to work or has changed the instruction mnemonics that are recognised (without documentation). The Raspberry Pi binary for the assembler that is supplied with the VPU-Example outlink code (VASM version 1.8k) works fine.

If a binary is run on the VPU using the mailbox.c execute_code function, a new one can't be loaded until the last program returns. If the VPU program runs in an infinite loop, the Raspberry Pi has to be physically powered off to clear the VPU and allow a new binary to be loaded. The VPU program continues running even after a "sudo reboot" or "sudo poweroff".

Memory is mapped differently on the CPU and the GPU, which have separate MMUs. The GPU addresses are sometimes called the "bus addresses", and the CPU addresses the "physical addresses". The starting bits of the GPU addresses determine how caching is used, if at all. See the address map in the BCM2835 ARM Peripherals document, PDF page 5, for details. The VPU Mailbox interface uses GPU "bus address" values, so these must be converted to/from physical addresses for use in software running on the CPU. Both GPU and CPU can access all RAM and registers, however the GPU reserves a configurable memory area at start-up which isn't allocated to virtual address space by the CPU (if the latter is running Linux).

Register documentation is at best patchy for stuff not covered by the BCM2835 ARM Peripherals document. Herman Hermitage has two outlink lists outlink . There's also this outlink platform.h file from one of the bare metal environments (others probably have similar). There's also this outlink collection of header files from the rpi-open-firmware project. The most comprehensive might be rpi-internal-registers-online outlink , which is auto-generated from public Broadcom documents.

Initialisation

There seems to be some initialisation required for running code generated by the C compiler toolchains on the VPU. Unlike those assembled by Herman Hermitage's TinyAsm assember, or the VASM assembler, the binaries produced from the vc4-toolchain don't run on their own. They don't work replacing "bootcode.bin", nor when loaded via the mailbox routines from Linux (in fact there they tend to cause a kernel panic - or is this because they're too big?). Instead you have to run the open-source GPU firmware outlink and upload code using a serial link. As noted above, this alternative firmware isn't suitably full-featured for the PiTrex software.

The vc4-toolchain TODO file mentions "Add SDRAM/PLL configuration for running on
hardware" (to its Newlib library). Is this what is needed to make the code run directly?

The LLVM-VideoCore4 readme has an (incomplete?) initialisation routine example in VASM assembly at the end, which is claims is needed to generate a bootcode.bin.

VASM produces working binaries, so presumably VBCC does too? Or is there more to this? These outlink are some sort-of examples using VBCC (but a custom assembler?).

Maybe this murkiness about how to actually run the binaries is why everyone seems to prefer writing VPU stuff in assembly even though there are three C compilers for it? Or is there something more fundamental that's wrong with the C compilers? It's all a bit mysterious.

QPUs

The QPUs are designed for particular data processing tasks. Code can be run on them via the Mailbox interface, as with the VPU. Like the VPU, they run slower than the CPU, but they are designed to be used in parallel for batch operations, which are further optimised by using their vector instructions.

One application for them could be to sort the vector pipeline prior to display, to minimise the travel distance between points (a TSP outlink solver). An uneducated look into this suggests that they are not that well suited to the task, but there may be some clever maths that allows them to be used better than is immediately apparent.

Another application could be for "framebuffer vectorisation", where they read the actual video framebuffer and vectorise the image, generating a vector pipeline themselves to be displayed by the VPU. This would eliminate the CPU performance penalty from a framebuffer vectorisation approach, if the QPUs are in fact able to perform the task efficiently enough that they can keep up with the Vectrex frame rate.

For now these QPU applications have not been explored.