Legend
Symbol Description
orange box Core: an execution unit that has up to 8 associated lanes for executing program code against data. It has a physical ID like 0, 1, 2, 3... etc. and each lane associated with it also has a physical id relative the the core itself (so there will be as many lane 0s as there are cores). This toy example GPU has one core with with four lanes.[1]
outlined box with the "flag" on top
>
(in textarea)
Program Counter (PC): a pointer into the program text associated with a core that indicates the next operation which will progress. For the purposes of our model, each core only has one PC (per program[2]). That choice is more or less what defines the SIMT model of computation: a core will dispatch as many operations as it has (active) lanes at the same time, but those operations will complete independently[3] as the computation .
Not Pictured a sense of overall progress against entire logical space (i.e. all {group id, work id, ...} coordinates); this leaves the part of the "work mapping" entirely up to the reader, unless their problem space is sized 1:1 with a single core.
%xan operation that will produce results with ID %x; these results will have physical {core, lane} coordinates as well as logical {group id, work id, ...} coordinates.
Not Pictured the "type" of the operation (more specifically: the set of architectural hazards that may delay completion relative to dispatch), like "memory" or "not-memory"
st ret an operation that will not produce directly identified results, such as `OpStore` (which stores to memory) or `OpReturn` (which signals the exit of a program invocation).
dashed outlined box dispatched operation that will produce a single result. The operation "belongs" to a {core, lane} pair.
solid outlined box completed operation that has produced a result.
a result that is ready to be produced; i.e. all of its dependencies are available. When enough results are ready, the core will execute LANE_WIDTH operations to complete them in parallel.
•• ...
(in textarea)
••(in SVG vis view)two results that were computed simultaneously.
••two results that were computed sequentially (here, one tick apart).
Not Pictured This toy GPU's memory controller that only lets one operation through per tick (per core, probably?) but has infinite bandwidth per operation
Buffer 'a' (16 bytes):
  1. 3
  2. 4
  3. 5
  4. 6
A view into the GPU's memory for the 16 bytes associated with buffer "a" by the `OpBufferTALVOS` metadata opcode; formatted as an array of 4 elements (ideally: with the help of the associated SPIR-V type; currently: that's just what you get) with each element identified by its index (offset) such as `.[1] = ...` which indicates the value of the four bytes at element offset 1 (i.e. four bytes into the memory range) interpreted as an unsigned 32-bit integer; here, the element with index 1 has value 4. The view also tracks the most-recently held value, and displays that previous value when the memory changed in the most recent interaction (either a tick or a step), as seen here in elements 2 and 3.
Not Pictured The type metadata also associated by way of the `OpBufferTALVOS` opcode that indicates the layout of each element of a Buffer view
Not Pictured Memory safety, and especially bidirectional impact of parallel scalability on the same (i.e. the `if (i < N)` bit in most CUDA examples).
Not Pictured Tracking uninitialized memory, visualizing incorrect access bounds.
(the whole textarea) a program in SPIR-V (with Talvos-specific extensions) that will be interpreted by the virtual GPU
%x = OpXyz %1 %2 %bar
A SPIR-V operation (in text form) that will produce result with id %x by `Xyz`ing its arguments %1 %2 and %bar.
Not Pictured the whole SPIR-V spec[4] (plus associated references such as the Vulkan API[5]) which describe what an OpXyz does, or why it needs to take arguments.
Footnotes
  1. Strictly speaking, the GPU model has four cores with eight lanes each, but the view only currently supports a single core with four lanes.
  2. The many-live-programs-per-core will come up again in "parallelism" as a way the GPU implements something akin to of "hyper-hyperthreading" to hide memory latencies.
  3. Compare with the SIMD model found in almost every modern CPU; a SIMT operation may have some of its effects visible sooner than others, even if the overall core's pipeline is stalled waiting on a portion of the result set. That combined with explicit static memory tiering (as opposed to implicit dynamic cache coherence) and with very deep pipelining (not yet in scope for this project) are three key details to keep in mind while thinking about adapting problems to fit a GPU computational model.
  4. https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html
  5. e.g. https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/GlobalInvocationId.html