gpgpu playground

; SPIR-V
; Version: 1.5
OpCapability Kernel
OpCapability BuffersTALVOS
OpCapability ExecTALVOS
OpCapability PhysicalStorageBufferAddresses
OpExtension "SPV_TALVOS_buffers"
OpExtension "SPV_TALVOS_exec"
OpMemoryModel Logical OpenCL

OpEntryPoint Kernel %main_fn "main" %gl_GlobalInvocationID

OpExecutionGlobalSizeTALVOS %main_fn 4 1 1

OpDecorate %gl_GlobalInvocationID BuiltIn GlobalInvocationId
OpDecorate %_arr_uint32_t ArrayStride 4

; types
                              %void_t = OpTypeVoid
                           %void_fn_t = OpTypeFunction %void_t
                            %uint32_t = OpTypeInt 32 0
                            %gbl_id_t = OpTypeVector %uint32_t 3

%arr_len = OpConstant %uint32_t 16
                      %_arr_uint32_t = OpTypeArray %uint32_t %arr_len

%_ptr_Input_gbl_id_t = OpTypePointer Input %gbl_id_t
                %_ptr_Input_uint32_t = OpTypePointer Input %uint32_t

%_ptr_PhysicalStorageBuffer_uint32_t = OpTypePointer PhysicalStorageBuffer %uint32_t
%_arr_PhysicalStorageBuffer_uint32_t = OpTypePointer PhysicalStorageBuffer %_arr_uint32_t

; global arguments & constants
                    %n = OpConstant %uint32_t 0
%gl_GlobalInvocationID = OpVariable %_ptr_Input_gbl_id_t Input
                 %buf0 = OpBufferTALVOS %_arr_PhysicalStorageBuffer_uint32_t PhysicalStorageBuffer 16 "a"

; FILL_IDX entry point
%main_fn = OpFunction %void_t None %void_fn_t
      %1 = OpLabel

; let i = globalWorkId[0]
      %2 = OpAccessChain %_ptr_Input_uint32_t %gl_GlobalInvocationID %n
      %3 = OpLoad %uint32_t %2 Aligned 4

; a[i] = i
      %4 = OpAccessChain %_ptr_PhysicalStorageBuffer_uint32_t %buf0 %3
           OpStore %4 %3 Aligned 4

OpReturn
           OpFunctionEnd

Legend
Symbol	Description
	Core: an execution unit that has up to 8 associated lanes for executing program code against data. It has a physical ID like 0, 1, 2, 3... etc. and each lane associated with it also has a physical id relative the the core itself (so there will be as many lane 0s as there are cores). This toy example GPU has one core with with four lanes.[1]
	> (in textarea)	Program Counter (PC): a pointer into the program text associated with a core that indicates the next operation which will progress. For the purposes of our model, each core only has one PC (per program[2]). That choice is more or less what defines the SIMT model of computation: a core will dispatch as many operations as it has (active) lanes at the same time, but those operations will complete independently[3] as the computation .
Not Pictured	a sense of overall progress against entire logical space (i.e. all {group id, work id, ...} coordinates); this leaves the part of the "work mapping" entirely up to the reader, unless their problem space is sized 1:1 with a single core.
	an operation that will produce results with ID %x; these results will have physical {core, lane} coordinates as well as logical {group id, work id, ...} coordinates.
Not Pictured	the "type" of the operation (more specifically: the set of architectural hazards that may delay completion relative to dispatch), like "memory" or "not-memory"
		an operation that will not produce directly identified results, such as `OpStore` (which stores to memory) or `OpReturn` (which signals the exit of a program invocation).
◦		dispatched operation that will produce a single result. The operation "belongs" to a {core, lane} pair.
	completed operation that has produced a result.
•	a result that is ready to be produced; i.e. all of its dependencies are available. When enough results are ready, the core will execute LANE_WIDTH operations to complete them in parallel.
•• ... (in textarea)	(in SVG vis view)	two results that were computed simultaneously.
	two results that were computed sequentially (here, one tick apart).
Not Pictured	This toy GPU's memory controller that only lets one operation through per tick (per core, probably?) but has infinite bandwidth per operation
Buffer 'a' (16 bytes): 3 4 5 6	A view into the GPU's memory for the 16 bytes associated with buffer "a" by the `OpBufferTALVOS` metadata opcode; formatted as an array of 4 elements (ideally: with the help of the associated SPIR-V type; currently: that's just what you get) with each element identified by its index (offset) such as `.[1] = ...` which indicates the value of the four bytes at element offset 1 (i.e. four bytes into the memory range) interpreted as an unsigned 32-bit integer; here, the element with index 1 has value 4. The view also tracks the most-recently held value, and displays that previous value when the memory changed in the most recent interaction (either a tick or a step), as seen here in elements 2 and 3.
Not Pictured	The type metadata also associated by way of the `OpBufferTALVOS` opcode that indicates the layout of each element of a Buffer view
Not Pictured	Memory safety, and especially bidirectional impact of parallel scalability on the same (i.e. the `if (i < N)` bit in most CUDA examples).
Not Pictured	Tracking uninitialized memory, visualizing incorrect access bounds.
(the whole textarea)	a program in SPIR-V (with Talvos-specific extensions) that will be interpreted by the virtual GPU
%x = OpXyz %1 %2 %bar	A SPIR-V operation (in text form) that will produce result with id %x by `Xyz`ing its arguments %1 %2 and %bar.
Not Pictured	the whole SPIR-V spec[4] (plus associated references such as the Vulkan API[5]) which describe what an OpXyz does, or why it needs to take arguments.
Footnotes Strictly speaking, the GPU model has four cores with eight lanes each, but the view only currently supports a single core with four lanes. The many-live-programs-per-core will come up again in "parallelism" as a way the GPU implements something akin to of "hyper-hyperthreading" to hide memory latencies. Compare with the SIMD model found in almost every modern CPU; a SIMT operation may have some of its effects visible sooner than others, even if the overall core's pipeline is stalled waiting on a portion of the result set. That combined with explicit static memory tiering (as opposed to implicit dynamic cache coherence) and with very deep pipelining (not yet in scope for this project) are three key details to keep in mind while thinking about adapting problems to fit a GPU computational model. https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html e.g. https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/GlobalInvocationId.html

Legend

Symbol

Description

Core: an execution unit that has up to 8 associated lanes for executing program code against data. It has a physical ID like 0, 1, 2, 3... etc. and each lane associated with it also has a physical id relative the the core itself (so there will be as many lane 0s as there are cores). This toy example GPU has one core with with four lanes.[1]

(in textarea)

Program Counter (PC): a pointer into the program text associated with a core that indicates the next operation which will progress. For the purposes of our model, each core only has one PC (per program[2]). That choice is more or less what defines the SIMT model of computation: a core will dispatch as many operations as it has (active) lanes at the same time, but those operations will complete independently[3] as the computation .

Not Pictured

a sense of overall progress against entire logical space (i.e. all {group id, work id, ...} coordinates); this leaves the part of the "work mapping" entirely up to the reader, unless their problem space is sized 1:1 with a single core.

an operation that will produce results with ID %x; these results will have physical {core, lane} coordinates as well as logical {group id, work id, ...} coordinates.

Not Pictured

the "type" of the operation (more specifically: the set of architectural hazards that may delay completion relative to dispatch), like "memory" or "not-memory"

an operation that will not produce directly identified results, such as `OpStore` (which stores to memory) or `OpReturn` (which signals the exit of a program invocation).

◦

dispatched operation that will produce a single result. The operation "belongs" to a {core, lane} pair.

completed operation that has produced a result.

•

a result that is ready to be produced; i.e. all of its dependencies are available. When enough results are ready, the core will execute LANE_WIDTH operations to complete them in parallel.

•• ...
(in textarea)

(in SVG vis view)

two results that were computed simultaneously.

two results that were computed sequentially (here, one tick apart).

Not Pictured

This toy GPU's memory controller that only lets one operation through per tick (per core, probably?) but has infinite bandwidth per operation

Buffer 'a' (16 bytes):

A view into the GPU's memory for the 16 bytes associated with buffer "a" by the `OpBufferTALVOS` metadata opcode; formatted as an array of 4 elements (ideally: with the help of the associated SPIR-V type; currently: that's just what you get) with each element identified by its index (offset) such as `.[1] = ...` which indicates the value of the four bytes at element offset 1 (i.e. four bytes into the memory range) interpreted as an unsigned 32-bit integer; here, the element with index 1 has value 4. The view also tracks the most-recently held value, and displays that previous value when the memory changed in the most recent interaction (either a tick or a step), as seen here in elements 2 and 3.

Not Pictured

The type metadata also associated by way of the `OpBufferTALVOS` opcode that indicates the layout of each element of a Buffer view

Not Pictured

Memory safety, and especially bidirectional impact of parallel scalability on the same (i.e. the `if (i < N)` bit in most CUDA examples).

Not Pictured

Tracking uninitialized memory, visualizing incorrect access bounds.

(the whole textarea)

a program in SPIR-V (with Talvos-specific extensions) that will be interpreted by the virtual GPU

%x = OpXyz %1 %2 %bar

A SPIR-V operation (in text form) that will produce result with id %x by `Xyz`ing its arguments %1 %2 and %bar.

Not Pictured

the whole SPIR-V spec[4] (plus associated references such as the Vulkan API[5]) which describe what an OpXyz does, or why it needs to take arguments.

Footnotes

Strictly speaking, the GPU model has four cores with eight lanes each, but the view only currently supports a single core with four lanes.
The many-live-programs-per-core will come up again in "parallelism" as a way the GPU implements something akin to of "hyper-hyperthreading" to hide memory latencies.
Compare with the SIMD model found in almost every modern CPU; a SIMT operation may have some of its effects visible sooner than others, even if the overall core's pipeline is stalled waiting on a portion of the result set. That combined with explicit static memory tiering (as opposed to implicit dynamic cache coherence) and with very deep pipelining (not yet in scope for this project) are three key details to keep in mind while thinking about adapting problems to fit a GPU computational model.
https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html
e.g. https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/GlobalInvocationId.html