orca/vector_renderer_notes.txt

Various notes/thoughts about the 2D vector graphics renderer

Triangle Rasterization
----------------------
https://fgiesen.wordpress.com/2013/02/08/triangle-rasterization-in-practice/

https://github.com/rygorous/trirast/blob/master/main.cpp

https://joshbeam.com/articles/triangle_rasterization/

https://nlguillemot.wordpress.com/2016/07/10/rasterizer-notes/

https://web.archive.org/web/20120625103536/http://devmaster.net/forums/topic/1145-advanced-rasterization/

Bindless textures
-----------------
It's tempting to use bindless textures to be able to draw individual images using only one draw call. This would avoid much of the complexity of either managing a texture atlas under the hood or breaking the draw list into batches...
But, it's only an extension and seem to not be supported everywhere. Moreover, there might be a problem where the texture handle used by the shader can not differ between batches (must be "dynamic uniforms"), which defeats the purpose in our case -> it requires OES_gpu_shader5 or GLES 3.2

ideally, the atlas should be built on top of lower level image features of the renderer, eg mg_image_upload_sub_region(), mg_image_render_sub_region() etc...

This would mean individual textures can be set and used in a frame. So without bindless textures, we would need to break down the draw list in batches, each time the texture attribute changes. This also mean we need to blend each batch result to the previous one.

 - It seems possible to implement bindless texture in metal using argument buffers
 - We could investigate if angle/our targets likely support OES_gpu_shader5
 - But, this means the canvas renderer relies on the backend to provide this kind of feature
 - It also assume the upper bound for indexable bindless textures is enough on every backend
 - We'll likely need a batching fallback anyway?


-> Angle doesn't seem to support GL_IMG_bindless_textures for now.

Workaround: we could use a desktop GL 4.3+ context for the canvas renderer on windows, _BUT_ the functions would conflict with the GLES canvas. Except if we use function pointers that are loaded differently for each context (which we probably should but I'd better keep it for later).

-> We'll probably want to do that, or make 1 draw call per changing texture.

Anyway for now, is it possible have an _under the hood_ atlas, and reserve a way to change the API so that we make the atlas explicit / allow using single textures for big images etc?

* We could decide that we can set an atlas, and all mg_images get allocated from that atlas. If no atlas is set a default one is used.

* or, we can expose mg_texture and related APIs for now, as if they were individual, but back them by a hidden atlas. And a bit later expose mg_image/atlas -> maybe better.

* Or just implement breaking the triangle stream into batches now...

Perf issue of binding large vertex buffer
-----------------------------------------

Binding big buffers has a high cost. We should send updates in smaller batches, either
	- use a temporary storage to build vertex buffer, then send with glBufferData just before rendering
	- stream data to large buffer using glMapBufferRange (instead of mapping the whole buffer)
	- stream data to large buffer using glBufferSubData

We have to account for these edge cases:
	- how we handle overflowing the sub range (ie space allocated or mapped to build vertices)
	- how we handle overflowing buffer capacity (if using a pre-allocated buffer and glMapBufferRange/glBufferSubData)

* Using a temporary store and glBufferData forces a draw call when exceeding the temporary buffer limits. But the two cases of overflow are handled at once.
* Using a temporary store and glBufferSubData distributed data transfer asynchronously, and doesn't force a draw call when exceeding temporary buffer. We'd still need to force a draw when exceeding the gl buffer size.
* Using glMapRange also distributes data transfer asynchronously. We need to force a draw when exceeding the gl buffer size.

The first solution (temporary building buffer + glBufferData) is simpler and probably ok for low number of vertices. We can even build the vertices in an arena and virtually never care about exceeding building buffer capacity. But if we have many vertices, maybe we care about distributing transfer across asynchronous calls.

what happens if we exceed the gl buffer size? -> we need to make a draw call to use vertices, and maybe then grow the buffer to bigger size. But this implies breaking the batch, probably in the middle of a shape? this isn't really possible because we'd need the previous candidate color and flipcount transfered between batches. We could use a texture for that, but it complicates things quite a bit...

Notes:
	* Mapping/Unmapping smaller ranges of a big buffer doesn't seem to lower the cost of binding that buffer. Does the driver sends the full buffer regardless of the range that was changed?
	* Orphaning the buffer before mapping doesn't seem to do any good
	* Doing glBufferData from a small build buffer is surprisingly slow...
		* Angle seems to take into account only the first call to glBufferData to allocate size, and then send the full buffer???
		* Not pre-allocating in creation procedure "solves" the problem with glBufferData...

Only way seems to be to reduce buffer size:
	* pack vec2 together to avoid padding
	* store color, clip and uv out of band?
		* eg have a shape buffer containing uv for bounding box, clip, and color?
		* could also still store color as an int "for free" next to shape index
		* could send pos and uv as half-floats?

Ideally, we would do the vertex computation on the GPU (opportunity to parallelize AND avoid sending vertex buffer...)

-> Problem profiling canvas renderer with angle (not really well supported in renderdoc)
	-> Maybe use an OpenGL backend instead?

Quick measurement on perf_text.exe
----------------------------------
* Re-allocate and copy each time with glBufferData		--> ~23ms
* Allocate big buffer and update with glBufferSubData	--> ~23ms
* Map whole buffer										--> ~44ms
* Map whole buffer with GL_MAP_INVALIDATE_BUFFER_BIT	--> ~19ms (but with some startup hiccups...)
* Map whole buffer with GL_MAP_INVALIDATE_RANGE_BIT		--> ~44ms


-> Stutter is with GL_MAP_INVALIDATE_BUFFER_BIT isn't reassuring. Stick to glBufferData for now.
-> May be worth it to try persistently mapped buffers later.

* Splitting vertex data and shape data (using glBufferData) --> ~10ms


Backend Selection
-----------------

* We need define macros to select which backends are compiled into milepost.lib
	-> define default backends and allow override with user compile defines

* We also need define macros to select which backend-specific functions and/or graphics APIs to include when building an app with milepost (e.g.\ include opengl headers or not, etc.).

* For now, we intend to statically link milepost.lib (or dynamically load to an embedded dll), so at application compile time the supported backends are known and calling a backend-specific function for a backend that has not been compiled in milepost results in a link error. (No need to defines for that)

* However, if we want to provide a nice abstraction for creation function, this means it is possible for user to pass a backend that does not exists in the lib (resulting in a runtime error). It would be nice to have some way to runtime check which backends are available. That can be a function compiled in the lib.

* We also might want to access backend-specific functions in the app, so we need to include those conditionally.

* We also might want to include graphics API, eg OpenGL, in a uniform way. -> See later how we manage GLES/GL co-existence