Win32 rendering is slow on Intel GPU #13

Open
opened 2023-07-05 22:05:37 +00:00 by ilidemi · 13 comments
Collaborator

Pong runs at about 5fps on Intel GPU and consumes 1.5 cores of CPU and 3-4GB of RAM.

Visual Studio profiler says all the CPU is spent in GL calls - wglMakeCurrent, glBufferData, glGetIntegeri_v, glGetError.
Profile: https://cdn.discordapp.com/attachments/1126271320364167299/1126271664372600952/win32-intel-opengl.diagsession
(to see symbol names, open the orca.exe linked below as a VS project, then open the session)

GLIntercept reports wglMakeCurrent calls take 100-200ms (log attached)

When I force the GPU to NVidia (RTX 3050 Ti), it runs at 100fps and consumes 100MB of RAM. Looks like an Intel specific driver issue or an interaction.

Build: orca superpong (4578c8d767) + milepost new_gl_canvas (1e34c3406f) https://cdn.discordapp.com/attachments/1126271320364167299/1126271640658006066/Pong.7z
Hardware: Dell XPS 15 9510
GPU: Intel UHD Graphics Xe 32EUs (Tiger Lake-H)
Driver version: 30.0.101.1404 (2/18/2022)
OS: Windows 10 Pro 21H2

Pong runs at about 5fps on Intel GPU and consumes 1.5 cores of CPU and 3-4GB of RAM. Visual Studio profiler says all the CPU is spent in GL calls - wglMakeCurrent, glBufferData, glGetIntegeri_v, glGetError. Profile: https://cdn.discordapp.com/attachments/1126271320364167299/1126271664372600952/win32-intel-opengl.diagsession (to see symbol names, open the orca.exe linked below as a VS project, then open the session) GLIntercept reports wglMakeCurrent calls take 100-200ms (log attached) When I force the GPU to NVidia (RTX 3050 Ti), it runs at 100fps and consumes 100MB of RAM. Looks like an Intel specific driver issue or an interaction. Build: orca superpong (4578c8d767) + milepost new_gl_canvas (1e34c3406f) https://cdn.discordapp.com/attachments/1126271320364167299/1126271640658006066/Pong.7z Hardware: Dell XPS 15 9510 GPU: Intel UHD Graphics Xe 32EUs (Tiger Lake-H) Driver version: 30.0.101.1404 (2/18/2022) OS: Windows 10 Pro 21H2
Collaborator

I can't repro this one on my machine. Only initial loading is slow for me.
Can you check what the timings of wglMakeCurrent etc are on a minimal WGL example?
Does it still happen when you draw basically nothing?

You could also try changing the size of GL buffers in gl_canvas.c:

MG_GL_PATH_BUFFER_SIZE
MG_GL_ELEMENT_BUFFER_SIZE
MG_GL_SEGMENT_BUFFER_SIZE
MG_GL_PATH_QUEUE_BUFFER_SIZE
MG_GL_TILE_QUEUE_BUFFER_SIZE
MG_GL_TILE_OP_BUFFER_SIZE

These are throwaway sizes I got there to get the new renderer up and running. I should process paths in smaller batches (the infra already exists for this because different source images must be processed in separate batches). This should also take care of not overflowing those buffer (which it will totally do on the current version!!). If these are the culprits it will be good to know while I'm finishing the new renderer. I bet it could be because the intel driver seems to back gpu buffer by vram?

I can't repro this one on my machine. Only initial loading is slow for me. Can you check what the timings of `wglMakeCurrent` etc are on a minimal WGL example? Does it still happen when you draw basically nothing? You could also try changing the size of GL buffers in `gl_canvas.c`: ``` MG_GL_PATH_BUFFER_SIZE MG_GL_ELEMENT_BUFFER_SIZE MG_GL_SEGMENT_BUFFER_SIZE MG_GL_PATH_QUEUE_BUFFER_SIZE MG_GL_TILE_QUEUE_BUFFER_SIZE MG_GL_TILE_OP_BUFFER_SIZE ``` These are throwaway sizes I got there to get the new renderer up and running. I should process paths in smaller batches (the infra already exists for this because different source images must be processed in separate batches). This should also take care of not overflowing those buffer (which it will totally do on the current version!!). If these are the culprits it will be good to know while I'm finishing the new renderer. I bet it could be because the intel driver seems to back gpu buffer by vram?
Author
Collaborator

I haven't programmed Windows OpenGL before but a random small example found online runs at 120FPS without pegging CPU. wglMakeCurrent takes 0.05-0.3ms.

With buffer sizes reduced to (4<<10)*sizeof(...) pong runs at 65 FPS with one core pegged and 173MB RAM used, so it seems related to memory.

Random guess - Intel System Analyzer shows 15 buffer creations per frame and there are 15 calls to glBufferData in glInterceptLog, could it be that allocating data every frame causes churn?

I haven't programmed Windows OpenGL before but a [random small example](https://github.com/emoon/minifb/blob/master/tests/noise.c) found online runs at 120FPS without pegging CPU. `wglMakeCurrent` takes 0.05-0.3ms. With buffer sizes reduced to `(4<<10)*sizeof(...)` pong runs at 65 FPS with one core pegged and 173MB RAM used, so it seems related to memory. ~~Random guess - Intel System Analyzer shows 15 buffer creations per frame and there are 15 calls to glBufferData in glInterceptLog, could it be that allocating data every frame causes churn?~~
Author
Collaborator

Intel Graphics Frame Analyzer has useful info - it's 3 compute shader calls taking 4ms each (that's with 4<<10). Frame attached, had to add the .txt to please Gitea.

Intel Graphics Frame Analyzer has useful info - it's 3 compute shader calls taking 4ms each (that's with 4<<10). Frame attached, had to add the .txt to please Gitea.
ilidemi was assigned by MartinFouilleul 2023-07-07 08:51:25 +00:00
MartinFouilleul self-assigned this 2023-07-07 08:51:25 +00:00
Collaborator

Huh, I can't extract your 7z archive (tried both with 7-zip and 9-Zip). It produces a 0 bytes gpa_frame file...
Wasn't the time spent in wglMakeCurrent()? if the compute shaders are what's taking time, I'd expect it to show in wglSwapBuffers() somehow?
Btw the shader that seems to take up all the time is raster.glsl (it's called three times because we do a different pass per source texture). It is a bit surprising that it would depend on the size of the buffers, because it should only use the first few elements in the scene we have...

Huh, I can't extract your 7z archive (tried both with 7-zip and 9-Zip). It produces a 0 bytes gpa_frame file... Wasn't the time spent in `wglMakeCurrent()`? if the compute shaders are what's taking time, I'd expect it to show in `wglSwapBuffers()` somehow? Btw the shader that seems to take up all the time is `raster.glsl` (it's called three times because we do a different pass per source texture). It is a bit surprising that it would depend on the size of the buffers, because it should only use the first few elements in the scene we have...
Author
Collaborator

My bad, here's a working one. I was using a build of 7-zip with zstd support and it had a bug where it's saying it's using lzma2 but actually defaulting to zstd. Looks like gitea likes zips too.

On wglMakeCurrent() timing, GLIntercept has this note:

//////////////////////////////////////////////////////////////
//
//  Function time logging
//
//////////////////////////////////////////////////////////////
//
//  NOTE: It is important to not mis-use the results of this logger. OpenGL is a very pipelined
//        API and you can not optimize your code based on how long is spent in each function call.
//        This logger is only intended for advanced users to determine where pipeline stalls "MAY"
//        have occured and determine speeds of operations such as glReadPixels etc.

My interpretation is that it shows there's a slowdown but on the specifics we should trust Intel GPA more.

My bad, here's a working one. I was using a build of 7-zip with zstd support and it had a bug where it's saying it's using lzma2 but actually defaulting to zstd. Looks like gitea likes zips too. On `wglMakeCurrent()` timing, GLIntercept has this note: ``` ////////////////////////////////////////////////////////////// // // Function time logging // ////////////////////////////////////////////////////////////// // // NOTE: It is important to not mis-use the results of this logger. OpenGL is a very pipelined // API and you can not optimize your code based on how long is spent in each function call. // This logger is only intended for advanced users to determine where pipeline stalls "MAY" // have occured and determine speeds of operations such as glReadPixels etc. ``` My interpretation is that it shows there's _a_ slowdown but on the specifics we should trust Intel GPA more.
Author
Collaborator

Btw to be completely clear, this capture is with (4<<10) buffer sizes. So the initial 3fps problem definitely seems memory-related, and after changing that the next question is, why 65fps and not full 120.

My (possibly mistaken) interpretation of the capture is that each invocation of raster.glsl does a full-window worth of processing to draw a ball or a paddle, which is similar to overdrawing multiple times. Integrated graphics is underpowered in face of overdraw on high resolutions.

Btw to be completely clear, this capture is with (4<<10) buffer sizes. So the initial 3fps problem definitely seems memory-related, and after changing that the next question is, why 65fps and not full 120. My (possibly mistaken) interpretation of the capture is that each invocation of `raster.glsl` does a full-window worth of processing to draw a ball or a paddle, which is similar to overdrawing multiple times. Integrated graphics is underpowered in face of overdraw on high resolutions.
Collaborator

Your interpretation is correct. Normally all solid shapes can be renderer with one draw call, but I'm breaking the processing into batches for each source image. The possible plans to avoid doing too many draw calls are:

  • Use an array of texture samplers in the shader, so we can do e.g. 10 images per batch.
  • Use bindless textures, but it seems intel support isn't great for this (?)
  • Alternatively users can upload their images to an atlas using mg_image_atlas_alloc_from_data() etc. With this it should be possible to only have one draw call.

However, 4ms for one invocation of raster.glsl still seems slower than expected. There's couple things I can think of to try and do less work.

  • Anti-aliasing is super dumb right now (it's basically full super-sampling), so I could sample source colors only once and only compute a coverage from sub-samples.
  • Instead of dispatching an instance of raster.glsl for every pixel I could have an intermediate buffer that stores the indices of tiles that are covered, and only dispatch those tiles. This could cost a bit more when drawing large shapes, but save a substantial amount of work when doing batches of smaller shapes.

Nevertheless, I'd really like to understand the memory problem to inform how we send input data to the GPU. 4<<10 path elements is kinda small, and the smaller our buffers are, the more batches we have to do. Have you tried an even smaller size to see if it still runs faster?
Admittedly, I'm not really sure what's the best way to send these buffers in opengl, for now I'm just orphaning them every frame using glBufferData(). I tried mapping/unmapping them at some point but it somehow took more time. I've not looked into persistent mapping though.

Your interpretation is correct. Normally all solid shapes can be renderer with one draw call, but I'm breaking the processing into batches for each source image. The possible plans to avoid doing too many draw calls are: * Use an array of texture samplers in the shader, so we can do e.g. 10 images per batch. * Use bindless textures, but it seems intel support isn't great for this (?) * Alternatively users can upload their images to an atlas using `mg_image_atlas_alloc_from_data()` etc. With this it should be possible to only have one draw call. However, 4ms for one invocation of `raster.glsl` still seems slower than expected. There's couple things I can think of to try and do less work. * Anti-aliasing is super dumb right now (it's basically full super-sampling), so I could sample source colors only once and only compute a coverage from sub-samples. * Instead of dispatching an instance of `raster.glsl` for every pixel I could have an intermediate buffer that stores the indices of tiles that are covered, and only dispatch those tiles. This could cost a bit more when drawing large shapes, but save a substantial amount of work when doing batches of smaller shapes. Nevertheless, I'd really like to understand the memory problem to inform how we send input data to the GPU. `4<<10` path elements is kinda small, and the smaller our buffers are, the more batches we have to do. Have you tried an even smaller size to see if it still runs faster? Admittedly, I'm not really sure what's the best way to send these buffers in opengl, for now I'm just orphaning them every frame using `glBufferData()`. I tried mapping/unmapping them at some point but it somehow took more time. I've not looked into persistent mapping though.
bvisness added this to the Jam MVP milestone 2023-07-09 14:46:58 +00:00
Author
Collaborator

Played around with buffer sizes. 4<<9 results in about the same FPS as 4<<10, and 4<<8 produces text artifacts and/or crashes.

From looking at larger sizes, increasing until 4<<14 seems to be just as good:

Size	FPS		CPU		Mem		GPU
4<<10	68		14%		172MB		92%
4<<12 	68		14%		202MB		90%
4<<14	68		13%		298MB		91%
4<<16	66		13%		647MB		91%
4<<18	29		16%+13%K	1500MB		39%
4<<20 	3		18%+16%K	2800-4000MB	12%

Looking in Intel Graphics Trace Analyzer, frames in 4<<10 to 4<<14 diligently wait on DxgkPresent. 4<<16 (at least with trace capture attached) starts having stalls in DxgkDestroyAllocation2. 4<<20 is dominated by DxgkDestroyAllocation2, DxgkCreateAllocation, DxgkLock2 and DxgkUnlock2.

Traces: https://discord.com/channels/239737791225790464/1126271320364167299/1128217927863259177

Also tried a 4<<14 build on 2013 Thinkpad X240 with Intel graphics and haven't seen any memory issues (although 18fps even on lower resolution). So if this is solved with a lower threshold, this one seems appropriate.

Played around with buffer sizes. `4<<9` results in about the same FPS as `4<<10`, and `4<<8` produces text artifacts and/or crashes. From looking at larger sizes, increasing until `4<<14` seems to be just as good: ``` Size FPS CPU Mem GPU 4<<10 68 14% 172MB 92% 4<<12 68 14% 202MB 90% 4<<14 68 13% 298MB 91% 4<<16 66 13% 647MB 91% 4<<18 29 16%+13%K 1500MB 39% 4<<20 3 18%+16%K 2800-4000MB 12% ``` Looking in Intel Graphics Trace Analyzer, frames in `4<<10` to `4<<14` diligently wait on DxgkPresent. `4<<16` (at least with trace capture attached) starts having stalls in DxgkDestroyAllocation2. `4<<20` is dominated by DxgkDestroyAllocation2, DxgkCreateAllocation, DxgkLock2 and DxgkUnlock2. Traces: https://discord.com/channels/239737791225790464/1126271320364167299/1128217927863259177 Also tried a `4<<14` build on 2013 Thinkpad X240 with Intel graphics and haven't seen any memory issues (although 18fps even on lower resolution). So if this is solved with a lower threshold, this one seems appropriate.
Author
Collaborator

I am very much a novice on GPU memory management but advice I seen from Jasper on GP discord (although in #webgpu) is to allocate a big enough buffer, dole out individual sub-buffers in draw calls and not reuse sections within a frame. If the frame is bigger than expected, chain another big buffer. Idk if that is applicable to OpenGL, might be worth it to ask in #help or to head over to GP.

For the amount of work per pixel, I'd (again, novice) incline towards picking the low hanging fruit like supersampling and seeing how much time would the bigger refactors take, compared to the rest of MVP work needed. Possible questions to consider, will the UI be drawn in many draw calls? Are we expecting people to build games with multiple textures? Do apps need UI icons? Do they need to be raster or is asking people to stick to canvas ops fine for starters?

I am very much a novice on GPU memory management but advice I seen from Jasper on GP discord (although in #webgpu) is to allocate a big enough buffer, dole out individual sub-buffers in draw calls and not reuse sections within a frame. If the frame is bigger than expected, chain another big buffer. Idk if that is applicable to OpenGL, might be worth it to ask in #help or to head over to GP. For the amount of work per pixel, I'd (again, novice) incline towards picking the low hanging fruit like supersampling and seeing how much time would the bigger refactors take, compared to the rest of MVP work needed. Possible questions to consider, will the UI be drawn in many draw calls? Are we expecting people to build games with multiple textures? Do apps need UI icons? Do they need to be raster or is asking people to stick to canvas ops fine for starters?
Collaborator

Thanks! this is super useful. I'm curious if the memory problem only comes from MG_GL_PATH_BUFFER_SIZE and MG_GL_ELEMENT_BUFFER_SIZE (since these are the only buffers that get re-specified each frame), or from all buffer sizes?

Thanks! this is super useful. I'm curious if the memory problem only comes from `MG_GL_PATH_BUFFER_SIZE` and `MG_GL_ELEMENT_BUFFER_SIZE` (since these are the only buffers that get re-specified each frame), or from all buffer sizes?
Author
Collaborator

Keeping these two at 4<<14 and the rest 4<<20 produces 60+FPS but does use 1750MB of RAM.

Keeping these two at `4<<14` and the rest `4<<20` produces 60+FPS but does use 1750MB of RAM.
Collaborator

Perf and memory usage should be substantially improved in 835097f8b5.

I have some trouble confirming this on your remote machine though, because VNC lags and skips frames, but at least renderdoc tells me I'm hitting 60FPS.

Perf and memory usage should be substantially improved in 835097f8b5. I have some trouble confirming this on your remote machine though, because VNC lags and skips frames, but at least renderdoc tells me I'm hitting 60FPS.
Author
Collaborator

Can confirm low CPU and memory usage with Pong at 120fps. However, on my main machine it only renders the glClear-colored background on Intel GPU (NVidia works fine). Frame captures show that the rendering pipeline runs, but per RenderDoc the glDrawArrays input and output is a game (even though the window is solid blue) and Intel GFA shows input as a game and output is solid blue.

Also, it takes about 6 seconds from command launch till OnFrameResize() log entry to appear, on either GPU.

Can confirm low CPU and memory usage with Pong at 120fps. However, on my main machine it only renders the glClear-colored background on Intel GPU (NVidia works fine). Frame captures show that the rendering pipeline runs, but per RenderDoc the glDrawArrays input and output is a game (even though the window is solid blue) and Intel GFA shows input as a game and output is solid blue. Also, it takes about 6 seconds from command launch till OnFrameResize() log entry to appear, on either GPU.
bvisness added the
windows
label 2023-09-17 14:59:38 +00:00
Sign in to join this conversation.
No Label
macOS
windows
No Milestone
No project
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: hmn/orca#13
No description provided.