Win32 rendering is slow on Intel GPU #13
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Pong runs at about 5fps on Intel GPU and consumes 1.5 cores of CPU and 3-4GB of RAM.
Visual Studio profiler says all the CPU is spent in GL calls - wglMakeCurrent, glBufferData, glGetIntegeri_v, glGetError.
Profile: https://cdn.discordapp.com/attachments/1126271320364167299/1126271664372600952/win32-intel-opengl.diagsession
(to see symbol names, open the orca.exe linked below as a VS project, then open the session)
GLIntercept reports wglMakeCurrent calls take 100-200ms (log attached)
When I force the GPU to NVidia (RTX 3050 Ti), it runs at 100fps and consumes 100MB of RAM. Looks like an Intel specific driver issue or an interaction.
Build: orca superpong (
4578c8d767
) + milepost new_gl_canvas (1e34c3406f
) https://cdn.discordapp.com/attachments/1126271320364167299/1126271640658006066/Pong.7zHardware: Dell XPS 15 9510
GPU: Intel UHD Graphics Xe 32EUs (Tiger Lake-H)
Driver version: 30.0.101.1404 (2/18/2022)
OS: Windows 10 Pro 21H2
I can't repro this one on my machine. Only initial loading is slow for me.
Can you check what the timings of
wglMakeCurrent
etc are on a minimal WGL example?Does it still happen when you draw basically nothing?
You could also try changing the size of GL buffers in
gl_canvas.c
:These are throwaway sizes I got there to get the new renderer up and running. I should process paths in smaller batches (the infra already exists for this because different source images must be processed in separate batches). This should also take care of not overflowing those buffer (which it will totally do on the current version!!). If these are the culprits it will be good to know while I'm finishing the new renderer. I bet it could be because the intel driver seems to back gpu buffer by vram?
I haven't programmed Windows OpenGL before but a random small example found online runs at 120FPS without pegging CPU.
wglMakeCurrent
takes 0.05-0.3ms.With buffer sizes reduced to
(4<<10)*sizeof(...)
pong runs at 65 FPS with one core pegged and 173MB RAM used, so it seems related to memory.Random guess - Intel System Analyzer shows 15 buffer creations per frame and there are 15 calls to glBufferData in glInterceptLog, could it be that allocating data every frame causes churn?Intel Graphics Frame Analyzer has useful info - it's 3 compute shader calls taking 4ms each (that's with 4<<10). Frame attached, had to add the .txt to please Gitea.
Huh, I can't extract your 7z archive (tried both with 7-zip and 9-Zip). It produces a 0 bytes gpa_frame file...
Wasn't the time spent in
wglMakeCurrent()
? if the compute shaders are what's taking time, I'd expect it to show inwglSwapBuffers()
somehow?Btw the shader that seems to take up all the time is
raster.glsl
(it's called three times because we do a different pass per source texture). It is a bit surprising that it would depend on the size of the buffers, because it should only use the first few elements in the scene we have...My bad, here's a working one. I was using a build of 7-zip with zstd support and it had a bug where it's saying it's using lzma2 but actually defaulting to zstd. Looks like gitea likes zips too.
On
wglMakeCurrent()
timing, GLIntercept has this note:My interpretation is that it shows there's a slowdown but on the specifics we should trust Intel GPA more.
Btw to be completely clear, this capture is with (4<<10) buffer sizes. So the initial 3fps problem definitely seems memory-related, and after changing that the next question is, why 65fps and not full 120.
My (possibly mistaken) interpretation of the capture is that each invocation of
raster.glsl
does a full-window worth of processing to draw a ball or a paddle, which is similar to overdrawing multiple times. Integrated graphics is underpowered in face of overdraw on high resolutions.Your interpretation is correct. Normally all solid shapes can be renderer with one draw call, but I'm breaking the processing into batches for each source image. The possible plans to avoid doing too many draw calls are:
mg_image_atlas_alloc_from_data()
etc. With this it should be possible to only have one draw call.However, 4ms for one invocation of
raster.glsl
still seems slower than expected. There's couple things I can think of to try and do less work.raster.glsl
for every pixel I could have an intermediate buffer that stores the indices of tiles that are covered, and only dispatch those tiles. This could cost a bit more when drawing large shapes, but save a substantial amount of work when doing batches of smaller shapes.Nevertheless, I'd really like to understand the memory problem to inform how we send input data to the GPU.
4<<10
path elements is kinda small, and the smaller our buffers are, the more batches we have to do. Have you tried an even smaller size to see if it still runs faster?Admittedly, I'm not really sure what's the best way to send these buffers in opengl, for now I'm just orphaning them every frame using
glBufferData()
. I tried mapping/unmapping them at some point but it somehow took more time. I've not looked into persistent mapping though.Played around with buffer sizes.
4<<9
results in about the same FPS as4<<10
, and4<<8
produces text artifacts and/or crashes.From looking at larger sizes, increasing until
4<<14
seems to be just as good:Looking in Intel Graphics Trace Analyzer, frames in
4<<10
to4<<14
diligently wait on DxgkPresent.4<<16
(at least with trace capture attached) starts having stalls in DxgkDestroyAllocation2.4<<20
is dominated by DxgkDestroyAllocation2, DxgkCreateAllocation, DxgkLock2 and DxgkUnlock2.Traces: https://discord.com/channels/239737791225790464/1126271320364167299/1128217927863259177
Also tried a
4<<14
build on 2013 Thinkpad X240 with Intel graphics and haven't seen any memory issues (although 18fps even on lower resolution). So if this is solved with a lower threshold, this one seems appropriate.I am very much a novice on GPU memory management but advice I seen from Jasper on GP discord (although in #webgpu) is to allocate a big enough buffer, dole out individual sub-buffers in draw calls and not reuse sections within a frame. If the frame is bigger than expected, chain another big buffer. Idk if that is applicable to OpenGL, might be worth it to ask in #help or to head over to GP.
For the amount of work per pixel, I'd (again, novice) incline towards picking the low hanging fruit like supersampling and seeing how much time would the bigger refactors take, compared to the rest of MVP work needed. Possible questions to consider, will the UI be drawn in many draw calls? Are we expecting people to build games with multiple textures? Do apps need UI icons? Do they need to be raster or is asking people to stick to canvas ops fine for starters?
Thanks! this is super useful. I'm curious if the memory problem only comes from
MG_GL_PATH_BUFFER_SIZE
andMG_GL_ELEMENT_BUFFER_SIZE
(since these are the only buffers that get re-specified each frame), or from all buffer sizes?Keeping these two at
4<<14
and the rest4<<20
produces 60+FPS but does use 1750MB of RAM.Perf and memory usage should be substantially improved in
835097f8b5
.I have some trouble confirming this on your remote machine though, because VNC lags and skips frames, but at least renderdoc tells me I'm hitting 60FPS.
Can confirm low CPU and memory usage with Pong at 120fps. However, on my main machine it only renders the glClear-colored background on Intel GPU (NVidia works fine). Frame captures show that the rendering pipeline runs, but per RenderDoc the glDrawArrays input and output is a game (even though the window is solid blue) and Intel GFA shows input as a game and output is solid blue.
Also, it takes about 6 seconds from command launch till OnFrameResize() log entry to appear, on either GPU.