|
|
|
[
Permlink
| « Hide
]
Lex Neva added a comment - 05/Mar/08 09:50 AM
An armchair question, without having actually looked at the code: if the end output is a 16-bit integer, I wonder if it's possible to make wind generation use only integer math?
Possibly, but these filters are really touchy from my playing with it. This stuff is subtle in ways you wouldn't expect. For example, biasing the RNG by just one causes bad clipping pops. One would probably have to find a whole new filter designed to be integer from the beginning.
New patch that fixes some subtle clicking when gain interpolation is turned off. Profile on the new patch is similar to the last one, still about 2% CPU usage.
Here is Aimee's patch with a few enhancements from me. Aimee's patch profiles at 1.7%, better than my patch-2. This patch-3 also turns off wind synth completely when it's muted.
This -4 patch combines Aimee's technique with my earlier one, except it does do interpolation if the frequency is changing quickly. This prevents the very subtle pop when you stop moving that Aimee observed.
This latest patch profiles at 1.1% standing still, 1.3% moving, and near 0% if you turn wind all the way down or mute it. I'm fairly sure there's a fair bit more mileage in this still, filtering white noise is probably not the most efficient way to generate pink, some variant of the Voss-McCartney algorithm could be used with pure integers. I haven't started thinking about the resonant filter yet, but that can probably move the same way too, along with some tweaking to make it sound a little less mournful, which is a common complaint (though will no doubt elicit a barrage of JIRAs saying "Give us back our emo wind!").
PS. My patch accidentally snuck a (double) back in ...
and that cast to S16 should probably be
( ... and yes, I know I said I was going to bed 3 hours ago, I'm going now, honest PPS. I think "LightWind" would be a good name for this project, though it sounds kinda familiar from somewhere ... ( ... ok, I really am going now
You don't want to change that unless you also change next_sample to be F32. I thought there was a reason you made next_sample F64, so I left it. That last cast is probably more correct to be MIXBUFFERFORMAT, but the code assumes it's S16 in other places (like multiplying by 32768), so MIXBUFFERFORMAT is kinda irrelevant, since it will never work right unless it's S16. Nah, it was laziness, missed that one too
Two wrongs don't make a right I'm working on some simple floatification of the doubles based on faster_wind-4, will post the result if it helps.
assorted double->float cleanup. not benched, but it saves a few hundred bytes of generated code which can't be a bad sign.
(And... reading llmath.h, it's full of C-centric cruft. I'll clean that up as another issue.)
Here comes V5. Based on Tofu's patch, got rid of some divisions and increased the subsampling to 4X.
Should give an even greater saving, but brings some aliasing in at 11K. At normal volumes most people probably won't hear it; it will need some further work to remove it if we want to be fussy (I am anyway Tweaked the muting threshold to act on the next to lowest slider position when stationary, as it was cutting in at the volume I normally have wind (no pun intended Do we get a prize for most patch iterations attached to one issue in the shortest time frame?
Running around 1%-1.2% on patch-5, no significant gain over previous patches.
The implication here is that we are hitting strong diminishing returns above 2 subsamples. 4 subsamples doesn't sound any worse than two and it's probably saving a few cycles though, so patch-5 is still good to go. I think we'll probably hit diminishing returns from here on out. We'd appreciate it if you profilers would please describe your profiling tools, configuration, and basically describe how you generate the performance deltas that you're reporting. Please also note your machine configurations (CPU's, OS, etc.). A savings of several percent is definitely interesting but we need to be able to reproduce those results and compare them across various systems. Thank you.
Sure. I use oprofile as recommended by.... Sardonyx I think it was.
The report is: The oprofile daemon is monitoring CPU_CLK_UNHALTED, no vmlinux image loaded. OS is Ubuntu Linux. Exclude dependent is scaling our numbers up a little now that I look at it, so I guess it's a little more realistic to say we went from 3.75% to 0.75%. Still a significant gain. Results including all libraries in the numbers: 2.18% (+0.5% ll_frand) Pre-patch, 0.58% post-patch
Created a separate issue for the 16-bit clipping assumption.
If 4x sub-sampling isn't giving much of a gain I'd say stick with 2x, as that will keep the aliasing out of the audible range, and will work better with some other ideas I'm playing with at the back of my mind for future new features.
In case anyone's really bored and has nothing better to do ...
http://www.fmod.org/docs/HTML/FSOUND_GetCPUUsage.html Would probably have been more useful if I remembered it existed earlier, sorry =.= I don't know how useful that is to us right now. Fmod itself doesn't seem to be using very much CPU; I haven't noticed it anywhere near the top when I include all libs. This wind synthesis is outside of fmod.
It includes time spent in DSP callbacks, it's intended for this purpose.
Hopefully the final version faster_wind-6.patch reverts subsampling to 2X while retaining the other changes. Also rolls in a fix for the unnecessary clipping when using a 32-bit mix buffer from VWR-5354, as it wasn't easy to separate.
OK, so it wasn't QUITE the final version
More optimizations and cleanup. Some stuff was being calculated every time when it only needs to be done once. Got rid of llfloor() from the inner loop. I'm using one of the later versions of this patch in my recent viewer build. A user told me that the wind sounds seems to be unaffected by master volume. I have not checked if is supposed to be that way, but I thought I'd mention it here since I did not hear about this from users before. Yeah, that's an existing bug, fixed by the patch on
Thanx Aimee! Much appreciated.
I tested this patch and for me, it does not work properly: after a couple of hours spent in a skybox (lot's of wind up there), the wind sound becomes corrupted, turning into static and crackings.
Also, as it is, the patch fails to compile under Linux with gcc 4.1 unless casts to S32 are added for sample_left and sample_right, like this: sample_right = (S32)(last_sample * current_pan_gain_r); (in two different places of audioengine_fmod.cpp). Finally, the resulting theoretical speed up brought by this patch is impossible to actually measure (no significant change in the FPS) on my system (Linux 2.6.23.17, Althlon XP 3200+, Nvidia 7600GT, nForce2 AC97 Audio Controler with OSS 4.1 drivers and sounds played via OSS' vmix). Can anyone repro Henri's problem? Any thoughts?
Note that I could reproduce this problem with both v1.19.0.5 and v1.19.1.4 (v1.20 untested so far), but the sound static, distortions and crackings sometimes take 4 or 5 hours to appear in a skybox (I'm scripting and building up there, that's why I'm spending so much time in a raw in it), so you need a stable enough viewer so that it won't crash by itself before the bug occurs...
As for the origin of this bug, I did not investigate further as it is not my patch, but I'd say that there is an awful lot of floating point casting from F32 to S32 and float, and stuff.... and there are some static floating point variables in use too, so this could be an overflow occurring at some point... The code should be amended so that it uses only one type of floating point variables, and tested again from there. Also, about profiling as given in the description: > Profiling suggests that the current wind noise generation is using 6% or so of my CPU So indeed we have around 5.5%. But then, Gigs says: > ll_frand is no longer called, in favor of using stdlib rand(), which seems to be much faster. And in his post-patch profiling he gets: > 18572 2.2307 windCallback(void*, void*, int, void*) And he deduces that he gained 2.5% or so of CPU time... But to keep things truly comparable, rand() should be profiled too and its overhead added. Note also the figures in clock cycles: 70374 against 18572: Which is a 3.8 ratio and not 2 as the percentage figure would suggest. Conclusion: during the post-patch profiling, the CPU was much more busy doing something else out of the viewer code (or the profiling was not done in the same conditions: avatar in another sim, for example)... Could part of it be in rand() ?.... You would then end up with a much slower windCallback(), and this would explain why I did not notice any change in the frame rate with the patch applied (and this in spite of the fact that I got only a single core CPU which is 100% loaded whenever the viewer runs: such an optimization should therefore have a larger impact on my system)... Profiling is an art... so beware of quick conclusions... I'm not telling the patch brings nothing, but so far, it is still unclear if it brings anything but troubles... First off, it is very arrogant and petty that you attack my and Aimee's patch, merely because I did not think one of your patches was an appropriate change.
rand() is negligible compared to the very inefficient ll_frand. I did look at it, it wasn't worth mentioning. As was pointed out above, there was an error in methodology, it is actually more like the wind noise taking about 2.5%, reduced to 0.5%. The exaggeration in percentage came because I was excluding time in external libs during the first profile. This error was later corrected and I said that it's more like 2% CPU saved. It is still a worthy percentage of CPU time saved, 2% is nothing to sneeze at. I can't comment about your bug that takes 5 hours to appear, except to say that it is awfully convenient that such a hard-to-reproduce bug should arise, since it is obvious that your intentions here exceed mere technical problems with the patch. BTW- as for the raw counts being different, that's merely because I profiled for longer or shorter times. Also oprofile is stochastic. You can't compare exact numbers of samples, as it relies on long term average sample hits which are more or less random. Read up on how it works if you are curious. @Gigs
> First off, it is very arrogant and petty that you attack my and Aimee's patch, merely because Huh ??? Which patch ??? I don't even know what you are speaking about !!! My only concern is to avoid having new bugs introduced in the viewer. I DID use your patch in the Cool SL Viewer for some time, but I HAD to remove it after I found out that it did induce troubles. I ALWAYS have been enthusiastic when it deals with including more bug fixes and speed optimizations in the Cool SL Viewer, and I DO use some of your patches in it. I NEVER exclude someone's work on the pretext this someone would have argued against my own work at one point or another. So, please, stop assuming things and being paranoid, and judge exclusively on FACTS. > I can't comment about your bug that takes 5 hours to appear, I said that it SOMETIMES takes as much time to appear, because it happened once to me (after I made a test release of the Cool SL Viewer, trying this patch for the fourth time in a new release). But most of the time, it will trigger after just one or two hours at most (I did use two different test releases for a full week with your patch, and I ALWAYS could reproduce the bug). So, please stop denying FACTS. > except to say that it is awfully convenient that such a hard-to-reproduce bug should arise, And AGAIN you are assuming things (and calling me a liar !!!). See above for my only > As was pointed out above, there was an error in methodology, it is actually more like the wind Sorry, I did not re-read the whole thread. I was merely replying to Tofu's comment which I received > It is still a worthy percentage of CPU time saved, 2% is nothing to sneeze at. It is when it introduces a new bug... or at least this is how I see things. This said, with some more Henri. /me gives this a nudge. Anyone still workin on it?
Nobody's working on it internally - it's sitting around but at lower priority than a lot of bug fixes.
If you're using it yourself and have good results with it (quality of sound and CPU performance wise), it would be helpful to hear. Not forgotten, but needs reworking now OpenAL has gone in, it's on my list of 'Stuff'
There's pretty substantial debate at Linden Lab about whether or not this represents a significant performance improvement. A number of people have done profiling and didn't see the current wind generation taking up significant CPU at all. What was the profiling methodology that led to the conclusion that this was a significant source of CPU usage?
Whatever the case integer math is alot faster then floating point math on most cpu, and SL can use all the performance improvements it can get. (abiet if its less than 1% or so, it can probably wait however, compared to bigger more immediate problems).
Christ, wrap the function in a test harness and benchmark old vs new. If it's faster, it's faster. Then merge it.
Seg's suggestion is really what it would take to motivate this patch into the system.
Alternatively, pushing this into snowglobe for peer review would work just as well. It's certainly had no shortage of discussion. I agree, wrap new and old versions in nice timers, please let us know.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||