• All submissions to this site are governed by Second Life Project Contribution Agreement. By submitting patches and other information using this site, you acknowledge that you have read, understood, and agreed to those terms.
Issue Details (XML | Word | Printable)

Key: VWR-5308
Type: Bug Bug
Status: Open Open
Priority: Normal Normal
Assignee: Unassigned
Reporter: Gigs Taggart
Votes: 16
Watchers: 9
Operations

If you were logged in you would be able to see more operations.
1. Second Life Viewer - VWR

Current wind noise generation is CPU intensive.

Created: 05/Mar/08 01:12 AM   Updated: 10/Sep/09 10:30 AM
Return to search
Component/s: Sound
Affects Version/s: None
Fix Version/s: Source code

File Attachments: 1. Text File faster_wind-2.patch (5 kB)
2. Text File faster_wind-3.patch (6 kB)
3. Text File faster_wind-4.patch (7 kB)
4. Text File faster_wind-4a.patch (8 kB)
5. Text File faster_wind-5.patch (8 kB)
6. Text File faster_wind-6.patch (8 kB)
7. Text File faster_wind-7.patch (9 kB)

Issue Links:
Relates
 

Last Triaged: 10/Sep/09 06:43 PM
Source Version: 1.19.0.0
Linden Lab Issue ID: DEV-11516
Patch attached: Patch attached


 Description  « Hide
Profiling suggests that the current wind noise generation is using 6% or so of my CPU (Quad Core2 Intel 2.33Ghz).

Profiling results:
-------------------------
Here is the profile with the stock wind noise generation:
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask
of 0x00 (Unhalted core cycles) count 100000
samples % symbol name
98301 7.0780 LLViewerObject::updateDrawable(int)
70374 5.0672 windCallback(void*, void*, int, void*)
<snip>
7730 0.5566 ll_frand(float)

The wind noise generation is making most of the calls to ll_frand here.

With the attached patch:
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples % symbol name
43506 5.2256 LLVOSky::calcSkyColorInDir(LLColor3&, LLColor3&, LLVector3 const&) const
22452 2.6968 _Rb_tree<wchar_t, pair<wchar_t const, LLFontGlyphInfo*>, wchar_t const>::find(wchar_t const&)
20856 2.5051 _Rb_tree<wchar_t, pair<wchar_t const, LLFontGlyphInfo*>, wchar_t const>::find(wchar_t const&) const
20493 2.4615 Get_Kerning
18572 2.2307 windCallback(void*, void*, int, void*)
<snip>
42 0.0050 ll_frand()

As you can see this patch reduces the wind noise generation from 5.5% usage to about 2.25%.

------------------------------

Technical Notes:
1. This patch removes the fast damping of frequency and pan-gain changes. Removal of the fast damping of frequency change allows for much of the resonant filter calculations to be moved out of the inner loop. The first patch turned off gain damping too, but that caused some subtle clicking. The new patch works better and profiles basically the same.

2. All of the math has been converted to F32. The final precision here is 16 bit integers, I don't think we gained much from the extra precision of doubles in the filter calculations.

3. ll_frand is no longer called, in favor of using stdlib rand(), which seems to be much faster.

4. Most of the nasty scoping problems that existed in the original implementation were fixed, but not all

One potential improvement on top of this patch is to disable the wind noise synthesis completely if the final gain for it is going to be low (i.e. the user has it turned down or muted).



 All   Comments   Change History      Sort Order: Ascending order - Click to sort in descending order
Lex Neva added a comment - 05/Mar/08 09:50 AM
An armchair question, without having actually looked at the code: if the end output is a 16-bit integer, I wonder if it's possible to make wind generation use only integer math?

Gigs Taggart added a comment - 05/Mar/08 01:21 PM
Possibly, but these filters are really touchy from my playing with it. This stuff is subtle in ways you wouldn't expect. For example, biasing the RNG by just one causes bad clipping pops. One would probably have to find a whole new filter designed to be integer from the beginning.

Gigs Taggart added a comment - 05/Mar/08 02:03 PM
New patch that fixes some subtle clicking when gain interpolation is turned off. Profile on the new patch is similar to the last one, still about 2% CPU usage.

Gigs Taggart added a comment - 05/Mar/08 05:18 PM
Here is Aimee's patch with a few enhancements from me. Aimee's patch profiles at 1.7%, better than my patch-2. This patch-3 also turns off wind synth completely when it's muted.

Gigs Taggart added a comment - 05/Mar/08 07:52 PM
This -4 patch combines Aimee's technique with my earlier one, except it does do interpolation if the frequency is changing quickly. This prevents the very subtle pop when you stop moving that Aimee observed.

This latest patch profiles at 1.1% standing still, 1.3% moving, and near 0% if you turn wind all the way down or mute it.


Aimee Trescothick added a comment - 05/Mar/08 08:28 PM - edited
I'm fairly sure there's a fair bit more mileage in this still, filtering white noise is probably not the most efficient way to generate pink, some variant of the Voss-McCartney algorithm could be used with pure integers. I haven't started thinking about the resonant filter yet, but that can probably move the same way too, along with some tweaking to make it sound a little less mournful, which is a common complaint (though will no doubt elicit a barrage of JIRAs saying "Give us back our emo wind!").

PS. My patch accidentally snuck a (double) back in ...

  • next_sample = (double)( a0 * next_sample - b1 * gY0 - b2 * gY1 );
    + next_sample = (F32)( a0 * next_sample - b1 * gY0 - b2 * gY1 );

and that cast to S16 should probably be

  • sample_left = (S16)(last_sample - sample_right);
    + sample_left = (MIXBUFFERFORMAT)(last_sample - sample_right);

( ... and yes, I know I said I was going to bed 3 hours ago, I'm going now, honest

PPS. I think "LightWind" would be a good name for this project, though it sounds kinda familiar from somewhere ...

( ... ok, I really am going now


Gigs Taggart added a comment - 05/Mar/08 08:54 PM
  • next_sample = (double)( a0 * next_sample - b1 * gY0 - b2 * gY1 );

You don't want to change that unless you also change next_sample to be F32. I thought there was a reason you made next_sample F64, so I left it.

That last cast is probably more correct to be MIXBUFFERFORMAT, but the code assumes it's S16 in other places (like multiplying by 32768), so MIXBUFFERFORMAT is kinda irrelevant, since it will never work right unless it's S16.


Aimee Trescothick added a comment - 06/Mar/08 03:41 AM - edited
Nah, it was laziness, missed that one too

Two wrongs don't make a right the 32768s are also on my hit list. MIXBUFFERFORMAT is S32 on Darwin, so they're just plain wrong.


Tofu Linden added a comment - 06/Mar/08 05:33 AM
I'm working on some simple floatification of the doubles based on faster_wind-4, will post the result if it helps.

Tofu Linden added a comment - 06/Mar/08 06:11 AM
assorted double->float cleanup. not benched, but it saves a few hundred bytes of generated code which can't be a bad sign.

Tofu Linden added a comment - 06/Mar/08 06:54 AM
(And... reading llmath.h, it's full of C-centric cruft. I'll clean that up as another issue.)

Lex Neva added a comment - 06/Mar/08 09:47 AM
Wow, this is cool 5% CPU down to ~1% is an impressive optimization, and it's awesome to see a Linden get in on this. Keep up the good work, folks.

Aimee Trescothick added a comment - 06/Mar/08 10:58 AM - edited
Here comes V5. Based on Tofu's patch, got rid of some divisions and increased the subsampling to 4X.

Should give an even greater saving, but brings some aliasing in at 11K. At normal volumes most people probably won't hear it; it will need some further work to remove it if we want to be fussy (I am anyway

Tweaked the muting threshold to act on the next to lowest slider position when stationary, as it was cutting in at the volume I normally have wind (no pun intended Not experienced a problem, but would be a good idea to add some hysteresis on this to prevent it oscillating in and out.


Aimee Trescothick added a comment - 06/Mar/08 11:01 AM
Do we get a prize for most patch iterations attached to one issue in the shortest time frame?

Gigs Taggart added a comment - 06/Mar/08 04:36 PM - edited
Running around 1%-1.2% on patch-5, no significant gain over previous patches.

The implication here is that we are hitting strong diminishing returns above 2 subsamples. 4 subsamples doesn't sound any worse than two and it's probably saving a few cycles though, so patch-5 is still good to go.

I think we'll probably hit diminishing returns from here on out.


Coco Linden added a comment - 06/Mar/08 06:57 PM
We'd appreciate it if you profilers would please describe your profiling tools, configuration, and basically describe how you generate the performance deltas that you're reporting. Please also note your machine configurations (CPU's, OS, etc.). A savings of several percent is definitely interesting but we need to be able to reproduce those results and compare them across various systems. Thank you.

Gigs Taggart added a comment - 06/Mar/08 07:52 PM
Sure. I use oprofile as recommended by.... Sardonyx I think it was.

The report is:
opreport --exclude-dependent --demangle=smart --symbols /home/jgiglio/opensl/SecondLife_i686_1_19_0_0/bin/do-not-directly-run-secondlife-bin

The oprofile daemon is monitoring CPU_CLK_UNHALTED, no vmlinux image loaded.

OS is Ubuntu Linux.

Exclude dependent is scaling our numbers up a little now that I look at it, so I guess it's a little more realistic to say we went from 3.75% to 0.75%. Still a significant gain.


Gigs Taggart added a comment - 06/Mar/08 08:36 PM - edited
Results including all libraries in the numbers: 2.18% (+0.5% ll_frand) Pre-patch, 0.58% post-patch

Aimee Trescothick added a comment - 06/Mar/08 09:38 PM
Created a separate issue for the 16-bit clipping assumption.

Aimee Trescothick added a comment - 07/Mar/08 08:42 AM - edited
If 4x sub-sampling isn't giving much of a gain I'd say stick with 2x, as that will keep the aliasing out of the audible range, and will work better with some other ideas I'm playing with at the back of my mind for future new features.

Aimee Trescothick added a comment - 07/Mar/08 10:07 AM
In case anyone's really bored and has nothing better to do ...

http://www.fmod.org/docs/HTML/FSOUND_GetCPUUsage.html

Would probably have been more useful if I remembered it existed earlier, sorry =.=


Gigs Taggart added a comment - 07/Mar/08 12:26 PM
I don't know how useful that is to us right now. Fmod itself doesn't seem to be using very much CPU; I haven't noticed it anywhere near the top when I include all libs. This wind synthesis is outside of fmod.

Aimee Trescothick added a comment - 07/Mar/08 04:36 PM - edited
It includes time spent in DSP callbacks, it's intended for this purpose.

Aimee Trescothick added a comment - 07/Mar/08 09:09 PM - edited
Hopefully the final version faster_wind-6.patch reverts subsampling to 2X while retaining the other changes. Also rolls in a fix for the unnecessary clipping when using a 32-bit mix buffer from VWR-5354, as it wasn't easy to separate.

Aimee Trescothick added a comment - 15/Mar/08 09:57 AM
OK, so it wasn't QUITE the final version

More optimizations and cleanup. Some stuff was being calculated every time when it only needs to be done once. Got rid of llfloor() from the inner loop.


Nicholaz Beresford added a comment - 05/May/08 06:12 AM

I'm using one of the later versions of this patch in my recent viewer build. A user told me that the wind sounds seems to be unaffected by master volume. I have not checked if is supposed to be that way, but I thought I'd mention it here since I did not hear about this from users before.

Aimee Trescothick added a comment - 05/May/08 06:26 AM
Yeah, that's an existing bug, fixed by the patch on VWR-4371.

Nicholaz Beresford added a comment - 05/May/08 08:21 AM
Thanx Aimee! Much appreciated.

Henri Beauchamp added a comment - 09/May/08 10:16 AM - edited
I tested this patch and for me, it does not work properly: after a couple of hours spent in a skybox (lot's of wind up there), the wind sound becomes corrupted, turning into static and crackings.

Also, as it is, the patch fails to compile under Linux with gcc 4.1 unless casts to S32 are added for sample_left and sample_right, like this:

sample_right = (S32)(last_sample * current_pan_gain_r);
sample_left = (S32)last_sample - sample_right;

(in two different places of audioengine_fmod.cpp).

Finally, the resulting theoretical speed up brought by this patch is impossible to actually measure (no significant change in the FPS) on my system (Linux 2.6.23.17, Althlon XP 3200+, Nvidia 7600GT, nForce2 AC97 Audio Controler with OSS 4.1 drivers and sounds played via OSS' vmix).


Tofu Linden added a comment - 19/Jun/08 04:45 PM
Can anyone repro Henri's problem? Any thoughts?

Henri Beauchamp added a comment - 20/Jun/08 01:46 AM - edited
Note that I could reproduce this problem with both v1.19.0.5 and v1.19.1.4 (v1.20 untested so far), but the sound static, distortions and crackings sometimes take 4 or 5 hours to appear in a skybox (I'm scripting and building up there, that's why I'm spending so much time in a raw in it), so you need a stable enough viewer so that it won't crash by itself before the bug occurs...

As for the origin of this bug, I did not investigate further as it is not my patch, but I'd say that there is an awful lot of floating point casting from F32 to S32 and float, and stuff.... and there are some static floating point variables in use too, so this could be an overflow occurring at some point... The code should be amended so that it uses only one type of floating point variables, and tested again from there.

Also, about profiling as given in the description:

> Profiling suggests that the current wind noise generation is using 6% or so of my CPU
>.../...
> 70374 5.0672 windCallback(void*, void*, int, void*)
> 7730 0.5566 ll_frand(float)

So indeed we have around 5.5%. But then, Gigs says:

> ll_frand is no longer called, in favor of using stdlib rand(), which seems to be much faster.

And in his post-patch profiling he gets:

> 18572 2.2307 windCallback(void*, void*, int, void*)
> 42 0.0050 ll_frand()

And he deduces that he gained 2.5% or so of CPU time...

But to keep things truly comparable, rand() should be profiled too and its overhead added.

Note also the figures in clock cycles: 70374 against 18572: Which is a 3.8 ratio and not 2 as the percentage figure would suggest. Conclusion: during the post-patch profiling, the CPU was much more busy doing something else out of the viewer code (or the profiling was not done in the same conditions: avatar in another sim, for example)... Could part of it be in rand() ?.... You would then end up with a much slower windCallback(), and this would explain why I did not notice any change in the frame rate with the patch applied (and this in spite of the fact that I got only a single core CPU which is 100% loaded whenever the viewer runs: such an optimization should therefore have a larger impact on my system)...

Profiling is an art... so beware of quick conclusions...

I'm not telling the patch brings nothing, but so far, it is still unclear if it brings anything but troubles...


Gigs Taggart added a comment - 20/Jun/08 07:32 AM - edited
First off, it is very arrogant and petty that you attack my and Aimee's patch, merely because I did not think one of your patches was an appropriate change.

rand() is negligible compared to the very inefficient ll_frand. I did look at it, it wasn't worth mentioning.

As was pointed out above, there was an error in methodology, it is actually more like the wind noise taking about 2.5%, reduced to 0.5%. The exaggeration in percentage came because I was excluding time in external libs during the first profile. This error was later corrected and I said that it's more like 2% CPU saved.

It is still a worthy percentage of CPU time saved, 2% is nothing to sneeze at.

I can't comment about your bug that takes 5 hours to appear, except to say that it is awfully convenient that such a hard-to-reproduce bug should arise, since it is obvious that your intentions here exceed mere technical problems with the patch.

BTW- as for the raw counts being different, that's merely because I profiled for longer or shorter times. Also oprofile is stochastic. You can't compare exact numbers of samples, as it relies on long term average sample hits which are more or less random. Read up on how it works if you are curious.


Henri Beauchamp added a comment - 20/Jun/08 10:29 AM
@Gigs

> First off, it is very arrogant and petty that you attack my and Aimee's patch, merely because
> I did not think one of your patches was an appropriate change.

Huh ??? Which patch ??? I don't even know what you are speaking about !!!

My only concern is to avoid having new bugs introduced in the viewer. I DID use your patch in the Cool SL Viewer for some time, but I HAD to remove it after I found out that it did induce troubles.

I ALWAYS have been enthusiastic when it deals with including more bug fixes and speed optimizations in the Cool SL Viewer, and I DO use some of your patches in it.

I NEVER exclude someone's work on the pretext this someone would have argued against my own work at one point or another.

So, please, stop assuming things and being paranoid, and judge exclusively on FACTS.

> I can't comment about your bug that takes 5 hours to appear,

I said that it SOMETIMES takes as much time to appear, because it happened once to me (after I made a test release of the Cool SL Viewer, trying this patch for the fourth time in a new release). But most of the time, it will trigger after just one or two hours at most (I did use two different test releases for a full week with your patch, and I ALWAYS could reproduce the bug).

So, please stop denying FACTS.

> except to say that it is awfully convenient that such a hard-to-reproduce bug should arise,
> since it is obvious that your intentions here exceed mere technical problems with the patch.

And AGAIN you are assuming things (and calling me a liar !!!). See above for my only
concern (avoiding bugs !).

> As was pointed out above, there was an error in methodology, it is actually more like the wind
> noise taking about 2.5%, reduced to 0.5%. The exaggeration in percentage came because
> I was excluding time in external libs during the first profile. This error was later corrected and
> I said that it's more like 2% CPU saved.

Sorry, I did not re-read the whole thread. I was merely replying to Tofu's comment which I received
as an email. So yes, my fault, but next time, please consider editing the description of the bug if it
contains errors, as this is what people will base their comments on in the first place.

> It is still a worthy percentage of CPU time saved, 2% is nothing to sneeze at.

It is when it introduces a new bug... or at least this is how I see things. This said, with some more
work on the patch (you could, for example, consider doing all the arithmetic with integers, which
would speed up the whole thing and avoid possible overflows in the existing casts), I'm pretty
sure you could squash that nasty bug and make the patch very worth an inclusion in the viewer.

Henri.


Khyota Wulluf added a comment - 18/Dec/08 02:58 PM
/me gives this a nudge. Anyone still workin on it?

Soft Linden added a comment - 18/Dec/08 03:28 PM
Nobody's working on it internally - it's sitting around but at lower priority than a lot of bug fixes.

If you're using it yourself and have good results with it (quality of sound and CPU performance wise), it would be helpful to hear.


Aimee Trescothick added a comment - 18/Dec/08 04:32 PM
Not forgotten, but needs reworking now OpenAL has gone in, it's on my list of 'Stuff'

Rob Linden added a comment - 05/Mar/09 10:22 AM
There's pretty substantial debate at Linden Lab about whether or not this represents a significant performance improvement. A number of people have done profiling and didn't see the current wind generation taking up significant CPU at all. What was the profiling methodology that led to the conclusion that this was a significant source of CPU usage?

Aeron Kohime added a comment - 07/Mar/09 05:01 AM
Whatever the case integer math is alot faster then floating point math on most cpu, and SL can use all the performance improvements it can get. (abiet if its less than 1% or so, it can probably wait however, compared to bigger more immediate problems).

Seg Baphomet added a comment - 07/Mar/09 04:07 PM
Christ, wrap the function in a test harness and benchmark old vs new. If it's faster, it's faster. Then merge it.

Soft Linden added a comment - 28/May/09 10:57 AM
Seg's suggestion is really what it would take to motivate this patch into the system.

Alternatively, pushing this into snowglobe for peer review would work just as well. It's certainly had no shortage of discussion.


Tofu Linden added a comment - 10/Sep/09 10:30 AM
I agree, wrap new and old versions in nice timers, please let us know.