Skip to content
This repository has been archived by the owner on Mar 28, 2024. It is now read-only.

[BUG-230116] Sorting group notices by date disconnects the viewer. #7853

Closed
3 tasks
sl-service-account opened this issue Jan 26, 2021 · 2 comments
Closed
3 tasks

Comments

@sl-service-account
Copy link

sl-service-account commented Jan 26, 2021

Steps To Reproduce

  • Open the stats floater: CTRL+SHIFT+1 & note FPS & packet loss.

  • Join the group named SPAM: secondlife:///app/group/28438cdd-138b-8754-4a5f-8c9d23b6c2a7/about

  • This group is ideal to reproduce this bug quickly because it has a large number of group notices.

  • Open the SPAM group profile & go to the notices tab.

  • Sort the notices by date - click on the date column once so notices are sorted as most recent at the top.

    Observed Behaviour

  • Everything will seem fine for about the first 30 seconds - 1 minute.

  • Then viewer FPS will start to rapidly drop & packet loss will increase to 15-20%, which you can see in the statistics floater.

  • The viewer will become extremely stuttery, so much so that the UI is barely responsive to clicks.

  • The viewer never recovers performance & the viewer will disconnect after approx 2 minutes.

  • This repros 100% of the time on multiple systems.

  • See Fig 1 attached showing the 16% packet loss shortly after sorting the notices by date.

  • See Fig 2 attached showing the viewer disconnection message.

  • See Secondlife.log attached after a session that reproduced this disconnect.

    Expected Behaviour

  • The viewer should not disconnect when loading a large number of group notices.

    Other Information

  • This bug also reproduces on Firestorm Viewer.
    It's actually worse on Firestorm because we sort group notices by date by default.

  • I believe this is technically a viewer bug but it is drastically exacerbated by the server bug BUG-229936, which causes group notices not to be culled after 14 days.
    So as time goes on, this disconnect will happen in more & more groups as the number of group notices grows.

  • Beq Janus from Firestorm will comment with more technical information about what causes this bug and also she will contribute a fix.

Attachments

Links

Related

Duplicates

Original Jira Fields
Field Value
Issue BUG-230116
Summary Sorting group notices by date disconnects the viewer.
Type Bug
Priority Unset
Status Closed
Resolution Triaged
Reporter Whirly Fizzle (whirly.fizzle)
Created at 2021-01-26T18:38:46Z
Updated at 2022-02-28T18:25:22Z
{
  'Build Id': 'unset',
  'Business Unit': ['Platform'],
  'Date of First Response': '2021-01-26T16:28:35.798-0600',
  'ReOpened Count': 0.0,
  'Severity': 'Unset',
  'System': 'SL Viewer',
  'Target Viewer Version': 'viewer-development',
  'What just happened?': 'Filling in',
  'What were you doing when it happened?': '...',
  'What were you expecting to happen instead?': '...',
  'Where': 'Any region',
}
@sl-service-account
Copy link
Author

Beq Janus commented at 2021-01-26T22:28:36Z

Ohhh, where to start? This is mostly FAO Vir Linden I would guess.

The main message is please fix the server-side ASAP because the longer this goes on the more groups grow and the more people will be affected.

The remainder is for viewer fixes.

This bug uncovers a veritable pit of vipers. There are I think 5 viewer "features" highlighted by this.

1 & 2 are the main causes of this issue and the general "pause" when opening a group under normal circumstances. 

Essentially the problem is caused by two significant inefficiencies in the viewer code, these are normally hidden by the constraint on notices, but are significant and growing now.

When the group profile panel/floater is opened it causes a refresh of the notices list via GroupNoticesListRequest, this results in a series of GroupNoticesListReply messages in batches of 16.

prior to making the request, the llScrollList UI is cleared of all data.

For every message that is received, a check is made against the NotificationsList (the scrolllist UI element) which is represented as a std:deque. We are looking for the UUID in the list and that requires a linear search, O(N) complexity and as we are expecting 99.9% of the time for this to not be found we are attaining near worst case performance on that search.  With every newly added notice the list gets longer and the time increases.

After every batch (of 16) we sort the entire list for good measure (if sorting is enabled). The sort, of course, gets progressively slower as the list grows, and given the multiple batches per frame and across contiguous frames achieves nothing useful.

All the following screenshots show telemetry captures that correspond largely with Fasttimers (the lower levels are further capture points added for fault tracking)

The first video clip here shows the cost of the "uniqueness" check, https://i.gyazo.com/08aff228f7419493a88dfe578d9caa81.mp4

This issue affects all viewers based on the LL code base, irrespective of sorting. On its own, it will eventually disconnect a user but more typically it will cause a large drop in FPS, and a loss of responsiveness.

In the case of a large set of notices these two combined are fatal to the viewer session, in other cases they result in significant Jitter/freezing and UDP message loss.

Problem number 3 is more general in nature and relates to how the viewer is designed to deal with a flood of data.

By default the viewer allows 0.2 seconds per frame to handle all the network messages (this in itself is a generous slice and puts a dependency on viewer performance tuning that is external to the user.

When the viewer cannot handle the message load in a given frame, that value is increased by a small margin with the stated intention of growing to accommodate the data flood. It is, unfortunately, uncapped and when linked to a bug such as the two features above, will grow indefinitely, starving the viewer more by allowing the idle processing to run the message handling code multiple times. 

image 1) we have already grown to the point where we are processing multiple batches (2 and then 3 the frame after) of messages per frame. https://i.gyazo.com/f63f81d9de35198c1c6fe0a8cae38215.png

image 2) 40 frames later and the time has grown further still (four batches and the frame time in excess of half a second) https://i.gyazo.com/f6521bc05711d8887c74ab5c511500fe.png

By plotting the time allowed for network processing we can see how it grows. In this case, we manage to process the data once it reaches ~1 second dedicated network processing time. Of course, the viewer is effectively hung now.

 

 

Issue #4 - A new notice doesn't add itself to the list it deletes the list (yes that one we just tried so hard to build) and requests the entire set again.

I have no idea why this is done. It is possible that this is only in FS but my investigation so far suggests that that is not the case.  The only possible benefit I can see is that it forces a refresh and thus "hopefully" recovers from earlier dropped packets. In our extreme circumstances at the present time, the extra load is in fact causing more data loss. If a notice arrives while we are already processing another (the odds of which grow with the more time spent) it all just gets trampled and the duplicate checking earns its living for once.

Issue #5 - A minor one but worth considering. At the end of every batch, just after the sorting if sorting is required, if the panel is open (which in our examples it is)then the top of the list is selected. This selection causes a request from the viewer to the simulator, which then replies in the IM channel with an IM_GROUP_NOTICE update with the full details of that notice. though of course, by the time it returns the next batch (or batches) have been processed and that data is of no use. 

 

Proposed fixes

  1. The O(n) dupe check. 

There is only one place in the viewer code base where a scrolllist is searched for UUID only to prove that it is not there. All other uses are to locate a known present UUID. As such the change I propose is applied to the NoticesList processing and not to the underlying scrolllist. An additional std::set<LLUUID> is added and notice IDs are added to the set (as well as to the scrolllist) this gives an O(logN) complexity. It is debatable whether this is worth doing once the limits are restored to 200, however, the code is simple, the overhead in size is small, and anything that reduces jitter worth considering. 

  1. The sorting per batch

I have implemented a deferred sort that is triggered only when there has not been an update to the data within a preset number of frames. This assumes that anything being flooded would be consumed over sequential frames and effectively defers the sort until the end.

  1. The uncapped network read.

Placing an upper cap on the amount of network time allowed trades the ability to handle large floods of data, for comparative responsiveness in the viewer. I believe that this is the better option as uncapped the result is a failure in any case.

  1. The pointless wipe/refresh

No fix for this yet, I'd like to understand why this is done and whether there is a sound design behind it (and whether it is FS specific) 

  1. The unnecessary detail requests

This can be fixed by moving the selection to a point after the list update completes, however, unlike the deferred sorting there is no immediately obvious place to do this without scheduling callbacks and the problem does not warrant more complexity. For now, I intend to leave this be.

 

I have #1 #2 and #3 implemented in FS and can contribute, however,

#1 is probably of less interest once the server-side is fixed

#2 is strictly speaking a problem for FS as we have sorting enabled by default. However, unsorted notices are pretty useless, so it is probably worth using.

#3 depends on what the design philosophy is FPS versus packet loss. If you would like patches for any of these please let me know.

 

 

 

 

@sl-service-account
Copy link
Author

Dan Linden commented at 2021-01-26T23:02:15Z

Thank you for the report and investigation!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant