[BUG-230116] Sorting group notices by date disconnects the viewer. #7853
Comments
Beq Janus commented at 2021-01-26T22:28:36Z Ohhh, where to start? This is mostly FAO Vir Linden I would guess. The main message is please fix the server-side ASAP because the longer this goes on the more groups grow and the more people will be affected. The remainder is for viewer fixes. This bug uncovers a veritable pit of vipers. There are I think 5 viewer "features" highlighted by this. 1 & 2 are the main causes of this issue and the general "pause" when opening a group under normal circumstances. Essentially the problem is caused by two significant inefficiencies in the viewer code, these are normally hidden by the constraint on notices, but are significant and growing now. When the group profile panel/floater is opened it causes a refresh of the notices list via GroupNoticesListRequest, this results in a series of GroupNoticesListReply messages in batches of 16. prior to making the request, the llScrollList UI is cleared of all data. For every message that is received, a check is made against the NotificationsList (the scrolllist UI element) which is represented as a After every batch (of 16) we sort the entire list for good measure (if sorting is enabled). The sort, of course, gets progressively slower as the list grows, and given the multiple batches per frame and across contiguous frames achieves nothing useful. All the following screenshots show telemetry captures that correspond largely with Fasttimers (the lower levels are further capture points added for fault tracking) The first video clip here shows the cost of the "uniqueness" check, https://i.gyazo.com/08aff228f7419493a88dfe578d9caa81.mp4 This issue affects all viewers based on the LL code base, irrespective of sorting. On its own, it will eventually disconnect a user but more typically it will cause a large drop in FPS, and a loss of responsiveness. In the case of a large set of notices these two combined are fatal to the viewer session, in other cases they result in significant Jitter/freezing and UDP message loss. Problem number 3 is more general in nature and relates to how the viewer is designed to deal with a flood of data. By default the viewer allows 0.2 seconds per frame to handle all the network messages (this in itself is a generous slice and puts a dependency on viewer performance tuning that is external to the user. When the viewer cannot handle the message load in a given frame, that value is increased by a small margin with the stated intention of growing to accommodate the data flood. It is, unfortunately, uncapped and when linked to a bug such as the two features above, will grow indefinitely, starving the viewer more by allowing the idle processing to run the message handling code multiple times. image 1) we have already grown to the point where we are processing multiple batches (2 and then 3 the frame after) of messages per frame. https://i.gyazo.com/f63f81d9de35198c1c6fe0a8cae38215.png image 2) 40 frames later and the time has grown further still (four batches and the frame time in excess of half a second) https://i.gyazo.com/f6521bc05711d8887c74ab5c511500fe.png By plotting the time allowed for network processing we can see how it grows. In this case, we manage to process the data once it reaches ~1 second dedicated network processing time. Of course, the viewer is effectively hung now.
Issue #4 - A new notice doesn't add itself to the list it deletes the list (yes that one we just tried so hard to build) and requests the entire set again. I have no idea why this is done. It is possible that this is only in FS but my investigation so far suggests that that is not the case. The only possible benefit I can see is that it forces a refresh and thus "hopefully" recovers from earlier dropped packets. In our extreme circumstances at the present time, the extra load is in fact causing more data loss. If a notice arrives while we are already processing another (the odds of which grow with the more time spent) it all just gets trampled and the duplicate checking earns its living for once. Issue #5 - A minor one but worth considering. At the end of every batch, just after the sorting if sorting is required, if the panel is open (which in our examples it is)then the top of the list is selected. This selection causes a request from the viewer to the simulator, which then replies in the IM channel with an IM_GROUP_NOTICE update with the full details of that notice. though of course, by the time it returns the next batch (or batches) have been processed and that data is of no use.
Proposed fixes
There is only one place in the viewer code base where a scrolllist is searched for UUID only to prove that it is not there. All other uses are to locate a known present UUID. As such the change I propose is applied to the NoticesList processing and not to the underlying scrolllist. An additional
I have implemented a deferred sort that is triggered only when there has not been an update to the data within a preset number of frames. This assumes that anything being flooded would be consumed over sequential frames and effectively defers the sort until the end.
Placing an upper cap on the amount of network time allowed trades the ability to handle large floods of data, for comparative responsiveness in the viewer. I believe that this is the better option as uncapped the result is a failure in any case.
No fix for this yet, I'd like to understand why this is done and whether there is a sound design behind it (and whether it is FS specific)
This can be fixed by moving the selection to a point after the list update completes, however, unlike the deferred sorting there is no immediately obvious place to do this without scheduling callbacks and the problem does not warrant more complexity. For now, I intend to leave this be.
I have #1 #2 and #3 implemented in FS and can contribute, however, #1 is probably of less interest once the server-side is fixed #2 is strictly speaking a problem for FS as we have sorting enabled by default. However, unsorted notices are pretty useless, so it is probably worth using. #3 depends on what the design philosophy is FPS versus packet loss. If you would like patches for any of these please let me know.
|
Dan Linden commented at 2021-01-26T23:02:15Z Thank you for the report and investigation! |
Steps To Reproduce
Open the stats floater: CTRL+SHIFT+1 & note FPS & packet loss.
Join the group named SPAM: secondlife:///app/group/28438cdd-138b-8754-4a5f-8c9d23b6c2a7/about
This group is ideal to reproduce this bug quickly because it has a large number of group notices.
Open the SPAM group profile & go to the notices tab.
Sort the notices by date - click on the date column once so notices are sorted as most recent at the top.
Observed Behaviour
Everything will seem fine for about the first 30 seconds - 1 minute.
Then viewer FPS will start to rapidly drop & packet loss will increase to 15-20%, which you can see in the statistics floater.
The viewer will become extremely stuttery, so much so that the UI is barely responsive to clicks.
The viewer never recovers performance & the viewer will disconnect after approx 2 minutes.
This repros 100% of the time on multiple systems.
See Fig 1 attached showing the 16% packet loss shortly after sorting the notices by date.
See Fig 2 attached showing the viewer disconnection message.
See Secondlife.log attached after a session that reproduced this disconnect.
Expected Behaviour
The viewer should not disconnect when loading a large number of group notices.
Other Information
This bug also reproduces on Firestorm Viewer.
It's actually worse on Firestorm because we sort group notices by date by default.
I believe this is technically a viewer bug but it is drastically exacerbated by the server bug BUG-229936, which causes group notices not to be culled after 14 days.
So as time goes on, this disconnect will happen in more & more groups as the number of group notices grows.
Beq Janus from Firestorm will comment with more technical information about what causes this bug and also she will contribute a fix.
Attachments
Links
Related
Duplicates
Original Jira Fields
The text was updated successfully, but these errors were encountered: