• All submissions to this site are governed by Second Life Project Contribution Agreement. By submitting patches and other information using this site, you acknowledge that you have read, understood, and agreed to those terms.
Issue Details (XML | Word | Printable)

Key: SVC-29
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Critical Critical
Assignee: kelly linden
Reporter: Apotheus Silverman
Votes: 120
Watchers: 21
Operations

If you were logged in you would be able to see more operations.
2. Second Life Service - SVC

XMLRPC sporadically (but usually) very slow

Created: 22/Feb/07 07:16 AM   Updated: 18/Nov/08 10:27 AM
Return to search
Component/s: XML-RPC
Affects Version/s: None
Fix Version/s: 1.18.3

Issue Links:
Duplicate
 
Relates
 

Linden Lab Issue ID: DEV-5497


 Description  « Hide
One week ago (2/15/2007) XML-RPC calls from external servers suddenly began taking much longer than usual. I originally decided to wait, but this problem still persists after yesterday's update.

The average time for the XML-RPC call to return something to the calling server went from an average of 5-10 seconds to over 1 minute. I am seeing many calls take up to 3 minutes before the server connection times out and assumes the call failed. I experimented with removing the hard server timeout altogether, but found that on some calls, the HTTP session hangs open indefinitely.

From my tests, it also appears that the amount of time from the beginning of the XML-RPC call to when scripts in-world receive the data also increased accordingly. My tests involved SL Exchange terminals and the problem has been witnessed grid-wide. Health of the region where the LSL scripts reside doesn't seem to play a role here - the problem occurs in both empty and heavily loaded sims.



 All   Comments   Change History      Sort Order: Ascending order - Click to sort in descending order
Apotheus Silverman added a comment - 22/Feb/07 07:19 AM
I am not sure if this is related to issue SVC-1 or not, as I did not experience the problem widespread until 2/15.

Lex Neva added a comment - 22/Feb/07 08:36 AM
I can confirm this issue. It's really annoying, and I'm not even trying to run a commercial system using XML-RPC.

Rob Linden added a comment - 23/Feb/07 09:57 AM
Treating as duplicate of SVC-1. We're working on getting the performance up for the entire system.

Nynthan Folsom added a comment - 26/May/07 03:51 AM
If this was closed because it was a duplicate, I can't find the original.

And no this issue is not resolved. xmlrpc still has a 30 second to over a minute round trip time.


aric linden added a comment - 17/Jul/07 09:27 AM
fixed by an internal update to xml-rpc

Phidian Krasner added a comment - 09/Sep/07 09:10 PM
XML-RPC calls from an external source (PHP), to an SL object appears to be going unheard for periods of time, then functioning for other periods of time.

The issues does not appear to be occasional, its actually quite frequent.

Date: 9/9/07


Apotheus Silverman added a comment - 13/Sep/07 10:02 AM
I've also witnessed a reversion to this behavior beginning a few days ago. Sometimes the connection to xmlrpc.secondlife.com times out, sometimes the messages take longer to be delivered to the scripts than the server is willing to wait for, and sometimes they outright fail.

I don't have hard numbers, but it looks like the error rate across the grid is optimistically somewhere around 5%.


Apotheus Silverman added a comment - 19/Sep/07 08:21 AM
Just wanted to post an update that I am still seeing a consistent error rate around 5%.

In most cases our servers return "unable to connect to xmlrpc.secondlife.com:80 (Connection timed out)". The error happens when attempting to open a connection before any data is sent.


Impalah Shenzhou added a comment - 09/Oct/07 02:49 PM
Another confirmation for this issue.
I have a commercial product that uses intensive xml-rpc calls (string buffer filled with 250 chars) and it works 10% of time... if works.
Simple xml-rpc calls (i.e. 1 integer and 10 chars in string) takes more than 30 seconds to process.
Messages varies but 2 are frequent: timeout and "unrecognized address".

Asha Eerie added a comment - 10/Oct/07 12:10 AM
I am having this issue too; it is impossible to make the simplest xml-rpc calls; I always get a timeout error for unexistent adress.

Koray Akina added a comment - 10/Oct/07 01:14 AM
I am having this issue too; it is impossible to make the simplest xml-rpc calls; I always get a timeout error for unexistent adress.

Melnik Balogh added a comment - 10/Oct/07 12:48 PM
Me to - all the above FIX IT ALLREADY..!!!

Apotheus Silverman added a comment - 14/Oct/07 02:39 PM
Beginning in the last 1 or 2 days the error rate has increased to approximately 10%.

Adam Kaupas added a comment - 14/Oct/07 03:22 PM
I've also noticed a large increase in both the error rate and the average time it takes to process the request; so much so that when an XML-RPC request is initiated by an HTTP request from LSL, the HTTP request is timed out by SL before there is any XML-RPC response.

About 60% of my XML-RPC calls are failing in the end, although only 10 to 20% of this is due to the "Unable to connect to xmlrpc.secondlife.com:80 (Connection timed out)" error.

With this kind of success rate, XML-RPC is practically unusable.


Melnik Balogh added a comment - 15/Oct/07 01:37 PM
question: is there an other way for connection then XML-RPC or is this the only way?

Apotheus Silverman added a comment - 20/Oct/07 01:09 PM
Starting yesterday afternoon (after the bulk of the XMLRPC problem announced on the SL blog was fixed) this problem is even worse. The error rate is now close to 20% and some messages are returning errors even with HTTP 200 codes. Here's an example:

expat reports error code 4
description: Empty document
line: 1
column: 1
byte index: 0
total bytes: 0

data beginning 0 before byte index: /1.1 200 OK


Beware Hax added a comment - 24/Oct/07 02:12 AM
i have a vendor network, and i have seen xml-rpc failure rates of 80% during some periods, and other periods during which it works fast. during the last few days i think a half of my sales failed because of this

Beware Hax added a comment - 24/Oct/07 05:58 AM
i don't know if this makes any sense, but as the XML-RPC problems seem to be caused by overload of a single server, how about using multiple XML-RPC servers, and using DNS round robin without TTL to distribute the calls over multiple servers?

Beware Hax added a comment - 24/Oct/07 05:58 AM

Apotheus Silverman added a comment - 30/Oct/07 09:46 AM
The failure rate and errors I previously mentioned are still the same as they have been. I don't know what percentage of the overall xmlrpc request volume SL Exchange generates, but I have a feeling it's been somewhat high (our servers attempt about 100,000 requests per day currently on behalf of customers and merchants). I've optimized our polling process somewhat to reduce our request volume by about 1/3 - 1/2 and will be looking for ways to reduce it further without affecting our quality of service. Hopefully this helps.

Apotheus Silverman added a comment - 02/Nov/07 09:09 AM
Could someone at Linden Lab with knowledge of the internal workings of the XML-RPC service please contact me in-world or via email? I have some specific questions outside the scope of this ticket which may go a long way toward improving things.

Apotheus Silverman added a comment - 03/Nov/07 11:26 AM
In the past 2 or 3 days the error rate we're experiencing has increased to about 30% even after scaling back the number of requests we send by about 2/3 (I've cut our requests back by at least 60,000 per day). Furthermore the occurrence of errors and blank responses with HTTP 200 codes has increased significantly and there is no indication of whether those requests are actually delivered or not. These failures are in addition to the fact that each successful request takes a minimum of 30 seconds to be delivered.

Please either correct these problems or give us a reliable alternative. As it stands right now, XML-RPC is for all intents and purposes a failed service.


Ann Otoole added a comment - 03/Nov/07 12:32 PM
they just close anything they don't like.

Virtouse Lilienthal added a comment - 04/Nov/07 04:49 AM
I have mentioned that the default settings of my vserver caused my apache to "die" when using xmlrpc Now my server forces disconnects after a specified timeout.
Well for me it is not a big issue, since I don't need those commands to be reliable. But instead Apotheus needs that for SLX and I was affected by this delivery error because of communication problems too!

Please Fix it

Regard


FlipperPA Peregrine added a comment - 05/Nov/07 06:17 AM
Seriously, please fix this. I have a very important demonstration that relies on XML-RPC tomorrow, and a failure rate of over 50% isn't helping.

This is probably the highest failure rate of any system in Second Life, and some of your bigger business and most interactive, innovative products rely on XML-RPC.


Jade Lily added a comment - 05/Nov/07 08:39 AM
OnRez has also been suffering from this problem. Would appreciate some attention to it. Thanks!

Beware Hax added a comment - 05/Nov/07 03:11 PM
if you currently rely on xml-rpc, consider using http request polling instead. i moved to this. it has also been recommended by a linden. i think the new slex magicbox also uses it.

Gibson Willis added a comment - 05/Nov/07 07:51 PM
Beware Hax: We use HTTP requests to do the heavy lifting, but use XML-RPC to let the in-world objects know when to phone home. HTML polling simply isn't a scalable solution. Having thousands of in-world objects poll every minute would result in millions of unnecessary hits on our backend infrastructure every day. Let's say hypothetically that we had 20,000 in-world dropboxes that were capable of delivering goods. If each of those dropboxes checks in once a minute, that's 28,800,000 hits to our web service every day, instead of 1 hit or fewer per sale when the dropboxes only check in on demand.

I'm not suggesting necessarily that XMLRPC is the ultimate answer, but we really need to have some reliable and scalable mechanism of communicating from out-of-world to in-world. And since another alternative doesn't yet exist, XMLRPC is it.


Virtouse Lilienthal added a comment - 06/Nov/07 02:57 AM
@Gibson Willis

Well I haven't tested it, so can't tell how well it works, but how about maintaining a kind of job queue for the object.
The object will be triggered by email for a job. And then it starts processing the jobs. Well the mail could be empty, just the mail needs to be sent to the box. A possible use case:

  • Buy on website started
  • website sends mail to the box containing the item which should be purchased
  • box receives an email and requests the job queue from the website
  • box delivers and tells the website that the job has been processed
  • website removes the job from the queue
  • box requests further jobs if there are

Additionally the box says hello to the website once a day and checks for updates, jobs etc

Regards,
Virtouse


FlipperPA Peregrine added a comment - 06/Nov/07 05:43 AM
The long term solution is here - incoming HTTP across simulators (see various discussions):

http://wiki.secondlife.com/wiki/LlHTTPServer
https://wiki.secondlife.com/wiki/User:Zero_Linden/Office_Hours/Discussion#Incoming_HTTP
http://jira.secondlife.com/browse/SVC-913

The one thing I'd suggest differently, for security concerns, is to use a channel like XML-RPC rather than the object's UUID for the URL. llOpenHTTPServer() could get the object ready to receive, with an event http_server with types HTTP_CHANNEL and HTTP_INCOMING_DATA (or something to that effect) to identify whether this is the channel UUID or an incoming data transmission.

Regards,

-Flip


Torley Linden added a comment - 06/Nov/07 08:40 AM
Thanks for emailing Jeska about this, Flip – I'm was looped in on the discussion, and I'm going to ping our devs to find out more.

Torley Linden added a comment - 06/Nov/07 09:29 AM
I heard back from Kelly Linden that he investigated XML-RPC issues yesterday, and that he'll work with our Operations Team on this... stay tuned for more news...

Melnik Balogh added a comment - 07/Nov/07 01:55 PM
I have a build tool (henshin 3) using xml-rpc and its working 100% again..!!!!!!!!!!Unable to render embedded object: File (.....YESSSSSSSSSS.... Thank you lindens..) not found.! Only connecting to xml-rpc source is still a problem, but once connection is established, its working super fast..!!! This will make a lot of residents happy...

Melnik Balogh


Virtouse Lilienthal added a comment - 07/Nov/07 02:16 PM
Well that's good news. I hope it will stay that way.

Since we wasn't abled to rely on it in the past there I am still reserved about that topic. But I hope we can rely on it.

Regards,
Virtouse


AnnMarie Otoole added a comment - 07/Nov/07 02:39 PM
I sincerely hope you are correct but I suspect your enthusiasm is premature, Melnik. I just ran two tests and the delay was so long on both that the server timed out before getting a response from SL.

Melnik Balogh added a comment - 07/Nov/07 09:41 PM
I run two different tests, and both where sucsessfull... so thats a good thing i think.. i will run some more tests to make sure i wasn't just lucky....

Melnik


Melnik Balogh added a comment - 08/Nov/07 06:09 AM
Run a couple of tests again today, with perfect..!!! result.... sending / receiving xml-rpc data was superfast..!! only MAKING connection to xml-rpc source i still a small problem, i hope linden can take a look a that....

melnik


Roger Tripp added a comment - 08/Nov/07 06:33 AM
I just ran 10 tests. Here are the results:

1:15
0:20
0:25
0:25
1:20
> 5:00 (failed or timed out)
0:15
0:25
> 3:00 (failed or timed out)
1:20

I can deal with under 30 seconds, but it very often takes over a minute or just never works at all.


Roger Tripp added a comment - 08/Nov/07 06:36 AM
Transmitting from SL to my web server works perfectly, but not vice versa.
All 10 tests were attempts to send XML RPC from my web server into SL.

FlipperPA Peregrine added a comment - 08/Nov/07 06:43 AM
I'm still having problems as well. Here's a test from 5-10 minutes ago in the simulator Hawthorne:

Testing Region: Hawthorne

Test 1: 47 seconds
Test 2: 20 seconds
Test 3: 21 seconds
Test 4: FAIL (after 67 seconds; "failure to open socket to xmlrpc.secondlife.com:80" message reported in web browser)
Test 5: 19 seconds


Virtouse Lilienthal added a comment - 08/Nov/07 06:48 AM
Well the status of this bug hasn't changed, nor any comment was made that the bug is fixed. So I guess its still the same.
I would wait with further testing until a linden says that it was fixed or the bug status has been changed.

Regards,
Virtouse


hewitt huet added a comment - 08/Nov/07 03:27 PM
Apotheus, I tried to get you on IM and PM about this. I guess your posts here and on SLX have answered my question however XD

Anyway, my whole project, which with my luck started just as XMLRPC decided to s**t the bed, is hosed. I can't get more than one in 20 calls to my obects to not get timed out or just drift off towards the Planet Bork. I've asked around and people are getting wildly varying results as reported here in other posts. I had some scripted vendors that had to be taken offline, no biggie because I only had two which were sort of 'beta' anyway so that was not major - but when you are trying to test a program you are writing to acces objects in world (c++ in this case) when you are off-world its a maddening experience becuase you dont knkow if your code is crap if you can't test it properly.

Please fix this or tell us its not fixable so I can stop coding something that wont work anyway.


Melnik Balogh added a comment - 09/Nov/07 01:13 PM
BLUSH BLUSH everything screwed up again, i gess my enthusiasm was premature.
To Linden(s), Please restore xml-rpc configuration as it was BEFORE the last Rolling restart.... (two day's ago, 11-07-07)

maybe this will help.... i hope....

Mel..


Drew Dwi added a comment - 11/Nov/07 06:53 AM
I also am having this problem it is disrupting a ban system in place. If this is not high on the todo list please say so, so can start coding a HTTP only solution.

hewitt huet added a comment - 11/Nov/07 07:28 AM
Umm, this issue was reopened and bumped to critical.... any word at all? I have given up for now and will just use email for my communications in/off world. It's reliable enough, and the delay SEEMS to be about on the order of xmlrpc before it drifted off... but I'll leave the xmlrpc framework in my project 'just in case' lol. Seriously, any updates on progress?

AnnMarie Otoole added a comment - 11/Nov/07 11:55 AM
There is related discussion here, hewitt http://jira.secondlife.com/browse/SVC-913
There seems to be some consensus and discussions with Linden Labs on how to solve this problem but it also includes a note that no one has been assigned to it and no ETA for a solution.

It appears that the possible solution is to abandon the XML-RPC approach and provide a method where any prim can have a random URL assigned to it so the HTTP process could be made bi-directional.


Apotheus Silverman added a comment - 11/Nov/07 12:06 PM
A reliable stop-gap solution could be to put the xmlrpc CGI on each sim and then have the "central" xmlrpc server simply lookup which server the request should go to and act as a proxy for it. This is very likely how incoming HTTP will have to be setup anyways so this work would easily port over to that project.

Darius Lehane added a comment - 11/Nov/07 08:01 PM
I am receiving up to a dozen or more complaints per day that scripted XML-RPC products are either failing to connect or lagging horribly. This issue is most definitely not better. The same products a year ago were buttery smooth with no issues. Soon I may have to pull these products out of SL unless these are fixed.

Lex Neva added a comment - 12/Nov/07 10:05 AM
I have a system in Suffugium that uses XML-RPC to update a ticker that displays what's currently playing on our audio stream. Suffice it to say that it's usually off by a few minutes... that is, when the XML-RPC requests don't fail outright. I guess I'm going to have to switch to llHTTPRequest polling. Ridiculous.

FlipperPA Peregrine added a comment - 15/Nov/07 01:01 PM
This is failing horribly again. Can we PLEASE get this up and running until we get a decent scalable solution (llOpenHTTPRequest / http_server event?) in place?! Polling every second isn't an option for those of us that have tens of thousands of items out there that get sent and XML-RPC request once a day or so, and a horrible excuse for a solution. Please, please, please get this working until we can send incoming http requests directly to simulators!

Torley Linden added a comment - 15/Nov/07 02:22 PM
I'm sorry this is such a frustration, and I'm asking our developers and operations right now for an update on latest progress.

kelly linden added a comment - 15/Nov/07 02:41 PM
We are working on getting a second server into rotation, but this won't go live until next week at the earliest.

To be clear these issues appear to be entirely a case of the server just being over loaded, I can't find any other bug especially no newly introduced ones. We attempted increasing the maximum connections (since it always sits at just under the max) and that just resulted in the rest of the server trying to melt, so we reset it back to the original limit. The second server should alleviate some pain but I have no doubts that it may not be long until we reach capacity on that server as well.


FlipperPA Peregrine added a comment - 15/Nov/07 02:58 PM
Thank you, Torley and Kelly, for an update, and thanks for the news on a second server! I think we all realize XML-RPC isn't the long term goal here - too centralized - but anything we can do to keep it on life support until we have a more scalable, simulator level solution is greatly appreciated! Thanks again for all your hard work. I know that I, for one, really appreciate it.

AnnMarie Otoole added a comment - 15/Nov/07 08:12 PM
I'm relieved to know there is some action here. I did get my project running using HTTP requests and it is working fine. I am somewhat handicapped by the 5 second minimum average time between queries but I can live with the degraded performance. My objects spend less than 0.5% of the time on line but there can be a burst of activity lasting a few minutes.

It, however goes against the grain having every object poll once every 6 seconds, 24/7 just to see if there is data to be retrieved. So far my host has not complained about the rate of MySQL queries but traffic will build with the project and I suspect I will run to trouble.

I'm no expert on the inner workings of these but would it be possible to have a function that establishes a sub domain URL for any prim? The root DNS would be a second life port that has a look up by UUID to send an HTTP request to the prim. I don't know if this makes sense but it would solve the problem without the XML-RPC headaches.


Impalah Shenzhou added a comment - 16/Nov/07 11:59 AM
Hi, I'm the creator of Henshin III. Our "online" version that uses xml-rpc calls intensively is not working since more than one month ago, too much time for a bug, I think.

These kind of things is making me, and much others, to lose clients and money and it is, simply, not acceptable.

The solution is not to round or jump the wall (the "http" solutions posted before). Is as simple as if Linden pretends to maintain its business, it needs residents making business on the grid and this is not the way, having us waiting for a miracle (Torley, I know you're trying to help, but maybe someone needs a slap on the face to wake up and not only words).

In the last days I have done a couple of tests, not with my product, with a simple web page sending only 4 or 5 chars and locally, from my machine.

What I have detected is:

  • From web pages hosted in USA the calls seems to work, slowly to connect, but working. The times to answer varies from 20 seconds to 1 minute.
  • From web pages hosted in Europe (Spain, were I live) the time increases from 45 seconds to 2 minutes.
  • Locally at home, from a test program, 3MB DSL connection... impossible, timeout always (I set timeouts from 5 to 10 minutes).

Greetingzzz


Darius Lehane added a comment - 16/Nov/07 08:24 PM
Similar to comments above:

My business is being decimated with lots of customers screaming at me. I may suspend sales soon, with a corresponding reduction in tier. What seemed an intermittent problem is now a flood.

Also note that polling http is not merely a brutal abuse of bandwidth furthering lag, it is not even possible for certain types of communication.

Please please please solve this problem.


hewitt huet added a comment - 17/Nov/07 06:35 AM
Actually, I am shocked at the lack of action on this to date. It always works out that when BIG clients (195 usd tier+) complain, things get done quickly, and if Apotheus Silverman isnt 'big' enough in running the ENTIRE sl exchange, then I guess the only person who gets action is "you know who", and she'd even be miffed

Fanciful talk of HTTP to a prim etc etc is lovely, but to take a system thats worked well and is current BBB (borked beyond belief) and tell us 'the servers overloaded, we'll add another, and it will be just as bad in a few weeks' is pathetic!!! I didnt think for a second that xmlrpc was bugged, it was just timing out repeatedly or sooo slloooww as to make it unusable. This indicates a huge overload or bottleneck.

Has anyone looked to see where all these calls are originating that are clogging the server all of a sudden? Is someone flooding them, to grief? How is it that in the space of two weeks the server gets THAT overloaded? If you're not sure adding another server will help, that means there will be a bigger flood of calls to the script. Can't you add two or three? Or even shudder four?!?!?

I'd be happy to build you a few servers if you need them. No really, nice ones too. Of course if there's ever a problem with them, you'll just have to post on my blog, or put in a "ticket" on my site, and wait for a reply. After all we have more important things to do at my company, like new wallpaper, installing the cappucino machine, and making sure the sky looks nice outside the window, but we'll get to it. Eventually. Maybe next week, at the earliest. If the plumbers don't come, that is.

Kelly, Torley, please... PLEASE.... do something!!!!!!


Kamilion Schnook added a comment - 17/Nov/07 07:46 PM
The way I solved this problem for myself:

Previously I was using llHTTPRequest polling every 5 seconds to a queue on my webserver. If there was anything in the queue, it would get executed.

This ate a lot of unnecessary bandwidth on my webserver.

My new solution:

Inworld-Server-Object listens for email:

state_entry() { gServerKey = llGetKey(); llSetObjectDesc(gServerKey); llSetText(llGetObjectName() + "\n Key: "+ gServerKey,<0,1,0>,1.0); llSetTimerEvent(2.5); // not a good idea to drop below 2.5 seconds, it's laggy enough as it is, and I'll already process the entire queue each event. }
timer() { llGetNextEmail("", ""); } // check for email with any subject/sender
email(string time, string address, string subj, string message, integer num_left) {
checkHTTPQueue();
if (num_left > 0) { llGetNextEmail("", ""); } // Continue processing queue if any remain.
}
link_message(integer from, integer num, string str, key id) {
if (str == "TIMER.HTTPQUEUE") { checkHTTPQueue(); }
}

Plus a second script:
default { state_entry() {llSetTimerEvent(1800);} timer() {llMessageLinked(LINK_SET, 0, "TIMER.HTTPQUEUE", NULL_KEY);}}

When it receives an email (Any email), it will go and retrieve & execute the queue over HTTP.
The second script forces a poll every 30 minutes just as a backup.
Now the load is only on the timer/getnextemail on the simulator.

When I have larger server farms, I use the second script, replacing llMessageLinked with llRegionSay/llShout depending on the farm's size, and adding a listen event to the servers so only a single 30m timer script is needed for the whole farm.

Currently, I'm using internal-only object-to-object emails which never actually hit postfix, and are normally "instant".

For external email servers sending mail in-world, can be subject to delays on ll's postfix server getting it's queues filled up, but whenever that happens, I just poke LL on the sldev mailing list to empty the queues.

Harold Brown had this to say the last time it came up on sldev: "For a nice history of the reliability of external -> internal E-Mail from May 2 -> current date, you can view my test log at this URL: http://www.rpgstats.com/SL/maillog.txt "

As of this comment posting (Nov-17-07 - 7:45PM PST), the average delay of external -> internal emails is roughly 4-8 seconds.

I should also note, Sean Linden proposed adding HTTP server support to LSL – PLEASE Vote for it here: https://jira.secondlife.com/browse/SVC-913 "Allow LSL scripts to act as HTTP servers in order to replace XMLRPC with something scalable"


Alexa Linden added a comment - 19/Nov/07 03:20 PM
Last Triaged 11/19/2007

alina aida added a comment - 20/Nov/07 04:53 PM
I have been trying for three days to post messages on my message boards. After four or five error messages I get a message that the information has finally gone through successfully, but yet it never shows up in world on the board itself. This is not a matter of minutes or hours now as it used to be. This has gone on for three days now. Please do something to fix this.

hewitt huet added a comment - 20/Nov/07 06:42 PM
Triage Report
From : Dr. Huet
To: Alexa Linden
11/19/07

The operation was succesful, but the patient is dead.


kelly linden added a comment - 20/Nov/07 07:36 PM
A progress update:
We have the server configured and ready to go, we just need to test it and make the necessary firewall changes. However, given the recent grid issues from this last week and the imminent US holiday there is a good chance it won't get into rotation until next Monday. I really am sorry, I will push for an exception to the 'no changes on the last day of the week' rule tomorrow, but I can't guarantee anything. In fact the recent issues are likely to make our system engineers even more reluctant to change anything.

FlipperPA Peregrine added a comment - 21/Nov/07 06:06 AM
Thank you for the update and your continued efforts, Kelly, this sounds like a promising solution for the time being!

hewitt huet added a comment - 22/Nov/07 02:53 PM
Kelly, ditto that and I apologize for the sarcasm... but I had to suspend all my activities, ie building and scripting, to attend to my vendors that I thought were not working perhaps for some other reason. I appreciate your continued efforts, and also thank you for the update. =D

Haravikk Mistral added a comment - 23/Nov/07 03:52 AM
I'm glad I haven't had time to switch to my new networked vendor, as almsot 100% of my requests fail.

Polling with llHTTPRequest() is an impossible solution, as I'm using PHP which is incapable of proper asynchronous. Trying to offload asynchronous behaviour to vendors is a nightmare solution.

We NEED that new machine soon, and need to know an alternative is in development that will prevent this just happening again when the new machine hits capacity as well.


Virtouse Lilienthal added a comment - 23/Nov/07 04:09 AM
Well as a workaround you guys can do this:

Create multiple communication objects in multiple sims which are polling a notification queue on the webserver. (Multiple Objects in multiple regions to increase that the objects are polling the webserver at all.

If the notification queue has an entry the communication object sends an email to the object which needs to be notified. Including a sequence id so that the entry can be removed at the webservers queue.
When the object receives an email it checks if it the notification id is the same as the last one if not it will contact the server
and asks for a job and it passes it sequence id, if it was triggered by an email so the server knows which job can be removed from the notification queue.

NOTE: The jobs are not passed via email. The jobs are passed from the webserver directly to avoid
that someone is sending you faked emails.

Well however this architekture should work pretty stable for the moment and is at least more reliable than XMLRPC.
Plus the server is not bombed with requests from thousands objects.

Just a few communication objects are polling the webserver like every 20 seconds for the next notification 10 entries.

Regards,
Virtouse


Virtouse Lilienthal added a comment - 23/Nov/07 04:11 AM
Correction:

"(Multiple Objects in multiple regions to increase that the objects are polling the webserver at all."

=>

"(Multiple Objects in multiple regions to increase the redundancy so that the objects are polling the webserver at all.)"


Darius Lehane added a comment - 23/Nov/07 08:59 AM
There is NO WORKAROUND for the types of objects I build. I am a professional programmer, I can think of a hundred ways to make stuff "sort of work", but many of the very popular objects I build rely on the "RPC", i.e., "realtime" of the XML-RPC.

Let's not distract from what is important: fixing the bug. I hope one server is enough, add two or three or more if needed, please.

CONCRETE USEFUL SUGGESTION: Although there should be no time limiter (keep it RPC), perhaps LL can investigate if parties are abusing the system with spamming objects, bots etc. If one object is sending hundreds of these per hour, maybe cut that object off!


Haravikk Mistral added a comment - 23/Nov/07 10:51 AM
Virbouse, there are several problems with that workaround:
  • Polling! If you have multiple objects per region polling your servers then your poor server has to constantly accept and close new connections, not to mention process them, probably with some database look-up. This is incredibly costly and would increase the work the server has to do by many orders of magnitude. Instead of simply receive maybe 100 sales requests a day, and routing these through to delivery boxes, it would have to do the 100 requests a day, plus process THOUSANDS of polling requests just looking for something to do.
  • Asynchronous. In my case I'm using PHP as a back-end, what if my script requires some data in the XML-RPC reply? PHP doesn't do proper asynchronous behaviour, meaning I would have to offload all of that to the vendors, meaning THEY would have to poll for data as well, whenever they're waiting for a reply!

A working XML-RPC, or an incoming HTTP service is NECESSARY to making these applications behave correctly, and with a reasonable degree of performance.


Keiko Rau added a comment - 29/Nov/07 06:28 PM
Wondering if that new server has been implemented yet.

I began doing some one-off data cleaning which would run over a couple of months, but have turned that off temporarily after reading of the problems that real businesses are already experiencing with XML-RPC, so as not to impact them (I hate it when my SLX purchases arent delivered - it will need to be turned back on eventually as my application gets closer to completion. The application itself will also rely on this, with objects scattered all over the world, once its completed.


kelly linden added a comment - 30/Nov/07 10:01 AM
Sorry for the delay. We attempted to put the new server into rotation this week but ran into a snag in the configuration. We need to verify we have the kinks worked out and will try putting it into rotation again next week.

Apotheus Silverman added a comment - 04/Dec/07 07:35 AM
I noticed XML-RPC started working reliably again as of last night. SL Exchange staff, customers, and I all thank you!

FlipperPA Peregrine added a comment - 04/Dec/07 09:10 AM
I've noticed that it has been working pretty well today as well. Thanks for your hard work in resolving this issue! We really appreciate it.

AnnMarie Otoole added a comment - 04/Dec/07 09:22 AM
Greatly appreciated but I have a question.

How permanent is this solution? Is it just an increase in capacity that will gradually fill again and degrade performance in the future? And if so, do you plan to add capacity as necessary to maintain performance?

I had to reprogram my application to HTTP polling and it is working reasonably well. Switching back to XML-RPC is a big task so I would like some indication of reliability before commencing. Application performance and bandwidth load on the web site would be considerably improved but I only want to do it once.


kelly linden added a comment - 04/Dec/07 09:24 AM
We successfully got the new server in place last night. There are still some spikes on the graphs I'm looking at (Thanks Thraxis), but no actual failures (on this graph) and the usual response times outside the spikes is in the 0 - 2 second range.

I will be looking into the spikes, and we have some plans to further reduce load in the hopes that we can extend the time until this becomes a problem again. We will also be working on monitoring tools for this service.

I am closing this as fixed.

If you are still having problems let me know - kelly@lindenlab.com - before re-opening this. I am going to guess that any problems still happening are a different issue and not this same umbrella issue. We can discuss whether to make a new jira etc.

If you are reading this down the line and think you should re-open it because things have gone to hell again please don't! Instead please create a NEW issue. I ask this because it will be much easier to track and diagnose a new issue (or even a recurrence of this one) if it isn't tied to the large history of votes, watchers and comments that this issue now has.


FlipperPA Peregrine added a comment - 04/Dec/07 09:30 AM
I've switched many of my applications back to receiving email on a 5 second timer, as much as I dislike the solution. Kelly, would you recommend using email or xml-rpc until we have the ability to send requests into the sim level? I know XML-RPC is centralized, but I'm honestly not sure how incoming email is processed.

kelly linden added a comment - 04/Dec/07 09:32 AM
AnnMarie:
We added another server, we are going to make some other changes soon to increase capacity even more. However, this is still a central service bottleneck, even if it is a load balanced bottleneck now. From a reliability standpoint llHTTPRequest will always be better on the SL side. It is distributed with no central bottlenecks at all. If you have already switched to llHTTPRequest and it is working for you I recommend sticking with that, unless you absolutely must switch back. XMLRPC will hit the cap again, llHTTPRequest will not.

The primary goal here is to give us enough head room to implement, test, deploy and iron out kinks of a replacement lsl feature for xmlrpc, plus a reasonable phase out period. At which point we would really love to officially phase out xmlrpc. Current theory for such a replacement is a super light weight http server inside lsl (SVC-913). I have a more thorough design for this feature that I hope to share this week, the general idea of which is in my comments in that jira.


kelly linden added a comment - 04/Dec/07 09:58 AM
My order or preference is:
llHTTPRequest
llEmail
xml-rpc

Email has latency issues sometimes, and can be difficult to program a transport layer onto. Well, at the least it isn't a usual choice.

If you are communicating between objects llEmail is THE way to go. It isn't really email on the back end, and is much more reliable than any other option.

One fear I have is that residents will decide xml-rpc is 'fixed' and slam it, and break it again. Even with the round robin in place now I do not think we stand a chance of keeping up with the potential load of people believing it is a reliable stable system to use. If at all possible I still recommend llHTTPRequest at the top (the only option with no central server on our end!!) then llEmail second (more lightweight, scaling and routing email is a more known and solved problem) and lastly xmlrpc (a very hacky system with a round robin that I honestly don't know how high can scale, does incredibly bad and hacky things including DB writes on the central database with every connection etc).


AnnMarie Otoole added a comment - 04/Dec/07 10:22 AM
As a noob in web communications, it seems to me that the HTTP could be set up like the website hosts do using a "sub-domain" so the SL DNS routing is done inside SL. A prim could then have a function call to enroll it as an SL URL using the SLUUID as an identifier and to install the HTTP processing functions.

Does this make any sense and is this what is intended for the HTTP solution?


kelly linden added a comment - 04/Dec/07 10:36 AM
Ideas and comments about the http_server and xmlrpc replacement should happen in SVC-913, not here. In short, that isn't my plan. I will be updating that jira to align with my design later this week hopefully, but my comments there are pretty good I think.

Melnik Balogh added a comment - 06/Dec/07 03:07 AM
Hello,

I have tried xml-rpc again today, everything looked great..!! and fast (at my end anyway) so if it stay's this way im a happy camper..

thanks


hewitt huet added a comment - 14/Dec/07 05:42 AM
Yep, sames here. All seems well for the time being. Kelly, I am still curious - did anybody look into the possibitily that the server was being hammered by griefers?

I hope it stays this way. Thanks for your efforts!


Alexa Linden added a comment - 18/Nov/08 10:27 AM
reopening to update internal link