Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 302 · Next

AuthorMessage
PFLIEGER Guy

Send message
Joined: 20 Dec 15
Posts: 3
Credit: 1,230,645
RAC: 0
Message 81421 - Posted: 13 Apr 2017, 5:21:57 UTC

today is a problem with the server: most functions are not running
Guy PFLIEGER MASEVAUX ALSACE France
phone: 0033973514697
ID: 81421 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 81424 - Posted: 13 Apr 2017, 19:34:18 UTC

Looks like all servers are now active. Are you still seeing upload problems?
Rosetta Moderator: Mod.Sense
ID: 81424 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 4
Credit: 40,012,804
RAC: 99,563
Message 81429 - Posted: 14 Apr 2017, 9:42:22 UTC

Hi,

We are still getting upload problems. Status: Project back off.

Thanks.
ID: 81429 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Priebe

Send message
Joined: 27 Nov 09
Posts: 16
Credit: 33,020,247
RAC: 0
Message 81432 - Posted: 14 Apr 2017, 14:58:20 UTC

I have two machines here with the same problem. 10 WU's stuck on uploading. They transmit between 48KB and 55KB then stop dead before going into the retry loop.
ID: 81432 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Omega

Send message
Joined: 25 Jan 11
Posts: 3
Credit: 1,012,997
RAC: 0
Message 81433 - Posted: 14 Apr 2017, 18:13:32 UTC

Upload problems remain.
ID: 81433 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 81434 - Posted: 14 Apr 2017, 18:30:07 UTC
Last modified: 14 Apr 2017, 18:30:18 UTC

I'll take a look. Sorry for being late on this.
ID: 81434 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
amgthis

Send message
Joined: 25 Mar 06
Posts: 81
Credit: 203,879,282
RAC: 0
Message 81435 - Posted: 15 Apr 2017, 0:43:05 UTC - in response to Message 81434.  

I'll take a look. Sorry for being late on this.



at 17:40 pacific time here I still have 16 queued up waiting in line...

Happy Easter everyone!

Cheers,

/M
ID: 81435 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Omega

Send message
Joined: 25 Jan 11
Posts: 3
Credit: 1,012,997
RAC: 0
Message 81436 - Posted: 15 Apr 2017, 7:27:46 UTC

More WU's getting stuck on uploading.
ID: 81436 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 81437 - Posted: 15 Apr 2017, 7:34:35 UTC

We've been trying to troubleshoot this issue but still do not know what is causing it. Our sys admin said that UW-IT has been contacted to determine if it's a UW network issue. Sorry for any inconvenience.
ID: 81437 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 4
Credit: 40,012,804
RAC: 99,563
Message 81442 - Posted: 15 Apr 2017, 12:00:25 UTC

Just to update. Some WU are actually uploading, but some are not.

I am not sure if this helps troubleshoot, I thought it may be worth mentioning.

Happy Easter, and Happy Holidays to everyone else.
ID: 81442 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Keith E. Laidig
Volunteer moderator
Project developer
Avatar

Send message
Joined: 1 Jul 05
Posts: 154
Credit: 117,189,961
RAC: 0
Message 81445 - Posted: 15 Apr 2017, 12:32:19 UTC
Last modified: 15 Apr 2017, 12:49:53 UTC

Hey folks - Happy Easter to all.

Yesterday, in response contributors concerns about our aged SSL cipher handling on the R@H services, we made changes the webserver configurations. For reasons I don't yet understand, these changes resulted in upload timeouts and the slow collapse of project backend.

Regrettably, I had 'gone offline' yesterday evening to celebrate the season with family and was unaware of the problems until this AM [ADT]. We've reverted to previous configurations and restarted the project.

We apologize for the 'outage' and will keep monitoring the situation closely until convinced things are working.

FYI - we are in the final stages of building out a new, shiny, high-powered BOINC system for R@H. More info coming.... -KEL

[Update] It looks like there are still problems....

ID: 81445 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,402
Message 81448 - Posted: 15 Apr 2017, 15:45:29 UTC

Could you modify your Server Status web page to show which of the server programs handle uploads and downloads?
ID: 81448 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 81450 - Posted: 15 Apr 2017, 17:47:28 UTC - in response to Message 81448.  

Could you modify your Server Status web page to show which of the server programs handle uploads and downloads?


I just added some text:

Web servers: boinc, srv1, srv2, srv3, srv4, srv5 (upload and download servers)


boinc is load balanced among the srv web servers. The srv servers handle uploads and downloads.
ID: 81450 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TPCBF

Send message
Joined: 29 Nov 10
Posts: 111
Credit: 5,084,721
RAC: 1,523
Message 81467 - Posted: 17 Apr 2017, 5:27:08 UTC
Last modified: 17 Apr 2017, 5:32:22 UTC

I have seen so far one WU that is stuck, though I can't remotely check all the hosts that are running R@H.
On that one host where I noticed this since Friday, other WUs are uploading fine.
And the one that gets stuck is trying to upload but as far as I watched some forced upload retries, it craps out at various amounts of data, between 3KB and 32KB, out of 739.36KB.

The WU in question is https://boinc.bakerlab.org/rosetta/workunit.php?wuid=820755819.

I now see that there are actually a few more WUs from the same date send that should have all been returned by now, on other hosts as well...
EDIT: Actually, I just checked and there are exactly 3 more WUs from the same data (one is early morning the next day, the 3/12) on one other host, a laptop that I can't check remotely. However, that same laptop has successfully received and returned WU send after 3/11, 3/12...

As other WUs are uploading just fine, even on the same machine, I can not think of a reason as to why a networking issue at UW should be causing this... :?

Ralf
ID: 81467 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
amgthis

Send message
Joined: 25 Mar 06
Posts: 81
Credit: 203,879,282
RAC: 0
Message 81476 - Posted: 17 Apr 2017, 18:48:49 UTC

I've got 12 queued that are now past the due date.
And another 14 waiting that haven't gone past the due date yet, but
more are dropping off all the time.

These are spread out on 15 different clients.
ID: 81476 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BarryAZ

Send message
Joined: 27 Dec 05
Posts: 153
Credit: 30,843,285
RAC: 0
Message 81478 - Posted: 17 Apr 2017, 21:22:23 UTC - in response to Message 81476.  

Pretty similar here -- realizing that the issue is back at the Rosetta site, since iis not happening with other projects on the same systems, I've elected to suspend processing on Rosetta units, which pushes my processing over to WorldGrid.

I periodically try to push the uploads but so far (since last week) no joy here.

Ideally the folks on the project side will figure out the problem they have at there end and resolve it.



I've got 12 queued that are now past the due date.
And another 14 waiting that haven't gone past the due date yet, but
more are dropping off all the time.

These are spread out on 15 different clients.


ID: 81478 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Priebe

Send message
Joined: 27 Nov 09
Posts: 16
Credit: 33,020,247
RAC: 0
Message 81480 - Posted: 17 Apr 2017, 22:10:31 UTC

Still have WU's stuck uploading. However, a different error sometimes shows up now:

17-Apr-2017 18:06:53 | rosetta@home | [error] Error reported by file upload server: [des_DS_160_fragments_fold_SAVE_ALL_OUT_470271_1422_0_0] locked by file_upload_handler PID=4156
17-Apr-2017 18:06:53 | rosetta@home | [error] Error reported by file upload server: [tj_3_6_junc_X_DHR55_DHR55_l3_t2_t3_4_v5c_fragments_abinitio_SAVE_ALL_OUT_474351_416_0_0] locked by file_upload_handler PID=23998
ID: 81480 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 8,784
Message 81485 - Posted: 18 Apr 2017, 2:42:20 UTC - in response to Message 81480.  

Still have WU's stuck uploading. However, a different error sometimes shows up now:

17-Apr-2017 18:06:53 | rosetta@home | [error] Error reported by file upload server: [des_DS_160_fragments_fold_SAVE_ALL_OUT_470271_1422_0_0] locked by file_upload_handler PID=4156
17-Apr-2017 18:06:53 | rosetta@home | [error] Error reported by file upload server: [tj_3_6_junc_X_DHR55_DHR55_l3_t2_t3_4_v5c_fragments_abinitio_SAVE_ALL_OUT_474351_416_0_0] locked by file_upload_handler PID=23998

So where does file_upload_handler come from? Who is locking the file. The host machine or the server?

My error messages are
18/04/2017 03:38:57 | rosetta@home | Started upload of rb_04_14_73903_117148__t000__ab_robetta_IGNORE_THE_REST_480361_204_0_0
18/04/2017 03:39:19 | rosetta@home | Temporarily failed upload of rb_04_14_73903_117148__t000__ab_robetta_IGNORE_THE_REST_480361_204_0_0: transient HTTP error
18/04/2017 03:39:19 | rosetta@home | Backing off 05:44:21 on upload of rb_04_14_73903_117148__t000__ab_robetta_IGNORE_THE_REST_480361_204_0_0
18/04/2017 03:39:31 | | Project communication failed: attempting access to reference site
18/04/2017 03:39:33 | | Internet access OK - project servers may be temporarily down.

Again, the server...
ID: 81485 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,402
Message 81486 - Posted: 18 Apr 2017, 3:21:03 UTC - in response to Message 81450.  

Could you modify your Server Status web page to show which of the server programs handle uploads and downloads?


I just added some text:

Web servers: boinc, srv1, srv2, srv3, srv4, srv5 (upload and download servers)


boinc is load balanced among the srv web servers. The srv servers handle uploads and downloads.

Looks good so far, but could you also add the status of all those servers to help us tell when to expect upload and download problems?
ID: 81486 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Omega

Send message
Joined: 25 Jan 11
Posts: 3
Credit: 1,012,997
RAC: 0
Message 81491 - Posted: 18 Apr 2017, 18:06:52 UTC

It seems to me that the upload problem has been solved. At least, all my stuck WU's have been uploaded now.
ID: 81491 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 302 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org