Chaos in Rosetta@Home???

Message boards : Number crunching : Chaos in Rosetta@Home???

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Emigdio Lopez Laburu

Send message
Joined: 25 Feb 06
Posts: 61
Credit: 40,240,061
RAC: 0
Message 62617 - Posted: 30 Jul 2009, 9:27:19 UTC

Good morning.

After all the issues during this July... ,my impresion is that, actually, Rosetta@home is a true chaos. I hope that this chaos is only in the "IT part" and not in the "science" part.

Perhaps I,m wrong but this is my particular impresion.

Hopefully all the issues will be solved soon.

Regards.
ID: 62617 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 62640 - Posted: 30 Jul 2009, 17:28:29 UTC

Here is what caused our current issues. I personally wouldn't call it chaos but a simple mistake and issues that arise with most large scale software development projects.

A developer/scientist in the lab accidentally updated the R@h application using the wrong signature file for the database which is unfortunately our largest input file. The update happened during the weekend and no one was around to fix the problem (I personally was on a backpacking trip with my family otherwise I would have immediately dealt with the problem). This caused all jobs to fail and hammered our servers. Our servers are still struggling to keep up with scheduler requests and download/uploads.

Coincidentally, a very large code checkin was made to introduce symmetric folding to our minirosetta application and unfortunately there was a bug that caused a 10-fold slow down. Before catching this bug, the R@h app was updated so we had to revert to the previous application version as a quick fix. To make sure this doesn't happen again we are planning to implement a quick benchmark test on Ralph for every application update that will test various protocols for performance and speed.

We are still in debug mode for our minirosetta application. There is a small memory leak and a 2 fold slow down in performance. The slow down was caused by a recent refactoring of the hydrogen bond energy code.


ID: 62640 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
googloo
Avatar

Send message
Joined: 15 Sep 06
Posts: 133
Credit: 21,716,992
RAC: 5,890
Message 62642 - Posted: 30 Jul 2009, 17:48:46 UTC

Nobody's updating the Rosetta Application Version Release Log either.
ID: 62642 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Emigdio Lopez Laburu

Send message
Joined: 25 Feb 06
Posts: 61
Credit: 40,240,061
RAC: 0
Message 62643 - Posted: 30 Jul 2009, 17:51:48 UTC

Hi, David.

First of all I should like to thank you for your explanations. I appreciate it.

As this has been discused before in other threads, it should be a good idea transmit this information to all the volunteers; perhaps in the main page of R@H. Not everybody goes into the forums and not evereybody will read this thread, I suppose.

I do not understand the science behind this project; I only work as an IT professional not related with protein folding. But let me give you a couple of advices (without understand your "business"!):

- Never, never perform a change in the software/hardware just before a weekend. If something fails, nobody could attend and fix it.
- You must to build a Pre-Production environment to test the changes.

As I said, I dont understand your environment/software and so on.

I give you this advices with a total humility but I think that someone of your team should take some actions. The price with these errors is high if you want to maintain thousands of volunteers working for you.

Thanks again.

ID: 62643 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 62645 - Posted: 30 Jul 2009, 18:20:20 UTC
Last modified: 30 Jul 2009, 18:23:02 UTC

Glad to hear, David.
I still have plenty of WUs on cache to keep my house warm :D (it's winter here :P)
Post like yours are what most people ask to be explained on the home page, that'd make new users feel more "connected" to the science team.

My $0.02.
ID: 62645 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,662,550
RAC: 720
Message 62647 - Posted: 30 Jul 2009, 19:30:15 UTC

I have one wu stuck in upload which presimably will resolve itself. When I finally got a new wu down, it crashed after 6 seconds, Mini 1.88.

30/07/2009 21:01:21 rosetta@home [sched_op_debug] Reason: Unrecoverable error for result lr13_seq_score12_F_rlbd_1a68_IGNORE_THE_REST_DECOY_14592_2633_0 (Incorrect function. (0x1) - exit code 1 (0x1))

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 62647 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,709,409
RAC: 1,933
Message 62649 - Posted: 30 Jul 2009, 19:43:27 UTC - in response to Message 62647.  

I have one wu stuck in upload which presimably will resolve itself. When I finally got a new wu down, it crashed after 6 seconds, Mini 1.88.

30/07/2009 21:01:21 rosetta@home [sched_op_debug] Reason: Unrecoverable error for result lr13_seq_score12_F_rlbd_1a68_IGNORE_THE_REST_DECOY_14592_2633_0 (Incorrect function. (0x1) - exit code 1 (0x1))



be sure to post the error message section part of this message over in the 1.88 thread so they know.
ID: 62649 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 62654 - Posted: 30 Jul 2009, 20:33:19 UTC - in response to Message 62647.  

I have one wu stuck in upload which presimably will resolve itself. When I finally got a new wu down, it crashed after 6 seconds, Mini 1.88.

30/07/2009 21:01:21 rosetta@home [sched_op_debug] Reason: Unrecoverable error for result lr13_seq_score12_F_rlbd_1a68_IGNORE_THE_REST_DECOY_14592_2633_0 (Incorrect function. (0x1) - exit code 1 (0x1))


These get solved in v1.90.

Apparently one tiny mistake had a snowball effect on the whole project.
ID: 62654 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yifan Song
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 26 May 09
Posts: 62
Credit: 7,322
RAC: 0
Message 62655 - Posted: 30 Jul 2009, 20:42:12 UTC

Well, for about two days, it was a little chaotic here. A bunch of us are running around trying to figure out where the bug is and how to fix etc.
But be assured that the chaos is not in the science part and hopefully just temporary. I'll post something more detailed on what we've tried to solve this problem.
ID: 62655 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 62656 - Posted: 30 Jul 2009, 20:53:43 UTC - in response to Message 62643.  

Hi, David.

First of all I should like to thank you for your explanations. I appreciate it.

As this has been discused before in other threads, it should be a good idea transmit this information to all the volunteers; perhaps in the main page of R@H. Not everybody goes into the forums and not evereybody will read this thread, I suppose.

I do not understand the science behind this project; I only work as an IT professional not related with protein folding. But let me give you a couple of advices (without understand your "business"!):

- Never, never perform a change in the software/hardware just before a weekend. If something fails, nobody could attend and fix it.
- You must to build a Pre-Production environment to test the changes.

As I said, I dont understand your environment/software and so on.

I give you this advices with a total humility but I think that someone of your team should take some actions. The price with these errors is high if you want to maintain thousands of volunteers working for you.

Thanks again.



Your advice is great and I agree completely with both points.

1. we will make it a point never to do an update during the weekend or end of the week.
2. we do have a pre production environment - Ralph@home. But this problem was caused by user error . The signature file was accidentally copied over from Ralph when the standard protocol should automatically create the correct signature file. The 10x slow-down wasn't caught by our internal unit tests and benchmark tests but we are going to modify the tests to make sure it will get caught in the future.

There has recently been a very large rate of code development.


ID: 62656 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 62657 - Posted: 30 Jul 2009, 20:55:03 UTC - in response to Message 62654.  

I have one wu stuck in upload which presimably will resolve itself. When I finally got a new wu down, it crashed after 6 seconds, Mini 1.88.

30/07/2009 21:01:21 rosetta@home [sched_op_debug] Reason: Unrecoverable error for result lr13_seq_score12_F_rlbd_1a68_IGNORE_THE_REST_DECOY_14592_2633_0 (Incorrect function. (0x1) - exit code 1 (0x1))


These get solved in v1.90.

Apparently one tiny mistake had a snowball effect on the whole project.



The "tiny" mistake was a very very very detrimental one unfortunately.
ID: 62657 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Henry Huff

Send message
Joined: 31 May 06
Posts: 6
Credit: 2,298,502
RAC: 0
Message 62658 - Posted: 30 Jul 2009, 21:24:10 UTC
Last modified: 30 Jul 2009, 21:28:05 UTC

Yes something is wrong. I have a total of 4 computers running Rosetta@home and all have been having trouble uploading and downloading as well as getting new tasks. A frequent message is internet access ok - project servos may be down. This has been gong on for over a week.
ID: 62658 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 62661 - Posted: 30 Jul 2009, 21:56:32 UTC

Yes something is wrong. I have a total of 4 computers running Rosetta@home and all have been having trouble uploading and downloading as well as getting new tasks. A frequent message is internet access ok - project servos may be down.


Perhaps the answer lies here

Message 62660 -
all services were temporarily shut down to add more web servers.

ID: 62661 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Gen_X_Accord
Avatar

Send message
Joined: 5 Jun 06
Posts: 154
Credit: 279,018
RAC: 0
Message 62663 - Posted: 30 Jul 2009, 22:24:02 UTC

In regards to David E's explanation (thank you for that by the way) I have added Ralph@home too. Maybe helping with the early development can help prevent screw ups like this in the future.
ID: 62663 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 62665 - Posted: 31 Jul 2009, 0:11:19 UTC - in response to Message 62663.  

In regards to David E's explanation (thank you for that by the way) I have added Ralph@home too. Maybe helping with the early development can help prevent screw ups like this in the future.


Ditto
ID: 62665 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,709,409
RAC: 1,933
Message 62666 - Posted: 31 Jul 2009, 1:18:54 UTC

Should put a link on the Rosetta home page pointing to Ralph and then we can get more volunteers to help with pre production testing. I run Ralph to help find any bugs in the program.

Little is known about Ralph with the exception of occasional mention in the boards here.
ID: 62666 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile joseps

Send message
Joined: 25 Jun 06
Posts: 72
Credit: 8,173,820
RAC: 0
Message 62683 - Posted: 31 Jul 2009, 18:01:43 UTC - in response to Message 62640.  

Here is what caused our current issues. I personally wouldn't call it chaos but a simple mistake and issues that arise with most large scale software development projects.

A developer/scientist in the lab accidentally updated the R@h application using the wrong signature file for the database which is unfortunately our largest input file. The update happened during the weekend and no one was around to fix the problem (I personally was on a backpacking trip with my family otherwise I would have immediately dealt with the problem). This caused all jobs to fail and hammered our servers. Our servers are still struggling to keep up with scheduler requests and download/uploads.

Coincidentally, a very large code checkin was made to introduce symmetric folding to our minirosetta application and unfortunately there was a bug that caused a 10-fold slow down. Before catching this bug, the R@h app was updated so we had to revert to the previous application version as a quick fix. To make sure this doesn't happen again we are planning to implement a quick benchmark test on Ralph for every application update that will test various protocols for performance and speed.

We are still in debug mode for our minirosetta application. There is a small memory leak and a 2 fold slow down in performance. The slow down was caused by a recent refactoring of the hydrogen bond energy code.




I turned off my 5computers when I went on vacation. When I return today, I can not upload work. Need work units to run computers.
joseps
ID: 62683 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile joseps

Send message
Joined: 25 Jun 06
Posts: 72
Credit: 8,173,820
RAC: 0
Message 62698 - Posted: 1 Aug 2009, 13:54:37 UTC

Hi,
I do not know servers, research operation or it's management. Rosetta@home has become very big and draws large volunteer crunchers worldwide. Some kind of preventive maintenance should be implemented to make sure that distributed computing is not interrupted. If possible, a backup server or whatever should be available. And no one person should be doing work/checking alone . There should be at least two people working together counter checking/discussing each move before a move is carried out. This is done to prevent any break in the operation. I used to run a large production plant 3 shifts operation and I make sure that 2-3 engineers discuss an action before implementing it.No one person is fail proof. I love Rosetta. I just want to volunteer my 2 cents worth of idea. If I am barging in or out of line, I am very very sorry. I'll just shut my big mouth.
joseps

I turned off my 5computers when I went on vacation. When I return today, I can not upload work. Need work units to run computers.
joseps
ID: 62698 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,709,409
RAC: 1,933
Message 62699 - Posted: 1 Aug 2009, 15:01:22 UTC

I have to agree, this coding/signature problem should have been avoided in the first place with double checking of the code or signatures. Projects should be always alpha/beta tested on Ralph before coming over here to Rosetta. When the major errors have been worked out then bring the tasks to here for running. Then only very odd errors will show up.

The group of users should be higher, but the technology problems are driving some the big users away. Perhaps Rosetta is now to big for just the group that is running it now.
ID: 62699 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
j2satx

Send message
Joined: 17 Sep 05
Posts: 97
Credit: 3,670,592
RAC: 0
Message 62704 - Posted: 1 Aug 2009, 21:15:45 UTC - in response to Message 62643.  

Hi, David.

First of all I should like to thank you for your explanations. I appreciate it.

As this has been discused before in other threads, it should be a good idea transmit this information to all the volunteers; perhaps in the main page of R@H. Not everybody goes into the forums and not evereybody will read this thread, I suppose.

I do not understand the science behind this project; I only work as an IT professional not related with protein folding. But let me give you a couple of advices (without understand your "business"!):

- Never, never perform a change in the software/hardware just before a weekend. If something fails, nobody could attend and fix it.
- You must to build a Pre-Production environment to test the changes.

As I said, I dont understand your environment/software and so on.

I give you this advices with a total humility but I think that someone of your team should take some actions. The price with these errors is high if you want to maintain thousands of volunteers working for you.

Thanks again.


They have ralph@home to test on, but they don't use it.
ID: 62704 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Chaos in Rosetta@Home???



©2024 University of Washington
https://www.bakerlab.org