Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 23 · 24 · 25 · 26 · 27 · 28 · 29 . . . 55 · Next

AuthorMessage
BarryAZ

Send message
Joined: 27 Dec 05
Posts: 153
Credit: 30,843,285
RAC: 0
Message 75413 - Posted: 21 Apr 2013, 19:34:32 UTC - in response to Message 75408.  

Mikey -- yup -- I am considering something similar. Perhaps focusing on Malaria and SETI a bit more for me -- I'm sort of unhappy with POEM -- they have had a few outages and their inability to feed to the GPU side, given the low numbers they generate from the the CPU side sort of pushes me away a bit. I figure I should 'reward' SETI a bit for the changes they have made to improve their performance.

Rosetta, which runs solid as a project has (as you noted) something of a disinclination to acknowledge issues and even more a disinclination to act to resolve those relatively rare issues they encounter. In this case, I would think it relatively simple to 1) Acknowledge the problems with the cryo units and 2) Stop generating them. So I don't believe it is a case of a technical problem here, but rather one of those keyboard/cerebellum issues...



Indeed -- the problem of course is that while I periodically ferret out ALL Cryo units I have, not only does that not stop me from getting new ones, but also the ones I abort simply go back into the queue for future downloads.

I realize that some of the cryo workunits are OK -- but it seems to me that it is incumbent on the project folks to simply stop these from going out at the project level and debug them there.

There are no doubt plenty of folks running Rosetta in a 'no attention mode' - and they are really wasting CPU cycles here.


And I think THAT is a major part of what could be Rosetta downfall, they simply don't manage their project well and just keep on keeping on despite the problems they are having. I found TEN cryo units this morning on ONE machine, one had errored out and I aborted the rest!! I have already turned a different machine off, as in NO NEW TASKS, and will move on to Poem with it. Two other machines are now on Eon with Malaria as backups! I am getting VERY tired aborting cryo units or being asleep and having them error out!!! I only have Windows machines, so it seems they will NEVER work for me!!

ID: 75413 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1226
Credit: 14,034,809
RAC: 2,884
Message 75414 - Posted: 22 Apr 2013, 1:46:37 UTC - in response to Message 75413.  

BarryAZ, are you aware of the bandwidth problems SETI is currently having? For example, it takes several days to download the input files for one of their Astropulse workunits, but only about an hour to run it once all the input files
are downloaded.

I've given them a few suggestions on how to make more efficient use of their bandwidth; they don't seem to be using them.

As for malaria, there are two sources of workunits:

malariacontrol.net
http://www.malariacontrol.net/

World Community Grid
http://www.worldcommunitygrid.org/
GO Fight Against Malaria project only
ID: 75414 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BarryAZ

Send message
Joined: 27 Dec 05
Posts: 153
Credit: 30,843,285
RAC: 0
Message 75415 - Posted: 22 Apr 2013, 5:03:56 UTC - in response to Message 75414.  

Robert -- for SETI and Astropulse -- I understand -- I don't do GPU work units for SETI - a constraint that is easy enough to configure. As to CPU work units for SETI, since the relocation downloads (and uploads) have been much improved for me.

Regarding Malaria -- actually I run both the Malaria project (and have for years) and the World community grid project (and have for years). For World Community Grid, I've even shifted three GPU's to support it (AMD 6x and 7x GPU's).

One thing I've not seen much of with their projects though (albeit some with Malaria on a single workstation which leads me to believe its my problem), is work units that run for more than a few minutes and yield computational errors.

I have seen that problem occasionally with Einstein though.


BarryAZ, are you aware of the bandwidth problems SETI is currently having? For example, it takes several days to download the input files for one of their Astropulse workunits, but only about an hour to run it once all the input files
are downloaded.

I've given them a few suggestions on how to make more efficient use of their bandwidth; they don't seem to be using them.

As for malaria, there are two sources of workunits:

malariacontrol.net
http://www.malariacontrol.net/

World Community Grid
http://www.worldcommunitygrid.org/
GO Fight Against Malaria project only

ID: 75415 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TJ

Send message
Joined: 29 Mar 09
Posts: 127
Credit: 4,799,890
RAC: 0
Message 75419 - Posted: 22 Apr 2013, 17:47:41 UTC - in response to Message 75406.  

There are no doubt plenty of folks running Rosetta in a 'no attention mode' - and they are really wasting CPU cycles here.


That's correct BarryAZ. I aborted all the cryo's, but I have to work, sleep and study from time to time, so a lot slips through my hands...

The only one responding is Mod.Sense but he has actually no influence on the project team. I wonder if they update him on all things going.

The problem is that the cause of this project is GOOD, it can really help in the future. I have seen brain diseases and cancer from close by so I will contribute and stick to the project. But the team could learn a lot from other projects, like Einstein@home (they are the best).

And one more off topic, Fightmalaria@home is another project for Malaria. I use that as a back-up project. And is a good cause as well as it is a nasty disease.
Greetings,
TJ.
ID: 75419 · Rating: 0 · rate: Rate + / Rate - Report as offensive
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 75423 - Posted: 22 Apr 2013, 22:35:55 UTC

Hi.

It doesn't look like the validator is running as all tasks that i have returned this morning are still pending after hours of waiting.
( the server status is showing green )

They usually go valid after a few minutes.

ID: 75423 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TJ

Send message
Joined: 29 Mar 09
Posts: 127
Credit: 4,799,890
RAC: 0
Message 75424 - Posted: 22 Apr 2013, 22:42:44 UTC - in response to Message 75423.  

Hi.

It doesn't look like the validator is running as all tasks that i have returned this morning are still pending after hours of waiting.
( the server status is showing green )

They usually go valid after a few minutes.

I just sent 6 tasks home manually, 5 have credit almost immediately, one is still pending.
Greetings,
TJ.
ID: 75424 · Rating: 0 · rate: Rate + / Rate - Report as offensive
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 75426 - Posted: 22 Apr 2013, 23:24:53 UTC

Well my returned tasks from earlier are still stuck!

I'll have a another look later on today.

ID: 75426 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Polian
Avatar

Send message
Joined: 21 Sep 05
Posts: 152
Credit: 10,141,266
RAC: 0
Message 75427 - Posted: 23 Apr 2013, 2:12:31 UTC - in response to Message 75426.  

Well my returned tasks from earlier are still stuck!

I'll have a another look later on today.


Yeah I have several in pending still as well.
ID: 75427 · Rating: 0 · rate: Rate + / Rate - Report as offensive
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 75428 - Posted: 23 Apr 2013, 3:56:56 UTC

I believe there is a problem the teraflops has dropped from 120+ to

TeraFLOPS estimate: 89.449, to what it is now & falling.


ID: 75428 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BarryAZ

Send message
Joined: 27 Dec 05
Posts: 153
Credit: 30,843,285
RAC: 0
Message 75429 - Posted: 23 Apr 2013, 5:07:25 UTC - in response to Message 75428.  

Indeed -- looks like the validator is in trouble. Maybe the validator has been dipped in the Cryo tank <rueful smile>


I believe there is a problem the teraflops has dropped from 120+ to

TeraFLOPS estimate: 89.449, to what it is now & falling.


ID: 75429 · Rating: 0 · rate: Rate + / Rate - Report as offensive
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,856,873
RAC: 2,278
Message 75430 - Posted: 23 Apr 2013, 11:12:27 UTC - in response to Message 75429.  

Indeed -- looks like the validator is in trouble. Maybe the validator has been dipped in the Cryo tank <rueful smile>


I believe there is a problem the teraflops has dropped from 120+ to

TeraFLOPS estimate: 89.449, to what it is now & falling.


That would make sense, all those units that both get aborted and those that error out all cause the validator to handle each and every one. Maybe NOW they will finally do something about them!! I set all my pc's to NNT yesterday and 99% are now out of work, I have re-enabled two and no cryo tasks came thru.
ID: 75430 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BarryAZ

Send message
Joined: 27 Dec 05
Posts: 153
Credit: 30,843,285
RAC: 0
Message 75433 - Posted: 23 Apr 2013, 17:05:32 UTC - in response to Message 75430.  

I haven't seen any project notice of issues -- then again, the form here is that the volunteers handling the message boards collect feedback, pass it to the project folks and after careful rumination, the folks actually in the project staff consider the possibility that something needs to be done. Then, often, they do act. After they have done whatever it is they have determined needs to be done, then they pass that information on to the volunteers here. So, assuming a normal process, we should see manifestations of the project folks doing something (unannounced) about the reported problems (and may be seeing them now) over the coming day or two or three and then by the weekend, we will get a message or two regarding the project understanding of the problem and what they have done.

Rosetta is typically a more reliable BOINC project than most, but perhaps no more and maybe less communicative (from the project side) than average.


Indeed -- looks like the validator is in trouble. Maybe the validator has been dipped in the Cryo tank <rueful smile>


I believe there is a problem the teraflops has dropped from 120+ to

TeraFLOPS estimate: 89.449, to what it is now & falling.


That would make sense, all those units that both get aborted and those that error out all cause the validator to handle each and every one. Maybe NOW they will finally do something about them!! I set all my pc's to NNT yesterday and 99% are now out of work, I have re-enabled two and no cryo tasks came thru.

ID: 75433 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TJ

Send message
Joined: 29 Mar 09
Posts: 127
Credit: 4,799,890
RAC: 0
Message 75434 - Posted: 23 Apr 2013, 18:23:49 UTC

The waiting begins.... there are more problems. Not uploading, a lot more pending, and new errors while computing (and they did run out of cryo ;-) (I borrowed that from BarryAZ)), its a CASPx_ task this time.

Seems though that I get new work when requesting manually.
Greetings,
TJ.
ID: 75434 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Cutchet Salvador

Send message
Joined: 1 Feb 10
Posts: 17
Credit: 10,690,439
RAC: 0
Message 75435 - Posted: 23 Apr 2013, 19:06:44 UTC - in response to Message 75434.  

Not news,good news??
Servers all green color!
Congratulations to the department of communication and public relations.
In the XXIst century they keep on trusting in the drums and the pigeons to communicate,thank you.
Greetings,
Salvador
ID: 75435 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BarryAZ

Send message
Joined: 27 Dec 05
Posts: 153
Credit: 30,843,285
RAC: 0
Message 75436 - Posted: 23 Apr 2013, 20:05:36 UTC

Thanks for the follow up message indicating closer monitoring of the boards.

To facilitate matters, here is a summary of the current issues I've seen and seen reported by others.

1) Cryo work units -- spinning out computation errors, not all the time but quite often and a fair proportion of the time these errors are after hours of processing.

2) Validation issues -- normally Rosetta has a very low proportion of pendings. Currently that number (since Monday morning it seems) has been rising.

3) Some of the CASP work units -- spinning out computation errors -- these seem to
happen early on (after a few minutes).

4) Uploading issues -- this started late last night.

All in all, things are rather unwell in Rosetta-land at the moment.

For me, my approach is to temporarily suspend processing of Rosetta pending resolution and ideally reports of resolution.

BarryAZ
ID: 75436 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 75437 - Posted: 23 Apr 2013, 21:55:49 UTC
Last modified: 23 Apr 2013, 22:00:54 UTC

I wonder if the Rosetta staff even knows there's a problem.

Also, I have these HUGE WU, that after about 6 hours or so, they still are on Model 1, Step 0, and on the graphics, it shows just one big "sinusoidal" line in the "Searching..." graph and the rest of the graphs are blank.

EDIT: After6-7 hours of running, they DO NOT checkpoint. I'm aborting all tasks and running SETI in the meanwhile.
ID: 75437 · Rating: 0 · rate: Rate + / Rate - Report as offensive
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 75439 - Posted: 24 Apr 2013, 2:00:41 UTC

Hi.

There seems to be a problem with downloads now as well as the other problems, they are slow & are timing out like before.

Wed 24 Apr 2013 11:53:10 EST Project communication failed: attempting access to reference site
Wed 24 Apr 2013 11:53:10 EST rosetta@home Temporarily failed download of rb_04_23_37754_72868_h003__sirtd_h003_.psipred_ss2: connect() failed
Wed 24 Apr 2013 11:53:10 EST rosetta@home Temporarily failed download of rb_04_23_37754_72868_h003__sirtd_h003_.fasta: connect() failed
Wed 24 Apr 2013 11:53:10 EST rosetta@home Started download of rb_04_23_37754_72868_h003__sirtd_h003_.nobuformat.psipred_ss2
Wed 24 Apr 2013 11:53:10 EST rosetta@home Started download of rb_04_23_37754_72868_h003__sirtd_aah003_03_05.200_v1_3.gz
Wed 24 Apr 2013 11:53:12 EST Internet access OK - project servers may be temporarily down.
Wed 24 Apr 2013 11:53:33 EST Project communication failed: attempting access to reference site
Wed 24 Apr 2013 11:53:33 EST rosetta@home Temporarily failed download of rb_04_23_37754_72868_h003__sirtd_h003_.nobuformat.psipred_ss2: connect() failed
Wed 24 Apr 2013 11:53:33 EST rosetta@home Temporarily failed download of rb_04_23_37754_72868_h003__sirtd_aah003_03_05.200_v1_3.gz: connect() failed
Wed 24 Apr 2013 11:53:33 EST rosetta@home Started download of rb_04_23_37754_72868_h003__sirtd_aah003_17_05.200_v1_3.gz
Wed 24 Apr 2013 11:53:34 EST Internet access OK - project servers may be temporarily down.

ID: 75439 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1226
Credit: 14,034,809
RAC: 2,884
Message 75441 - Posted: 24 Apr 2013, 4:36:30 UTC - in response to Message 75436.  

Thanks for the follow up message indicating closer monitoring of the boards.

To facilitate matters, here is a summary of the current issues I've seen and seen reported by others.

1) Cryo work units -- spinning out computation errors, not all the time but quite often and a fair proportion of the time these errors are after hours of processing.


In another forum, I've seen a statement that the cryo work units run properly on Macs, but not on whatever other type of computer that poster used.

For the last few cryo workunits on my Windows 7 computer, they failed for me and for all wingmates using Windows 7. One succeeded for a wingmate using Windows 8, though.

ID: 75441 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BarryAZ

Send message
Joined: 27 Dec 05
Posts: 153
Credit: 30,843,285
RAC: 0
Message 75442 - Posted: 24 Apr 2013, 4:50:50 UTC - in response to Message 75441.  

But since, as has been noted before, this project doesn't support application specific choices at either the project level or workstation level, that places the onus on the project to not generate work units that are OS specific.

At this point though, the project has a batch of problems so it makes sense for all of us to 'help out' by suspending processing until the project addresses the problems and confirms that the various fixes are in place and tested out.


Thanks for the follow up message indicating closer monitoring of the boards.

To facilitate matters, here is a summary of the current issues I've seen and seen reported by others.

1) Cryo work units -- spinning out computation errors, not all the time but quite often and a fair proportion of the time these errors are after hours of processing.


In another forum, I've seen a statement that the cryo work units run properly on Macs, but not on whatever other type of computer that poster used.

For the last few cryo workunits on my Windows 7 computer, they failed for me and for all wingmates using Windows 7. One succeeded for a wingmate using Windows 8, though.


ID: 75442 · Rating: 0 · rate: Rate + / Rate - Report as offensive
morgan

Send message
Joined: 30 Jun 06
Posts: 3
Credit: 387,964
RAC: 0
Message 75443 - Posted: 24 Apr 2013, 8:57:35 UTC - in response to Message 75436.  
Last modified: 24 Apr 2013, 8:58:48 UTC

Thanks for the follow up message indicating closer monitoring of the boards.

To facilitate matters, here is a summary of the current issues I've seen and seen reported by others.

1) Cryo work units -- spinning out computation errors, not all the time but quite often and a fair proportion of the time these errors are after hours of processing.

2) Validation issues -- normally Rosetta has a very low proportion of pendings. Currently that number (since Monday morning it seems) has been rising.

3) Some of the CASP work units -- spinning out computation errors -- these seem to
happen early on (after a few minutes).

4) Uploading issues -- this started late last night.

All in all, things are rather unwell in Rosetta-land at the moment.

For me, my approach is to temporarily suspend processing of Rosetta pending resolution and ideally reports of resolution.

BarryAZ


BarryAZ You took this from my mouth, yes! hihi
in other words; Have the SAME PROBLEMS HERE
ID: 75443 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Previous · 1 . . . 23 · 24 · 25 · 26 · 27 · 28 · 29 . . . 55 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org