Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · 24 · 25 . . . 55 · Next

AuthorMessage
Profile Andrii Muliar

Send message
Joined: 10 Nov 05
Posts: 12
Credit: 7,655,243
RAC: 0
Message 73919 - Posted: 28 Sep 2012, 16:43:43 UTC - in response to Message 73914.  

Is anyone having problems uploading results? Not one of my computers will upload. My Internet access is ok. I noticed that the server status page says that everything is ok. Still nothing will upload.


I have this problem today on two my computers with different internet providers. Other projects are working fine except rosetta@home.
ID: 73919 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 73920 - Posted: 28 Sep 2012, 17:05:26 UTC - in response to Message 73919.  

Is anyone having problems uploading results? Not one of my computers will upload. My Internet access is ok. I noticed that the server status page says that everything is ok. Still nothing will upload.


I have this problem today on two my computers with different internet providers. Other projects are working fine except rosetta@home.


plus the tflop estimate went from over 100 down to 30
ID: 73920 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TJ

Send message
Joined: 29 Mar 09
Posts: 127
Credit: 4,799,890
RAC: 0
Message 73922 - Posted: 28 Sep 2012, 17:13:14 UTC - in response to Message 73918.  

While, on occasion, this project has encountered connectivity problems (perhaps a few times a year), it is relatively rare for this project in the distributed processing world.

Somewhat more troublesome is -- and this is VERY rare for Rosetta -- an informational black out. As we move toward a full day of the outage, we've not seen any information from the project folks, not even an acknowledgement of what folks are reporting here.

It may well be that they are aware of the problem and are working on it, but at this juncture, for the community here, it is all speculation. I very much looking toward at least an acknowledgement that the folks back at the lab are aware there is a problem,


My experience with the communication from the admins of the Rosie project is not good BarryAZ. Even private messages don't get an answer or at times after a week or so. However I find this project very useful and will stick to it, but communication with the crunching community could be a lot better.
Greetings,
TJ.
ID: 73922 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 118,374,337
RAC: 23,504
Message 73926 - Posted: 28 Sep 2012, 18:52:22 UTC

As Polian has just posted in another thread:

Network Outages: As part of the UW's continuing datacenter consolidation, the network topology upon which Rosetta@home is run was changed yesterday. Since that time we've been shaking out the various hiccups that result from changing things in such a busy system. We, the IT crew, apologize for the troubles and will try to get them ironed out as soon as we can. We appreciate your patience and your continued contributions to our research efforts. -KEL




From the front page
ID: 73926 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 73932 - Posted: 28 Sep 2012, 23:02:38 UTC - in response to Message 73910.  

Maybe you should try to play a little bit with settings? Go to Preferences > Disk and memory usage. Then check/uncheck "Leave aplications in memory while suspended" and try to decrease time for "Tasks checkpoints to disk every:" (I put here 600 seconds). (...)

My default there is 60 seconds. The application is supposed to stay in memory while suspended (box is checked).

I often also suspend the application manually, so it can finish writing to disk before I suspend/hibernate the machine.

Is there a place where I can read more about the "reset conditions" of a task - under which circumstance can this happen?

Oh, by the way, yes, I tried turning it off and on again. I also uninstalled and reinstalled. ;-)


...and so it sounds like you are of the understanding that suspending a work unit at any random point-in-time will force a checkpoint to be preserved on disk. It doesn't work that way. Sorta like trying to force a pregnant woman to give birth, best to wait for when the baby is ready. BOINC applications have to write specific, and complex logic into their code to take checkpoints and to be able to properly reestablish themselves from them. Some type of Rosetta work units checkpoint more frequently than others.

No settings can force a checkpoint, only prevent them from occurring too frequently (based on your definition provided in the setting for how frequently you want to permit writes to disk). So if an application were trying to checkpoint every 30 seconds, you might set that to 10 minutes or something to not take all of those checkpoints and help your machine run smoother for the way you use it.
Rosetta Moderator: Mod.Sense
ID: 73932 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1226
Credit: 14,046,097
RAC: 2,240
Message 73934 - Posted: 29 Sep 2012, 0:51:00 UTC - in response to Message 73911.  

Maybe you should try to play a little bit with settings? Go to Preferences > Disk and memory usage. Then check/uncheck "Leave aplications in memory while suspended" and try to decrease time for "Tasks checkpoints to disk every:" (I put here 600 seconds). (...)

My default there is 60 seconds. The application is supposed to stay in memory while suspended (box is checked).

I often also suspend the application manually, so it can finish writing to disk before I suspend/hibernate the machine.

Is there a place where I can read more about the "reset conditions" of a task - under which circumstance can this happen?

Oh, by the way, yes, I tried turning it off and on again. I also uninstalled and reinstalled. ;-)


This won't help you, but I have read somewhere with another project, that hibernating a pc (under windows) will result in strange behavior of BOINC and that thanks error out eventually.


I believe that's dependent on which model of computer you are using, and whether the workunits are set up to recover from long delays during timeout checks. For my computers, hibernate/sleep while BOINC is suspended usually allows the workunits to resume properly when I'm ready to resume. I'm not sure
I've tried it with Rosetta@Home workunits, though.

Turning the computer off, by any means other than sleep/hibernate, removes the possibility of recovering from such a sleep/hibernate. So does anything that removes BOINC and the workunits from memory.
ID: 73934 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1226
Credit: 14,046,097
RAC: 2,240
Message 73935 - Posted: 29 Sep 2012, 0:55:41 UTC - in response to Message 73915.  

Is anyone having problems uploading results? Not one of my computers will upload. My Internet access is ok. I noticed that the server status page says that everything is ok. Still nothing will upload.


Yes, see my thread "upload problem".
No new work either.


Same here, on both my desktops. At least I am connected to several other BOINC projects as well, so I can get enough workunits from them.

Do the people running the Rosetta@Home servers WANT any more workunits run during the next few days? It looks like they don't.
ID: 73935 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Cutchet Salvador

Send message
Joined: 1 Feb 10
Posts: 17
Credit: 10,690,439
RAC: 0
Message 73936 - Posted: 29 Sep 2012, 10:52:27 UTC - in response to Message 73892.  
Last modified: 29 Sep 2012, 11:12:52 UTC

J.S. does not believe that it is a fault of his N150, for some time there is WUS that do not generate checkpoints for example: hyb_xx, ebolanator_xx, rb_xx and others....
Therefore whenever I began his N150 again the work will begin from 0, not because he has not kept in memory but because there is no kept, and like that checkpoint we are losing hours of realized work.
Today in Barcelona (Catalunya,Catalonia) it is raining very much and there have been several micro-cut of electricity, and the result is that several WUS have begun every time from 0 I have lost till now 23 hours of work.
Greetings and patience,
Salvador
ID: 73936 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1226
Credit: 14,046,097
RAC: 2,240
Message 73937 - Posted: 29 Sep 2012, 12:49:19 UTC - in response to Message 73936.  

J.S. does not believe that it is a fault of his N150, for some time there is WUS that do not generate checkpoints for example: hyb_xx, ebolanator_xx, rb_xx and others....
Therefore whenever I began his N150 again the work will begin from 0, not because he has not kept in memory but because there is no kept, and like that checkpoint we are losing hours of realized work.
Today in Barcelona (Catalunya,Catalonia) it is raining very much and there have been several micro-cut of electricity, and the result is that several WUS have begun every time from 0 I have lost till now 23 hours of work.
Greetings and patience,
Salvador


You might want to check if your computer has the sleep/hibernate feature. Mine do, and I have added a UPS (uninterruptible power supply) for each, so that they
can run from a battery for a short time and then copy the entire contents of the memory to the hard drive. Later, the computer can resume from the saved copy of the memory, instead of a normal reboot. If the workunits were designed properly, they can resume from the point of interruption rather than from the beginning.
ID: 73937 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Link
Avatar

Send message
Joined: 4 May 07
Posts: 355
Credit: 382,349
RAC: 0
Message 73941 - Posted: 30 Sep 2012, 11:31:00 UTC - in response to Message 73934.  

[quote]This won't help you, but I have read somewhere with another project, that hibernating a pc (under windows) will result in strange behavior of BOINC and that thanks error out eventually.
(...)

Turning the computer off, by any means other than sleep/hibernate, removes the possibility of recovering from such a sleep/hibernate. So does anything that removes BOINC and the workunits from memory.

Unless something requires a reboot I only hibernate both my computers (those two, that are not always running 24/7), no issues with any of the projects I cruch for.

The only thing I learned recently from one timed out WU is that you have to watch if BOINC is not requesting work just in the moment you want to hibernate the computer, Rosetta apparently does not have the "resend lost tasks" feature active.
.
ID: 73941 · Rating: 0 · rate: Rate + / Rate - Report as offensive
dima

Send message
Joined: 26 Nov 09
Posts: 2
Credit: 2,689,921
RAC: 0
Message 74093 - Posted: 24 Oct 2012, 7:50:51 UTC

hyb_al_09_bench_3rj8A_SAVE_ALL_OUT_IGNORE_THE_REST_61028_1254

100%, and it doesn't complete. CPU not working.

If restart boinc-client service, it work from 0% to 100%, and nothing else. LA - null. I aborted this task.

ID: 74093 · Rating: 0 · rate: Rate + / Rate - Report as offensive
dima

Send message
Joined: 26 Nov 09
Posts: 2
Credit: 2,689,921
RAC: 0
Message 74124 - Posted: 29 Oct 2012, 10:15:08 UTC - in response to Message 74093.  

hyb_al_09_bench_3rj8A_SAVE_ALL_OUT_IGNORE_THE_REST_61028_1254

100%, and it doesn't complete. CPU not working.

If restart boinc-client service, it work from 0% to 100%, and nothing else. LA - null. I aborted this task.


the problem persists
hyb_al_02_bench_2yeqB_SAVE_ALL_OUT_IGNORE_THE_REST_60648_3065_0
ID: 74124 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,869,008
RAC: 714
Message 74125 - Posted: 29 Oct 2012, 12:19:59 UTC

hyb_ai_bench_4adyB_SAVE_ALL_OUT_IGNORE_THE_REST_58035_47

My mac (BOINC 6.12.33) ended with Outcome: Success; Client state: Done; Exit status: 0(0x0) but the following in the stderr out:

BOINC:: CPU time: 36269.7s, 14400s + 21600s[2012-10-29 7:18:59:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001

The watchdog ended it and I received the default one model/20 credits.

On my wingman's windows machine the workunit ended with a client error within a few seconds of starting though it should be noted that all 40 of his most recent tasks have failed so his failure might not be related to the workunit.

Best,
Snags
ID: 74125 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 208
Credit: 24,285,827
RAC: 15,419
Message 74126 - Posted: 29 Oct 2012, 14:11:00 UTC
Last modified: 29 Oct 2012, 14:12:43 UTC

I have tons of errors with Wus hyb_.._bench_ series of Wus
Same as described above or in my message a 1.5 month ago: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6055&nowrap=true#73741

I abort all WUs from this series in queue.
ID: 74126 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Andrea [E.R.]

Send message
Joined: 4 Jul 11
Posts: 3
Credit: 180,074
RAC: 0
Message 74129 - Posted: 30 Oct 2012, 10:03:14 UTC - in response to Message 74126.  
Last modified: 30 Oct 2012, 10:08:24 UTC

Hi!!!

A member of the Boinc.Italy Team reported this error:

boinc.bakerlab.org/rosetta/workunit.php?wuid=484405926 (too old, but with the same problem of the following)
boinc.bakerlab.org/rosetta/workunit.php?wuid=489955188

The WU was re-sent near the deadline of another cruncher, but wasn't validated after the completion by my team companion.

It's a bug???

Thanks!!!
ID: 74129 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 118,374,337
RAC: 23,504
Message 74133 - Posted: 30 Oct 2012, 17:12:16 UTC - in response to Message 74129.  

Hi!!!

A member of the Boinc.Italy Team reported this error:

boinc.bakerlab.org/rosetta/workunit.php?wuid=484405926 (too old, but with the same problem of the following)
boinc.bakerlab.org/rosetta/workunit.php?wuid=489955188

The WU was re-sent near the deadline of another cruncher, but wasn't validated after the completion by my team companion.

It's a bug???

Thanks!!!

I think there's a script that's run daily to pick up these tasks that were handed out close to the deadline and assign credit to them.
ID: 74133 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,792,601
RAC: 2,014
Message 74161 - Posted: 3 Nov 2012, 23:50:24 UTC - in response to Message 74125.  

hyb_ai_bench_4adyB_SAVE_ALL_OUT_IGNORE_THE_REST_58035_47

My mac (BOINC 6.12.33) ended with Outcome: Success; Client state: Done; Exit status: 0(0x0) but the following in the stderr out:

BOINC:: CPU time: 36269.7s, 14400s + 21600s[2012-10-29 7:18:59:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001

The watchdog ended it and I received the default one model/20 credits.

On my wingman's windows machine the workunit ended with a client error within a few seconds of starting though it should be noted that all 40 of his most recent tasks have failed so his failure might not be related to the workunit.

Best,
Snags


i've been getting that crap on and off in my tasks as well. they come out to just 20 credits when the claim is for 197 or so credits.
ID: 74161 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,792,601
RAC: 2,014
Message 74162 - Posted: 3 Nov 2012, 23:54:07 UTC

i'm starting to get a little fed up with this projects stupid errors and other odds and ends of incomplete data files and gzip errors and giving me only 20 credits for something that should be 150+ credits just because they don't weant to upgrade their code to fit the new boinc manager program. i also don't understand the lack of communication from this team. they must be hibernating under their desks somewhere or don't know how to write.

ID: 74162 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 74163 - Posted: 4 Nov 2012, 0:38:53 UTC

I would take the existence of this thread to be contrary to your assertions. Please don't attempt to characterize people you've never met.
Rosetta Moderator: Mod.Sense
ID: 74163 · Rating: 0 · rate: Rate + / Rate - Report as offensive
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,866,740
RAC: 1,851
Message 74166 - Posted: 4 Nov 2012, 12:19:25 UTC - in response to Message 74163.  

I would take the existence of this thread to be contrary to your assertions. Please don't attempt to characterize people you've never met.


True but you must agree that it has been a LONG time that these errors have been happening on a regular basis for some people! It is getting frustrating and even when the Project Admins say they will look into it they say "in a couple of weeks when I have more time". We put our time and energy and MONEY into running our pc's FOR Rosetta and get little to NO help in return when we have problems! You and I had a conversation a while back about how Rosetta is happy the ways things are, things haven't changed and yet we users are STILL hoping for one. Some of us BELIEVE in the idea of Rosetta, some are here for the credits and some are here for other reasons, but whatever the reason when the software 'just works' everywhere else yet works SOOOOO badly here, it is VERY FRUSTRATING for some of us!!!
ID: 74166 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · 24 · 25 . . . 55 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org