Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 40 · 41 · 42 · 43 · 44 · 45 · 46 . . . 55 · Next

AuthorMessage
Profile Andrew Sanchez

Send message
Joined: 25 Nov 14
Posts: 2
Credit: 4,008
RAC: 0
Message 77697 - Posted: 28 Nov 2014, 6:24:41 UTC - in response to Message 77695.  

I have a 3 tasks right now that kinda weird.

rb_11_07_51583_97466_ab_stage0_t000___robetta_cstwt_3.0_IGNORE_THE_REST_03_09_224683_29047_1

rb_10_28_50725_96287_ab_stage0_h003___robetta_IGNORE_THE_REST_04_09_223752_4_2

rb_11_24_51954_97708__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_226953_21_1

BOINC manager they have a deadline of 12/11/14 but in my \"tasks\" page here in Rosetta they show a deadline of 12/12/14. Maybe it\'s just some UTC conversion thing i\'ve never noticed before. But there\'s also this coincidence. It seems these tasks have a history of error and \"no reply\" from their previous crunchers. I\'m new to Rosetta so maybe this is just something that happens all the time, i don\'t know.
Just thought i\'d post this info here, we\'ll see what happens when i start to crunch them.


Well, they look like they ran successfully. I guess the date thing is just normal and the tasks were resends or something.
ID: 77697 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TJ

Send message
Joined: 29 Mar 09
Posts: 127
Credit: 4,799,890
RAC: 0
Message 77698 - Posted: 28 Nov 2014, 13:13:23 UTC - in response to Message 77691.  

I\'ve noticed a couple of my boxes have gone idle, apparently at random, recently... it appears this is the cause:

765: 25-Nov-2014 13:57:26 (low) [rosetta@home] Sending scheduler request: To report completed tasks.
766: 25-Nov-2014 13:57:26 (low) [rosetta@home] Reporting 2 completed tasks
767: 25-Nov-2014 13:57:26 (low) [rosetta@home] Requesting new tasks for CPU
768: 25-Nov-2014 13:57:29 (low) [rosetta@home] Scheduler request completed: got 0 new tasks
769: 25-Nov-2014 13:57:29 (low) [rosetta@home] No work sent
770: 25-Nov-2014 13:57:29 (low) [rosetta@home] Rosetta Mini for Android is not available for your type of computer.


This then puts the time for the next scheduler request off until at least the next day. I forgot to look for a specific time before I manually did a \'boinccmd --project update\' but it\'s at least 24 hours.


I have noticed this too, since last week. Once I entered the room and there was no noise of running fans. Checking shows all Rosetta WU\'s ready and project back off for more than 16 hours. After manually updating, everything went smooth again. Has happend now 3 times on two different rigs.
Greetings,
TJ.
ID: 77698 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator
Project administrator

Send message
Joined: 22 Aug 06
Posts: 3520
Credit: 0
RAC: 0
Message 77699 - Posted: 28 Nov 2014, 15:17:50 UTC - in response to Message 77697.  

Well, they look like they ran successfully. I guess the date thing is just normal and the tasks were resends or something.


Yes, the website shows all pending tasks with date and time in UTC. Your local BOINC Manager shows them in the local time zone of your machine.
Rosetta Moderator: Mod.Sense
ID: 77699 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Killersocke@rosetta

Send message
Joined: 13 Nov 06
Posts: 26
Credit: 916,358
RAC: 3,561
Message 77703 - Posted: 30 Nov 2014, 12:38:31 UTC
Last modified: 30 Nov 2014, 12:39:03 UTC

Task ID 703335720
Created 29 Nov 2014 21:35:01 UTC

Outcome Client error
Client state Compute error
Exit status -1073740940 (0xffffffffc0000374)

Workunit: 636829405
636829405


stderr out
<core_client_version>7.4.27</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -1073740940 (0xc0000374)
</message>
<stderr_txt>
[2014-11-30 10:31:59:] :: BOINC:: Initializing ... ok.
[2014-11-30 10:31:59:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
command: projects/boinc.bakerlab.org_rosetta/minirosetta_3.52_windows_x86_64.exe @141112.2.5L_8H_C4_5_fold_and_dock_flags -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip fold_and_dock_141112.2.5L_8H_C4_5_data.zip -nstruct 10000 -cpu_run_time 21600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 1325984
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_3d2618f.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/fold_and_dock_141112.2.5L_8H_C4_5_data.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.

</stderr_txt>
]]>
Validate state Invalid
ID: 77703 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 970
Credit: 19,898,209
RAC: 14,257
Message 77713 - Posted: 2 Dec 2014, 16:11:23 UTC

rb_11_26_50854_96408__t000__3_C1_SAVE_ALL_OUT_IGNORE_THE_REST_227069_1754_0
[quote]ERROR: Fullatom mismatch in checkpointer.
ERROR:: Exit from: ..\\..\\..\\src\\protocols\\checkpoint\\CheckPointer.cc line: 379
[/quoted]
I shouldn\'t really complain. I was granted 50% more than claimed credit, but obviously something\'s gone wrong there...
ID: 77713 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 862
Credit: 3,140,018
RAC: 1,764
Message 77715 - Posted: 3 Dec 2014, 6:56:04 UTC

On my AMD FX 6300 (Win7), today:
03/12/2014 07:54:02 | rosetta@home | Requesting new tasks for CPU and AMD/ATI GPU
03/12/2014 07:54:05 | rosetta@home | Scheduler request completed: got 0 new tasks
03/12/2014 07:54:05 | rosetta@home | No work sent
03/12/2014 07:54:05 | rosetta@home | Rosetta Mini for Android is not available for your type of computer.

ID: 77715 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Killersocke@rosetta

Send message
Joined: 13 Nov 06
Posts: 26
Credit: 916,358
RAC: 3,561
Message 77716 - Posted: 3 Dec 2014, 13:05:01 UTC

Task ID: 704019018

Name: 1L-14H-3L-6E-3L-14H-2L-6E-1L_1-2.P.0_0003_fold_SAVE_ALL_OUT_226560_148_0

Workunit: 637464884

Outcome Client error
Client state Compute error
Exit status -1073741819 (0xffffffffc0000005)

Details:

Details 637464884


ID: 77716 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 970
Credit: 19,898,209
RAC: 14,257
Message 77724 - Posted: 9 Dec 2014, 15:28:10 UTC

Two errors within tasks - one on behalf of someone else

Error 1:
141111.4.3L_6H_C2_4_fold_and_dock_SAVE_ALL_OUT_224959_4892_0
ERROR: total_residue() != 0
ERROR:: Exit from: ..\\..\\..\\src\\core\\pose\\Pose.cc line: 1194

ERROR: Energies:: operation NOT permitted during scoring.
ERROR:: Exit from: ..\\..\\..\\src\\core\\scoring\\Energies.cc line: 372
std::cerr: Exception was thrown:


[ERROR] EXCN_utility_exit has been thrown from: ..\\..\\..\\src\\core\\scoring\\Energies.cc line: 372
ERROR: Energies:: operation NOT permitted during scoring.


Error 2
11.19.14.11.BL_tetramer_tyr_net3_fold_and_dock_SAVE_ALL_OUT_225832_4188_0
ERROR: std::abs( coordsys_rot.det() - 1.0 ) < 1e-6
ERROR:: Exit from: ..\\..\\..\\src\\core\\pose\\symmetry\\util.cc line: 889
...
No heartbeat from core client for 30 sec - exiting
[2014-12- 9 1:48: 6:] :: BOINC:: Initializing ... ok.
[2014-12- 9 1:48: 6:] :: BOINC :: boinc_init()
...
ERROR: std::abs( coordsys_rot.det() - 1.0 ) < 1e-6
ERROR:: Exit from: ..\\..\\..\\src\\core\\pose\\symmetry\\util.cc line: 889
BOINC:: CPU time: 43269.3s, 14400s + 28800s[2014-12- 9 7:30:48:] :: BOINC
InternalDecoyCount: 79
======================================================
DONE :: 2 starting structures 43269.3 cpu seconds
This process generated 79 decoys from 79 attempts
======================================================
called boinc_finish

That 2nd one is mine. It ran until the watchdog cut in after running without checkpointing for around 6 hours. It only seemed to give credit for 6 hours work instead of 8 (usual runtime) or 12 (when watchdog cut in).
ID: 77724 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Killersocke@rosetta

Send message
Joined: 13 Nov 06
Posts: 26
Credit: 916,358
RAC: 3,561
Message 77755 - Posted: 23 Dec 2014, 18:49:19 UTC

today i\'ve got 16 Files in series with issues
ID: 77755 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Usuario1_S

Send message
Joined: 24 Mar 14
Posts: 92
Credit: 3,059,705
RAC: 0
Message 77775 - Posted: 29 Dec 2014, 14:59:49 UTC - in response to Message 77755.  
Last modified: 29 Dec 2014, 15:07:01 UTC

12/29/14 00:24:07 | rosetta@home | [error] Signature verification failed for minirosetta_database_3d2618f.zip

I get this after a few hours or a few days, then all the current WUs that were crunching stop along with the rest of waiting WUs in queue with a \'Computation Error\' message, it\'s pretty random, I\'ve go back to BOINC 6.13.12 and the erros, everything works fine on my PC.

I have checked my PC:
I have checked my RAM with Win7 Memory Diagnostic: Extended Test, which include no CPU Cache testing, 2 passes and my 8GB of DDR3-1333 RAM are OK, checked Event Log and no hard drive failures, I have the pagefile.sys on another partition, I have moved the file to another partition temporarily and filled its previous partition with zeros then made a full format and return it to this partition and chkdsk all my partitions twice, I even defragmented C: with PerfectDisk to check the read/write and no errors. I have no overclock, on CPU, only on GPU but I only use CPU for BOINC, so pretty sure is not hardware, otherwise the SMART system at boot time and the WIndows event loug would have a HDD or some other hardware related error.

So it has to be your WUs. I\'m getting tired of this, I have never had this when I was on World Community Grid, if this keeps happening I\'m going back it where at least I\'ll be helping do useful research.
ID: 77775 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator
Project administrator

Send message
Joined: 22 Aug 06
Posts: 3520
Credit: 0
RAC: 0
Message 77787 - Posted: 31 Dec 2014, 16:11:08 UTC - in response to Message 77775.  

So it has to be your WUs.


That is possible, but I\'ve not seen reports of such problems from others. Since the downloads are the same for everyone, any task requiring that file would fail the same way for everyone.

It sounds like the symptoms that can result from an anti-virus application modifying the file, or a firewall modifying the download. Which AV do you use? Have you whitelisted R@h?

Rosetta Moderator: Mod.Sense
ID: 77787 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Polian
Avatar

Send message
Joined: 21 Sep 05
Posts: 152
Credit: 10,141,266
RAC: 0
Message 77788 - Posted: 1 Jan 2015, 4:29:04 UTC

Hash failures on a downloaded file? Does your internet service compress the data (used to be typical on cellular internet)? Are you having any internet or router problems that would cause a relatively large file to fail with a connection reset therefore truncating the file? Any other atypical internet connection possibilities, like perhaps a proxy server? Just brainstorming here.
ID: 77788 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Ron

Send message
Joined: 25 Dec 14
Posts: 1
Credit: 632,941
RAC: 485
Message 77789 - Posted: 1 Jan 2015, 15:17:50 UTC

I\'m a newbie here. The last few days I\'ve gotten a notice that says, \"Message from account manager: Can\'t resolve host name.\"
Yet the program seems to be running normally. What does this mean, and what should I do?
ID: 77789 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 178
Credit: 30,826,092
RAC: 26,440
Message 77791 - Posted: 1 Jan 2015, 15:23:25 UTC

Also may want to check that yourvdate, time, and timezone are set correctly on your PC and even on any routers you may be using as this can cause issues with certain security protocals, ssl, ect. Also just brainstorming!
ID: 77791 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 178
Credit: 30,826,092
RAC: 26,440
Message 77792 - Posted: 1 Jan 2015, 15:25:49 UTC - in response to Message 77789.  

I\'m a newbie here. The last few days I\'ve gotten a notice that says, \"Message from account manager: Can\'t resolve host name.\"
Yet the program seems to be running normally. What does this mean, and what should I do?

Sounds like a DNS issue. You may want to flush your DNS cache.
ID: 77792 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator
Project administrator

Send message
Joined: 22 Aug 06
Posts: 3520
Credit: 0
RAC: 0
Message 77793 - Posted: 1 Jan 2015, 17:03:06 UTC - in response to Message 77792.  

I\'m a newbie here. The last few days I\'ve gotten a notice that says, \"Message from account manager: Can\'t resolve host name.\"
Yet the program seems to be running normally. What does this mean, and what should I do?

Sounds like a DNS issue. You may want to flush your DNS cache.


It can also mean that the BOINC Manager attempted to contact the project at a time when you were disconnected from the interest. It automatically does retries until it gets through. You can establish preferences that define the hours of the day that the BOINC Manager should perform its internet access if there are times when access is more expensive or more likely to slow other things you would like to do with your internet connection. You can also setup bandwidth limits and caps so BOINC is less likely to noticeably slow your other usage of the internet.
Rosetta Moderator: Mod.Sense
ID: 77793 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Oldman

Send message
Joined: 17 Oct 06
Posts: 4
Credit: 1,069,527
RAC: 347
Message 77910 - Posted: 10 Feb 2015, 15:22:22 UTC

2/10/2015 10:03:51 AM | | [error] Can\'t create HTTP response output file projects/boinc.bakerlab.org_rosetta/minirosetta_3.52_windows_x86_64.exe

I\'m having download problems with rosetta only.
ID: 77910 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Questor

Send message
Joined: 3 Mar 06
Posts: 5
Credit: 470,067
RAC: 419
Message 77918 - Posted: 12 Feb 2015, 9:52:07 UTC


I have been runninng Rosetta@home since \'06; but I think I am done with this project for the forseeable future.

I just bought a new PC 3 months ago, and loaded it up with all the programs I usually run, including Boinc.

Started out with the occasional BSOD; but I thought \"I\'ts windows, it happens\". but it got worse, and worse and worse. It got to the point where the PC would not run for more then 15 minutes; if I could even get int owindows.

I spent a lot of time trying to diagnose the problem, testing hardware, changing drivers, installing, upgrading and trying different firewall/anitvirus programs trying to find out what was killing my PC. Even once wiping everything and starting over from scratch.

Errors kept pointing to Hard drive failure, but repeated tests kept showing the drives were perfectly fine.

Finally, at a loss, I asked for help and someone pointed out to me while I had tested the drives for physical/mechanical failure, I had not tested the actual data intregity.

So I did, and what an eye-opener it was.

Hundreds and hundreds of file sorting errors, incorrect index enteries and orphaned file fragments showed up. And every bloody single one of them was part of a Rosetta@Home work unit.

Every Rosetta work unit I processed was like a round of grapeshot fired though my hard-drives data. No wonder I was having so many problems.

I have never had a moments problem with Seti@home, so I am not certain why Rosetta@home is such a continual source of user aggrivation.

So, until I can be reasonably certain that processing future Rosetta@home work units is not the equivelent of intentionally giving ones self a virus to slowly corrupt your own hard drive data, I think I am unlinking the project.

I am running Windows 7 Home edition 64 bit, on a Athlon ll X4 630 Socket AM3, Foxconn motherboard with a AMD 760G chipset and 6 gigs of memory.

Boinc is set to \'Run Always\' and \'Suspend GPU\' but is disabled from auto-starting. Version 7.4.36 (x64). Projects are Rosetta@home and Seti@home 50/50 split.

Hopefully this would help someone is diagnosing the problem, but I would not be holding my breath for someone from Rosetta@home to take the time to actually look into this either.
ID: 77918 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator
Project administrator

Send message
Joined: 22 Aug 06
Posts: 3520
Credit: 0
RAC: 0
Message 77920 - Posted: 12 Feb 2015, 15:50:30 UTC
Last modified: 12 Feb 2015, 15:50:41 UTC

I had tested the drives for physical/mechanical failure, I had not tested the actual data intregity.


What did you do to test the data integrity?
Rosetta Moderator: Mod.Sense
ID: 77920 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Questor

Send message
Joined: 3 Mar 06
Posts: 5
Credit: 470,067
RAC: 419
Message 77942 - Posted: 14 Feb 2015, 17:07:10 UTC
Last modified: 14 Feb 2015, 17:21:46 UTC

I has used Windows 7 built-in test, Western Digital\'s Data Lifeguard as well as using Seagate\'s SeaTools to all to check the drives for physical defects/failure. I used multiple tests as most of my BSOD errors repeatedly kept pointing to the drives as the reason for the system crashing.

I had also used under CMD the \'SFC /scannow\' to test the windows files for damage.

It was only when someone suggested using under CMD \'CHKDSK /R\' (which examines the drive for both physical errors as well as checking the data contents) did the indexing errors and orphaned file fragments, et al show up.

Since the drives data intregity was cleaned up/repaired by CHKDSK the system has been stable.

There were as I had mentioned in the previous post quite literally hundred of enteries needing to be repaired/removed, and all were part of Rosetta@home work units. There were multiple enteries for each work unit as they seemed to be unzipped/expanded with multiple similar naming for files, but with different extensions such as .SYM .WTS .CSV .PSSM .PARAM .LIN .OUT .PAP .TAB and many others I was unable to catch as the error data was scrolling constantly across the screen.
ID: 77942 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Previous · 1 . . . 40 · 41 · 42 · 43 · 44 · 45 · 46 . . . 55 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2019 University of Washington
http://www.bakerlab.org