Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 21 · 22 · 23 · 24 · 25 · 26 · 27 . . . 309 · Next

AuthorMessage
fcbrants
Avatar

Send message
Joined: 25 Mar 13
Posts: 13
Credit: 3,933,177
RAC: 0
Message 89830 - Posted: 3 Nov 2018, 23:45:08 UTC - in response to Message 89812.  

Thanks, but after looking at the affected tasks, it looks like the result was discarded & no credit granted.

That said, it's looking more & more like this was a problem with my Dell PERC H710P RAID card. The machine was sluggish as hell with the disk cache write back enabled & everything Really went south (machine became unbootable) after I tried a backup. Fiddled with it for days, finally pulled the backup battery off the card, which disabled the cache & let it sit overnight. Next morning, reinstalled the card, and back on go. Jacked my "use at most" CPU's back up to 100% & the machine is still snappy. Back to Munching & Crunching ;)

Thanks for looking this up for me, if I run into problems again, I will try increasing this timeout.

Franko


The error message is displayed by the BOINC Client.
I think it is just a BOINC Client timing issue that they have declared "fixed" several times.
I don't think it is ever a problem, just annoying.

client/app_control.cpp

// Check for finish files every 10 sec.
// If we already found a finish file, abort the app;
// it must be hung somewhere in boinc_finish();
//
static double last_finish_check_time = 0;
if (gstate.clock_change || gstate.now - last_finish_check_time > 10) {
last_finish_check_time = gstate.now;
for (i=0; i<active_tasks.size(); i++) {
ACTIVE_TASK* atp = active_tasks[i];
if (atp->task_state() == PROCESS_UNINITIALIZED) continue;
if (atp->finish_file_time) {
// process is still there 10 sec after it wrote finish file.
// abort the job
atp->abort_task(EXIT_ABORTED_BY_CLIENT, "finish file present too long"); <<<<<<<<<<<< line 140
} else if (atp->finish_file_present()) {
atp->finish_file_time = gstate.now;
}
}
}
ID: 89830 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
fcbrants
Avatar

Send message
Joined: 25 Mar 13
Posts: 13
Credit: 3,933,177
RAC: 0
Message 89834 - Posted: 4 Nov 2018, 9:12:58 UTC - in response to Message 89812.  

Dang it, I'm still getting the same error.

I tried to find the file app_control.cpp, but couldn't find it - is this a file I can edit?

Thanks!!

Franko

The error message is displayed by the BOINC Client.
I think it is just a BOINC Client timing issue that they have declared "fixed" several times.
I don't think it is ever a problem, just annoying.

client/app_control.cpp

// Check for finish files every 10 sec.
// If we already found a finish file, abort the app;
// it must be hung somewhere in boinc_finish();
//
static double last_finish_check_time = 0;
if (gstate.clock_change || gstate.now - last_finish_check_time > 10) {
last_finish_check_time = gstate.now;
for (i=0; i<active_tasks.size(); i++) {
ACTIVE_TASK* atp = active_tasks[i];
if (atp->task_state() == PROCESS_UNINITIALIZED) continue;
if (atp->finish_file_time) {
// process is still there 10 sec after it wrote finish file.
// abort the job
atp->abort_task(EXIT_ABORTED_BY_CLIENT, "finish file present too long"); <<<<<<<<<<<< line 140
} else if (atp->finish_file_present()) {
atp->finish_file_time = gstate.now;
}
}
}
ID: 89834 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,338,560
RAC: 2,014
Message 89837 - Posted: 5 Nov 2018, 2:19:58 UTC - in response to Message 89834.  

Dang it, I'm still getting the same error.

I tried to find the file app_control.cpp, but couldn't find it - is this a file I can edit?

Thanks!!

Franko

[snip]

Files with the .cpp extension are usually C++ source files, which can be edited. However, doing so is not useful unless:

1. You have a copy of the file. Most BOINC downloads do not include the source files - you have to know where to find the source files and download the entire package of source files.

2. You know enough C++ to make useful edits.

3. You have all of the compilers installed to compile the entire program for your operating system.

4. You have the instructions to compile all source files needed, and then link them into a new version of the program.

5. You know how to substitute the new version of the program for the old version.
ID: 89837 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
fcbrants
Avatar

Send message
Joined: 25 Mar 13
Posts: 13
Credit: 3,933,177
RAC: 0
Message 89845 - Posted: 6 Nov 2018, 17:48:05 UTC - in response to Message 89837.  

Got it, thanks!!

I spent some more time with this machine running at 100% (32 Rosetta tasks + 1 SETI task on the GPU) & it DID hang occasionally, which would explain this error.

As this is also my daily driver, I backed the "Use at most CPU's" option down to 93.75% (30 of 32 threads) & I haven't seen the problem since.

Problem resolved.

Thanks!!

Franko

Dang it, I'm still getting the same error.

I tried to find the file app_control.cpp, but couldn't find it - is this a file I can edit?

Thanks!!

Franko

[snip]

Files with the .cpp extension are usually C++ source files, which can be edited. However, doing so is not useful unless:

ID: 89845 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
anklab

Send message
Joined: 1 Jun 10
Posts: 1
Credit: 9,599,886
RAC: 127
Message 89926 - Posted: 24 Nov 2018, 14:05:41 UTC

Hi!
Recently, I have noticed that WU calculations that go on for a long time are also evaluated, as WU calculations that take place for a short time.
For example, mu computers
Intel Core2Duo E8500 and Intel Core i5-2500.

E8500 get WUs with 4 hours crunching, i5-2500 with 24 hours. it is strange that different tasks with different work results are granted equally.

Core i5-2500 // 24 hours // granted 160.33
======================================================
DONE :: 1 starting structures 86255.3 cpu seconds
This process generated 174 decoys from 174 attempts
======================================================

E8500 // 4 hours // granted 152.93
======================================================
DONE :: 1 starting structures 13805.2 cpu seconds
This process generated 22 decoys from 22 attempts
======================================================


Much earlier, i5-2500 received for each completed WU approximately 800~850 credits.
What can i do?
ID: 89926 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LarryMajor

Send message
Joined: 1 Apr 16
Posts: 22
Credit: 31,533,212
RAC: 0
Message 89930 - Posted: 25 Nov 2018, 19:52:03 UTC - in response to Message 89926.  

Much earlier, i5-2500 received for each completed WU approximately 800~850 credits.
What can i do?


I'd do nothing for a few days. It appears to have been the recent WUs/scoring that caused a big drop. Mine started to look more typical in the past 24 hours.
ID: 89930 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>Le_Pommier] Jerome_C2005

Send message
Joined: 22 Aug 06
Posts: 44
Credit: 1,258,039
RAC: 0
Message 89949 - Posted: 2 Dec 2018, 12:32:51 UTC - in response to Message 89930.  

Hi

I have tasks erroring after 10 hours of calculation

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
finish file present too long</message>
<stderr_txt>
command: rosetta_4.09_x86_64-apple-darwin -run:protocol jd2_scripting @flags_rb_12_01_955_1018__t000__0_C1_robetta -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip input_rb_12_01_955_1018__t000__0_C1_robetta.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3814287
Starting watchdog...
Watchdog active.
======================================================
DONE :: 43 starting structures 28348 cpu seconds
This process generated 43 decoys from 43 attempts
======================================================
BOINC :: WS_max 5.21523e+08

BOINC :: Watchdog shutting down...
12:42:37 (98417): called boinc_finish(0)

</stderr_txt>
]]>


A few did succeed from the same lot after the same amount of calculation time

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<stderr_txt>
command: rosetta_4.09_x86_64-apple-darwin -run:protocol jd2_scripting @flags_rb_12_01_948_1013__t000__1_C1_robetta -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip input_rb_12_01_948_1013__t000__1_C1_robetta.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3810154
Starting watchdog...
Watchdog active.
======================================================
DONE :: 7 starting structures 28105.7 cpu seconds
This process generated 7 decoys from 7 attempts
======================================================
BOINC :: WS_max 9.90781e+08

BOINC :: Watchdog shutting down...
12:37:56 (98460): called boinc_finish(0)

</stderr_txt>
]]>

ID: 89949 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,338,560
RAC: 2,014
Message 89956 - Posted: 3 Dec 2018, 22:52:34 UTC - in response to Message 89955.  

I am getting a message of "Abandoned by Project" on too many workunits. With 8 hour workunits this is unacceptable and since I compute in the Gridcoin pool I cannot change my settings.


Could this mean that your computer is so slow that two other computers have finished the workunit before your does?

Does your computer finish workunits before their deadlines?
ID: 89956 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Arnav Sood

Send message
Joined: 20 Aug 18
Posts: 2
Credit: 11,782,086
RAC: 0
Message 89984 - Posted: 11 Dec 2018, 17:25:27 UTC

Have been unable to upload work units since yesterday (two have timed out). Keeps telling me "project backoff."

I'm on an iMac Pro 2017 running macOS 10.14 Mojave and BOINC 7.12
ID: 89984 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
fcbrants
Avatar

Send message
Joined: 25 Mar 13
Posts: 13
Credit: 3,933,177
RAC: 0
Message 89985 - Posted: 11 Dec 2018, 17:55:31 UTC - in response to Message 89984.  

I just checked my logs back to 12/10 15:00 CST & it looks like I've been uploading continuously, uninterrupted. Win64 Boinc 7.12.1.

Have been unable to upload work units since yesterday (two have timed out). Keeps telling me "project backoff."

I'm on an iMac Pro 2017 running macOS 10.14 Mojave and BOINC 7.12

ID: 89985 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 90000 - Posted: 14 Dec 2018, 19:35:41 UTC
Last modified: 14 Dec 2018, 20:10:20 UTC

I was away from home (of course), and Rosetta took out my i7-4770. Everything was frozen up. I have never seen that before for Rosetta.

Apparently it was this work unit:
https://boinc.bakerlab.org/result.php?resultid=1046921926

<core_client_version>7.12.0</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_i686-pc-linux-gnu @foldit_2006238_0004_fold_and_dock_flags -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip fold_and_dock_foldit_2006238_0004_data.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2498717
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.

ERROR: Unable to open database file for dun10 rotamer library: minirosetta_database/rotamer/shapovalov/StpDwn_0-0-0/cys.bbdep.rotamers.lib
ERROR:: Exit from: src/core/pack/dunbrack/RotamerLibrary.cc line: 1085
BACKTRACE:
[0xe8ca514]
[0xca17443]
[0xca178ce]
[0xca92145]
[0xc90133c]
[0xc9ef641]
[0xd019a4b]
[0xd3e6e18]
[0xd3eb9ce]
[0xc96b2d1]
[0xc963eb2]
[0xb7fef3f]
[0xac8f844]
[0x9404246]
[0x9299a6c]
[0xc232777]
[0xc234a84]
[0xc2f46c0]
[0xc2f323b]
[0x929e531]
[0x8054670]
[0xedcf791]
[0xedcf98d]
[0x8266087]
BOINC:: Error reading and gzipping output datafile: default.out
14:21:38 (2187): called boinc_finish(1)

</stderr_txt>

Rosetta is the only project I have running on that machine (limited to six cores, with two cores free); I don't even have a GPU installed.
It probably won't happen again, but once is enough.

EDIT: I updated Ubuntu 16.04, and upon reboot, picked up this in my BOINC log. I have never seen it before, and have no idea what it means.

6	Rosetta@home	12/14/2018 2:51:39 PM	[error] App version has unsupported platform i686-pc-linux-gnu; changing to x86_64-pc-linux-gnu	
7	Rosetta@home	12/14/2018 2:51:39 PM	[error] State file error: duplicate app version: minirosetta x86_64-pc-linux-gnu 378 	
8	Rosetta@home	12/14/2018 2:51:39 PM	[error] App version has unsupported platform i686-pc-linux-gnu; changing to x86_64-pc-linux-gnu	


But everything appears to be back to normal, and Rosetta is running OK now.
ID: 90000 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Killersocke@rosetta

Send message
Joined: 13 Nov 06
Posts: 29
Credit: 2,579,125
RAC: 0
Message 90001 - Posted: 14 Dec 2018, 20:37:32 UTC

to my surprise i see 24 Tasks in my Profile uploaded to my PC
In real i have 10 in my Boinc Manager
Whats going on there?

Anwendung Rosetta 4.07
Name foldit_2006238_0005_fold_and_dock_SAVE_ALL_OUT_707998_5433
Status Angehalten durch Benutzer
erhalten

Anwendung Rosetta 4.07
Name foldit_2006238_0002_fold_and_dock_SAVE_ALL_OUT_707992_5434
Status Angehalten durch Benutzer
erhalten

Anwendung Rosetta 4.07
Name foldit_2006254_0004_fold_and_dock_SAVE_ALL_OUT_708044_5432
Status Angehalten durch Benutzer
erhalten
slots/2

Anwendung Rosetta 4.07
Name foldit_2006238_0003_fold_and_dock_SAVE_ALL_OUT_707994_5434
Status Angehalten durch Benutzer erhalten
slots/7

Anwendung Rosetta 4.07
Name foldit_2006238_1059_fold_and_dock_SAVE_ALL_OUT_708020_5431
Status Angehalten durch Benutzer erhalten
slots/5

Anwendung Rosetta 4.07
Name foldit_2006238_1059_fold_and_dock_SAVE_ALL_OUT_708020_4988
Status Angehalten durch Benutzer erhalten
slots/4

Anwendung Rosetta 4.07
Name foldit_2006254_0002_fold_and_dock_SAVE_ALL_OUT_708040_5432
Status Angehalten durch Benutzer erhalten
slots/3

Anwendung Rosetta 4.07
Name foldit_2006254_0003_fold_and_dock_SAVE_ALL_OUT_708042_5432
Status Aktiv erhalten
slots/6

Anwendung Rosetta 4.07
Name foldit_2006238_0004_fold_and_dock_SAVE_ALL_OUT_707996_5434
Status Aktiv erhalten
slots/11

Anwendung Rosetta 4.07
Name foldit_2006238_0005_fold_and_dock_SAVE_ALL_OUT_707998_5434
Status Aktiv erhalten
slots/13
ID: 90001 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
jjch

Send message
Joined: 10 Nov 13
Posts: 14
Credit: 441,128,699
RAC: 19,444
Message 90006 - Posted: 16 Dec 2018, 18:38:59 UTC - in response to Message 90001.  
Last modified: 16 Dec 2018, 18:41:29 UTC

I think I may be experiencing a similar issue.

Recently I noted the work in progress value appeared to be approximately double the normal amount of work units I have running at a time.

In order to trouble shoot this I set Rosetta to no new tasks and let them run out. Checking Boincstats I no longer have any work left on any host.

According to Rosetta I currently have a total of 1709 tasks in progress. For example host 1770544 it is not running any Rosetta tasks but yet the In progress count is 216.

https://boinc.bakerlab.org/rosetta/results.php?hostid=1770544&offset=0&show_names=0&state=1&appid=

I did try resetting the project on that host but it didn't make any difference. My impression there is a problem on the Rosetta server side and it isn't updating the task status properly.

I think we need the Rosetta programming team look into this further.
ID: 90006 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 90007 - Posted: 16 Dec 2018, 19:50:52 UTC - in response to Message 90006.  

According to Rosetta I currently have a total of 1709 tasks in progress. For example host 1770544 it is not running any Rosetta tasks but yet the In progress count is 216.

This is interesting. I have 8 in progress, but Rosetta "In progress" shows 11.
https://boinc.bakerlab.org/rosetta/results.php?hostid=3510039&offset=0&show_names=0&state=1&appid=

It is the oldest three that are missing. That isn't a big difference, so I thought I would take a look in the BOINC log. I see the following curious entry for the oldest one (but it is the only one I see):

43	Rosetta@home	12/14/2018 3:52:05 PM	[error] Can't parse file info in scheduler reply: file name is empty or has '..'	
44	Rosetta@home	12/14/2018 3:52:05 PM	[error] Can't parse file info in scheduler reply: file name is empty or has '..'	
46	Rosetta@home	12/14/2018 3:52:05 PM	[error] State file error: missing file r1_r1_ems_3hC_984_0002_000000007_0001_0001_0001_23_41_H_.._EHEE_10482_0001_0001_0001_0001_15_38_H_.._DHR70_DHR15_l2_t3_t2_D20_D25_ct21_nTerm_3x_r8_0001_0001_0001_0001_0002_0001_0001_0001_0001_fragments_data.zip	
47	Rosetta@home	12/14/2018 3:52:05 PM	[error] State file error: missing input file r1_r1_ems_3hC_984_0002_000000007_0001_0001_0001_23_41_H_.._EHEE_10482_0001_0001_0001_0001_15_38_H_.._DHR70_DHR15_l2_t3_t2_D20_D25_ct21_nTerm_3x_r8_0001_0001_0001_0001_0002_0001_0001_0001_0001_fragments_data.zip	
48	Rosetta@home	12/14/2018 3:52:05 PM	[error] Can't handle task r1_r1_ems_3hC_984_0002_000000007_0001_0001_0001_23_41_H_.._EHEE_10482_0001_0001_0001_0001_15_38_H_.._DHR70_DHR15_l2_t3_t2_D20_D25_ct21_nTerm_3x_r8_0001_0001_0001_0001_0002_0001_0001_0001_0001_fragment_706193_213 in scheduler repl	
49	Rosetta@home	12/14/2018 3:52:05 PM	[error] State file error: missing task r1_r1_ems_3hC_984_0002_000000007_0001_0001_0001_23_41_H_.._EHEE_10482_0001_0001_0001_0001_15_38_H_.._DHR70_DHR15_l2_t3_t2_D20_D25_ct21_nTerm_3x_r8_0001_0001_0001_0001_0002_0001_0001_0001_0001_fragment_706193_213	
50	Rosetta@home	12/14/2018 3:52:05 PM	[error] Can't handle task r1_r1_ems_3hC_984_0002_000000007_0001_0001_0001_23_41_H_.._EHEE_10482_0001_0001_0001_0001_15_38_H_.._DHR70_DHR15_l2_t3_t2_D20_D25_ct21_nTerm_3x_r8_0001_0001_0001_0001_0002_0001_0001_0001_0001_fragment_706193_213_1 in scheduler re	


Maybe someone can figure it out.
ID: 90007 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Killersocke@rosetta

Send message
Joined: 13 Nov 06
Posts: 29
Credit: 2,579,125
RAC: 0
Message 90008 - Posted: 16 Dec 2018, 22:44:45 UTC

I'm scared
I see 27 tasks with Status given up
They are all from December 14th
ID: 90008 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Killersocke@rosetta

Send message
Joined: 13 Nov 06
Posts: 29
Credit: 2,579,125
RAC: 0
Message 90009 - Posted: 17 Dec 2018, 0:01:36 UTC

Sorry Guys
these are my time, my money and my costs
So i will stop Rosetta now
ID: 90009 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 90010 - Posted: 17 Dec 2018, 2:01:19 UTC - in response to Message 90009.  

I don't see a problem with your completion rate. Everything looks pretty good.
You may just see a status problem.
ID: 90010 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Killersocke@rosetta

Send message
Joined: 13 Nov 06
Posts: 29
Credit: 2,579,125
RAC: 0
Message 90011 - Posted: 17 Dec 2018, 8:20:28 UTC - in response to Message 90010.  

I don't see a problem with your completion rate. Everything looks pretty good.
You may just see a status problem.


Sorry but this not my Problem
ID: 90011 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2140
Credit: 41,518,559
RAC: 10,612
Message 90016 - Posted: 17 Dec 2018, 16:20:32 UTC - in response to Message 90009.  

Sorry Guys
these are my time, my money and my costs
So i will stop Rosetta now

I've got a similar problem - just posted somewhere else.
Having evaluated what's happened, no time was involved, no download took place and no costs were incurred.
Maybe 7 seconds of processing time were affected per download - once every few hours - but I'm not sure it was in place of anything else.
The only problem for users seems to be a mismatch between the online list of your tasks and what shows in your offline task list.
I suspect you wasted more energy clicking reply, typing 17 words and clicking Post reply.
ID: 90016 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
jjch

Send message
Joined: 10 Nov 13
Posts: 14
Credit: 441,128,699
RAC: 19,444
Message 90025 - Posted: 18 Dec 2018, 19:10:02 UTC
Last modified: 18 Dec 2018, 19:11:04 UTC

From what I can tell these work units were cancelled but the status remained In progress.

If you check the Workunit under errors you will see WU cancelled.

For example: https://boinc.bakerlab.org/workunit.php?wuid=942284714

I don't think there is anything major to worry about just an annoyance. It's not likely you lost any compute cycles either.

The Rosetta programming team should clean this up if possible however I think they will disappear after the deadline expires.

For now I have stopped all Rosetta computing until after Dec 23rd to see if this is true. FYI, I am giving WCG cycles in the meantime.
ID: 90025 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 21 · 22 · 23 · 24 · 25 · 26 · 27 . . . 309 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org