Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 28 · 29 · 30 · 31 · 32 · 33 · 34 . . . 311 · Next

AuthorMessage
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92274 - Posted: 25 Mar 2020, 15:16:04 UTC

rlpm, your host profile shows 256MB of memory. And the "mini" tasks require just as much memory as any others. They seem to have moved the documentation on minimum host requirements on the R@h website, so I'm not finding it at the moment. But the basic guideline is 1GB of memory per active CPU core.

I might suggest that you attach the machine to World Community Grid. They have a number of bioscience projects running there, and generally can run in a smaller memory footprint.
Rosetta Moderator: Mod.Sense
ID: 92274 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rlpm

Send message
Joined: 23 Mar 20
Posts: 13
Credit: 84
RAC: 0
Message 92279 - Posted: 25 Mar 2020, 16:40:34 UTC - in response to Message 92274.  

Thanks Mod.Sense.
It would be nice if BOINC automatically failed early, perhaps even at project attachment, if the host doesn't meet the minimum requirements for any app (RAM, disk, instruction set, OS).
I already have my old 1st gen RasPis crunching on TN-Grid (gene sequencing) via BOINC, so I'll do the same with this AppleTV.
ID: 92279 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bormolino

Send message
Joined: 16 May 13
Posts: 4
Credit: 160,977
RAC: 0
Message 92292 - Posted: 25 Mar 2020, 20:11:24 UTC

The graphics of the Rosetta 4.07 WU for COVID-19 does not work. It shows "Stage unknown" and "No shared mem" inside the graphics-window.

The graphics of the other WUs are working without any problems.
ID: 92292 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2146
Credit: 41,570,180
RAC: 6,141
Message 92355 - Posted: 26 Mar 2020, 19:29:06 UTC

I've seen the Rosetta stats for the number of new users who've come on board recently - basically quadrupled with massive throughput, which is great.
The number of in-progress tasks is similarly huge - well over a million - more than I can ever remember seeing.

A little earlier this afternoon I saw my buffers were smaller than usual and noticed that a few calls for new tasks had brought none down. This is hardly surprising.

Before I finally got to this page to mention the task shortage, more had come on stream, which is great.

I guess all I'm saying is, especially with all the new users around, if there's an interruption in task supply in the coming daysweeks, we (more accurately, I) need to have a little patience and understanding. It's going to happen and it's surprising it hasn't happened already.

Great job on keeping the tasks coming through - thanks.
ID: 92355 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Shaky Jake

Send message
Joined: 26 Mar 07
Posts: 2
Credit: 55,684
RAC: 0
Message 92455 - Posted: 28 Mar 2020, 13:58:41 UTC - in response to Message 80621.  

I have an older desktop computer with a Pentium Duo cpu that is having a problem with the COVID-19 workunits. They are erroring out at about 2 min.

EXAMPLE:

Task 1134452442
Name 0ef4jx8h_jhr_design1_COVID-19_SAVE_ALL_OUT_903439_1_0

Workunit 1021756085
Created 27 Mar 2020, 9:12:21 UTC
Sent 27 Mar 2020, 9:38:35 UTC
Report deadline 4 Apr 2020, 9:38:35 UTC
Received 28 Mar 2020, 12:10:42 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 11 (0x0000000B) Unknown error code
Computer ID 3794680
Run time 2 min 15 sec
CPU time 1 min 59 sec
Validate state Invalid
Credit 0.00
Device peak FLOPS 1.87 GFLOPS
Application version Rosetta v4.08
x86_64-pc-linux-gnu
Stderr output

<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol jhr_boinc.xml @flags -in:file:silent 0ef4jx8h_jhr_design1_COVID-19.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 0ef4jx8h_jhr_design1_COVID-19.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3902678
Starting watchdog...
Watchdog active.

</stderr_txt>
]]>

I have seen a couple that did complete and were validated.

EXAMPLE:

Task 1133949909
Name 0gr1iv8s_jhr_design1_COVID-19_SAVE_ALL_OUT_903456_1_0
Workunit 1021309240
Created 26 Mar 2020, 20:05:44 UTC
Sent 26 Mar 2020, 20:22:20 UTC
Report deadline 3 Apr 2020, 20:22:20 UTC
Received 27 Mar 2020, 23:58:09 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x00000000)
Computer ID 3794680
Run time 13 hours 53 min 23 sec
CPU time 10 hours 30 min 46 sec
Validate state Valid
Credit 222.11
Device peak FLOPS 1.87 GFLOPS
Application version Rosetta v4.07
i686-pc-linux-gnu
Stderr output

<core_client_version>7.2.42</core_client_version>
<![CDATA[
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_i686-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol jhr_boinc.xml @flags -in:file:silent 0gr1iv8s_jhr_design1_COVID-19.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 0gr1iv8s_jhr_design1_COVID-19.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3546964
Starting watchdog...
Watchdog active.
======================================================
DONE :: 3 starting structures 37846.6 cpu seconds
This process generated 3 decoys from 3 attempts
======================================================
BOINC :: WS_max 9.36336e-97

BOINC :: Watchdog shutting down...
18:53:10 (26863): called boinc_finish(0)

</stderr_txt>
]]>


Should I stop using this computer for this project or let it continue. All of the other workunits appear to process with no problems.
ID: 92455 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
IBM01902

Send message
Joined: 23 Mar 20
Posts: 3
Credit: 43,044
RAC: 0
Message 92460 - Posted: 28 Mar 2020, 14:40:07 UTC - in response to Message 92455.  

I am seeing this too with older computers. I don't have any new ones. They seem to eventually find something they can work on, but there's nothing in the BOINC event log that's helpful. I will occasionally have a task that halts and waits for memory, but that's not the Computation Error result we're seeing. Glad it's not just me.
ID: 92460 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rlpm

Send message
Joined: 23 Mar 20
Posts: 13
Credit: 84
RAC: 0
Message 92464 - Posted: 28 Mar 2020, 15:16:30 UTC - in response to Message 92460.  

<message>
process got signal 11
</message>

The process is crashing. More info:
 SIGSEGV      11       Core    Invalid memory reference

The people with access to the code will have to look into it. I don't know whether there are any crash reports (stack traces, etc.) that you can pull to provide more information to them.
ID: 92464 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 3,317
Message 92468 - Posted: 28 Mar 2020, 16:21:24 UTC - in response to Message 92460.  

I am seeing this too with older computers. I don't have any new ones. They seem to eventually find something they can work on, but there's nothing in the BOINC event log that's helpful. I will occasionally have a task that halts and waits for memory, but that's not the Computation Error result we're seeing. Glad it's not just me.


Working ok for me on all my computers. My oldest is an Intel Q8400 (about 10 years old).

It's a pity you can't select which sub projects to run in the Rosetta preferences. Most projects allow you to pick which ones, so you can block the ones that don't work on your machines.

I guess as long as some of them work, you should keep going. Sending one back with an error just means the server will try someone else.
ID: 92468 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92474 - Posted: 28 Mar 2020, 17:18:13 UTC - in response to Message 92455.  

@Shaky Jake. I see you have two machines. It appears the one with 2 CPUs and 2GB of memory is where the errors are occurring the most (the other machine has 2CPUs and 4GB). This is consistent with what I have gleaned from others as well. I believe the Project Team will be tagging the COVID tasks as requiring more memory in the coming days. This should help things run smoother going forward.
Rosetta Moderator: Mod.Sense
ID: 92474 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Shaky Jake

Send message
Joined: 26 Mar 07
Posts: 2
Credit: 55,684
RAC: 0
Message 92489 - Posted: 28 Mar 2020, 21:01:16 UTC - in response to Message 92455.  
Last modified: 28 Mar 2020, 21:10:21 UTC

I found the problem. I am short .1 GB of memory so when 2 COVID-19 WUs try to run, one of them will fail due to lack of memory. I have ordered additional memory. Until it arrives I have set the computer to use run only 1 WU at a time.


Thanks Mod.Sense

Every thing seems to be running OK by using only 1 core. I am going to upgrade to 4GB of memory. I think that will solve the problem. My other computer is a laptop with 2 cores and 4GB memory and it has had no problems.

Shaky Jake
ID: 92489 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rlpm

Send message
Joined: 23 Mar 20
Posts: 13
Credit: 84
RAC: 0
Message 92490 - Posted: 28 Mar 2020, 21:22:44 UTC - in response to Message 92489.  
Last modified: 28 Mar 2020, 21:28:47 UTC

The binaries should check that there's enough memory for the WU, both at process start time, and checking results of malloc, etc. at run time. Since the process on your computer hit a segfault, it may have been due to a memory allocation failing but the software not checking the result of the allocation. There must be some checking in the 32-bit (for linux) version of the Rosetta & Rosetta Mini binaries, since I've encountered this error message on an older box with only 256MB of memory:
working set size > client RAM limit: 180.00MB > 179.51MB

(But it would be nice to have the check happen ahead of time -- before sending the WU to the computer.)
ID: 92490 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bormolino

Send message
Joined: 16 May 13
Posts: 4
Credit: 160,977
RAC: 0
Message 92491 - Posted: 28 Mar 2020, 21:24:50 UTC

The graphics of the Rosetta 4.07 WU for COVID-19 does not work. It shows "Stage unknown" and "No shared mem" inside the graphics-window.

The graphics of the other WUs are working without any problems.
ID: 92491 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
EHM-1
Avatar

Send message
Joined: 21 Mar 20
Posts: 23
Credit: 183,782
RAC: 0
Message 92534 - Posted: 29 Mar 2020, 15:37:52 UTC
Last modified: 29 Mar 2020, 15:41:28 UTC

Hello all- Longtime SETI@Home user here, new to Rosetta. Hope I'm posting in the right place; please advise me if not.
I attached several days ago, and the screensaver was displaying what I would expect for processing until a couple days ago. Since at least yesterday morning (midday Mar 28 UT), the processing screen displays what I would call a blank template, with no indication that anything is being processed. See image below.
Any ideas? Anyone else encountering this? I could find no mention of anything similar in the forums.
Thanks in advance for any help.
Eric
PS- Just after posting, I now see that bormolino might be reporting the same issue just above my post.

ID: 92534 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bormolino

Send message
Joined: 16 May 13
Posts: 4
Credit: 160,977
RAC: 0
Message 92558 - Posted: 29 Mar 2020, 18:34:00 UTC - in response to Message 92534.  

PS- Just after posting, I now see that bormolino might be reporting the same issue just above my post.


Yes :D

Same on my machines.
ID: 92558 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
EHM-1
Avatar

Send message
Joined: 21 Mar 20
Posts: 23
Credit: 183,782
RAC: 0
Message 92572 - Posted: 30 Mar 2020, 0:17:14 UTC - in response to Message 92558.  

Follow-up to my earlier post: At the most recent screensaver invocation, the normal behavior resumed.
Note: Though subscribed to this thread, I received no notification of bormolino's post.
Eric
ID: 92572 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rlpm

Send message
Joined: 23 Mar 20
Posts: 13
Credit: 84
RAC: 0
Message 92573 - Posted: 30 Mar 2020, 0:23:19 UTC - in response to Message 92572.  

Note: Though subscribed to this thread, I received no notification of bormolino's post.


Check your community prefs from your main account page.
ID: 92573 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2146
Credit: 41,570,180
RAC: 6,141
Message 92581 - Posted: 30 Mar 2020, 2:21:05 UTC

Not sure what this means atm
30/03/2020 3:17:00 | Rosetta@home | Scheduler request completed: got 0 new tasks
30/03/2020 3:17:00 | Rosetta@home | Server can't open database

Also, entering this thread I initially got a message saying the site was down. Came back on a refresh
ID: 92581 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
amgthis

Send message
Joined: 25 Mar 06
Posts: 81
Credit: 203,879,282
RAC: 0
Message 92582 - Posted: 30 Mar 2020, 2:43:32 UTC

getting an 'temporarily failed upload of (w/u name here xxx ) transient http error' message on upload failure and time out.

I'm guessing it's just some new message I've never seen and the project is just getting updated, etc.
ID: 92582 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
HPE Belgium

Send message
Joined: 27 Mar 20
Posts: 16
Credit: 367,648,439
RAC: 0
Message 92589 - Posted: 30 Mar 2020, 9:22:12 UTC

Hello,

I have some servers that I want to use for R@H. Most of the servers use full CPU and all cores/logical CPU's, however I have 2 servers that only use half of the available logical processor.
Both servers are ProLiant Gen9 servers.

One server is a BL660c Gen9 with 32 logical CPU's but only half of them are working while I still have tasks "ready to start".
Other server is DL380 Gen9 which takes 67% CPU load instad of 100%
My other servers are Gen8 servers which take full load.


Is there something I can do to fix this? Somebody that can help me troubleshoot? All my preferences are set to 100% load in my global preferences and this setting works fine on most of my servers.
ID: 92589 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1735
Credit: 18,532,940
RAC: 14,716
Message 92591 - Posted: 30 Mar 2020, 9:37:39 UTC - in response to Message 92589.  

Is there something I can do to fix this? Somebody that can help me troubleshoot? All my preferences are set to 100% load in my global preferences and this setting works fine on most of my servers.
Are they "Ready to start" or "Waiting on memory?"- they've got enough RAM to support all of those cores & threads? You haven't changed any settings in the BOINC Manager on those systems (local settings override web based ones)?
Grant
Darwin NT
ID: 92591 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 28 · 29 · 30 · 31 · 32 · 33 · 34 . . . 311 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2025 University of Washington
https://www.bakerlab.org