Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 151 · 152 · 153 · 154 · 155 · 156 · 157 . . . 311 · Next

AuthorMessage
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 104047 - Posted: 6 Jan 2022, 18:53:19 UTC - in response to Message 104045.  

I could do, but I've decided to stick to Gridcoin as I've worked out it pays for the hardware in 1.5 years of crunching, so not to be sneezed at.


I've had a glance at it, but never really dug into it at all.
After 1.5 years. Slow but pretty good investment.

I killed a CPU after a year of hard flat rate OC. So I don't do that anymore, that was an expensive lesson that even Gridcoin could not pay in 1.5 years. 2 places with a bunch of tech time. New MOBO and new CPU (that hurt). Since I want to run a bunch of projects I just dropped $150 (aprox) for 2 sticks of 16GB of memory, since python is such a memory hog and RAH does not allow us to control the amount of cores used. 2 sticks of 4 have been with me since they were some of the largest memory on the market many moons ago. I know expense. Now if GC could pay for my electric, then that might be something to look at.
ID: 104047 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 104048 - Posted: 6 Jan 2022, 19:06:07 UTC - in response to Message 104047.  
Last modified: 6 Jan 2022, 19:37:23 UTC

I don't OC because of crashes and data corruption. Never killed a CPU though, I thought they lasted forever. But then I never experimented too harshly with overvoltage, which sounds a very nasty thing to do to a chip. I don't actually see the point, Intel/AMD test these chips thoroughly to see what speed can go at reliably. I'm sure they know what they're doing.

There are ways of getting cheap electricity, some of which are legal :-) Discounts for direct debit, paperless billing, duel fuel, read your own meter, etc. And choosing a supplier that's 30% cheaper. Or using night time rates. Or installing solar panels and taking absurd government subsidies.

Just connected two dual xeons (I needed a proprietary cable to make the stupid things boot up), then fixed the same duplicated ID cloning problem I had before. They total 48 cores, and they're taking pythons :-)
ID: 104048 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 104049 - Posted: 6 Jan 2022, 21:00:41 UTC - in response to Message 104048.  

It's just a pity you can't use LHC with gridcoin. The reason being the stupid creditnew system screws up with multicore tasks, and is very easy to cheat, so people were managing 10x the coins they were due and taking money from the rest of us. LHC refuse to fix it and say it's Boinc's fault, which I agree with.
ID: 104049 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 104050 - Posted: 6 Jan 2022, 23:35:44 UTC - in response to Message 104049.  
Last modified: 6 Jan 2022, 23:36:28 UTC

If it's not my Vbox version, the only other difference is AMD vs Intel. Do AMDs run the Rosetta Python ok? They have different virtualization technology. And by ok I mean check if they are validated on the server, since they appeared alright on my end.
ID: 104050 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 104051 - Posted: 6 Jan 2022, 23:53:01 UTC - in response to Message 104050.  

Shouldn't be.
Nobody else has complained about Intel.
I'm a AMD user, so really can't be of any help.

Again, via our view all the tasks on your computer have disaoeared.

If I can find some time tomorrow [Friday] (EU time) I will try to dig into validate errors.
Everything from my system is chugging along just fine.

You know there is one other thing you can check...look at the task itself and see if it was sent to another computer and what that computer got out of it. Valid or invalid.
If you both got invalid, then there is something wrong with the data.
If your #1 and invalid and then you look at it again and #2 is valid, then there is something wrong with your data.

You are completing the tasks, but get validation inconclusive?
Can you copy the readout from Stderr output on the task page if it is anything other than something like this:
<core_client_version>7.16.20</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol pdblite_boinc_998_10_tfirst--fuse--predictor_v11_boinc_fix--fuse--tslp_design_v1_boinc_fix_tyr.xml @tau_site_altern_row2_V_gggraft_bcov_flags -in:file:silent tau_site_altern_row2_V_gggraft_bcov_v1_xaa_SAVE_ALL_OUT_IGNORE_THE_REST_2oa4rj8j.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip tau_site_altern_row2_V_gggraft_bcov_v1_xaa_SAVE_ALL_OUT_IGNORE_THE_REST_2oa4rj8j.zip @tau_site_altern_row2_V_gggraft_bcov_v1_xaa_SAVE_ALL_OUT_IGNORE_THE_REST_2oa4rj8j.flags -nstruct 100 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 2406449
Using database: database_357d5d93529_n_methylminirosetta_database
======================================================
DONE :: 100 starting structures 11805.6 cpu seconds
This process generated 100 decoys from 100 attempts
======================================================
BOINC :: WS_max 6.34278e+08
12:22:14 (16996): called boinc_finish(0)

</stderr_txt>
]]>

This was a valid task....I haven't had any invalids in so long....
ID: 104051 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 104052 - Posted: 7 Jan 2022, 0:36:49 UTC - in response to Message 104051.  

Shouldn't be.
Nobody else has complained about Intel.
I'm a AMD user, so really can't be of any help.
Have you misread my post? I'm having problems with AMD, and not Intel. If your AMDs are working fine, this shows AMD is ok. Rosetta recommend the latest Vbox. So if it isn't either, I can't think why my Ryzen has a problem. They all completed successfully here, but failed to validate on the server.

Again, via our view all the tasks on your computer have disaoeared.
Don't know why that happened. You can see the totals of consecutive validations, but not individual tasks, see https://boinc.bakerlab.org/rosetta/host_app_versions.php?hostid=6167614 under rosetta python projects - "Number of tasks completed 89", "Consecutive valid tasks 0"

If I can find some time tomorrow [Friday] (EU time) I will try to dig into validate errors.
Everything from my system is chugging along just fine.

You know there is one other thing you can check...look at the task itself and see if it was sent to another computer and what that computer got out of it. Valid or invalid.
If you both got invalid, then there is something wrong with the data.
If your #1 and invalid and then you look at it again and #2 is valid, then there is something wrong with your data.
I would assume if there was something wrong with the data, I was very unlucky. Assuming the grcpool admin resets my computer, I should be lucky next time.

You are completing the tasks, but get validation inconclusive?
Can you copy the readout from Stderr output on the task page if it is anything other than something like this:
<core_client_version>7.16.20</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol pdblite_boinc_998_10_tfirst--fuse--predictor_v11_boinc_fix--fuse--tslp_design_v1_boinc_fix_tyr.xml @tau_site_altern_row2_V_gggraft_bcov_flags -in:file:silent tau_site_altern_row2_V_gggraft_bcov_v1_xaa_SAVE_ALL_OUT_IGNORE_THE_REST_2oa4rj8j.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip tau_site_altern_row2_V_gggraft_bcov_v1_xaa_SAVE_ALL_OUT_IGNORE_THE_REST_2oa4rj8j.zip @tau_site_altern_row2_V_gggraft_bcov_v1_xaa_SAVE_ALL_OUT_IGNORE_THE_REST_2oa4rj8j.flags -nstruct 100 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 2406449
Using database: database_357d5d93529_n_methylminirosetta_database
======================================================
DONE :: 100 starting structures 11805.6 cpu seconds
This process generated 100 decoys from 100 attempts
======================================================
BOINC :: WS_max 6.34278e+08
12:22:14 (16996): called boinc_finish(0)

</stderr_txt>
]]>

This was a valid task....I haven't had any invalids in so long....
Can't get to such things on my machine, due to grcpool owning the account.
ID: 104052 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 104053 - Posted: 7 Jan 2022, 1:56:48 UTC - in response to Message 104052.  

I'm having problems with AMD, and not Intel

That is interesting because I was working with the intel Vs AMD idea
Except i have more problems with my intel xeon cruncher than my AMD opteron,
pop go`s another theory as to why this stuff happens.
I have tried 5xx and 6xx Vbox and it seemed to make no difference to my problems.
ID: 104053 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 104054 - Posted: 7 Jan 2022, 2:16:41 UTC - in response to Message 104053.  
Last modified: 7 Jan 2022, 2:18:51 UTC

I'm having problems with AMD, and not Intel

That is interesting because I was working with the intel Vs AMD idea
Except i have more problems with my intel xeon cruncher than my AMD opteron,
pop go`s another theory as to why this stuff happens.
I have tried 5xx and 6xx Vbox and it seemed to make no difference to my problems.
So far I've only proved an Intel i5 works, and an AMD Ryzen 9 doesn't. I have 4 old Intel Xeons (X5650, 3 years older than yours) running overnight on python, I'll find out tomorrow if they work and post here.
ID: 104054 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 104057 - Posted: 7 Jan 2022, 9:18:59 UTC - in response to Message 104054.  

I'm having problems with AMD, and not Intel

That is interesting because I was working with the intel Vs AMD idea
Except i have more problems with my intel xeon cruncher than my AMD opteron,
pop go`s another theory as to why this stuff happens.
I have tried 5xx and 6xx Vbox and it seemed to make no difference to my problems.
So far I've only proved an Intel i5 works, and an AMD Ryzen 9 doesn't. I have 4 old Intel Xeons (X5650, 3 years older than yours) running overnight on python, I'll find out tomorrow if they work and post here.



Why would Intel process the data any differently than AMD?
Data is data, a program is a program.
Or is Intel garbling the data?


And Peter, I am responding late at night and reading fast, so I might misread some details of your post. Sorry.
ID: 104057 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 104060 - Posted: 7 Jan 2022, 16:07:50 UTC - in response to Message 104057.  

I'm having problems with AMD, and not Intel

That is interesting because I was working with the intel Vs AMD idea
Except i have more problems with my intel xeon cruncher than my AMD opteron,
pop go`s another theory as to why this stuff happens.
I have tried 5xx and 6xx Vbox and it seemed to make no difference to my problems.
So far I've only proved an Intel i5 works, and an AMD Ryzen 9 doesn't. I have 4 old Intel Xeons (X5650, 3 years older than yours) running overnight on python, I'll find out tomorrow if they work and post here.



Why would Intel process the data any differently than AMD?
Data is data, a program is a program.
Or is Intel garbling the data?
Anything using virtualbox, like Rosetta's Python, or anything from LHC, requires hardware virtualisation, which is done differently with Intel and AMD. I can't find any info on what is different other than it's not very significant, but there may be something that causes a bug in one and not the other. But my Intels all work as far as I know (haven't had a validation from the xeons yet) and my AMD doesn't, which is the opposite of what you get, so perhaps it's nothing to do with AMD/Intel. I do notice however that if I have virtualbox on all the AMD's cores, the Windows interface slows to a crawl, and I've not seen that with an Intel, so something is different.

And Peter, I am responding late at night and reading fast, so I might misread some details of your post. Sorry.
I have trouble sleeping, so my hours are weird, I'm possibly half dozed off sometimes too.
ID: 104060 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 104062 - Posted: 7 Jan 2022, 16:30:27 UTC - in response to Message 104060.  

I have trouble sleeping, so my hours are weird, I'm possibly half dozed off sometimes too.

Not me, 1am is my limit. Then I am off to bed and need the 8 hour recharge. 7.5 is the minimum.
But I have a very physically demanding job.
ID: 104062 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 104063 - Posted: 7 Jan 2022, 16:51:48 UTC - in response to Message 104062.  

I have trouble sleeping, so my hours are weird, I'm possibly half dozed off sometimes too.

Not me, 1am is my limit. Then I am off to bed and need the 8 hour recharge. 7.5 is the minimum.
But I have a very physically demanding job.
Lucky you, I have chronic fatigue :-(
ID: 104063 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 104066 - Posted: 7 Jan 2022, 19:22:01 UTC - in response to Message 104057.  

Why would Intel process the data any differently than AMD?
Data is data, a program is a program.
Or is Intel garbling the data?

This isn't any data this is `Python` data ,
and it will wot funky stuff it wants.
[that is a skit on the M&S adverts of uk tv]
ID: 104066 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 104067 - Posted: 7 Jan 2022, 20:04:36 UTC - in response to Message 104066.  
Last modified: 7 Jan 2022, 20:22:22 UTC

Well so far my i5 has worked perfectly, my Ryzen got banned, and my old Xeons I just noticed have spent 24 hours running python tasks with a total of 13 minutes CPU time. I wondered why they felt cold to the touch. There's something terribly wrong with these WUs.

These are the two Xeons, I'm in the process of aborting the tasks, if anyone can look and interpret the outputs. https://boinc.bakerlab.org/rosetta/results.php?hostid=6169682 https://boinc.bakerlab.org/rosetta/results.php?hostid=6169697 Make sure you look at the right ones, the ones aborted just now, not the ones aborted yesterday (that was something else when I was trying to set things up).

Here is a dodgy one, many errors, please interpret: https://boinc.bakerlab.org/rosetta/result.php?resultid=1463541284

It includes many of these lines:

Hypervisor System Log:
24:11:34.575288 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={85cd948e-a71f-4289-281e-0ca7ad48cd89} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0"

I have asked over in the main Boinc forum too, https://boinc.berkeley.edu/dev/forum_thread.php?id=14532
ID: 104067 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 104068 - Posted: 7 Jan 2022, 21:37:14 UTC - in response to Message 104067.  

I've asked in the LHC forum, since they use vbox on almost all tasks and might know what the problem is: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5781
ID: 104068 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 826
Message 104069 - Posted: 7 Jan 2022, 23:00:47 UTC - in response to Message 104068.  

I've asked in the LHC forum, since they use vbox on almost all tasks and might know what the problem is: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5781

Don't confuse vbox (which handles 32-bit work) with vbox64 (which handles 64-bit work).
ID: 104069 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 104072 - Posted: 7 Jan 2022, 23:28:31 UTC - in response to Message 104069.  
Last modified: 7 Jan 2022, 23:48:15 UTC

I assume everyone is on vbox64 by now?

LHC will be, and they seem to use the same wrapper as Rosetta.

I'm not sure what it is you're trying to tell me. I only installed one piece of software, virtualbox, from the Oracle site, same version that Boinc issues. Are you telling me there's two halves and Rosetta uses the other one to LHC? My i5 which does python ok has vboxheadless and virtualbox interface listed in the windows task manager azs running, no mention of 32 or 64 bit.

After following the advice from the LHC forum, I am no further forwards. My old xeons don't do any CPU time, my Ryzen (I think, can't check as it's now banned) computes but is not validated, and my i5 runs them perfectly. Same version of everything on all of them.
ID: 104072 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 826
Message 104074 - Posted: 7 Jan 2022, 23:47:11 UTC - in response to Message 104072.  
Last modified: 7 Jan 2022, 23:56:40 UTC

RNA World is still on vbox, but they're down to 19 unfinished workunits. So, not everyone.

Virtualbox (at least the latest versions) has two parts, the vbox part for 32-bit work and the vbox64 part for 64-bit work.

Rosetta. and probably also LHC. use the vbox64 part. I don't participate in LHC. so I haven't seen what they use.
ID: 104074 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 104075 - Posted: 7 Jan 2022, 23:48:51 UTC - in response to Message 104074.  
Last modified: 7 Jan 2022, 23:49:28 UTC

RNA World is still on vbox, but they're down to 19 unfinished workunits. So, not everyone.
But LHC and Rosetta are 64 bit?

And how does RNA world work, do you have to download an old 32 bit version?
ID: 104075 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 104077 - Posted: 8 Jan 2022, 0:04:15 UTC - in response to Message 104067.  
Last modified: 8 Jan 2022, 0:06:01 UTC

Well so far my i5 has worked perfectly, my Ryzen got banned, and my old Xeons I just noticed have spent 24 hours running python tasks with a total of 13 minutes CPU time. I wondered why they felt cold to the touch. There's something terribly wrong with these WUs.

These are the two Xeons, I'm in the process of aborting the tasks, if anyone can look and interpret the outputs. https://boinc.bakerlab.org/rosetta/results.php?hostid=6169682 https://boinc.bakerlab.org/rosetta/results.php?hostid=6169697 Make sure you look at the right ones, the ones aborted just now, not the ones aborted yesterday (that was something else when I was trying to set things up).

Here is a dodgy one, many errors, please interpret: https://boinc.bakerlab.org/rosetta/result.php?resultid=1463541284

It includes many of these lines:

Hypervisor System Log:
24:11:34.575288 ERROR [COM]: aRC=E_ACCESSDENIED (0x80070005) aIID={85cd948e-a71f-4289-281e-0ca7ad48cd89} aComponent={MachineWrap} aText={The object functionality is limited}, preserve=false aResultDetail=0"

I have asked over in the main Boinc forum too, https://boinc.berkeley.edu/dev/forum_thread.php?id=14532



It was chugging along just fine and then blows up with access denied? That's weird.
Did windows all of sudden block it or it ran into a fault with the data.
That it ran 24 hours is really odd. These finish in 4 hours or less.
A quick look with the object statement says something went wrong in Vbox.
If that happens repeatedly, then you need to remove Vbox and reinstall it.


Again its very late in the EU, so I will have to dig into more later.
Maybe our two experts can help you more.
ID: 104077 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 151 · 152 · 153 · 154 · 155 · 156 · 157 . . . 311 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org