Minirosetta 3.52

Message boards : Number crunching : Minirosetta 3.52

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1860
Credit: 8,154,097
RAC: 8,001
Message 77364 - Posted: 19 Aug 2014, 6:31:57 UTC

'll look into the server upgrade. It will be a long process since there is a lot of R@h specific code. Priorities for now are first to release our android app and then to add a replica DB and upgrade the server code. The later may require significant down time so we need to plan this with the on going research projects in the lab. We also have to look into hardware upgrades.

That's great!!

P.s. Please, try optimize the code for android (memory footprint, for example)
ID: 77364 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1860
Credit: 8,154,097
RAC: 8,001
Message 77369 - Posted: 20 Aug 2014, 14:14:55 UTC
Last modified: 20 Aug 2014, 14:17:10 UTC

[25675] Add feature for specifying plan classes in an XML file
[25321] Move antique file deletion to a separate program
[22778] Server support for Virtualbox applications


This is an old list (updated to june 2012). After my request, DA has updated it.
These are other changes to server code.

15-18.Aug.2014 Add support for per-app credit
8 Aug 2014 Convey user CPID to client (for BoincTasks?)
29 Jul 2014 version.xml can specify API version (for compressed apps)
25 Jul 2014 partial support in scheduler for generic coproessors (e.g. ASICs)
16 Jul 2014 scheduler support for client "brand"; store in DB
14 Jul 2014 add <maintenance_delay> config option
8 Jul 2014 matchmaker (score-based) scheduling is now the default
3 Jul 2014 fix bugs in changing code signing key
3 Jul 2014 scheduler: fix bugs if project has both NCI and regular apps
10 Jun 2014 add "delete_spammers.php" for removing various types of spam accounts
6 Jun 2014 app versions (as well as apps) can be marked as "beta"
4 Jun 2014 support CPU OpenCL apps in plan class spec
27 May 2014 fully implement targeted jobs
18 May 2014 include badges in XML stats export
8 May 2014 send notices w/ video or images only to 7.3+ clients
6 May 2014 file_deleter: delete .gz versions also
6 May 2014 add web page showing top CPU models and their stats
4 May 2014 apps can be marked as "exact fraction done" (base completion time est only on FD)
30 Apr 2014 generalize interface to PHPMailer
20 Apr 2014 support remote input files in create_work
18 Apr 2014 let projects disable forums and/or teams
10 Apr 2014 support efficient bulk job creation in create_work
2 Apr 2014 store job peak mem/disk usage in DB
26 Mar 2014 support gzipped input files
21 Mar 2014 use mysqli PHP functions if available
18 Mar 2014 add validator that checks for string in stderr
8 Mar 2014 enforce GPU job limits separately for each GPU type
6 Mar 2014 store gpu_active_frac, and use it in runtime estimation
5-20 Dec 2013 add generic support for badges
23 May 2013 parse client "product name" (e.g. phone model) and store in DB
9 May 2013 use HTTPS for forms containing password
25 Apr 2013 add support for multi-size apps
9 Apr 2013 add new score-based scheduling
27 Aug 2012 add support for limited locality scheduling
17 Aug 2012 add support for volunteer data archival
11 Jul 2012 pagination in forums
25 Jun 2012 scheduler: support Intel GPUs
ID: 77369 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Orgil

Send message
Joined: 11 Dec 05
Posts: 82
Credit: 169,751
RAC: 0
Message 77371 - Posted: 21 Aug 2014, 2:18:00 UTC

Finished wu's not validating for 1 full day I checked the server status everything looking green, what happenned?!
ID: 77371 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Miklos M

Send message
Joined: 8 Dec 13
Posts: 29
Credit: 5,277,251
RAC: 0
Message 77375 - Posted: 21 Aug 2014, 17:42:44 UTC

Are we getting longer wu's effective 8/31/14? They seem to be estimated time to get done 40 hours or so. My preferences are not changed and still set for max 1 day to get a wu done per cpu.
ID: 77375 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 77376 - Posted: 21 Aug 2014, 17:48:16 UTC

The jobs should still run based on the target cpu run time preference. The estimate is likely off because the workunit estimated FLOPS value has been doubled. The client should make better estimates as more jobs get processed but if the problem persists or if the job is actually running significantly longer than your target run time, let us know. Thanks.
ID: 77376 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1860
Credit: 8,154,097
RAC: 8,001
Message 77377 - Posted: 21 Aug 2014, 19:03:58 UTC

I'll look into the server upgrade. It will be a long process since there is a lot of R@h specific code.


Rosetta@Home and Ralph@Home run on same version of server code? If not, you can try to update Ralph and see what happens before update Rosetta...
ID: 77377 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Orgil

Send message
Joined: 11 Dec 05
Posts: 82
Credit: 169,751
RAC: 0
Message 77381 - Posted: 22 Aug 2014, 4:31:33 UTC
Last modified: 22 Aug 2014, 4:32:44 UTC

My wu's are waiting for 48hrs to validate or still in upload state. Houston we have a problem?!
ID: 77381 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 77387 - Posted: 22 Aug 2014, 17:42:58 UTC - in response to Message 77354.  
Last modified: 22 Aug 2014, 18:05:25 UTC


I suspect you have too high an expectation of most users. Target CPU runtime has always existed. It's just a little more flexible now. But the people who post here, like you and me, are very much the exception. The "set & forget" option is much more the norm. A document would be nice - no objection to it - but unlikely to gain much of a readership beyond what it is now.

Aborting tasks is clearly different from tasks being timed out. One is an active choice, the other the result of no choice at all. I doubt there's much of a "discouragement" factor. More that defaults don't coincide with a normal pattern of behaviour for ordinary people.

That's why I suggested the proportion of tasks failing to meet deadline should be monitored following the changes. Personally I'd have gone to 4hrs first, but obviously the vast increase in users required a more extreme and urgent response at the time.

I trust TPTB will make the appropriate assessment, seeing as they're the ultimate beneficiaries.


i'd guess server side resource constraints could be a part of the reasons for some of these bottlenecks. i'd guess there are 'other solutions' e.g. an even more 'distributed' computing paradigm / design or partnering with 'mirror' servers say with a willing partner / institution may help alleviate some of the issues. but i'd think that those software changes possibly affecting the design of boinc itself and could take considerable effort to diagnose, develop and integrate with rosetta

Hence, i'd guess for the immediate term having a somewhat longer default run time is hence a *practical* consideration to alleviate some of the issues.

nevertheless, i'm attempting to make do with a somewhat longer self-defined run time (4 hrs) as a compromise for that.

i do agree that running long jobs do not coincide with say an average 'normal' usage pattern of a desktop or even notebook computer as for various reasons users would want to shut down their PC/notebook. a simple example could be that a computer could be running with a rather loud fan, and that'd be simply annoying at night in a bedroom and the (naive?) user could simply decide to abort the jobs and shutdown.

i used to run a PC that had a fan which almost runs like a jet engine (*noisy*) mainly due to an old graphic card lol,
ID: 77387 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 77391 - Posted: 23 Aug 2014, 3:02:05 UTC - in response to Message 77387.  
Last modified: 23 Aug 2014, 3:13:18 UTC


Personally I'd have gone to 4hrs first, but obviously the vast increase in users required a more extreme and urgent response at the time.

I trust TPTB will make the appropriate assessment, seeing as they're the ultimate beneficiaries.


i'd guess server side resource constraints could be a part of the reasons for some of these bottlenecks. i'd guess there are 'other solutions' e.g. an even more 'distributed' computing paradigm / design or partnering with 'mirror' servers say with a willing partner / institution may help alleviate some of the issues. but i'd think that those software changes possibly affecting the design of boinc itself and could take considerable effort to diagnose, develop and integrate with rosetta



i'd guess other possible 'designs/paradigms' such as a qos (quality of service) design can also be used to alleviate some of the high server load issues.
an example is that when the server is busy it can 'announce' qos controls and issue tokens with a number and a waiting period. this is to issue 'queue numbers' to the participating hosts and to request the hosts to back off and wait for the per-detimined period before retrying.

however in the same way these could involve various changes to boinc (both client and server) and integration with rosetta and could require rather large effort to develop them.

qos has similar limitations as a lengthened run time however a big difference is that the participant host computer is *idle* while waiting for re-contact with the server. this could alleviate cases where for instance the jobs runs with a noisy pc fan as the fan would likely wind down and run at lower speeds hence less noise
ID: 77391 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Norman

Send message
Joined: 3 Oct 06
Posts: 3
Credit: 1,872,215
RAC: 233
Message 77392 - Posted: 23 Aug 2014, 6:55:38 UTC

I have discovered a serious memory leak in Rosetta Mini 3.52 on a Mac OS X version 10.9.4 system on a Macbook Pro with 8 GB of physical memory. I watched as my system slowed to a crawl and then hung over several hours while my system was otherwise idle.

On another occasion I watched with the Memory panel of the Activity Monitor as my system slowed down, virtual memory grew to 49 GB and the swap file grew to 13 GB. Each of the three Rosette processes that were running were using about 1 GB each, but were not growing.

I interpret this as filling the available disk space with the swap file. Mac OS X apparently does not cope well with a full disk because there were also many weird error messages in the Console log. When I suspended the Rosetta project in BOINC Manager, my system returned to normal and has been running smoothly all day.

I will not resume running Rosetta until you tell me this bug is fixed.
ID: 77392 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Orgil

Send message
Joined: 11 Dec 05
Posts: 82
Credit: 169,751
RAC: 0
Message 77396 - Posted: 24 Aug 2014, 5:27:13 UTC

I have few completed wu results on upload state for 4 days. And why no one from the project is answering my questions!! These results are not my property not project staffs property it is scientific property. It is shocking that project server status is showing fals green light status but a cruncher cannot upload the results.
ID: 77396 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 77397 - Posted: 24 Aug 2014, 5:40:55 UTC

Not sure why your client isn't uploading results. Is anyone else having this issue? Is there any useful info in your client log?

Norman, that's a pretty serious bug/bad workunit. Any specifics? WU id?
ID: 77397 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Orgil

Send message
Joined: 11 Dec 05
Posts: 82
Credit: 169,751
RAC: 0
Message 77398 - Posted: 24 Aug 2014, 8:47:39 UTC
Last modified: 24 Aug 2014, 8:52:28 UTC

The status says: Upload pending, project backoff .. (counting time)

WU id's:
1

application Rosetta Mini
created 18 Aug 2014 14:04:30 UTC
name tj_8_7_ordered_X_25_h20_BAB_20_BAB_wD_fragments_abinitio_SAVE_ALL_OUT_185149_3629
minimum quorum 1
initial replication 1
max # of error/total/success tasks 1, 2, 1

2

application Rosetta Mini
created 6 Aug 2014 9:32:18 UTC
name 1L-18H-2L-8E-4L-8E-1L_1-2.A.0.rsmn_0060_2_fold_SAVE_ALL_OUT_183385_151
minimum quorum 1
initial replication 1
max # of error/total/success tasks 1, 2, 1

3

application Rosetta Mini
created 16 Aug 2014 8:56:24 UTC
name flu.c05g_3_input_0244_0001_ss1_1_ss2_2_ss3_2_ss4_2_ss5_2_0001_0001_0001.B_fragments_fold_188798_181
minimum quorum 1
initial replication 1
max # of error/total/success tasks 1,

4

application Rosetta Mini
created 16 Aug 2014 6:23:12 UTC
name db_triangle104B_fold_SAVE_ALL_OUT_189886_7480
minimum quorum 1
initial replication 1
max # of error/total/success tasks 1, 2, 1
ID: 77398 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Norman

Send message
Joined: 3 Oct 06
Posts: 3
Credit: 1,872,215
RAC: 233
Message 77403 - Posted: 24 Aug 2014, 12:27:27 UTC

For Mac OS X memory leak, three task names:
5htube05_relax_SAVE_ALL_OUT_189789_1457_0
batch2_pdb16_relax_SAVE_ALL_OUT_189866_5873_0
1L-7E-2L-11H-3L-7E-2L-11H-1L_1-2.P.0_0002_fold_SAVE_ALL_OUT_190736_101_0
ID: 77403 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Miklos M

Send message
Joined: 8 Dec 13
Posts: 29
Credit: 5,277,251
RAC: 0
Message 77404 - Posted: 25 Aug 2014, 11:24:20 UTC

Errors in the new long tasks.
ID: 77404 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chunfu Xu

Send message
Joined: 2 Oct 13
Posts: 2
Credit: 8,816
RAC: 0
Message 77409 - Posted: 25 Aug 2014, 18:11:00 UTC - in response to Message 77403.  

For Mac OS X memory leak, three task names:
5htube05_relax_SAVE_ALL_OUT_189789_1457_0
batch2_pdb16_relax_SAVE_ALL_OUT_189866_5873_0
1L-7E-2L-11H-3L-7E-2L-11H-1L_1-2.P.0_0002_fold_SAVE_ALL_OUT_190736_101_0



The 5htube* work unit was submitted by me. I am sorry that it caused a problem to your computer. I have identified the problem and will avoid it in the future. Sorry about that.

ID: 77409 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 77412 - Posted: 25 Aug 2014, 21:58:21 UTC
Last modified: 25 Aug 2014, 22:01:42 UTC

The "extremely long file name that goes over the Windows character limit" issue is back:

684142940, 684308973

WARNING! attempt to create gzipped file ../../projects/boinc.bakerlab.org_rosetta
/benchmark_0008_master_babd28351e57425d68b32333be5a837fb7cd5818_ploops
_64_input_0002_no_lig_fragments_contact_opt_iteration_2_50447fb2412049d0b1fecfb10acecfee
_fold_SAVE_ALL_OUT_170398_3761_0_0 failed.


As Windows has a path limit of 256 characters and the above path is 228 characters (excluding the file extension and higher levels of the path) you are bound to generate errors on a regular basis.

This issue has come up before but I guess that some of the scientists missed the memo.

Can you put in place a character limit for scientists submitting work?
I guess there will be a small inconvenience for the scientists in not being as descriptive as they want to be, but at least you don't scare the crunchers away with swathes of compute errors.
ID: 77412 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Norman

Send message
Joined: 3 Oct 06
Posts: 3
Credit: 1,872,215
RAC: 233
Message 77414 - Posted: 26 Aug 2014, 2:31:42 UTC

"For Mac OS X memory leak, three task names:
5htube05_relax_SAVE_ALL_OUT_189789_1457_0
batch2_pdb16_relax_SAVE_ALL_OUT_189866_5873_0
1L-7E-2L-11H-3L-7E-2L-11H-1L_1-2.P.0_0002_fold_SAVE_ALL_OUT_190736_101_0"

"The 5htube* work unit was submitted by me. I am sorry that it caused a problem to your computer. I have identified the problem and will avoid it in the future. Sorry about that."

"Avoid it" in the future is not enough. You have described changing the input data for the work unit, but since I am a retired software engineer, I know that the root cause of this problem probably is a software bug.
If such a problem can go wrong in the future, then it will. This software bug caused me to loose a week of work tracking it down. I will not use Minirosetta again until someone tells me that this bug is fixed.
ID: 77414 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 77419 - Posted: 26 Aug 2014, 18:07:45 UTC - in response to Message 77412.  
Last modified: 26 Aug 2014, 18:37:07 UTC

The "extremely long file name that goes over the Windows character limit" issue is back
Can you put in place a character limit for scientists submitting work?
I guess there will be a small inconvenience for the scientists in not being as descriptive as they want to be, but at least you don't scare the crunchers away with swathes of compute errors.


Thanks for catching this. Yes, there is a character limit imposed but this job somehow slipped through. I'll have to reduce the max characters allowed so this doesn't happen again. edit- I see now how it slipped through and have fixed our submission code. thanks!
ID: 77419 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 77420 - Posted: 26 Aug 2014, 18:08:25 UTC

We'll definitely track this bug down and make sure it's fixed in the next app update.
ID: 77420 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : Minirosetta 3.52



©2024 University of Washington
https://www.bakerlab.org