Grebyn Corporation's T1000 Experience

		Evaluation Chronology
Main Page Introduction Sun Microsystems T1000 The RSA Factoring Challenge The Constructive Approach Evaluation Summary Future Research Evaluation Chronology System Configurations Financial Disclaimer Postscript © Grebyn Corp. 2006		This is from the running log of material. Hopefully it is now summarized elsewhere but kept for historical purposes. September / October. My friend at Sun (a Sr. Technical Specialist on SPARC Servers) suggested I look into a T1000 / T2000 under Sun's Try and Buy program. I have what I think is a novel approach to factoring large integers and it looks like it would be a pretty good fit. (Factoring large integers generally takes massive amounts of CPU time, often measured in cpu-YEARS / DECADES / CENTURIES. The RSA Factoring Challenge was the original motivation for this research.) I looked on the web site, checked the conditions and such. Looks like it might be... Although it seemed somewhat unlikely that Sun would let me (one guy doing research in his basement in his spare time) have one, either permanently (the folks getting awarded systems are doing much more professional evaluations with the boxes than I) or even for 60 days, but hey... Ported application to Solaris 10. Built several versions of GCC along the way. October 16. Ordered T10Z10A-16GA2F, better known as the "large" 8 core, 2 73GB drives, 16GB T1000, at just under $15,000. I got this because it has the most memory of systems available under this program. The application as currently written is memory hungry. I really wanted to order the T20Z108B-32GA2D, because it has 32GB of memory (the application can use about 800MB per process) and has 1.2 GHz cores, versus the 1.0GHz in the other model, but (a) that model is not on the try-it-and-buy-it program, and (b) at much closer to $30,000, I can be almost certain that it will be out of consideration for actually purchasing. Let's see what they say... October 17. Wow - I've been "credit approved" in one day (with a little email exchange about my Dun & Bradstreet number)! I received a quote with all the gory legal terms and conditions. October 19. I sent an email to the person who told me about my credit approval asking about when the system might ship. October 23. I receive a "Welcome" email indicating shipment within two weeks and links to a resource web page that I start looking through. Lots of useful stuff. October 26. I went to the web site in the "Welcome" email to check on the order status, which just redirects me to the generic order status page for the try-it-and-buy-it program, which just tells me to call an 800 number... That's kind of silly. But I called the 800 number and get a ship date of November 2 and an expected delivery date of Novmber 9. I also filled out the comment form on the bottom of the page to indicate that I had expected a more "online" experience. Why couldn't the information that was available to the person that I called available to me on the web? And why indirect me through a web page just to get a phone number when the phone number could have been provided initially? I received a followup email from my comment that confirmed the system is set to ship November 2. Two and a half weeks from order to shipment - don't know if that's product demand, just-in-time manufacturing or what... November 2. I've taken to putting my notes down in a chronological manner for future import into blog entries (after all - I have as much likelihood of "winning" one of these systems as anybody else!). I'm on pins and needles waiting to hear that the system has shipped. So much so that at 7 pm (EST) I called the order status number again... The system shipped yesterday! This time I got a tracking number from to track online. Strange, it says it's only two pounds... From California and it's already in Indianapolis! More than halfway here. Maybe tomorrow, certainly by Monday. November 3. The tracking number indicates that the package is in town... On the truck for delivery... It's here! Oops. That's package 2 of 2. It's only the power cord... Huh? Maybe, it's the "second" power cord. Call the 800 number again and got tracking information for package 1 of 2 (at 36 pounds, that must be the rest of it). It's somewhere between Memphis and here... Probably Monday... November 6. The system has arrived. Going to let it have a couple hours to acclimate (it was definitely a little colder than room temperature when it arrived). No second power cord. Heck - no place to even attach a second power cord. Gosh, it's been ages since I ran a Solaris box. Let's see what to do... Read the manual... Hooked up a VT100 (yes - those still work fine) to the Serial Controller port. Set the admin password, and "poweron -c". WOW is that loud! Definitely be moving it into the rack once it gets configured... Not desgigned for an "office" environment... Running through all the POST tests... Passed all the tests. Shutdown and restart (wow - that's unnerving the way it shuts off and on like that). Booting the OS... Set language to English / English. Hey - there's an option for a VT100 terminal! Enter hostname, network configuration, dns, time zone, etc. Syncing and rebooting... (expected the shutdown this time). Create a non-root user so I can ssh into the box and shut it down for moving into the rack. That noise was giving me a headache. Put a Kill-A-Watt meter on it to see how much power it takes. Installing development tools (GCC) and preliminary testing shows that I can crank it up to use lots of processors... (For those of you who are going to point out that there's a copy of GCC in /usr/sfw/bin, thanks, but that compiler is only built with C/C++ support and the application under test is written in Ada.) November 7. Preliminary results with 32 threads on T1000: 34227.72u 67.72s 18:27.50 3096.6% Versus 2 threads on 3.2GHz Pentium IV running Fedora Core 5. 2684.499u 2.980s 24:33.32 182.4% 0+0k 0+0io 0pf+0w Or one thread on a 2.2GHz Athlon 64 (3200+ - /proc/cpuinfo below), also running Fedora Core 5: 1573.162u 4.572s 26:19.08 99.9% 0+0k 0+0io 0pf+0w Burning up a lot of CPU time, but not getting more than 33% improvement. And compared against systems that were $900 and $600 in 2003 and 2005 $$$. It appears that the release of GCC in use (4.1.0) doesn't particularly target the Niagara processor. Off getting and building later releases (gcc-4.2-20061031 looks like a good choice). Also, need to run the system upgrades. That's not working well. Contacted Sun to find out why the command # sconadm register -a -r /tmp/myreg.profile fails to authenticate after trying several things referenced on their web site. No response... Maybe I got something wrong there. Well, not a high priority for me at the moment... November 9. After updating sed, libiconv, m4, flex, bison, ..., got gcc-4.2-20061031 built and installed successfully. Recompiled application with -mcpu=niagara and -mtype=niagara. First quick test with a single task appears to have improved performance by just under 40%! Running more detailed test... Hmm, not that good... Barely 2%. Need to investigate... Maybe I've got some horrendous tasking or function call overhead... 34056.31u 68.64s 17:49.54 3190.6% November 14. Reimplemented to provide more optimizations to the infrastructure (or so I thought). Actually running slower. With two clients on the hyperthreaded Pentium IV CPU (was 24 minutes wall, although the CPU seems almost constant, perhaps it's due to overhead someplace): 2864.018u 2.736s 27:21.81 174.6% 0+0k 0+0io 0pf+0w On the T1000: 39157.21u 73.07s 20:28.64 3192.9% Not good - 20% INCREASE in time. gprof has indicated four major time sinks - I'll look in those. On Linux / Pentium4: 58.93 1340.60 1340.60 237405021 0.00 0.00 server__solve_last_item 17.42 1736.83 396.23 237116148 0.00 0.00 server__incremental_last_multiply 12.48 2020.65 283.83 __divdi3 3.66 2104.03 83.37 __moddi3 On Solaris / Niagara: 53.15 7379.44 7379.44 195694643 0.00 0.00 server__solve_last_item 16.80 9711.64 2332.20 194689084 0.00 0.00 server__incremental_last_multiply 15.90 11919.06 2207.42 __divdi3 4.59 12556.02 636.96 __moddi3 The first two are where I expected the time to be, in my code. The other two of them, __divdi3 and __moddi3, for division and modulus arithmetic, making up 15 - 20% of the CPU time, which is surprising, since the algorithm is skewed to reduce division. I need to look into seeing whether I'm doing redundant operations. November 15. More minor tweaks. Got a 15 - 20% improvement from the original times! Have thought of another algorithmic change to consider making. On Solaris / Niagara. 28754.24u 53.52s 15:02.40 3192.3% On Linux / Pentium IV: 1873.421u 4.036s 20:53.53 149.7% 0+0k 0+0io 0pf+0w On Linux / Athlon64 1450.622u 2.568s 24:14.28 99.9% 0+0k 0+0io 0pf+0w Interesting. On the Athlon64, gprof reports no __divdi3 or __moddi3. Maybe it has hardware instructions for that (being a 64 bit CPU). But I would have expected the same on the Niagara... Sent a note into a friend at Sun... Hmm, compiling with -m64 works on the Athlon64, but not on the Niagara. Maybe I should recompile the compiler with some other configuration switches... That doesn't appear to be it. November 16. I definitely need a better CM system so I can reliably reproduce these results... Followup from my friend at Sun - there's a GCCFSS (GCC For Sparc Systems) specifically targeted for Sparc (and Niagara). Binaries only for C/C++, but sources available (for a GCC 4.0.3 baseline), so I've grabbed the generic gcc-4.0.3, imported the gcc/ada subtree and am off building now. This should be fun, since I only have 4.1.0 and 4.2-20061031 available.... November 17 .. 19. Built a self-generated gcc 4.0.3 (down-compiled from 4.1.0) with minor code changes (formatting). Tried putting the ada and libada directories into the gccfss_src tree. Bombs trying to link gnat1. Clearly, it's missing some Sun stuff. Tweaked to get most of the symbols resolved, but not quite all. Looks like the right place to properly fix this is in gcc/ada/Makefile.in. Made a posting to the cooltools forum. Got a hit on the web site requesting the log file from sun.com (and googlebot.com!) That means at least somebody looked at it! November 20 .. 21. Worked on the application, unrolling the recursion. Got it working on a small scale. Improvement is on the order of HUGE (6 seconds versus the 15 - 25 minutes before), now I can use larger tests in "reasonable" time. Need to write a program to automate the generation of the unrolling. This should be very interesting... November 21. Hmm, maybe I should offer some Duke Stars (formerly Duke Dollars) for somebody to help creating the necessary Makefile.in infrastructure changes for getting the Ada front end hooked in with the gccfss. Maybe I should wait until the Thanksgiving holidays are over. November 24. Got the first part of the automation of the unrolling complete in the server. Testing some size issues. Trying unsuccessfully to build with large address space on Solaris. Can only get larger objects on an AMD Athlon64 platform. Must be something I'm still missing. Try some compiler rebuilds... Have another "unrolling" optimization that should reduce the number of mods and divides down into the noise. Hold that one in my back pocket until I have other automation completed and running. November 25. Attempted to contact some folks from the FSF "services" page to see if I can find somebody interested in making the necessary Makefile.in changes. Automated a small case of unrolling in the client. Now to raise the bar and see what breaks for larger test cases. November 30. Completed unrolling on a grand scale, putting declarations at the top level, reducing runtime overhead of blocks. New results at comparable executions: T1000 (32 threads): 10506.76u 31.90s 5:30.37 3189.9% Athlon64 (1 thread): 383.979u 0.552s 6:25.01 99.8% 0+0k 0+0io 0pf+0w Let's crank up to a higher level, let it run a while and see what we can get... Duh, let's not... This implementation gets exponential in increase of size. Need to make it at worst linear... December 4. The implementation granularity is all wrong the other way (time) now, with hardly any space (less than 500MB), but with way too many iterations to be able to make any meaningful measurements (can you say 10 99!) because of the inability to conveniently find a nice fencepost to do the measurements upon automatically... Going to have to think about this... Since the current implementation now "wastes" much of the need for 64 bit arithmetic, change the types to 32 bits and see what improvement there is. Yep, that quick little change generates an additional 10% improvement. Time for a little reflection... And a break to do my real work, have a life, etc. Just about halfway through the evaluation period and what has been learned? Clearly, there are multiple axes of performance here that need to be aligned in order to maximize throughput. Memory, cpu performance, disk space, bandwidth (currently nothing is implemented that would take advantage of the T1000s 4 gigabyte Ethernet interfaces) algorithm, etc. Of course, there's that power consumption advantage that the T1000 is supposed to have over systems. December 5. Followup on the GCCfss forum about my attempted merge of GCCfss with Ada support. Just not high on the priority list at the moment to invest the additional time... December 20. Decided to reimplement with byte alignment rather than powers of 10 (that's part of why there were no entries for the past couple weeks). WOW! Reduced by almost 95% for the small test case for the Athlon64 (380 seconds down to 20) and 90% for the T1000 (330 seconds down to 32)! Duh! Slap forehead. I'll bet all the divides and mods (there are still some, despite my best efforts) are now just simple shifts and masks, running at clock speeds, rather than in some C library... Running gprof indicates slightly more activity on the T1000 and the Pentium IV, probably due to completing remaining tasks, rather than terminating them after finding the solution in one of them. T1000 (32 threads): 989.16u 4.56s 0:32.47 3060.4% Athlon64 (1 thread): 19.941u 0.132s 0:20.14 99.6% 0+0k 0+0io 0pf+0w Pentium IV (2 threads) 51.895u 0.312s 0:26.53 196.7% 0+0k 0+0io 0pf+0w December 22. Larger example (strange timing report there with all the negative numbers) clearly favors the T1000 now... T1000 (32 threads): -37836.-46u 74.62s 3:24:18.71 -308.0% Athlon64 (1 thread): 15088.478u 20.813s 4:12:03.36 99.9% 0+0k 0+0io 0pf+0w Pentium IV (2 threads): 35527.756u 5.668s 4:56:41.53 199.6% 0+0k 0+0io 0pf+0w Want to run this in "production" on a longer test cases over the Christmas holiday. Also want to look into making it work with words (16 bits, maybe double, etc.) rather than bytes (8 bits), but I know the memory / disk space requirement will require some novel data storage approaches. At 8GB per client task (256GB!), we can't afford to keep everything in memory or on disk conveniently, since there's only a couple 73GB disks on the machine currently. December 24. Contacted Sun about picking up the box. It was depressing to have to do so, since there's been so much learned. But, I've really specified the system wrong for the way I'm using it (I'm not using the 16GB of memory, for example), and I simply can't afford that kind of expense for my R&D effort while I can use the Athlon64 for much of the algorithm development effort. I'm also not expecting that I'm going to be competitive in the Open Performance Contest, but I'll probably go ahead and submit what I've done - maybe somebody will read it and be interested enough in my work to help fund it... :-) At least it's not just another database / http application! Another Aha! on the axes of performance - as currently written, the algorithm does a breadth first search across 32 cpus. Now, there are only 64 partitions (when implemented at the byte boundary), so if we just ran 64 threads, we should get a more broad brush coverage. So, unless I actually DO implement another partitioning scheme, and / or ordering (search one front to back, another back to front), there's no way to effectively utilize multiple (as in greater than 2) T1000s anyway. With the new machines sporting 64 threads... Mostly final thoughts... Don't try to do anything like this in November / December. Not with a family, a life and two paying jobs at the same time. I had been hoping to write a paper that came up with a good estimate of the number of Niagara boxes that could effectively be used to break a particular number within a period of a month or year, with a certain budget (I think the number was $1,000,000). This Franke article talks about the development of a $25,000 system while the SHARK paper discusses a $160M machine for doing so. But I clearly have yet to identify a more "optimal" configuration and implementation. Hopefully I'll have time to run some more of those timing tests on the box after Christmas before having to package it up to ship it back... Running tests... December 27. Contacted FedEx... December 28. System going down... January 1 .. 2, 2007.** Wow! Several hits from .sun.com addresses on the web site, indicating somebody is actually reviewing it! January 2, 2007.* Email from Azul Systems indicating that their product doesn't have a native interface for languages other than Java. January 5, 2007. Received a letter from Sun congratulating me and informing me that I've been selected as a potential (my emphasis) winner in the Open Performance Contest. Had to pick my jaw up off the floor on that one. Even just potential was a pleasant (and shocking) surprise to me. Previous Next

Evaluation Chronology