Main Page
Introduction
Sun Microsystems T1000
The RSA Factoring Challenge
The Constructive Approach
Evaluation Summary
Future Research
Evaluation Chronology
System Configurations
Financial Disclaimer
Postscript
© Grebyn Corp. 2006
|
|
This is from the running log of material. Hopefully it is now summarized
elsewhere but kept for historical purposes.
September / October.
My friend at Sun (a Sr. Technical Specialist on SPARC Servers) suggested I
look into a T1000 / T2000 under Sun's Try and Buy program. I have what I
think is a novel approach to factoring large integers and it looks like it
would be a pretty good fit. (Factoring large integers generally takes
massive amounts of CPU time, often measured in cpu-YEARS / DECADES /
CENTURIES.
The RSA
Factoring Challenge was the original motivation for this research.) I
looked on the web site, checked the conditions and such. Looks like it
might be... Although it seemed somewhat unlikely that Sun would let me (one
guy doing research in his basement in his spare time) have one, either
permanently (the folks getting awarded systems are doing much more
professional evaluations with the boxes than I) or even for 60 days, but
hey...
Ported application to Solaris 10. Built several versions of GCC along the
way.
October 16.
Ordered T10Z10A-16GA2F, better known as the "large" 8 core, 2 73GB drives,
16GB T1000, at just under $15,000. I got this because it has the most
memory of systems available under this program. The application as
currently written is memory hungry.
I really wanted to order the T20Z108B-32GA2D, because it has 32GB of memory
(the application can use about 800MB per process) and has 1.2 GHz cores,
versus the 1.0GHz in the other model, but (a) that model is not on the
try-it-and-buy-it program, and (b) at much closer to $30,000, I can be
almost certain that it will be out of consideration for actually
purchasing. Let's see what they say...
October 17.
Wow - I've been "credit approved" in one day (with a little email exchange
about my Dun & Bradstreet number)! I received a quote with all the gory
legal terms and conditions.
October 19.
I sent an email to the person who told me about my credit approval asking
about when the system might ship.
October 23.
I receive a "Welcome" email indicating shipment within two weeks and links
to a resource web page
that I start looking through. Lots of useful stuff.
October 26.
I went to the web site in the "Welcome" email to check on the order
status, which just redirects me to the generic order status
page for the try-it-and-buy-it program, which just tells me to call an
800 number... That's kind of silly. But I called the 800 number and get a
ship date of November 2 and an expected delivery date of Novmber 9. I also
filled out the comment form on the bottom of the page to indicate that I had
expected a more "online" experience. Why couldn't the information that was
available to the person that I called available to me on the web? And why
indirect me through a web page just to get a phone number when the phone
number could have been provided initially?
I received a followup email from my comment that confirmed the system is set
to ship November 2.
Two and a half weeks from order to shipment - don't know if that's product
demand, just-in-time manufacturing or what...
November 2.
I've taken to putting my notes down in a chronological manner for future
import into blog entries (after all - I have as much likelihood of "winning"
one of these systems as anybody else!).
I'm on pins and needles waiting to hear that the system has shipped.
So much so that at 7 pm (EST) I called the order status number again...
The system shipped yesterday! This time I got a tracking number from to
track online. Strange, it says it's only two pounds... From California and
it's already in Indianapolis! More than halfway here. Maybe tomorrow,
certainly by Monday.
November 3.
The tracking number indicates that the package is in town... On the truck
for delivery... It's here! Oops. That's package 2 of 2. It's only the
power cord... Huh? Maybe, it's the "second" power cord. Call the 800
number again and got tracking information for package 1 of 2 (at 36 pounds,
that must be the rest of it). It's somewhere between Memphis and here...
Probably Monday...
November 6.
The system has arrived. Going to let it have a couple hours to acclimate
(it was definitely a little colder than room temperature when it arrived).
No second power cord. Heck - no place to even attach a second power cord.
Gosh, it's been ages since I ran a Solaris box. Let's see what to do...
Read the manual...
Hooked up a VT100 (yes - those still work fine) to the Serial Controller
port. Set the admin password, and "poweron -c".
WOW is that loud! Definitely be moving it into the rack once it gets
configured... Not desgigned for an "office" environment...
Running through all the POST tests... Passed all the tests. Shutdown and
restart (wow - that's unnerving the way it shuts off and on like that).
Booting the OS...
Set language to English / English. Hey - there's an option for a VT100
terminal!
Enter hostname, network configuration, dns, time zone, etc. Syncing and
rebooting... (expected the shutdown this time).
Create a non-root user so I can ssh into the box and shut it down for moving
into the rack. That noise was giving me a headache. Put a Kill-A-Watt
meter on it to see how much power it takes.
Installing development tools (GCC) and preliminary testing shows that I can
crank it up to use lots of processors... (For those of you who are going to
point out that there's a copy of GCC in /usr/sfw/bin, thanks, but that
compiler is only built with C/C++ support and the application under test is
written in Ada.)
November 7.
Preliminary results with 32 threads on T1000:
34227.72u 67.72s 18:27.50 3096.6%
Versus 2 threads on 3.2GHz Pentium IV running Fedora Core 5.
2684.499u 2.980s 24:33.32 182.4% 0+0k 0+0io 0pf+0w
Or one thread on a 2.2GHz Athlon 64 (3200+ - /proc/cpuinfo below), also
running Fedora Core 5:
1573.162u 4.572s 26:19.08 99.9% 0+0k 0+0io 0pf+0w
Burning up a lot of CPU time, but not getting more than 33% improvement.
And compared against systems that were $900 and $600 in 2003 and 2005 $$$.
It appears that the release of GCC in use (4.1.0) doesn't particularly
target the Niagara processor. Off getting and building later releases
(gcc-4.2-20061031 looks like a good choice).
Also, need to run the system upgrades. That's not working well. Contacted
Sun to find out why the command
# sconadm register -a -r /tmp/myreg.profile
fails to authenticate after trying several things referenced on their web
site. No response... Maybe I got something wrong there. Well, not a high
priority for me at the moment...
November 9.
After updating sed, libiconv, m4, flex, bison, ..., got gcc-4.2-20061031
built and installed successfully. Recompiled application with -mcpu=niagara
and -mtype=niagara. First quick test with a single task appears to have
improved performance by just under 40%! Running more detailed test... Hmm,
not that good... Barely 2%. Need to investigate... Maybe I've got some
horrendous tasking or function call overhead...
34056.31u 68.64s 17:49.54 3190.6%
November 14.
Reimplemented to provide more optimizations to the infrastructure (or so I
thought). Actually running slower.
With two clients on the hyperthreaded Pentium IV CPU (was 24 minutes wall,
although the CPU seems almost constant, perhaps it's due to overhead
someplace):
2864.018u 2.736s 27:21.81 174.6% 0+0k 0+0io 0pf+0w
On the T1000:
39157.21u 73.07s 20:28.64 3192.9%
Not good - 20% INCREASE in time.
gprof has indicated four major time sinks - I'll look in those.
On Linux / Pentium4:
58.93 1340.60 1340.60 237405021 0.00 0.00 server__solve_last_item
17.42 1736.83 396.23 237116148 0.00 0.00 server__incremental_last_multiply
12.48 2020.65 283.83 __divdi3
3.66 2104.03 83.37 __moddi3
On Solaris / Niagara:
53.15 7379.44 7379.44 195694643 0.00 0.00 server__solve_last_item
16.80 9711.64 2332.20 194689084 0.00 0.00 server__incremental_last_multiply
15.90 11919.06 2207.42 __divdi3
4.59 12556.02 636.96 __moddi3
The first two are where I expected the time to be, in my code. The other
two of them, __divdi3 and __moddi3, for division and modulus arithmetic,
making up 15 - 20% of the CPU time, which is surprising, since the algorithm
is skewed to reduce division. I need to look into seeing whether I'm
doing redundant operations.
November 15.
More minor tweaks. Got a 15 - 20% improvement from the original times!
Have thought of another algorithmic change to consider making.
On Solaris / Niagara.
28754.24u 53.52s 15:02.40 3192.3%
On Linux / Pentium IV:
1873.421u 4.036s 20:53.53 149.7% 0+0k 0+0io 0pf+0w
On Linux / Athlon64
1450.622u 2.568s 24:14.28 99.9% 0+0k 0+0io 0pf+0w
Interesting. On the Athlon64, gprof reports no __divdi3 or __moddi3. Maybe
it has hardware instructions for that (being a 64 bit CPU). But I would
have expected the same on the Niagara... Sent a note into a friend at
Sun... Hmm, compiling with -m64 works on the Athlon64, but not on the
Niagara. Maybe I should recompile the compiler with some other
configuration switches... That doesn't appear to be it.
November 16.
I definitely need a better CM system so I can reliably reproduce these
results...
Followup from my friend at Sun - there's a GCCFSS (GCC For Sparc Systems)
specifically targeted for Sparc (and Niagara). Binaries only for C/C++, but
sources available (for a GCC 4.0.3 baseline), so I've grabbed the generic
gcc-4.0.3, imported the gcc/ada subtree and am off building now. This
should be fun, since I only have 4.1.0 and 4.2-20061031 available....
November 17 .. 19.
Built a self-generated gcc 4.0.3 (down-compiled from 4.1.0) with minor code
changes (formatting). Tried putting the ada and libada directories into the
gccfss_src tree. Bombs trying to link gnat1. Clearly, it's missing some
Sun stuff. Tweaked to get most of the symbols resolved, but not quite all.
Looks like the right place to properly fix this is in gcc/ada/Makefile.in.
Made a posting to the cooltools forum.
Got a hit on the web site requesting the log file from sun.com (and
googlebot.com!) That means at least somebody looked at it!
November 20 .. 21.
Worked on the application, unrolling the recursion. Got it working on a
small scale. Improvement is on the order of HUGE (6 seconds versus the 15 -
25 minutes before), now I can use larger tests in "reasonable" time. Need
to write a program to automate the generation of the unrolling. This should
be very interesting...
November 21.
Hmm, maybe I should offer some Duke Stars (formerly Duke Dollars) for
somebody to help creating the necessary Makefile.in infrastructure changes
for getting the Ada front end hooked in with the gccfss. Maybe I should
wait until the Thanksgiving holidays are over.
November 24.
Got the first part of the automation of the unrolling complete in the
server. Testing some size issues. Trying unsuccessfully to build with
large address space on Solaris. Can only get larger objects on an AMD
Athlon64 platform. Must be something I'm still missing. Try some compiler
rebuilds...
Have another "unrolling" optimization that should reduce the number of mods
and divides down into the noise. Hold that one in my back pocket until I
have other automation completed and running.
November 25.
Attempted to contact some folks from the FSF "services" page to see if I can
find somebody interested in making the necessary Makefile.in changes.
Automated a small case of unrolling in the client. Now to raise the bar and
see what breaks for larger test cases.
November 30.
Completed unrolling on a grand scale, putting declarations at the top level,
reducing runtime overhead of blocks. New results at comparable executions:
T1000 (32 threads):
10506.76u 31.90s 5:30.37 3189.9%
Athlon64 (1 thread):
383.979u 0.552s 6:25.01 99.8% 0+0k 0+0io 0pf+0w
Let's crank up to a higher level, let it run a while and see what we can
get... Duh, let's not... This implementation gets exponential in increase
of size. Need to make it at worst linear...
December 4.
The implementation granularity is all wrong the other way (time) now, with
hardly any space (less than 500MB), but with way too many iterations to be
able to make any meaningful measurements (can you say 10 ** 99!) because of
the inability to conveniently find a nice fencepost to do the measurements
upon automatically... Going to have to think about this...
Since the current implementation now "wastes" much of the need for 64 bit
arithmetic, change the types to 32 bits and see what improvement there is.
Yep, that quick little change generates an additional 10% improvement.
Time for a little reflection... And a break to do my real work, have a
life, etc.
Just about halfway through the evaluation period and what has been learned?
Clearly, there are multiple axes of performance here that need to be aligned
in order to maximize throughput. Memory, cpu performance, disk space,
bandwidth (currently nothing is implemented that would take advantage of the
T1000s 4 gigabyte Ethernet interfaces) algorithm, etc. Of course, there's
that power consumption advantage that the T1000 is supposed to have over
systems.
December 5.
Followup on the GCCfss forum about my attempted merge of GCCfss with Ada
support. Just not high on the priority list at the moment to invest the
additional time...
December 20.
Decided to reimplement with byte alignment rather than powers of 10 (that's
part of why there were no entries for the past couple weeks). WOW! Reduced
by almost 95% for the small test case for the Athlon64 (380 seconds down to
20) and 90% for the T1000 (330 seconds down to 32)!
Duh! Slap forehead. I'll bet all the divides and mods (there are still
some, despite my best efforts) are now just simple shifts and masks, running
at clock speeds, rather than in some C library...
Running gprof indicates slightly more activity on the T1000 and the Pentium
IV, probably due to completing remaining tasks, rather than terminating them
after finding the solution in one of them.
T1000 (32 threads):
989.16u 4.56s 0:32.47 3060.4%
Athlon64 (1 thread):
19.941u 0.132s 0:20.14 99.6% 0+0k 0+0io 0pf+0w
Pentium IV (2 threads)
51.895u 0.312s 0:26.53 196.7% 0+0k 0+0io 0pf+0w
December 22.
Larger example (strange timing report there with all the negative numbers)
clearly favors the T1000 now...
T1000 (32 threads):
-37836.-46u 74.62s 3:24:18.71 -308.0%
Athlon64 (1 thread):
15088.478u 20.813s 4:12:03.36 99.9% 0+0k 0+0io 0pf+0w
Pentium IV (2 threads):
35527.756u 5.668s 4:56:41.53 199.6% 0+0k 0+0io 0pf+0w
Want to run this in "production" on a longer test cases over the Christmas
holiday.
Also want to look into making it work with words (16 bits, maybe double,
etc.) rather than bytes (8 bits), but I know the memory / disk space
requirement will require some novel data storage approaches. At 8GB per
client task (256GB!), we can't afford to keep everything in memory or on
disk conveniently, since there's only a couple 73GB disks on the machine
currently.
December 24.
Contacted Sun about picking up the box. It was depressing to have to do so,
since there's been so much learned. But, I've really specified the system
wrong for the way I'm using it (I'm not using the 16GB of memory, for
example), and I simply can't afford that kind of expense for my R&D effort
while I can use the Athlon64 for much of the algorithm development effort.
I'm also not expecting that I'm going to be competitive in the
Open
Performance Contest, but I'll probably go ahead and submit what I've done -
maybe somebody will read it and be interested enough in my work to help fund
it... :-) At least it's not just another database / http application!
Another Aha! on the axes of performance - as currently written, the
algorithm does a breadth first search across 32 cpus. Now, there are only
64 partitions (when implemented at the byte boundary), so if we just ran 64
threads, we should get a more broad brush coverage. So, unless I actually
DO implement another partitioning scheme, and / or ordering (search one
front to back, another back to front), there's no way to effectively utilize
multiple (as in greater than 2) T1000s anyway. With the new machines
sporting 64 threads...
Mostly final thoughts...
Don't try to do anything like this in November / December. Not with a
family, a life and two paying jobs at the same time. I had been hoping to
write a paper that came up with a good estimate of the number of Niagara
boxes that could effectively be used to break a particular number within a
period of a month or year, with a certain budget (I think the number was
$1,000,000). This Franke article talks
about the development of a $25,000 system while the
SHARK
paper discusses a $160M machine for doing so.
But I clearly have yet to identify a more "optimal" configuration and
implementation.
Hopefully I'll have time to run some more of those timing tests on the box
after Christmas before having to package it up to ship it back...
Running tests...
December 27.
Contacted FedEx...
December 28.
System going down...
January 1 .. 2, 2007.
Wow! Several hits from *.sun.com addresses on the web site, indicating
somebody is actually reviewing it!
January 2, 2007.
Email from Azul Systems indicating that their product doesn't have a native
interface for languages other than Java.
January 5, 2007.
Received a letter from Sun congratulating me and informing me that I've been
selected as a potential (my emphasis) winner in the Open Performance
Contest. Had to pick my jaw up off the floor on that one. Even just
potential was a pleasant (and shocking) surprise to me.
Previous
Next
|
 |
 |
 |
|