# Master Thesis

General description: Parallel SIQS

Period: 2003-3005

Technical details:
In this atricle I explain the problem that I studied during my master program [1]. My master thesis was on "Parallel Programming" in which I designed and implemented a practical framework for parallel execution of computational intensive applications. To make my proposed design more practical and concrete, I decided to choose a well-defined and hard mathematical problem. Thus, I chose to crack large RSA-keys whose complexity is similar to integer factorization problem. I focused on implementing an optimized parallel program for factorizing integers. To provide a general framework for other computational expensive problems I divided the main task to smaller pieces as follows:

1) First, I started studying the integer factorization problem by reading the number theory literature and identifying the existing algorithms and comparing their running time complexity. In this phase, I read a few mathematical books and papers to get familiar with number theory and integer factorization methods. Next, I chose the Self Initialization Quadratic Sieve (SIQS) algorithm.

2) In the next step, I analyzed the SIQS algorithm more deeply to identify the most computational part in terms of time and space. Any parallelization efforts must be focused on the most computational part of the serial algorithm (i.e. its bottleneck). This part in SIQS algorithm was the sieving part.

3) To design an optimized solution for the integer factorization problem, I reviewed the previous work in the field of large number factorization. The most important fact which was neglected by many researchers was the fact that they didnot spend enough time to optimize their serial implementation. I found that our efforts for parallelizing a serial task would be wasted unless we use the resources on a single machine efficiently. Practically we can obtain an optimized parallel solution for a huge task when our sub-tasks (i.e. little processes) run efficiently.

4) To satisfy the previous requirement, I concentrated my optimization efforts on two levels: firstly on the algorithm level (i.e. the techniques which can be used to reduce the complexity of the involved algorithms), and secondly on the implementation level (i.e different techniques should be used to optimize the code such as using gcc's flags to get faster code, using C code programming methods like inline functioning, assembly programming, loop unrolling, using multiplication operator instead of division operator, maximal usage of 32-bit data structures and so on and so forth). After identifying the existing optimization methods I tried to find a serial implementation that maximally meet these requirements instead of developing all code from the scratch. One of the best codes which matched my needs was developed by "J. Papdopolous". After this, I tried to optimize his code further and for this reason I concentrated on the sieving part of the SIQS algorithm. One of the most valuable solution for optimizing the code was proposed by G.Wabmach in "Block Sieving Algorithm" and K.Aoki in "Sieving Using Bucket Sort". Sieving part of the algorithm needs an enormous number of memory updates; however, the updates usually cause cache misses. For solving this problem and increasing the speed of sieving step I used the theoretical ideas from Aoki, Webmach and practical code of Papdopolous and tried to demonstrate analytically and scientifically that by integrating these ideas we can obtain an efficient application for integer factorization. The integration of these ideas would improve the cache locality by using cache blocking technique for partitioning the sieve array data space to smaller blocks that fit into layer-1 cache and hashtable data structure for processing the large factor base primes data space for factoring large integers to fit other active data structures during sieving process into L1 and L2 caches. All of these efforts resulted in an efficient serial version for sieving code.

5) When I attained an optimized serial implementation that efficiently uses the single node resources, I focused my efforts on parallelization step. For distributed computing, I used the idea of data decomposition and tried to decompose my data space to separate and independent parts. These separated data can run on different Cluster nodes. The dynamic load balancing technique was used to distribute the whole load for factorization between the Slave nodes.

6) The proposed parallel algorithm would give this opportunity to Slaves nodes to generate their needed input data (e.g. the coefficients of the quadratic polynomials) independently from each other. Thus, the Cluster would obtain a near ideal efficiency.

7) For the first time in Sharif University (Iran) I built a 17-node Linux Cluster for running highly computational tasks. I did the installation, configuration, optimization and administration of the Cluster nodes from A to Z. When my parallel implementation was ready, I started my big task on the designed Cluster and collected the required data for benchmarking. I cracked the RSA-key with a length of more than 330 bits in 2 hours. My parallel implementation is faster than other solutions at its time due to the time that I spent to optimize the serial and parallel codes.

8) To compare my performance results with other researchers, I refer the reader to Absrink's paper titled "Factoring large integers using parallel Quadratic Sieve" which was published in 2000. Absrink cracked a 60 decimal digit integer on his Cluster (16 nodes) in about 400 minutes, but I cracked this number on just a single node around 18 seconds giving a speed up around 400min/18sec=1333. By using our 16 nodes Cluster for factorizing the same number, we achieved a theoretical speed up factor around 21328. If we consider the Moor's law by which the speed of processors would double every two years, we will conclude that my designed Cluster is about 16 times faster than Absrink's Cluster at his time. However, we still obtain a speed up factor around 21328/16=1333! Indeed, this difference between my parallel implementation and Absrink's implementation is due to optimization methods that we used for our serial and parallel implementations.

You can download the source code of my thesis which is the parallel version of "msieve" application from here: msieve-1.01.parallel.tar.

Refs:
[1] Kazem Jahanbakhsh, "RUNNING MULTIPLE PROCESSES EFFICIENTLY BY CLUSTERING", master thesis in Persian language, November 2005, Sharif University of Technology.