Core 139

Fall, 2004

Algorithms and Complexity:

Why Encryption Systems Work, Part 2

Complexity

    When we look to develop or use an algorithm on large problems, it is important to understand how long the algorithm might take to run.   The time for most algorithms depends on the amount of data or size of the problem.  In order to analyze an algorithm, we try to find a relationship showing how the time needed for the algorithm depends on the amount of data.  This is call the "complexity" of the algorithm.  A simple algorithm may have a high complexity, whereas an algorithm which is very complex in its organization sometimes pays off by having a lower complexity in the sense of time needed for computation.

    Factors other than the algorithm selected to solve a problem affect the time needed for a program on a computer using the algorithm to work.  Just as different people carrying out a solution to a problem may work at different speeds, even when they use the same method, different computers work at different speeds.  The different speeds of computers on the same program can be due to different "clock speeds," the rate at which steps in the program are executed by the machine, and different "architectures," the way in which the internal insructions and circuitry of the computer are organized.  Consequently, an analysis of an algorithm can not predict exactly how long it will take on a particular computer.  What analysis can do is tell us how the time needed depends on the amount of data.  For example, an analysis of an algorithm might tell us that when we double the amount of data, the time needed is also doubled, in other words the time needed is proportional to the amount of data.  The analysis of another algorithm might tell us that when we double the amount of data, the time is increased by a factor of four, that is the time needed is proportional to the amount of data squared.  The latter algorithm would have the time needed increase much more rapidly than the first.

Examples

    We will analyze the complexity of the algorithms we used as examples.
Summing a list of numbers
    The algorithm for summing a list of numbers added each number in turn to the sum.  If there were N numbers in the list, the time needed would be some constant (the time needed to add a pair of numbers) times N.  Thus if we doubled the number of numbers to 2N, then the time needed would also be doubled.  This algorithm is "linear" and behaves like the straight line in the graph above.
 
Sorting a list of names
    The first sorting algorithm which we looked at, selection sort, takes each unsorted list and goes through each item in order to identify the smallest.  The first unsorted list consists of all N items in the original, and each successive unsorted list is one smaller, as the smallest item each time is placed in its correct position.  So the first search for the smallest will take N operations, the next search for the smallest will take N-1 operations, the next N-2 operations, and so on until the last searach for the smallest take 2 operations (we don't need to search when the list has only one item).  Thus the time needed will be a constant times the following sum:

                    N + (N-1) + (N-2) + ... + 3 + 2 + 1

(We include the 1 to create a sum for which a standard formula applies, it adds only a small constant which does not affect the overall result.)  This sum, the sum of the first N integers, is known to be

                   N (N+1) / 2  which is approximately    N2/2.

Since we know that this is multiplied by some constant representing the time the operations take on a specific machine, we can see that the time needed for this algorithm varies like N2, the square of the amount of data.  Thus the time needed for the selection sort algorithm behaves like the upper curve in the graph above; each time the size of the list is doubled, the amount of time needed is increased by a factor of four.  Such an algorithm is called quadratic.

    The second sorting algorithm, bin sort, depended on the fact that we sort a list of numbers which have a limited  number of decimal places, as might be the case for a set of identity numbers.  If the numbers each have k or fewer digits (we'll assume they are all positive), then simply think of them as all having exactly k digits, with zeroes filling in the higher digits when needed.  Recall that the algorithm goes through the entire set of numbers once for each digit place, first placing them in "bins" according to the least significant digit, then the next, and so on.  Thus the total number of operations to sort N numbers is kN.  Since we are assuming a fixed k, the time needed to sort the list is simply a constant times N, a linear relationship like the lower curve in the graph above.  As the length of the list to be sorted gets larger and larger, the bin sort will work faster than the selection sort by a wide margin -- it has a smaller computational complexity.
 

Factoring an integer
    The problem of factoring an integer had a fairly simple algorithm.  We simply try dividing by every odd number, until the number being factored is reduced to one by the divisions.  The worst possibility will be when  the number to be factored is prime.  In that case, we will need to try every odd number up to and including the target number.  If the target number is T, then we try dividing by T/2 odd numbers, so the time needed for this algorithm is some constant times the value of the target number.

    However, the value of the target number is not the same as the size of the data -- meaning the amount of space needed to store the number. The amount of data is the number of digits in the number.  Each time we add an additional digit to a number (assuming the highest digit has the same value), we in effect multiply the number by ten:  230 is tens times as large as 23, and 2300 is ten times as large as 230.  If we let N be the number of digits, then the value of the number is approximately 10N, so the time needed to carry out the algorithm is some constant times 10N.  This is an exponential function which grows faster than any power of N (like N2, N3, or N10, or N100).
 

Reasonable and "Hard" Problems

    Computer scientists sometimes call problems which can be computed in a time proportional to a power of N, the size of the data,  reasonable.  The summing algorithm and both sort algorithms discussed above would be "reasonable."  This is because as computers get faster and faster with advances in technology, the size of problems which can be solved using these methods grows fairly fast as well.

    Problems for which the best algorithm has a time complexity which grows proportional to an exponential function of N, the amount of data, are considered hard.  The factoring algorithm we discussed above is considered hard, because the time needed to factor a number is multiplied by 10 for each added digit.  In CS 102 a couple years ago, students wrote a program to factor arbitrarily large integers using this method.  An 8 digit integer took about an hour to factor with the faster version of the algorithm on a faster machine.  That means a 9 digit number would take 10 hours, a 10 digit number would take 100 hours, or about 4 days, an 11 digit number would therefore take about 40 days and a 12 digit number would take 400 days or more than a year.  If we keep projecting like this, a 20 digit number, using our algorithm, would take 100,000,000 years (one hundred million years), using our algorithm.

    Now, faster factoring algorithms are known, and faster computers are available.  However, there is no known factoring algorithm that does better than the exponential pattern, just ones that make the constant smaller.  Consequently, even if we found a combination of algorithm and machine that was a million times faster than ours, it would take 100 years to factor a 20 digit prime, and by going only up to 25 digits, we would increase the time to 10,000,000 years.  This is why this type of problem is considered "hard."
 

Factoring and Encryption

    Some modern encryption techniques are based on algorithms involving primes, such that in order to break the code, it would be necessary to factor a number which is the product of two very large primes.  As we have seen above, this would take a time which is an exponential function of the number of digits in the primes.  Consequently, this is a "hard" problem. As computers get faster, the problem can be put out of the range of any reasonable time by adding just a few digits to the numbers used for the code.  On the other hand, the algorithms which are used to do the encoding and decoding have a complexity which is a small power of the number of digits in the numbers used (N,  N2,or N3), so that they are quite reasonable to compute in a short time, even when the numbers have a large number of digits.