TUCoPS :: Crypto :: ptpkzip.txt

TUCoPS :: Crypto :: ptpkzip.txt
A Known Plaintext Attack on the PKZip Stream Cipher


If you have PostScript printer and/or GNU GhostScript (available free
via FTP), the PostScript version of this paper will look nicer and
will include the graph accompanying with Figure 1.

Superscripts and subscripts are in pseudo-Latex format, where a
superscript b is denoted a^{b} and a subscript b is a_{b}.

The paper will appear in the proceedings of the December 1994
Algorithms workshop held in Leuven, Belgium (which presumably
will be printed by Springer-Verlag).

Cheers,
Paul Kocher
kocherp@leland.stanford.edu

----------------------------------------------------------------------------



         A Known Plaintext Attack on the PKZIP Stream Cipher

                Eli Biham (*)    Paul C. Kocher (**)


(*)   Computer Science Department, Technion - Israel Institute of
            Technology, Haifa 32000, Israel.

(**)  Independent cryptographic consultant, 7700 N.W. Ridgewood Dr.,
            Corvallis, OR 97330, USA.  (415) 354-8004.



ABSTRACT:

The PKZIP program is one of the more widely used archive/compression
programs on personal computers.  It also has many compatible variants
on other computers, and is used by most BBS's and ftp sites to
compress their archives.  PKZIP provides a stream cipher which allows
users to scramble files with variable length keys (passwords).  In
this paper we describe a known plaintext attack on this cipher, which
can find the internal representation of the key within a few hours on
a personal computer using a few hundred bytes of known plaintext.  In
many cases, the actual user keys can also be found from the internal
representation.  We conclude that the PKZIP cipher is weak, and
should not be used to protect valuable data.



SECTION 1:  Introduction

The PKZIP program is one of the more widely used archive/compression
programs on personal computers.  It also has many compatible variants
on other computers (such as Infozip's zip/unzip), and is used by most
BBS's and ftp sites to compress their archives.  PKZIP provides a
stream cipher which allows users to scramble the archived files under
variable length keys (passwords).  This stream cipher was designed by
Roger Schlafly.

      In this paper we describe a known plaintext attack on the PKZIP
stream cipher which takes a few hours on a personal computer and
requires about 13--40 (compressed) known plaintext bytes, or the
first 30--200 uncompressed bytes, when the file is compressed.  The
attack primarily finds the 96-bit internal representation of the key,
which suffices to decrypt the whole file and any other file encrypted
under the same key.  Later, the original key can be constructed.
This attack was used to find the key of the PKZIP contest.

      The analysis in this paper holds to both versions of PKZIP:
version 1.10 and version 2.04g.  The ciphers used in the two versions
differ in minor details, which does not affect the analysis.

      The structure of this paper is as follows: Section 2 describes
PKZIP and the PKZIP stream cipher.  The attack is described in
Section 3, and a summary of the results is given in Section 4.



SECTION 2:  The PKZIP Stream Cipher

PKZIP manages a ZIP file[1] which is an archive containing many files
in a compressed form, along with file headers describing (for each
file) the file name, the compression method, whether the file is
encrypted, the CRC-32 value, the original and compressed sizes of the
file, and other auxiliary information.

      The files are kept in the zip-file in the shortest form
possible of several compression methods.  In case that the
compression methods do not shrink the size of the file, the files are
stored without compression.  If encryption is required, 12 bytes
(called the _encryption_header_) are prepended to the compressed
form, and the encrypted form of the result is kept in the zip-file.
The 12 prepended bytes are used for randomization, but also include
header dependent data to allow identification of wrong keys when
decrypting.  In particular, in PKZIP 1.10 the last two bytes of these
12 bytes are derived from the CRC-32 field of the header, and many of
the other prepended bytes are constant or can be predicted from other
values in the file header.  In PKZIP 2.04g, only the last byte of
these 12 bytes is derived from the CRC-32 field.  The file headers
are not encrypted in both versions.

      The cipher is byte-oriented, encrypting under variable length
keys.  It has a 96-bit internal memory, divided into three 32-bit
words called key0, key1 and key2.  An 8-bit variable key3 (not part
of the internal memory) is derived from key2.  The key initializes
the memory: each key has an equivalent internal representation as
three 32-bit words.  Two keys are equivalent if their internal
representations are the same.  The plaintext bytes update the memory
during encryption.

      The main function of the cipher is called update_keys, and is
used to update the internal memory and to derive the variable key3,
for each given input (usually plaintext) byte:

update_keys_{i} (char):
  local unsigned short temp
  key0_{i+1} <-- crc32(key0_{i},char)
  key1_{i+1} <-- (key1_{i} + LSB(key0_{i+1})) * 134775813 + 1 (mod 2^{32})
  key2_{i+1} <-- crc32(key2_{i},MSB(key1_{i+1}))
  temp_{i+1} <-- key2_{i+1} | 3    (16 LS bits)
  key3_{i+1} <-- LSB((temp_{i+1} * (temp_{i+1} xor 1)) >> 8)
end update_keys

where | is the binary inclusive-or operator, and >> denotes the right
shift operator (as in the C programming language).  LSB and MSB
denote the least significant byte and the most significant byte of
the operands, respectively.  Note that the indices are used only for
future references and are not part of the algorithm, and that the
results of key3 using inclusive-or with 3 in the calculation of temp
are the same as with the original inclusive-or with 2 used in the
original algorithm.  We prefer this notation in order to reduce one
bit of uncertainty about temp in the following discussion.

      Before encrypting, the key (password) is processed to update
the initial value of the internal memory by:

process_keys(key):
  key0_{1-l} <-- 0x12345678
  key1_{1-l} <-- 0x23456789
  key2_{1-l} <-- 0x34567890
  loop for i <-- 1 to l
    update_keys_{i-l}(key_{i})
  end loop
end process_keys

where l is the length of the key (in bytes) and hexadecimal numbers
are prefixed by 0x (as in the C programming language).  After
executing this procedure, the internal memory contains the internal
representation of the key, which is denoted by key0_{1}, key1_{1} and
key2_{1}.

      The encryption algorithm itself processes the bytes of the
compressed form along with the prepended 12 bytes by:

      Encryption                      Decryption
      ----------                      ----------
      prepend P_{1},...,P_{12}
      loop for i <-- 1 to n           loop for i <-- 1 to n
        C_{i} <-- P_{i} xor key3_{i}    P_{i} <-- C_{i} xor key3_{i}
        update_keys_{i}(P_{i})        update_keys_{i}(P_{i})
      end loop                        discard P_{1},...,P_{12}

The decryption process is similar, except that it discards the 12
prepended bytes.

      The crc32 operation takes a previous 32-bit value and a byte,
XORs them and calculates the next 32-bit value by the crc polynomial
denoted by 0xEDB88320.  In practice, a table crctab can be
precomputed, and the crc32 calculation becomes:

crc32 = crc32(pval,char) = (pval>>8) xor crctab[LSB(pval) xor char]

The crc32 equation is invertible in the following sense:

pval = crc32^{-1}(crc32,char) =
      (crc32 << 8) xor crcinvtab[MSB(crc32)] xor char

crctab and crcinvtab are precomputed as:

init_crc():
  local unsigned long temp
  loop for i <-- 0 to 255
  temp <-- crc32(0,i)
    crctab[i] <-- temp
    crcinvtab[temp >> 24] <-- (temp << 8) xor i
  end loop
end init_crc

in which crc32 refers to the original definition of crc32:

crc32(temp,i):
  temp <-- temp xor i
  loop for j <-- 0 to 7
    if odd(temp) then
      temp <-- temp >> 1 xor 0xEDB88320
    else
      temp <-- temp >> 1
    endif
  end loop
  return temp
end crc32


SECTION 3:  The Attack

The attack we describe works even if the known plaintext bytes are
not the first bytes (if the file is compressed, it needs the
compressed bytes, rather than the uncompressed bytes).  In the
following discussion the subscripts of the n known plaintext bytes
are denoted by 1,...,n, even if the known bytes are not the first
bytes.  We ignore the subscripts when the meaning is clear and the
discussion holds for all the indices.

      Under a known plaintext attack, both the plaintext and the
ciphertext are known.  In the PKZIP cipher, given a plaintext byte
and the corresponding ciphertext byte, the value of the variable key3
can be calculated by

                   key3_{i} = P_{i} xor C_{i}.

Given P_{1},...,P_{n} and C_{1},...,C_{n}, we receive the values of
key3_{1},...,key3_{n}.  The known plaintext bytes are the inputs of
the update_keys function, and the derived key3's are the outputs.
Therefore, in order to break the cipher, it suffices to solve the set
of equations derived from update_keys, and find the initial values of
key0, key1 and key2.

In the following subsections we describe how we find many possible
values for key2, then how we extend these possible values to possible
values of key1, and key0, and how we discard all the wrong values.
Then, we remain with only the right values which correspond to the
internal representation of the key.

Subsection 3.1:  key2

The value of key3 depends only on the 14 bits of key2 that
participate in temp.  Any value of key3 is suggested by exactly 64
possible values of temp (and thus 64 possible values of the 14 bits
of key2).  The two least significant bits of key2 and the 16 most
significant bits do not affect key3 (neither temp).

Given the 64 possibilities of temp in one location of the encrypted
text, we complete the 16 most significant bits of key2 with all the
2^{16} possible values, and get 2^{22} possible values for the 30
most significant bits of key2.  key2_{i+1} is calculated by
key2_{i+1} <-- crc32(key2_{i},MSB(key1_{i+1})).  Thus,

      key2_{i} = crc32^{-1}(key2_{i+1},MSB(key1_{i+1}))
               = (key2_{i+1}<<8) xor crcinvtab[MSB(key2_{i+1})]
                       xor MSB(key1_{i+1}).

Given any particular value of key2_{i+1}, for each term of this
equation we can calculate the value of the 22 most significant bits
of the right hand side of the equation, and we know 64 possibilities
for the value of 14 bits of the left hand side, as described in Table
1.  From the table, we can see that six bits are common to the right
hand side and the left hand side.  Only about 2^{-6} of the possible
values of the 14 bits of key2_{i} have the same value of the common
bits as in the right hand side, and thus, we remain with only one
possible value of the 14 bits of key2_{i} in average, for each
possible value of the 30 bits of key2_{i+1}.  When this equation
holds, we can complete additional bits of the right and the left
sides, up to the total of the 30 bits known in at least one of the
sides.  Thus, we can deduce the 30 most significant bits of key2_{i}.
We get in average one value for these 30 most significant bits of
key2_{i}, for each value of the 30 most significant bits of
key2_{i+1}.  Therefore, we are now just in the same situation with
key2_{i} as we were before with key2_{i+1}, and we can now find
values of 30 bits of key2_{i-1}, key2_{i-2},...,key2_{1}.  Given this
list of 30-bit values, we can complete the 32-bit values of key2_{n},
key2_{n-1},...,key2_{2} (excluding key2_{1}) using the same equation.
We remain with about 2^{22} lists of possible values
(key2_{n},key2_{n-1},...,key2_{2}), of which one must be the list
actually calculated during encryption.


Table 1:

  Side  Term                          Bits#  BitsPosition  #OfValues
  ------------------------------------------------------------------
  Left  key2_{i}                        14      2-15          64
  Right key2_{i+2} << 8                 22     10-31           1
        crcinvtab[MSB(key2_{i+1})]      32      0-31           1
        MSB(key1_{i+1})                 24      8-31           1
  ------------------------------------------------------------------
  Total Left Hand Side                  14      2-15          64
  Total RIght Hand Side                 22     10-31           1
  ------------------------------------------------------------------
  Common bits                            6     10-15
  Total bits                            30      2-31



Subsection 3.2:  Reducing the number of possible values of key2

The total complexity of our attack is (as described later) 2^{16}
times the number of possible lists of key2's.  If we remain with
2^{22} lists, the total complexity becomes 2^{38}.  This complexity
can be reduced if we can reduce the number of lists of key2's without
discarding the right list.

      We observed that the attack requires only 12--13 known
plaintext bytes (as we describe later).  Our idea is to use longer
known plaintext streams, and to reduce the number of lists based on
the additional plaintext.  In particular, we are interested only in
the values of key2_{13}, and not in all the list of key2_{i},
i=13,...,n.  key2_{13} is then used in the attack as is described
above.

We start with the 2^{22} possible values of key2_{n}, and calculate
the possible values of key2_{n-1}, key2_{n-2}, etc. using Equation 1.
The number of possible values of key2_{i} (i=n-1, n-2, etc.) remains
about 2^{22}.  However, some of the values are duplicates of other
values.  When these duplicates are discarded, the number of possible
values of key2_{i} is substantially decreased.  To speed things up,
we calculate all the possible values of key2_{n-1}, and remove the
duplicates.  Then we calculate all the possible values of key2_{n-2},
and remove the duplicates, and so on.  When the duplicates fraction
becomes smaller, we can remove the duplicates only every several
bytes, to save overhead.  Figure 1 shows the number of remaining
values for any given size of known plaintext participating in the
reduction, as was measured on the PKZIP contest file (which is
typical).  We observed that using about 40 known plaintext bytes (28
of them are used for the reduction and 12 for the attack), the number
of possible values of key2_{13} is reduced to about 2^{18}, and the
complexity of the attack is 2^{34}.  Using 10000-byte known
plaintext, the complexity of our attack is reduced to 2^{24}--2^{27}.


Figure 1:

      bytes     key2 list entries
        1      2^{22}=4194304
        2             3473408
        3             2152448         +-----------------------------+
        4             1789183         | The PostScript version of   |
        5             1521084         | the paper has a graph here  |
       10              798169         | showing the number of key2  |
       15              538746         | values as a function of the |
       20              409011         | number of plaintext bytes.  |
       25              332606         +-----------------------------+
       30              283930
       40              213751
       50              174471
      100               88248
      200               43796
      300               31088
      500               16822
     1000                7785
     2000                5196
     4000                3976
     6000                3000
     8000                1296
    10000                1857
    12000                 243
    12289                 801

    Fig 1.  Decrease in the number of key2 candidates using varying
    amounts of known plaintext.  These results are for the PKZIP
    contest file and are fairly typical, though the entry 12000 is
    unusually low.


Subsection 3.3:  key1

From the list of (key2_{n}, key2_{n-1},...,key2_{2}) we can calculate
the values of the most significant bytes the key1's by

      MSB(key1_{i+1}) =
       (key2_{i+1} << 8) xor crcinvtab[MSB(key2_{i+1})] xor key2_{i}.

!!!

We receive the list (MSB(key1_{n}),MSB(key1_{n-1}),...,MSB(key1_{3}))
(excluding MSB(key1_{2})).

      Given MSB(key1_{n}) and MSB(key1_{n-1}), we can calculate about
2^{16} values for the full values of key1_{n} and
key1_{n-1}+LSB(key0_{n}).  This calculation can be done efficiently
using lookup tables of size 256--1024.  Note that

      key1_{n-1}+LSB(key0_{n}) =
        (key1_{n}-1) * 134775813^{-1} (mod 2^{32})

and that LSB(key0_{n}) is in the range 0,...,255. At this point we
have about 2^{11} * 2^{16} = 2^{27} (or 2^{22} * 2^{16} = 2^{38})
possible lists of key2's and key1_{n}. Note that in the remainder of
the attack no additional complexity is added, and all the additional
operations contain a fixed number of instructions for each of the
already existing list.

      The values of key1_{n-1}+LSB(key0_{n}) are very close to the
values of key1_{n-1} (since we lack only the 8-bit value
LSB(key0_{n})). Thus, an average of only 256 * 2^{-8} = 1 possible
value of key1_{n-1} that leads to the most significant byte of
key1_{n-2} from the list. This value can be found efficiently using
the same lookup tables used for finding key1_{n}, with only a few
operations. Then, we remain with a similar situation, in which
key1_{n-1} is known and we lack only eight bits of key1_{n-2}. We
find key1_{n-2} with the same algorithm, and then find the rest of
key1_{n-3}, key1_{n-4}, and so on with the same algorithm. We result
with about 2^{27} possible lists, each containing the values of
(key2_{n}, key2_{n-1},...,key2_{2}, and key1_{n}, key1_{n-
1},...,key1_{4}) (again, key1_{3} cannot be fully recovered since two
successive values of MSB(key1) are required to find each value of
key1).


Subsection 3.4:

Given a list of (key1_{n}, key1_{n-1},...,key1_{4}), we can easily
calculate the values of the least significant bytes of (key0_{n},
key0_{n-1},...,key0_{5}) by

      LSB(key0_{i+1}) =
        ((key1_{i+1}-1) * 134775813^{-1})-key1_{i}   (mod 2^{32}).

key0_{i+1} is calculated by

      key0_{i+1} <-- crc32(key0_{i},P_{i})
        = (key0_{i} >> 8) xor crctab[LSB(key0_{i}) xor P_{i}].

Crc32 is a linear function, and from any four consecutive LSB(key0)
values, together with the corresponding known plaintext bytes it is
possible to recover the full four key0's. Moreover, given one full
key0, it is possible to reconstruct all the other key0's by
calculating forward and backward, when the plaintext bytes are given.
Thus, we can now receive key0_{n},...,key0_{1} (this time including
key0_{1}). We can now compare the values of the least significant
bytes of key0_{n-4},...,key0_{n-7} to the corresponding values from
the lists. Only a fraction of 2^{-32} of the lists satisfy the
equality. Since we have only about 2^{27} possible lists, it is
expected that only one list remain. This list must have the right
values of the key0's, key1's, and key2's, and in particular the right
values of key0_{n}, key1_{n} and key2_{n}. In total we need 12 known
plaintext bytes for this analysis (except for reducing the number of
key2 lists) since in the lists the values of LSB(key0_{i}) start with
i=5, and n-7=5 ==> n=12.

If no reduction of the number of key2 lists is performed, 2^{38}
lists  of (key0, key1, key2) remain at this point, rather than
2^{27}.  Thus, we need to compare five bytes
key0_{n-4},...,key0_{n-8} in order to remain with only one list.  In
this case, 13 known plaintext bytes are required for the whole
attack, and the complexity of analysis is 2^{38}.

Subsection 3.5:  The Internal Representation of the Key

Given key0_{n}, key1_{n} and key2_{n}, it is possible to construct
key0_{i}, key1_{i} and key2_{i} for any i<n using only the ciphertext
bytes, without using the known plaintext, and even if the known
plaintext starts in the middle of the encrypted file this
construction works and provides also the unknown plaintext and the 12
prepended bytes.  In particular it can find the internal
representation of the key, denoted by key0_{1}, key1_{1} and key2_{1}
(where the index denotes again the index in the encrypted text,
rather than in the known plaintext).  The calculation is as follows:

(Equation 2)

  key2_{i} = crc32^{-1}(key2_{i+1},MSB(key1_{i+1}))
  key1_{i} = ((key1_{i+1}-1) * 134775813^{-1}) -
                  LSB(key0_{i+1})  (mod 2^32)
  temp_{i} = key2_{i} | 3
  key3_{i} = LSB((temp_{i} * (temp_{i} xor 1)) >> 8)
     P_{i} = C_{i} xor key3_{i}
  key0_{i} = crc32(key0_{i+1},P_{i})

The resulting value of (key0_{1},key1_{1},key2_{1}) is the internal
representation of the key.  It is independent of the plaintext and
the prepended bytes, and depends only on the key.  With this internal
representation of the key we can decrypt any ciphertext encrypted
under the same key.  The two bytes of crc32 (one byte in version
2.04g) which are included in the 12 prepended bytes allow further
verification that the file is really encrypted under the found
internal representation of the key.


Subsection 3.6:  The Key (Password)

The internal representation of the key suffices to break the cipher.
However, we can go even further and find the key itself from this
internal representation with the complexities summarized in Table 2.
The algorithm tries all key lengths 0, 1, 2, ..., up to some maximal
length; for each key length it does as described in the following
paragraphs.


Table 2:  Complexity of finding the key itself

  -----------------------------------------------------------------
  Key length    1-6   7     8      9     10     11     12     13
  Complexity     1  2^{8} 2^{16} 2^{24} 2^{32} 2^{40} 2^{48} 2^{56}
  -----------------------------------------------------------------

For l <= 4 it knows key0_{1-l} and key0_{1}.  Only l <= 4 key bytes
are entered to the crc32 calculations that update key0_{1-l} into
key0_{1}.  Crc32 is a linear function, and these l <= 4 key bytes can
be recovered, just as key0_{n},...,key0_{n-3} recovered above.  Given
the l key bytes, we reconstruct the internal representation, and
verify that we get key1_{1} and key2_{1} as expected (key0_{1} must
be as expected, due to the construction).  If the verification
succeeds, we found the key (or an equivalent key).  Otherwise, we try
the next key length.

For 5 <= l <= 6 we can calculate key1_{0}, key2_{0} and key2_{-1}, as
in Equation 2.  Then, key2_{2-l},...,key2_{-2} can be recovered since
they are also calculated with crc32, and depend on l-2 <= 4 unknown
bytes (of key1's).  These unknown bytes MSB(key1_{2-l}),...,
MSB(key1_{-1}) are also recovered at the same time.  key1_{1-l} is
known.  Thus, we can receive an average of one possible value for
key1_{2-l} and for key1_{3-l}, together with LSB(key0_{2-l}) and
LSB(key0_{3-l}), using the same lookup tables used in the attack.
From LSB(key0_{2-l}) and LSB(key0_{3-l}) and key0_{1-l}, we can
complete key0_{2-l} and key0_{3-l} and get key_{1} and key_{2}.  The
remaining l-2 key bytes are found by solving the l-2 <= 4 missing
bytes of the crc32 as is done for the case of l <= 4.  Finally, we
verify that the received key has the expected internal
representation.  If so, we have found the key (or an equivalent key).
Otherwise, we try the next key length.

For l>6, we try all the possible values of key_{1},...,key_{l-6},
calculating key0_{-5}, key1_{-5} and key2_{-5}.  Then we used the l=6
algorithm to find the remaining six key bytes.  In total we try about
2^{8 * (l-6)} keys.  Only a fraction of 2^{-64} of them pass the
verification (2^{-32} due to each of key1 and key2).  Thus, we expect
to remain with only the right key (or an equivalent) in trials of up
to 13-byte keys.  Note that keys longer than 12 bytes will almost
always have equivalents with up to 12 (or sometimes 13) bytes, since
the internal representation is only 12-bytes long.


SECTION 4:  Summary

In this paper we describe a new attack on the PKZIP stream cipher
which finds the internal representation of the key, which suffices to
decrypt the whole file and any other file which is encrypted by the
same key.  This known plaintext attack breaks the cipher using 40
(compressed) known plaintext bytes, or about the 200 first
uncompressed bytes (if the file is compressed), with complexity
2^{34}.  Using about 10000 known plaintext bytes, the complexity is
reduced to about 2^{27}.  Table 3 describes the complexity of the
attack for various sizes of known plaintext.  The original key
(password) can be constructed from the internal representation.  An
implementation of this attack in software was applied against the
PKZIP cipher contest.  It found the key "f7 30 69 89 77 b1 20" (in
hexadecimal) within a few hours on a personal computer.


Table 3:  Complexity of the attack by the size of the known plaintext

            Bytes      Complexity
          -------      ----------
               13        2^{38}
               40        2^{34}
              110        2^{32}
              310        2^{31}
              510        2^{30}
             1000        2^{29}
             4000        2^{28}
            10000        2^{27}


A variant of the attack requires only 13 known plaintext bytes, in
price of a higher complexity 2^{38}.  Since the last two bytes (one
in version 2.04g) of the 12 prepended bytes are always known, if the
known plaintext portion of the file is in its beginning, the attack
requires only 11 (12) known plaintext bytes of the compressed file.
(In version 1.10 several additional prepended bytes might be
predictable, thus the attack might actually require even fewer known
plaintext bytes.)

We conclude that the PKZIP cipher is weak and that it should not be
used to protect valuable information.


[1] PKWARE, Inc., General Format of a ZIP File, technical note,
      included in PKZIP 1.10 distribution (pkz110.exe: file
      appnote.txt).