Pattern Analysis for Computer Security
Peter V. Radatti
radatti@cyber.com
CyberSoft
June 01 2005
This paper is somewhat philosophical in nature but sometimes that is exactly what is needed to convey meaning. It is my belief that after you have read this short paper you will have a different view of pattern analysis. Pattern analysis is not the “end all” of computer security but it is a very big hammer. Sometimes (often) when you have a big hammer other methods just don’t matter. What works, works!Most of computer security is about being able to tell the good guys from the bad. This is called pattern analysis. To a certain extent all of us do pattern analysis. When you look at an orange and recognize it as such your brain just did pattern analysis. You brain interpreted what your eyes saw, compared it to known objects, rejected things that were similar but not “it” and arrived at an answer. In fact, you may have been wrong. It may not have been an orange but a tangerine. The ability to tell the difference between oranges and tangerines requires additional parameters that you might not have.
If you think clearly about pattern analysis you will find that it is an overwhelming problem that is usually done wrong. Two examples of where pattern analysis was done wrong that changed the course of human history was Pearl Harbor in Hawaii and the World Trade Center in New York City. In both cases, the information necessary to predict and prevent these events existed prior to the event but was lost in the noise. That is because in real life, pattern analysis is a very difficult problem. The volume of data itself makes it difficult and frequently you do not know what patterns to look for until it is too late. Fortunately, this is about digital pattern analysis where the problems of volume and hindsight are still large but within the realm of the possible.
There are two aspects to volume making problems complex. The first is that a whole lot of stuff needs to be processed. The second is that given large volumes of data, patterns you are looking for will naturally occur as a misidentification. This is actually related to the well known postulate that an infinite number of monkeys typing on an infinite number of typewriters will eventually reproduce the entire works of Shakespeare. It is, however, different because when you are dealing with an infinite number, everything is possible. In computer security everything is finite even if it seems infinite from a practical point of view. An example of this is that many digital photos are 640 by 480 pixels which yield a total of 307,200 pixels. Using 256 colors, you get the number 256^307200. This number represents every single photo that has ever been taken or will ever be taken at that size. It contains every picture of every face, building, animal, aliens on distant planets, in fact anything that can be contained in a photo of that size. This number is finite but still so large that it becomes meaningless.
When a pattern exists but is not found, it is called a miss. Pearl Harbor and the Word Trade Center were misses. When a pattern is correctly found it is called a hit. An example is when a virus scanner finds a virus. When a pattern is found but it is not correct it is called a false hit. There are several types of false hits, including a special type called a “ghost hit”. An example of a false hit is when a virus scanner detects a virus that is not there. An example of a ghost is when a virus scanner detects a virus that was disinfected. The virus is actually there but moot. There are excellent reasons to want to know when ghosts are present but it is most useful to know they are ghosts.
In 1988 I wrote a program called “ring toss”. I named it that based on the backyard game called quates where you toss a ring onto a stake in the ground. It is basically the game horseshoes but with rubber rings. The object of the program is to find all of the letters in a complex sentence in positional order within a single file. I had an entire server to run the program against. What I found was that many files triggered the ring toss program as a hit. When I examined the files I found that what I was detecting was data below the noise level. In reality, there was no hit. My pattern was complex and long and the amount of data to search was finite. There was no significant data in the detected files but the way I was looking for my pattern gave me the false hits. This brings us to the point that the way you look for a pattern matters.
The problem of volume exists in computer security. The volume of data is much smaller than in the real world but still very large. The issue is that we are looking for patterns that are small. Many computer viruses are only about 60 bytes long! Lexical patterns are rarely more than 300 bytes long. The ratio of the pattern size to the volume of data to be searched is critical to the difference between a hit and a false hit. To avoid false hits, the pattern size and complexity needs to increase when the volume of data to be searched increases. When you are looking for a virus, you are limited by the maximum size of the virus so additional data such as target file identification and pattern location become meaningful.
I call pattern complexity “data richness”. For example, the hexadecimal pattern zero, zero repeated a few hundred times is not data rich. There is no statical significance to the pattern. In fact, this exact pattern will false hit on almost all executables since many executables set aside variable space using the byte value zero zero.
What I have just said is that pattern analysis is a game of statistics. The pattern to be searched needs to have a significant length and needs to have significant data richness IN RELATION to the target files that need to be searched. In addition, the pattern must be searched for in a way that is appropriate for the problem at hand. Do any of this wrong and you get a false hit.
Another view is that pattern analysis is puzzle solving. There are many different types of parameters that can add data richness beyond the actual pattern. For example, if you know that the specific pattern you are interested in can only validly exist in a file of a specific type, you can increase the data richness of the pattern master by restricting its use to files where the pattern can exist. This also helps to reduce false hits because if a matching pattern does exist in a file of the wrong type it will not be flagged. Positional information can also be of value when enhancing data richness. For example, if a pattern only exists at a specific location in a file, restricting the search for that pattern to where it can exist positionally in the file enhances data richness.
The next part of understanding pattern analysis is that it really exists as two separate problems. The first is to create the master search pattern; “The thing you are looking for”. This is not always obvious or easy. The second is to find the master pattern in target files. This is actually a different set of problems which require different skills. An example is that the analyst who detects a virus for the first time and writes the master pattern can be a different person from the person who wrote the program that executes the master pattern. They are almost always different from the people who run the program in the field.
The good news for the mathematically challenged is that you do not need to know statistics to do this well. In fact, certain people with mental problems do this well. In the movie, “A Beautiful Mind” about Dr. John Forbes Nash, Jr. you can see scenes where Dr. Nash sees patterns that don’t exist. The problem as told in the movie was that sometimes with schizophrenia it can be hard to tell the difference between a hit and a false hit. (The movie was not a true representation of the problem Dr. Nash and others suffer but is a good example to make this point.) Another type of mental condition that seems to make people good at pattern analysis is Attention Deficit Hyperactivity Disorder, ADHD. Some people with ADHD trained in pattern analysis can see the patterns they are looking for, whereas people without ADHD can not. In one case, I found an ADHD person with a high IQ who rarely has false hits. They are a natural talent at creating pattern masters.
So far, I have only discussed simple pattern masters where there is a near one to one correspondence between the pattern master and the search pattern. There are additional types of pattern masters which can match valid patterns where the exact pattern is unknown in advance. A simple example of that would be if you were searching for the pattern, “hello” and you knew that some people might spell the word “hello” with the number zero in the letter-O place or the numbers one in the letter-L place. Of course they may use any or every combination of the above. A pattern master for a pattern of that type could be described verbally as:
Search for the letter “h” while ignoring case.
Search for the letter “e”
Search for the letter “l” or the number “1”
Search for the letter “l” or the number “1”
Search for the letter “o” or the number “0”
As you can see, this is a very simple little pattern master yet it can detect 16 actual patterns, all of which are valid. Add the ability to ignore white space between letters or search for a prefix and postfix of white space and the data richness is still increased while the number of actual patterns that can be detected increases substantially. I call techniques that deal with this problem as “pattern looseness”. Again, while pattern looseness opens the number of patterns you can find with a single master, data richness dictates that you can not include false hits!
An area where pattern looseness is extremely useful is when you are trying to detect a member of a family of data. Most commonly, this is to detect a new member of a family of computer viruses such as the Begal virus. It is, of course, extremely valuable to be able to have a pattern master that detects all or most of the members of a family. Imagine when a new strain of the Begal virus is released and even before it becomes known, it is already detected by a family pattern master that included enough looseness to detect it! Pattern looseness also has applications in the biological sciences and in lexical analysis such as preserving secrets.
Another area of study in pattern analysis is data enhancement through transitional formatting. I believe that I implemented the first transitional pattern analysis engine in VFind in 1999 when we noted that pattern masters of Java Byte Code was not reliable because of the way the Java system itself worked. The solution was to create a transitional engine that prepared the data for pattern analysis. In the case of the Java Byte Code, we created a fast, light, disassembly system that processed the binary byte code into reliable and easily searched disassembly code. This is backward from a straight pattern search because instead of modifying the pattern master you modify the data being searched. Transitional formatting of data to enhance pattern analysis is a very powerful and effective valid tool that can reveal data when no other tool can. Always remember to look at all sides of the problem!
Another aspect of transitional pattern analysis is the meta-data creation that is the transitional phase of the problem. It is necessary to arrive at a transitional function that will process the raw data into a meta-data that by its nature lends itself to pattern analysis. This is not as easy as it seems and sometimes the resulting meta-data can not be searched by a simple pattern master but by an enriched pattern master that includes information about the resulting meta-data. An example of this is when a CPU emulator is invoked to emulate a binary. For the purposes of pattern analysis, this is most often used to detect attack programs such as viruses. The results of emulation may be several useful patterns. The first may be a hexadecimal string of values that have been decrypted or decoded by the emulated code. The second may be a string of events that triggered flags in the emulator. For example, if a program generates a list of target files and then modifies those target files, the setting of the emulator flags are useful information in themselves. In fact, you might generate separate pattern masters for a problem of this type. The first is the decoded string of values. That can be a straight forward pattern master that is used solely against values provided to it by the emulator. (Notice that since the pattern master is only applied against results from the emulator, that data richness is enhanced.) The second is a pattern master that detects dangerous results of an emulated program. In both cases, the results of the emulation must be interpreted. That interpretation is also “pattern analysis”.
Another method of pattern analysis is straight statistical analysis such as the Bayes method. In Bayes analysis, a database is created. The target is statically compared to the database to see how closely it compares to the database. The higher the statistical probability that the target under examination matches the contents of the Bayes database the better the chances are that the target belongs to the class of targets in the database. The drawback to this is that everything depends upon a properly created database. If the database contains flaws, then the analysis will be flawed. This is why an unsolicited bulk email (UBE) filter will have problems detecting some email as UBE while detecting desired email as such. One of the ways that CyberSoft solved the problem of the comparative database in the Bayes method was to not directly reply upon human population of the database. This solution was implemented in our Safe Internet Email program and found to be very successful. (Additional reading can be found on both www.cybersoft.com and www.safeinternetemail.com ) We create pattern masters with both enough data richness and data looseness to detect a good number of UBE messages. The result is still not good enough to significantly reduce UBE messages from a reader’s perspective but it is good enough to populate the Bayes database. The pattern masters and the Bayes method can then be applied against our target file synergistically. Notice that this is another form of transitional data. We created a pattern master which detected patterns which were then used as a method of detection of other patterns. The universe of pattern analysis is only limited by your imagination.
Just as a matter of clarification, I have only discussed our implementation of Bayes pattern analysis to determine if something is UBE or not. While this is the most common use of Bayes, it is not limited to that. It can be used to determine if a target conforms to any type of database. This is why VFind allows you to create multiple Bayes databases and CVDL programs. You could create a database to determine if a target has a high probability of being a terrorist communication, criminal communication or is in a specific foreign language. For example, if you create a database that contains classical literature written in Latin then you would have a way of determining if a target is a classical literature written in Latin. If you dumped a large amount of unspecialized communications written in French, then you would have a way of detecting French. Put in a large collection of love letters and you can detect love letters. The secret is that you can statically determine if something conforms to the Bayes database. What you put in the database is unlimited. I believe that you could even put in the hexadecimal strings that compose virus bodies and statically determine if a target was a virus. There is no practical limit to this technology.
The reader should note that so far I have discussed the use of pattern analysis for the purpose of hostile software detection or UBE. (Antivirus professionals are very fussy about the use of the term “computer virus” but the general public uses the term to mean any hostile software program) Pattern analysis is not limited to these areas. You can apply pattern analysis to anything. One of the more interesting uses for pattern analysis is searching stock market results real time looking for trends that can result in a gain for the operator. Pattern analysis can also be used to determine if something is a forgery. In fact, if very large databases of pattern analysis programs were developed, they could be used to analyze and determine even very complex problems such as medical diagnosis or structural analysis. In this view, pattern analysis is a way of capturing expert knowledge in a very specific and highly useable way.
One of the esoteric aspects of pattern analysis is that the tools you use define the pattern masters you can use. Many end users create their own tools to solve single use pattern analysis problems. More sophisticated users may use a language like Perl or Regex but there are significant limitations to these languages. The most important limitation being that there are practical upper limits to what these languages can perform, especially in the number of simultaneous pattern masters that can be solved for. If you only have a small number of pattern masters or a small amount of data to be targeted, then any size hammer will work. If on the other hand, your pattern master is complex, is dependent upon other pattern masters, requires a large number of simultaneous searches, uses Boolean logic or you just have a ton of data to search, then only a really big hammer will solve that problem. I like to think that the pattern analysis system that I designed and have been developing since 1990 is the largest general purpose pattern analysis hammer in the world. It started out life as a Unix virus scanner but very quickly became a pattern analysis system for use with lexical analysis. It has continued to be enhanced over the years and is now capable to do pattern analysis in all fields. That tool is called VFind and is part of the VFind Security ToolKit family of products. You can find out more about VFind at www.cybersoft.com.
There is one more aspect of pattern analysis that is often overlooked by users. That aspect is as important as the creation of the pattern master and often goes hand in hand with it. That aspect is what to do with the knowledge created by the pattern analysis. In the event that the knowledge is that the pattern fits a known virus, delete the virus. If it is UBE, do not forward it. If the knowledge is more complex, knowing something does or does not match a pattern is of no value without a plan for action. In the example given earlier, what would have happen if the airport screeners had been able to recognize the terrorists who boarded the planes that crashed into the Word Trade Center but did not have a plan or authority to deal with them? My guess is that they would have allowed them to board and the event would still have happened. If, however, they had a plan to detain the terrorists, then the event would not have happened. Sometimes even a simple plan is effective and almost always, any plan is more effective than no plan.






