![w32.spybot worm virus w32.spybot worm virus](https://static.wixstatic.com/media/4929aa_b5392067352642349dc5f9b174b27b46~mv2_d_2048_1532_s_2.jpg)
I compiled the components from known malware and known clean files into a malware knowledgebase, and labelled each entry with either the malware family name or 'clean'. These strings, or 'components' are special in that a program communicates with the system through these strings. Instead of looking for n-grams, I looked only for meaningful strings, such as filenames, IP addresses, email addresses, CLSIDs and URLs, using regular expressions.
![w32.spybot worm virus w32.spybot worm virus](https://rrav.altervista.org/wp-content/uploads/2015/07/wechia-remover.png)
I took a different approach from n-grams and neural networks, mainly to reduce noise. Work has already been done in this area by constructing neural networks of small sequences of bytes, called n-grams. Theoretically, by training neural networks with a sample set, a scanner should be able to distinguish malware from clean files. I believe these resources have not been fully utilized, and more valuable data could be extracted from these collections. These are very valuable resources, and many companies trade and share them, mainly to expand their signature databases to cover all known malware. Every anti-virus company has its own collection of malware. This scanning method relies mainly on statistics. So, a weight is added to each mini-signature: We can say that matching three out of four is good enough for detection, but it is not sufficiently flexible. A comma, a space, and the words 'in' and 'a' are omitted from the strings since they are not unique to 'fruit basket'. My mini-signatures for 'fruit basket' would be 'apple', 'banana', 'orange' and 'basket'. However, when used together with other mini-signatures, the risk of false positives is reduced. A mini-signature is shorter than a usual signature, and is too short and too generic to be used by itself. A more human pattern matching would use multiple short string signatures, or 'mini-signatures'. Semantically, a fruit basket is any basket containing any fruit.
![w32.spybot worm virus w32.spybot worm virus](https://bughira.files.wordpress.com/2009/02/fixsystem.png)
This means the scanner will detect 'apple, banana, orange in a basket', but it will miss variants like 'apple, banana, orange in a yellow basket', or 'banana, orange, apple in a basket'. If we want to detect a malware called 'fruit basket', containing the string 'apple, banana, orange in a basket', we might add a string signature like 'ana, orange in a baske'. One of the most common non-heuristic anti-virus scanning methods is string matching.