Show Menu


Review cheat sheet for DATA ANALYTICS FOR CYBER test 2

ML in practice: Malware detection

JSTAP: A project does malicious JS detection
Premis­e/p­roblem JS can cause:­bitcoin mining, abuse browser vulner­abi­lities
Abstract Syntax Tree – Derived from grammar of progra­mming language
JSTAP Principle: Perform static analysis with abstract syntax trees and random forests
Static Analyses
Static analysis - we don’t run the code at all, Reverse analysis of the code
dynamic analysis - run the code in virtual machine or debugger. Malware writers delibe­rately obfuscate to defeat static tools. Example: GozNym runs trivial infinite loop in thread, then suspends thread and overwrites code with jump to previously dead code
Dynamic Analysis pitfalls - 1. Easy to detect you are in a debugger, VM, or running Anti- virus – Query registry – IsDebu­gge­rPr­esent – VM specific instru­ctions. 2. Do long delay in hopes simulator will give up and go away
Control Flow Graph – Shows program flow (calls, selection, loops)
Program Dependence Graph – Includes data and control depend­encies
Token - splitting a program into lexical units (words in sentences for English)
N-gram - simple way to analyze token sequences
JSTAP n-grams
- Depth-­first pre-order traversal of AST
- For CFG, also traverse AST, but only nodes linked by control flow edge. Traverse sub-AST for each node with control flow once
- Similar for PDG, consid­ering data flow
- Indepe­ndent n-grams for tokens, AST, CFG, PDG-Data Flow and PDG-Co­ntrol Flow
- 4 is the best value.
- Use chi-sq­uared test to check for correl­ati­on(­check the ngram in benign or malcious), keep x^2(chi squared) geq 6.63 (confi­dence of 99%)
- if ngram in both (benh and malc), throw ngram away
JSTAP Dataset - 131448 malicous, 141768 benign
JSTAP Classifier Training • Select 10,000 malicious and benign randomly for training – Additional 5,000 of each for validation • Repeat 5 times and average detection results
JSTAP results • Two step process • First phase – Unanimous voting, classifies 93% of data with 99.73% accuracy • Second phase – Unanimous voting, classifies 6.5% of data with accuracy still over 99%
Evasion techniques - Add more benign features • Copy malicious into larger benign file
Extremely avstract OS- learn the sample without implem­enting the underlying OS. Over-a­ppr­oxi­mation has more behaviors than system S, under-­app­rox­imation has fewer. Less precise than virtua­liz­ation or emulation
Abstract execution - A Technique for Effici­ently Tracing Programs. In a dynamic analysis, it has Emulator, Extremely avstract OS and paths, Less precise than virtua­liz­ation or emulation

ML in practice: Phishing detection

Phishing Websites - Often used to collect creden­tials. Fake website to induce personal info.
Techniques for finding Phish:
- Industrial toolba­r-b­ased: Eg SpoofG­uard, TrustW­atch, Netcraft (found these ineffe­ctive)
- User-I­nte­rfa­ce-­based: Eg provide custom image per user, Password manager (Only provides password to certain domains)
- Web page conten­t-b­ased: Use web page info (URL, links, terms, images, forms) to detect phishing
–- CANTINA: compute term freque­ncy­-in­verse document frequency for terms, then Google a few terms to see if current website is a top result – B-APT: Bayesian based on tokens from DOM
Some defina­tion: Surface level conten­t-URL, hyperl­inks, Textual conten­t-Terms or words, Visual content- Color, font size, style, location of images
Textual and visual classi­fic­ation: text classi­fiers work by examining text within a page to detect whether certain words are more likely in a fraudulent page or not. Image classi­fiers transform webpage to images and then compares similarity to genuine webpages.
Step of baye analysis: 1. Obtain webpage and normalize 2. Compute signature 3. Calculate EMD and similarity between website and protected web page 4. Classify via threshold
Overall framework 1. Train text and image classi­fier, collect similarity measur­ements for different classi­fiers 2. Partition similarity into sub-in­tervals 3. Estimate probs for text classifier 4. Estimate probs for image classifier 5. Classify each test image 6. If different from two classi­fiers, calculate decision factor 7. Return final classi­fic­ation
High quality dataset:
access­ibility: publicly available;comple­teness: encompass all the breadth within phishing; consis­tency : range and variance of dataset to make sure data won't be substa­ntively changing; integrity: data and labels is correct, non-co­rru­pted; Validity: data is properly repres­ent­ative; interp­ret­ability : data is unders­tan­dable; Timeli­ness: data is updated or still valid today and future
Bagging classifier is an ensemble meta-e­sti­mator that fits base classi­fiers each on random subsets of the original dataset and then aggregate their individual predic­tions (either by voting or by averaging) to form a final prediction
boosting classifier is random forests build each tree indepe­ndently while gradient boosting builds one tree at a time. This additive model (ensemble) works in a forward stage-wise manner, introd­ucing a weak learner to improve the shortc­omings of existing weak learners.

Social network security - Spam

Spam - irrelevant messages sent to many, Spamming is the use of messaging systems to send multiple unsoli­cited messages (spam) to large numbers of recipients
Criminal accounts tend to be socially connected, Maybe less discri­min­ating in who they follow – Maybe intent­ional
Criminal hubs are more inclined to follow criminal accounts
K-anon­ymity - Publisher decides which attributes public­/pr­ivate – Public are “quasi­-id­ent­ifiers” • Every quasi-­ide­ntifier tuple appears in at least k records in anonymized DB
Determine if a database is k-anon­ymous for a particular value of k - for quasi-­ide­nti­fier, if it appears in at least k records in the db. Every public tuples appears at least twice. We can't uniquely identify someone. A database is 2-anon­ymous if no click trace is unique
how an attacker might deanon­ymize a database with auxiliary inform­ati­on(­bac­kground info related to record)
- Amplif­ication of background knowledge - Uses Aux(r) close to r on subset of attributes to find r’ close to r on all - Extended to a subset
1. Compute score(aux, r’) for each r’ in sample 2. Apply matching criteria 3. Output record or probab­ility distri­bution for records
Bystander - Someone who is “present but not taking part” in the photo, Someone who is “not a subject of the photo and is thus not important for the meaning of the photo”
How bystander detection could improve privacy: this can stop bystanders from being recorded without knowing or let them know. Self-c­entered photos can put bystanders in awkward situat­ions, poor posture, or reveal inform­ation they don't want on record,
Unicity - Proportion of unique pieces of inform­ation U =0 is k-anon­ymous and k>=2. U =0.25 means 1/4 of the click traces are unique.
How to get < 10% unicity • Remove all info pertaining to clients and website visits • Coarsen time to at least hours

Strategic manipu­lation, propag­anda, and fake news

fake news - news that is itenti­onally false, published by news outlet.
challenges in defining “fake news” - apart from validity of inform­ation, is it satire, actual misinf­orm­ation, intended for deception, clickbait, rumor etc.
automatic fact-c­hecking - compare with knowle­dge­/expert base (refer­ences); use base of SFO triples: subject, predicate object
fact extrac­tion: redund­anc­y(D­ona­ldj­ohn­trump vs donald­-tr­ump), timeli­nes­s(B­ritain, joinIn, Europe­anU­nion), conflict, unreli­abi­lit­y(T­heO­nion), incomp­let­ene­ss(May need to infer if something is missing)
Why temporal analysis may help with fake news detection: time can change the validity of inform­ation Why source analysis may help with fake news detection: is the news satire or credibile
Explain how textual and visual analysis may help with fake news detection


-textual can determine fake news by Quantity, Comple­xity, uncert­ainty, subjec­tivity, sentiment, inform­ality, specif­icity and readavlity
- visual content can clarity, coherence, similarity distri­bution, diversity and clustering score.
- using SVM’s and CNNs for text analysis
mixed code - Use of different languages, symbols, scripts, shapes to avoid detection. Text on Document – Defined from standard alphabetic characters • Text in Visual Media – Text in pictures • Text as Art Form** – Use symbols not part of the alphabet to depict a simple code
freque­ncy­-in­verse document frequency - tfidf is used to reflect how important a word is to a document in a collection tfidf
bi-clique - bipartite graph where every vertex of first is connected to every vertex of second
Label bi-partite graph with nodes as articles and users, Edge if user mentions article, Find maximal bi-cli­ques,
Find temporal cohesion, And textual cohesion, And created weighted sum, For an article, average its score in all bi- cliques,
top 5% of these are seeded fake, Bottom 5% are seeded true
Spread labels if – Part of same bi-cliques – Have a lot of common users – Are textually similar,
Spread labels based on – Common users – Textually similar

Dark Web

Deep Web: (password) consists of internet not indexed on search engines (such as social media)
Dark Web: (Tor) overlay networks that use the Internet but require specific software, config­ura­tions, or author­ization to access -Behind password logins – Encrypted – Not linked – Tor Hidden Servcies
ransom­ware: threatens to publish victims data or holds data hostage unless paid
Tor browsing: use many(3) different machine to create onion networks. Each connection is encripted beside of the exit.The exit will appear to be browsing.
Tor hidden service - introd­uction points , directory service () and rendezvous point.
1.pick introd­uction points to build encrpted tunnels 2. announce the service into db. 3. User get back to 3 introd­uction points and create rendezvous points (3 steps from) and 4. send msg to intro point. 5. now the rend point is 6 hops away from intro.
beneficial uses of Tor and anonymous browsing: can prevent control from author­itarian regimes; people cannot be banned from accessing inform­ation
socially detrim­ental uses of Tor and anonymous browsing: can be used as a harbor for illegal/ illicit things
how Tor traffic could be deanon­ymized by a large organi­zation: they with the comput­ational power can get both a entry and exit point and then be able to decrypt what goes on in between
how resear­chers have crawled the dark web: first get access by identi­fying dark web forms. Then get data thru anon access, then process and identify relati­ons­hips/ link data sources etc. then visual­ization and reports
why dark web crawling is beneficial for security practi­tioners - are able to limit the damage of a data breach and take the necessary steps to protect business, employees, customers, etc. from potential attacks. Can be used to detect/ collect any leaked inform­ation
Inform­ation gain - reduction of entropy gained by knowing feature x: IG(y|x) = H(y) – H(y|x)
Stemming - remove suffixes to get stem word can be use to handli­ng-­mis­spe­llings with 3-7 ngrams


Abstract execution records a small set of events during the traced program's execution. These events serve as input to an abstract version of the program that generates a full trace by re-exe­cuting selected portions of the original program.
insider threat and accidental insider threat: threats from within (emplo­yees, associ­ates) weak passwords, unlocked devices intent­ional can be injecting rogue software
Techniques for host-based user profiling on Unix and Windows: Markov chain codel; bayers factor to determine if transition is consistent (command A-> command B); windows measures “prope­rties” which vote with weights wether an intrusion has occurred
Advantage of a hidden Markov model over an SVM for classi­fying command sequences: Markov model creates probab­ility of each transi­tion; this can easily grow very big; pick a K that is small; svm can be very accurate but it does not address concept drift very well
honeypot: a computer security mechanism set tro detect deflect or counteract attempts at unauth­orized use of info systems. Generally consists of data that appears legit with info but is isolated and monitored and blocks or analyses attackers


No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.

          Related Cheat Sheets

          Weights and Measures Cheat Sheet
          Translate_Stats_ML Cheat Sheet
          Supervised Learning in R: Regression Cheat Sheet