Review cheat sheet for DATA ANALYTICS FOR CYBER test 2

ML in practice: Malware detection

JSTAP: A project does malicious JS detection
Premis­e/p­roblem JS can cause:­bitcoin mining, abuse browser vulner­abi­lities
Abstract Syntax Tree – Derived from grammar of progra­mming language
JSTAP Principle: Perform static analysis with abstract syntax trees and random forests
Static Analyses
Static analysis - we don’t run the code at all, Reverse analysis of the code
dynamic analysis - run the code in virtual machine or debugger. Malware writers delibe­rately obfuscate to defeat static tools. Example: GozNym runs trivial infinite loop in thread, then suspends thread and overwrites code with jump to previously dead code
Dynamic Analysis pitfalls - 1. Easy to detect you are in a debugger, VM, or running Anti- virus – Query registry – IsDebu­gge­rPr­esent – VM specific instru­ctions. 2. Do long delay in hopes simulator will give up and go away
Control Flow Graph – Shows program flow (calls, selection, loops)
Program Dependence Graph – Includes data and control depend­encies
Token - splitting a program into lexical units (words in sentences for English)
N-gram - simple way to analyze token sequences
JSTAP n-grams
- Depth-­first pre-order traversal of AST
- For CFG, also traverse AST, but only nodes linked by control flow edge. Traverse sub-AST for each node with control flow once
- Similar for PDG, consid­ering data flow
- Indepe­ndent n-grams for tokens, AST, CFG, PDG-Data Flow and PDG-Co­ntrol Flow
- 4 is the best value.
- Use chi-sq­uared test to check for correl­ati­on(­check the ngram in benign or malcious), keep x^2(chi squared) geq 6.63 (confi­dence of 99%)
- if ngram in both (benh and malc), throw ngram away
JSTAP Dataset - 131448 malicous, 141768 benign
JSTAP Classifier Training • Select 10,000 malicious and benign randomly for training – Additional 5,000 of each for validation • Repeat 5 times and average detection results
JSTAP results • Two step process • First phase – Unanimous voting, classifies 93% of data with 99.73% accuracy • Second phase – Unanimous voting, classifies 6.5% of data with accuracy still over 99%
Evasion techniques - Add more benign features • Copy malicious into larger benign file
Extremely avstract OS- learn the sample without implem­enting the underlying OS. Over-a­ppr­oxi­mation has more behaviors than system S, under-­app­rox­imation has fewer. Less precise than virtua­liz­ation or emulation
Abstract execution - A Technique for Effici­ently Tracing Programs. In a dynamic analysis, it has Emulator, Extremely avstract OS and paths, Less precise than virtua­liz­ation or emulation

ML in practice: Phishing detection

Phishing Websites - Often used to collect creden­tials. Fake website to induce personal info.
Techniques for finding Phish:
- Industrial toolba­r-b­ased: Eg SpoofG­uard, TrustW­atch, Netcraft (found these ineffe­ctive)
- User-I­nte­rfa­ce-­based: Eg provide custom image per user, Password manager (Only provides password to certain domains)
- Web page conten­t-b­ased: Use web page info (URL, links, terms, images, forms) to detect phishing
–- CANTINA: compute term freque­ncy­-in­verse document frequency for terms, then Google a few terms to see if current website is a top result – B-APT: Bayesian based on tokens from DOM
Some defina­tion: Surface level conten­t-URL, hyperl­inks, Textual conten­t-Terms or words, Visual content- Color, font size, style, location of images
Textual and visual classi­fic­ation: text classi­fiers work by examining text within a page to detect whether certain words are more likely in a fraudulent page or not. Image classi­fiers transform webpage to images and then compares similarity to genuine webpages.
Step of baye analysis: 1. Obtain webpage and normalize 2. Compute signature 3. Calculate EMD and similarity between website and protected web page 4. Classify via threshold
Overall framework 1. Train text and image classi­fier, collect similarity measur­ements for different classi­fiers 2. Partition similarity into sub-in­tervals 3. Estimate probs for text classifier 4. Estimate probs for image classifier 5. Classify each test image 6. If different from two classi­fiers, calculate decision factor 7. Return final classi­fic­ation
High quality dataset:
access­ibility: publicly available;comple­teness: encompass all the breadth within phishing; consis­tency : range and variance of dataset to make sure data won't be substa­ntively changing; integrity: data and labels is correct, non-co­rru­pted; Validity: data is properly repres­ent­ative; interp­ret­ability : data is unders­tan­dable; Timeli­ness: data is updated or still valid today and future
Bagging classifier is an ensemble meta-e­sti­mator that fits base classi­fiers each on random subsets of the original dataset and then aggregate their individual predic­tions (either by voting or by averaging) to form a final prediction
boosting classifier is random forests build each tree indepe­ndently while gradient boosting builds one tree at a time. This additive model (ensemble) works in a forward stage-wise manner, introd­ucing a weak learner to improve the shortc­omings of existing weak learners.

Social network security - Spam

Spam - irrelevant messages sent to many, Spamming is the use of messaging systems to send multiple unsoli­cited messages (spam) to large numbers of recipients
Criminal accounts tend to be socially connected, Maybe less discri­min­ating in who they follow – Maybe intent­ional
Criminal hubs are more inclined to follow criminal accounts
K-anon­ymity - Publisher decides which attributes public­/pr­ivate – Public are “quasi­-id­ent­ifiers” • Every quasi-­ide­ntifier tuple appears in at least k records in anonymized DB
Determine if a database is k-anon­ymous for a particular value of k - for quasi-­ide­nti­fier, if it appears in at least k records in the db. Every public tuples appears at least twice. We can't uniquely identify someone. A database is 2-anon­ymous if no click trace is unique
how an attacker might deanon­ymize a database with auxiliary inform­ati­on(­bac­kground info related to record)
- Amplif­ication of background knowledge - Uses Aux(r) close to r on subset of attributes to find r’ close to r on all - Extended to a subset
1. Compute score(aux, r’) for each r’ in sample 2. Apply matching criteria 3. Output record or probab­ility distri­bution for records
Bystander - Someone who is “present but not taking part” in the photo, Someone who is “not a subject of the photo and is thus not important for the meaning of the photo”
How bystander detection could improve privacy: this can stop bystanders from being recorded without knowing or let them know. Self-c­entered photos can put bystanders in awkward situat­ions, poor posture, or reveal inform­ation they don't want on record,
Unicity - Proportion of unique pieces of inform­ation U =0 is k-anon­ymous and k>=2. U =0.25 means 1/4 of the click traces are unique.
How to get < 10% unicity • Remove all info pertaining to clients and website visits • Coarsen time to at least hours

Strategic manipu­lation, propag­anda, and fake news

fake news - news that is itenti­onally false, published by news outlet.
challenges in defining “fake news” - apart from validity of inform­ation, is it satire, actual misinf­orm­ation, intended for deception, clickbait, rumor etc.
automatic fact-c­hecking - compare with knowle­dge­/expert base (refer­ences); use base of SFO triples: subject, predicate object
fact extrac­tion: redund­anc­y(D­ona­ldj­ohn­trump vs donald­-tr­ump), timeli­nes­s(B­ritain, joinIn, Europe­anU­nion), conflict, unreli­abi­lit­y(T­heO­nion), incomp­let­ene­ss(May need to infer if something is missing)
Why temporal analysis may help with fake news detection: time can change the validity of inform­ation Why source analysis may help with fake news detection: is the news satire or credibile
Explain how textual and visual analysis may help with fake news detection


-textual can determine fake news by Quantity, Comple­xity, uncert­ainty, subjec­tivity, sentiment, inform­ality, specif­icity and readavlity
- visual content can clarity, coherence, similarity distri­bution, diversity and clustering score.
- using SVM’s and CNNs for text analysis
mixed code - Use of different languages, symbols, scripts, shapes to avoid detection. Text on Document – Defined from standard alphabetic characters • Text in Visual Media – Text in pictures • Text as Art Form** – Use symbols not part of the alphabet to depict a simple code
freque­ncy­-in­verse document frequency - tfidf is used to reflect how important a word is to a document in a collection tfidf
bi-clique - bipartite graph where every vertex of first is connected to every vertex of second
Label bi-partite graph with nodes as articles and users, Edge if user mentions article, Find maximal bi-cli­ques,
Find temporal cohesion, And textual cohesion, And created weighted sum, For an article, average its score in all bi- cliques,
top 5% of these are seeded fake, Bottom 5% are seeded true
Spread labels if – Part of same bi-cliques – Have a lot of common users – Are textually similar,
Spread labels based on – Common users – Textually similar

Dark Web

Deep Web: (password) consists of internet not indexed on search engines (such as social media)
Dark Web: (Tor) overlay networks that use the Internet but require specific software, config­ura­tions, or author­ization to access -Behind password logins – Encrypted – Not linked – Tor Hidden Servcies
ransom­ware: threatens to publish victims data or holds data hostage unless paid
Tor browsing: use many(3) different machine to create onion networks. Each connection is encripted beside of the exit.The exit will appear to be browsing.
Tor hidden service - introd­uction points , directory service () and rendezvous point.
1.pick introd­uction points to build encrpted tunnels 2. announce the service into db. 3. User get back to 3 introd­uction points and create rendezvous points (3 steps from) and 4. send msg to intro point. 5. now the rend point is 6 hops away from intro.
beneficial uses of Tor and anonymous browsing: can prevent control from author­itarian regimes; people cannot be banned from accessing inform­ation
socially detrim­ental uses of Tor and anonymous browsing: can be used as a harbor for illegal/ illicit things
how Tor traffic could be deanon­ymized by a large organi­zation: they with the comput­ational power can get both a entry and exit point and then be able to decrypt what goes on in between
how resear­chers have crawled the dark web: first get access by identi­fying dark web forms. Then get data thru anon access, then process and identify relati­ons­hips/ link data sources etc. then visual­ization and reports
why dark web crawling is beneficial for security practi­tioners - are able to limit the damage of a data breach and take the necessary steps to protect business, employees, customers, etc. from potential attacks. Can be used to detect/ collect any leaked inform­ation
Inform­ation gain - reduction of entropy gained by knowing feature x: IG(y|x) = H(y) – H(y|x)
Stemming - remove suffixes to get stem word can be use to handli­ng-­mis­spe­llings with 3-7 ngrams


Abstract execution records a small set of events during the traced program's execution. These events serve as input to an abstract version of the program that generates a full trace by re-exe­cuting selected portions of the original program.
insider threat and accidental insider threat: threats from within (emplo­yees, associ­ates) weak passwords, unlocked devices intent­ional can be injecting rogue software
Techniques for host-based user profiling on Unix and Windows: Markov chain codel; bayers factor to determine if transition is consistent (command A-> command B); windows measures “prope­rties” which vote with weights wether an intrusion has occurred
Advantage of a hidden Markov model over an SVM for classi­fying command sequences: Markov model creates probab­ility of each transi­tion; this can easily grow very big; pick a K that is small; svm can be very accurate but it does not address concept drift very well
honeypot: a computer security mechanism set tro detect deflect or counteract attempts at unauth­orized use of info systems. Generally consists of data that appears legit with info but is isolated and monitored and blocks or analyses attackers


