Show Menu

Diving into Data Cheat Sheet by

CPCU 500_CH.10_Study Guide


Big Data: refers to the large, diverse sets of inform­ation that grow at ever-i­ncr­easing rates. It encomp­asses the volume of inform­ation, the velocity or speed at which it is created and collected, and the variety or scope of the data points being covered (known as the "­three v's" of big data). Big data often comes from data mining and arrives in multiple formats.
Big data is a great quantity of diverse inform­ation that arrives in increasing volumes and with ever-h­igher velocity.
Big data can be structured (often numeric, easily formatted and stored) or unstru­ctured (more free-form, less quanti­fia­ble).
Nearly every dept in a company can utilize findings from big data analysis, but handling its clutter and noise can pose problems.
Big data can be collected from publicly shared comments on social networks and websites, volunt­arily gathered from personal electr­onics and apps, through questi­onn­aires, product purchases, and electronic check-ins.
Big data is most often stored in computer databases and is analyzed using software specif­ically designed to handle large, complex data sets.
Data analysts look at the relati­onship between different types of data, such as demogr­aphic data and purchase history, to determine whether a correl­ation exists. Such assess­ments may be done in-house or externally by a third-­party that focuses on processing big data into digestible formats. Businesses often use the assessment of big data by such experts to turn it into actionable inform­ation.
Volu­me: Quantity of data; Size determines the value & potential insight, and if considered big data or not.
Variety: Type & Nature of data. Change from structured to semi- or unstru­ctured challenges the techno­logies.
Velocity: Speed the data.Big data is often avail. in real-time.
Veracity: Comple­teness & Accuracy of data. Quality can vary, affecting accurate analysis.
Value: Derived from results of big data analysis.


Economic Data: Data regarding Interest rates, Asset prices, Exchange rates, and the Consumer Price Index; and other info about the global, national, or regional economy.
Stru­ctured Data: Data organized into databases with defined fields, including links between databases.
Unst­ruc­tured Data: Data that is not organized into predet­ermined formats, such as databases, and often consists of text, images, or other nontra­dit­ional media.
Internal Data: Is owned, captured, and stored by an organi­zation. Includes: Master data identi­fying customers, vendors, prospects; HR records; Employ­ee/­Cus­tomer corres­pon­dence; and Files specific to the type of business, such as Mfr's inventory records; banks' customer financial records; and insurer's premium records & rating factors.
External Data: Facts and figures available in locations outside a company. Refers to published data from outside the business.
Expl­oratory Data Analysis (EDA): an approach to analyzing data sets to summarize their main charac­ter­istics, often with visual methods.
A statis­tical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
Promoted to encourage statis­ticians to explore the data, and possibly formulate hypotheses that could lead to new data collection & experi­ments.
EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assump­tions required for model fitting and hypothesis testing, and handling missing values and making transf­orm­ations of variables as needed. EDA encomp­asses IDA.
EDA Techni­ques incl: Scatter Plot & Bubble Plot
Initial Data Analysis (IDA): most important distin­ction between the initial data analysis phase and the main analysis phase, is that during initial data analysis one refrains from any analysis that is aimed at answering the original research question.
IDA phase is guided by the following (4) questions: Quality of Data, Quality of Measur­ements, Initial transf­orm­ation, and did the implem­ent­ation of the study fulfill the intentions of the research design.
Text Mining: Obtains info through language recogn­ition; more difficult than w/ other models b/c there are no organized fields & no numerical values.
Steps of the Text Mining Process:
1. Rectrieve & prepare the text.
2. Convert unstru­ctured data into structured data.
3. Create a data mining model to help the Org. achieve its object­ives.
4. Evaluate the model's effect­iveness in multiple areas.
Examples of:
External & Unstru­ctured Data: Social Media, News Reports, Internet Videos.
Internal & Structured Data: Policy Inform­ation, Claims History, Customer Data.
External & Structured Data: Telema­tics, Financial Data, Labor Statis­tics.
nternal & Unstru­ctured Data: Adjustor's notes, Customer voice records, Survei­llance videos.


Data loss occurs when valuable or sensitive inform­ation on a computer is compro­mised due to theft, human error, viruses, malware, or power failure. It may also occur due to physical damage or mechanical failure or equipment of an edifice.
Data loss can be caused by external factors, such as a power outage, theft, or a broad-­based phishing attack. {{nl} Companies can protect themselves by using data loss prevention procedures in software and by having protocols in place for employees that enable them to safely work with and share business documents.


Data Mining: is a process used by companies to turn raw data into useful inform­ation. By using software to look for patterns in large batches of data, businesses can learn more about their customers to develop more effective marketing strate­gies, increase sales and decrease costs. Data mining depends on effective data collec­tion, wareho­using, and computer processing.
Data mining programs break down patterns and connec­tions in data based on what inform­ation users request or provide.
Data Science: provides meaningful inform­ation based on large amounts of complex data or big data. Data science, or data-d­riven science, combines different fields of work in statistics and comput­ation to interpret data for decisi­on-­making purposes.
Data science uses techniques such as machine learning and artificial intell­igence to extract meaningful inform­ation and to predict future patterns and behavi­ors.
Disr­uptive Innova­tion: Disruptive Innovation refers to a technology whose applic­ation signif­icantly affects the way a market or industry functions. An example of modern disruptive innovation is the Internet, which signif­icantly altered the way companies did business and which negatively impacted companies that were unwilling to adapt to it.
Disru­ptive innovation refers to a new develo­pment that dramat­ically changes the way a structure or industry functi­ons.
Sequ­ential Pattern Mining: a topic of data mining concerned with finding statis­tically relevant patterns between data examples where the values are delivered in a sequence; presumed that the values are discrete, and thus time series mining is closely related, but usually considered a different activity. Seque­ntial pattern mining is a special case of structured data mining.


Elec­tronic Commerce (e-com­mer­ce): Electronic commerce or e-commerce (sometimes written as eCommerce) is a business model that lets firms and indivi­duals buy and sell things over the internet. E-commerce operates in all four of the following major market segments: *Business to business; * Business to consumer; *Consumer to consumer; and *Consumer to business


Scatter Plot: A graphed cluster of dots, each of which represents the values of two variables. The slope of the points suggests the direction of the relati­onship between the two variables. The amount of scatter suggests the strength of the correl­ation.
two dimens­ional plot of point values
Bubble Plot: A Scatter Plot in which the size of the bubble represents a 3rd attribute, such as average accident severity.
Best option for conveying the numerical relati­onship between three or four sets of values.
Corr­elation Matrix: A table that summarizes a series of correl­ations among several variables.
rectangular display of all the correl­ations between all pairs of data sets with a key (such as color coding) that indicates the strength of the correl­ation
Regr­ession Model: Estimates relati­onships between or among variables.
Model uses mathem­atical functions of statis­tical regression to predict the numerical value of a target variable based on the values of the explan­atory variables
Regr­ession Analys­is: A set of statis­tical processes for estimating the relati­onships between a dependent variable (often called the 'outco­me/­target variable') and one or more indepe­ndent variables (often called 'predi­ctors', 'covar­iates', or 'featu­res').
Primarily used for (2) concep­tually distinct purpos­es.
First, widely used for prediction and foreca­sting, where its use has substa­ntial overlap with the field of machine learning; Second, in some situat­ions, can be used to infer causal relati­onships between the indepe­ndent and dependent variables.
Linear Regres­sion: Statis­tical method that predicts the numerical value of a target variable based on the value of one or more attributes or explan­atory variables.
A linear approach to modelling the relati­onship between a scalar response and 1 or more explan­atory variables (also known as dependent & indepe­ndent variab­les).
Linear Regres­sion: Falls into 1 of 2 catego­ries:
If the goal is predic­tion, foreca­sting, or error reduction, linear regression can be used to fit a predictive model to an observed data set of values of the response and explan­atory variables. After developing such a model, if additional values of the explan­atory variables are collected without an accomp­anying response value, the fitted model can be used to make a prediction of the response.
If the goal is to explain variation in the response variable that can be attributed to variation in the explan­atory variables, linear regression analysis can be applied to quantify the strength of the relati­onship between the response and the explan­atory variables, and in particular to determine whether some explan­atory variables may have no linear relati­onship with the response at all, or to identify which subsets of explan­atory variables may contain redundant inform­ation about the response.
Gene­ralized Linear Model (GLM): Removes the normality and constant variance assumption in a linear model and it names a link funcation which defines the relati­onship between the expected response variable and linear combin­ation of the predictor variables.
A flexible genera­liz­ation of ordinary linear regression that allows for response variables that have error distri­bution models other than a normal distri­bution.
GLM consists of (3) elements:
1. An expone­ntial family of probab­ility distri­but­ions.
2. A linear predictor - the quantity which incorp­orates the inform­ation about the indepe­ndent variables into the model.
3. A link function - provides the relati­onship between the linear predictor and the mean of the distri­bution function.

Data-D­riven Decision Making

Data­-Driven Decision Making: gives reference to the collection and analysis of data to guide decisions that improve success.
Data-I­nformed Decision Making (DIDM): (2) basic approa­che­s:D­esc­riptive & Predictive approach.
Process for Data-d­riven Decision Making:
1. Define the Problem - provide a business context for using the data this step is crucial because modeling and analyzing data is not effective without a business context
2. Prepare the Data - Identify the necessary data; Gather quality data; Verify its quality
3. Analyze & Model - model the data using big data techni­ques. use the approp­riate descri­ptive or predictive approach
4. Develop Insights - identify trends, relati­ons­hips, behaviors, and events
5. Make an Actionable Decision - develop and implement a solution to the problem


Data Wareho­usi­ng: the electronic storage of a large amount of inform­ation by a business or organi­zation. Data wareho­using is a vital component of business intell­igence that employs analytical techniques on business data.
A data warehouse is designed to run query and analysis on historical data derived from transa­ctional sources for business intell­igence and data mining purposes.
Data Analyt­ics: Data analytics is the science of analyzing raw data in order to make conclu­sions about that inform­ation. Many of the techniques and processes of data analytics have been automated into mechanical processes and algorithms that work over raw data for human consum­ption. Data analytics techniques can reveal trends and metrics that would otherwise be lost in the mass of inform­ation. This inform­ation can then be used to optimize processes to increase the overall efficiency of a business or system.
Data Analytics Process: involves several different steps:
1. The first step is to determine the data requir­ements or how the data is grouped. Data may be separated by age, demogr­aphic, income, or gender. Data values may be numerical or be divided by category.
2. The second step in data analytics is the process of collecting it. This can be done through a variety of sources such as computers, online sources, cameras, enviro­nmental sources, or through personnel.
3. Once the data is collected, it must be organized so it can be analyzed. Organi­zation may take place on a spread­sheet or other form of software that can take statis­tical data.
4. The data is then cleaned up before analysis. This means it is scrubbed and checked to ensure there is no duplic­ation or error, and that it is not incomp­lete. This step helps correct any errors before it goes on to a data analyst to be analyzed.
Neural Network: A data analysis technique that operates similar to the human brain in its ability to infer rules from data patterns and construct logic to use for data analytics.
A network or circuit of neurons, or in a modern sense, an artificial neural network, composed of artificial neurons or nodes.
Form of AI that enables a computer to learn as it accumu­lates more data (deep learni­ng).
Neural Network: Disadv­ant­ages: The processes for developing the rules and logic may not be transp­arent.
a neural network can be overtr­ained if it reviews data in such detail that it can not then operate in a larger framework with other types of data
3 layers of Neural Network:
1. Input layer - provides data for the network to analyze
2. Hidden layer - uses mathem­atical functions to learn and recode input data
3. Output layer - provides results of the anaysis


Social Network Analysis (Network analys­is): Studies the connec­tions and relati­onships among people in a social network.
Useful tool for making predic­tions based on trends
Social Network - group of indivi­duals or entities who share relati­onships and the flow of commun­ication
Node: Each individual or entity is know as this
a basic unit used to build data struct­ures
Cent­rality Measur­es: In a social network context, the quanti­fic­ation of a node's relati­onship to other nodes in the same network.
Determines the efficiency of the flow btwn Social Network commec­tions.
indicators of centrality identify the most important vertices within a graph.
(3) Centrality measur­es:
1. Degree - the number of connec­tions each node has
2. Clos­eness - the average distance or path length btwn a given node and other nodes in the network
3. Betw­een­ness - how many times a given node is part of the shortest path btwn 2 other nodes in a network

Financial Technology & Automated Investing

Arti­ficial Intell­igence (AI): Artificial intell­igence (AI) refers to the simulation of human intell­igence in machines that are programmed to think like humans and mimic their actions. The term may also be applied to any machine that exhibits traits associated with a human mind such as learning and proble­m-s­olving.
Deep Learni­ng: is an artificial intell­igence (AI) function that imitates the workings of the human brain in processing data and creating patterns for use in decision making. Deep learning is a subset of machine learning in artificial intell­igence that has networks capable of learning unsupe­rvised from data that is unstru­ctured or unlabeled. Also known as deep neural learning or deep neural network.
*Deep learning AI is able to learn without human superv­ision, drawing from data that is both unstru­ctured and unlabe­led.; Also a form of machine learning, can be used to help detect fraud or money launde­ring, among other functions.
Machine Learni­ng: Machine learning is the concept that a computer program can learn and adapt to new data without human interv­ention. Machine learning is a field of artificial intell­igence (AI) that keeps a computer’s built-in algorithms current regardless of changes in the worldwide economy.
Machine learning is useful in parsing the immense amount of inform­ation that is consis­tently and readily available in the world to assist in decision making.
Algo­rit­hm: An algorithm is set of instru­ctions for solving a problem or accomp­lishing a task. One common example of an algorithm is a recipe, which consists of specific instru­ctions for preparing a dish/meal. Every comput­erized device uses algorithms to perform its functions.
Disr­uptive Techno­logy: Disruptive technology is an innovation that signif­icantly alters the way that consumers, indust­ries, or businesses operate. A disruptive technology sweeps away the systems or habits it replaces because it has attributes that are recogn­izably superior. Recent disruptive technology examples include e-comm­erce, online news sites, ride-s­haring apps, and GPS systems.
A disruptive technology supersedes an older process, product, or habit.
Asso­ciation Rule Learni­ng: a rule-based machine learning method for discov­ering intere­sting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of intere­sti­ngness.
association rules are employed today in many applic­ation areas including Web usage mining, intrusion detection, continuous produc­tion, and bioinf­orm­atics.
Process of Asso­ciation Rule Genera­tion is usually split up into two separate steps:
1. A minimum support threshold is applied to find all frequent itemsets in a database.
2. A minimum confidence constraint is applied to these frequent itemsets in order to form rules.

Decision Tree Analysis: (5) Steps

Decision Tree Analysis: (5) Steps:
1. Define the problem with a statement of the decision being considered
2. Create pathways (sequence of events) for each altern­ative, with each pathway leading to an outcome
3. Assign a probab­ility to each event on a pathway and estimate the value (cost or gain) of the outcome of each pathway
4. Multiply the probab­ility of each event by the value of its outcome to determine the expected value of each pathway
5. Compare expected values to determine the pathway with the highest expected value

DECISION TREE:A­nal­ysis, Use, Features, and Inputs

Anal­ysis: Analyzes the conseq­uences, costs, and gains of decisions to compare altern­ative decisions.
Use: Decision tree analysis helps risk managers choose the best strategy to meet a goal.
Feat­ures: The process can be used to analyze both negative and positive conseq­uences.
Inpu­ts: The risk manager inputs the project plan with decision points and inform­ation on possible outcomes.

DECISION TREE: Outputs, Advant­ages, Disadv­antages

Outp­uts: Decision tree analysis produces an analysis of risk for each pathway with options and an expected value for each pathway.
Adva­nta­ges: Presents a visual portrayal, provides both quanti­tative and qualit­ative inform­ation, and offers a way to calculate the best pathway through a problem.
Disa­dva­nta­ges: Can be compli­cated and difficult to explain
also suscep­tible to oversi­mpl­ifi­cation, which can result in less accurate decision making

EVENT TREE Analysis: (6) Steps:

1. Identify the initiating event (first accidental event that could result in unwanted conseq­uences)
2. Determine conseq­uences of events that could follow the accidental event
3. Construct an event tree diagram that lists barriers in the sequence that would be activated if the designated event occurred
4. Design each pathway to fork at each barrier depending on whether the barrier succeeds or fails
5. Assign an estimated probab­ility to the likelihood of success of failure of each barrier 6. Calculate the frequency of outcomes for each pathway

EVENT TREE: Analysis, Use, and Features

Event Tree Analys­is: Analyzes the conseq­uences of accidental events rather than decisions.
Use - Risk managers use event tree analysis to evaluate risk treatment measures and identify, recommend, and justify improv­ements.
Feat­ures - Process typically analyzes only negative conseq­uences.

EVENT TREE: Outputs, Process, Procedures

- List of potential problems, with estimated values for outcomes and freque­ncies
- Recomm­end­ations regarding the effect­iveness of barriers

EVENT TREE: Adv & Disadv­antages

- offers a visual portrayal of sequences of events following an accident
- shows the effect­iveness of control systems
- provides both quanti­tative and qualit­ative inform­ation
- effective only if all potential events are identified
- analysis considers only two options (success or failure of barriers)
- analysis may ignore depend­encies that arise within a sequence


Economic Data: Data regarding interest rates, asset prices, exchange rates, the Consumer Price Index, and other inform­ation about the global, the national, or a regional economy.
Clas­sif­ication Trees: A supervised learning technique that uses a structure similar to a tree to segment data according to known attributes to determine the value of a catego­rical target variable.
Cluster Analys­is: A model that determines previously unknow groupings of data.
Data Mining: The process of extracting hidden patterns from data that is used in a wide variety of applic­ations for research and fraud detection.
Cent­rality Measur­es: In a social network context, the quanti­fic­ation of a node's relati­onship to other nodes in the same network.
Big Data: Sets of data that are too large to be gathered and analyzed by tradit­ional methods.
Data Science: An interd­isc­ipl­inary field involving the design and the use of techniques to process very large amounts of data from a variety of courses to provide knowledge based on data.

Support Cheatography!



No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.

          More Cheat Sheets by djjang2

          Breaking Down Risk Modeling Cheat Sheet
          Financial Risk Cheat Sheet