title: Data Science
subtitle: DLMBDSA01
authors: Prof. Dr. Claudia Heß
publisher: IU International University of Applied Sciences
date: 2022
p. 12 - 32
We will explore a framework to create values from data through the study of data science. You will learn…
Study Goals
Definition - Data: Facts, observations, assumptions or incidences that can be analysed to get meaningful information.
Definition - Information: Patterns and relationships among data elements are instances of information. It is data that has been processed, interpreted, organized, structured or presented to be meaningful or useful.
Definition - Data Science: A field concerned with the systematic practice of analysing data, exploring the information contained in the data, and creating useful predictions to advise and guide the decision making process.
Data in its raw form is typically not useful. Data Science is concerned with the arrangement, analysis, visualization, and interpretation of collected data for the purpose of extracting embedded knowledge and useful information and predicting scenarios should new or different data be introduced. Output of Analysis can be used in by decision-makers.
Definition - Data Mining: The Process of discovering patterns in large datasets.
Data Science is sort of a cross between many fields:
Per the book’s Venn diagram, it looks like Data Science is also a subset of Knowledge Discovery, and Data Visualization also crosses over into every field as well, it at most just a little.
Definition - Business Intelligence: A set of strategies for identifying, extracting, analyzing, managing, and delivering imporant trends relevant to business metrics.
Another book, “Intelligent Techniques for Data Science” I believe, also covers Business Intelligence (BI) and Data Science, and how business intelligence is very different in that it requires less computing skills and much more domain knowledge. BI is meant to generate a descriptive analysis of data for use in reporting the past behaviour of a business. Data Science uses predictive analysis for estimating a business’s future behaviour.
Along with our errors are sensitivity and specificity
Formulas seem a little backwards, but maybe that is what the minus one is for.
Regression prediction model output is measured with metrics like absolute error, mean square error, and relative error.
There are 3 dimensions:
Each has its own set of challenges, solution methodologies and numerical techniques.
Starts with collection of data including a list of possible sources and attributes. Stored data must be transparent, complete, and accessible.
Put simply, data curation is the process of refining collected data. There are different ways to do this:
Data analytics uncovers hidden patterns in data and transforms data into relevant and useful information. Techniques include modelling and simulation, machine learning, artificial intelligence, and statistical analysis.
Sources must be trustworthy to ensure collected data are robust and high quality. There’s a saying, “Garbage In, Garbage Out,” with accompanying acronym GIGO. Sources include:
Think quantitative and qualitative data types.
Definition - Quantitative Data: measurable values. Subtypes include
Definition - Qualitative Data: information about quality of a good or service such as customer feedback.
You have the following shapes
Definition - Structured Data: These data have high level of organization, such as information shaped in tabular forms of rows and columns. Rows may also be called “records”.
Definition - Unstructured Data: Data with unknown form or structure. Think text and multi-media content. It is data in raw form that requires some processing.
Definition - Semi-Structured Data: All data shapes between structured and unstructured. Data might not be tabular but do have organizational properties that make them easier to analyze.
These are main obstacles to handling any type of data and describing data overloads.
Definition - Volume: The amount and scale of data. Volume of data captured is expected to double every two years. Experts believe with advances in computational power and decreasing storage costs, this will not be a problem.
Definition - Variety: Data come from many sources, are of many types, and have different levels of complexity. Much data generated today can be considered unstructured.
Definition - Velocity: Speed at which data are created, stored, analysed, and visualized.
Definition - Veracity/Validity: Veracity refers to data quality. Validity refers to the value of data in extracting useful information for making a decision. Data contains veracity with elevated levels of noise obtained during data collection or data processing. Data can also become outdated and invalid, even if noise-free.
Definition - Value: Also called usage value, refers to the application the data are used for and the frequency of their use.
Statistics provides a summary of the data. Several statistical parameters are calculated to describe a variable clearly (a Random Variable?). These include minimum, maximum, mean, median, mode, variance, and standard deviation. Definitions are better left to standard texts in statistics.
Important to note that the mean, , is more sensitive to extreme values than the median. However, it is mathematically more convenient to work with. That is, many other statistical formulas can be derived from or with the mean.
Definition - Probability: The chance of an even happening.
Essentially, data science overlaps a great deal with statistics. And statistics is a branch of mathematics that is intertwined with Probability Theory.
I think core topics should be left to text in probability theory, but sure we can cover some here as well.
The probability of an event occurring must be between 0 and 1 inclusive. The probabilities of all possibilities must always sum to 1 as well. That is, an event must always occur, even if that event is “do nothing”. Like rolling a die (singular dice). If you roll it, the chance of any side appearing, given a fair 6-sided die, is . The sum that it lands on a side, given you roll the die, is the sum of all sides, or .
Please ignore the possibility that it would amazingly land and balance on an edge. It is so tiny we can say .
If you roll the die, it cannot simultaneously land on two sides at once. That is, you cannot role a 3 and a 4 with the same die in one roll. Events of this nature are called mutually exclusive, and represented as
So, the capital represents generic probability function, and other capital letters typically represent random variables, being the input parameters. For now, think of it like a whole set of values (or outcomes) in an event space, each with its own probability of occurring given that an event has occurred, like given we rolled the die.
In a Venn diagram, mutually exclusive events are non-overlapping fields. A direct result of mutually exclusive events is that summing their probabilities is as easy as actually summing them. That is, the probability of either a 3 or a 4 being rolled is .
For events that are not mutually exclusive, we then consider if the events are independent. Independence just means that the events occurring simultaneously do not affect the probability of the other event occurring, increasing or decreasing the odds. You can roll two die and the outcome of one does not affect the outcome of another. However, if you have a standard deck of cards and you draw one card, the probability of the second draw is dependent on the first because there goes from 52 cards to 51. Maybe there are better examples…
Anyway, mutually independent events that are not mutually exclusive have their own maths
If you look at a Venn diagram, you can see that adding and double counts the portion, which is why we subtract one part.
If two events are correlated, so perhaps neither mutually exclusive nor independent, we run into conditional probability. The probability that occurs given event has occurred is
You can see that if they are independent events, the factors would cancel out.
In Data Science, all predictions from developed models are probabilities or a probability density distribution (for regression models). A probability density function is a function whose value at any given sample in the sample space can be interpreted as providing a relative likelihood that the value of the random variable would be equal to that sample.
The PDF as a function takes in a range of possible values as inputs and returns the probability of that value occurring. Density is a continuous concept, so the probability of one singe point is approximately 0.
Wiki - Probability Density Function
For discrete distributions, we should refer to them as Probability Mass Functions. It appears the text confusing them. But no problem as when you integrate density, you can obtain mass.
The text shows how to calculate probability as sum of event you desire over the total number of possible events. As the sum of two dice, you can create a more interesting graph for a probability mass function that actually appears kind of normal.
Every variable of a dataset meets a particular frequency distribution that reflects how often each value of this specific variable occurs. There are some general and classical distributions that appear more regularly for many datasets. A perk of associating to a known distribution is they can usually be described mathematically with closed form expressions of a few parameters.
We will only cover a few distributions. Since this is not a course in probability and statistics, I would refer readers to those notes (when I write them).
The bell-curve, appearing in nature at every turn. The normal distribution has of values within one standard deviations, . Then, within 2 standard deviations, , and within 3, .
Monitors success of an event occurring. Like a coin toss.
PMF is
The “n-choose-k” is the binomial coefficient. It has some other convenient properties such as
Wiki also has a bit about moments, possible the moment generating functions.
A discrete probability distribution that expresses probability of given number of events occurring in a time interval. Events need properties like constant mean rate and independence from time since last event.
The PMF is
It has a fun property of
An example might be how many calls a call centre receives in a day. There’s also cosmic rays, radioactive decay and sales records!
Back to Bayes Theorem
Naming convention?
An example is a drug test, and positive and negative results. Even if the test is 99% accurate, if the pool of drug users is relatively small, like 1%, then the chance a positive result actually means the user is on drugs is… the probability the user is on drugs given they tested positive.
If the denominator is confusing, it is the probability of a positive result. That is 99% likely for active users (1%) plus the unlikely false positive (1%) from non-users (99%).
But isn’t that just amazing. If we have a known pool of users, we can determine the test is positive with 99% accuracy. But just relying on the test itself to determine who is a user can only give us 50%. That means if a person tested positive, it’s a coin-flip whether or not they are actually a user.
This comes from the disproportionate sample sizes. Even with 1% error, the very large set of non-users ramps up the false positives, decreasing predictive power.
This is why domain knowledge is very important. If we know the person does not have a history of drug abuse maybe we would set for their case. Or if you know the person actively does drugs, you might set . In fact, just knowing there’s a 50% chance they should test positive could increase the accuracy .
It’s an example to show how prior probability of is adjusted according to the posterior probability . This can be the result of designing a classifier to predict the occurrence of an output for a new training set and is the main idea behind the Naïve Bayes Classifier for categorical data of independent random variables.
title: Business Analytics Using R
subtitle: A Practical Approach
authors:
- Dr. Umesh R. Hodeghatta
- Umesha Nayak
publisher: New York Apress
year: 2017
doi: 10.1007/978-1-4842-2514-1_5
iu_an: ihb.44870
Just a side note, these will not be the most intensive notes taken.
This chapter covers data exploration, validation, and cleaning required for data analysis. Good to know why we clean and prepare data, and some useful methods and techniques.
The purpose of business analytics is to derive information from data to help make good business decisions. There are about 8 phases to the business analytics project life cycle.
There’s also the Cross-Industry Standard Process for Data Mining (CRISP-DM), which has the following 6 phases:
Know exactly what the client really wants you to solve and document it. Determine availability of data, data format, quality, amount, and the data stream for final model deployment.
Document business objective, data source, risks, limitation. Define timeline, required infrastructure to support the model, and expertise required to support project.
Quality data is most important factor determining the accuracy of results. Data can be:
A data warehouse is an integrated database created from multiple databases within the organization. Warehouse technology typically cleans, normalizes and pre-processes data before it is stored. Warehouse technology also supports Online Analytical Processing (OLAP).
NoSQL databases were developed to overcome limitations of relational databases and meet the challenges of enormous amounts of data being generated on the Web. Stores structured and unstructured data.
Unless you have big data infrastructure, only a sample of the population is used to building analytical modeling. A sample is a smaller collection of units from a population used to determine truths about said population. It should represent the population. Techniques depend on the type of business problem.
Probability sampling ensures all members of the population have an equal chance of being selected for the sample. There are different variations:
Calculating sample sizes is left to statistics or research methodology books.
For finding good relationship between and predictor variables , you need enough data. How much?
Where is the number of outcome classes and is the number or variables. The more records, the better the results.
Data in a database may be susceptible to noise and inconsistencies. Various sources and collection methods and multiple people handling data over time can cause this. You must understand data in terms of data types, variables and characteristics of variables, and data tables.
Data can be…
Now you know the data’s types, study the data! Check values, find missing values, correct unknown characters, etc… this can all impact your model’s accuracy. How do you handle missing values?
It is important to predict the values based on the most probable value.
What about handling duplicate data, junk and null values? You should clean these from the database before analytics process.
There are multiple solutions to perform the same tasks, so here’s one:
Variables in R are represented as vectors. Here are basic data types in R:
Basic operations in R are performed as vector operations (element-wise). The book then covers an example in R. They note that converting factors to numeric values may not give actual values. So, first they convert the value to a character and then a numeric.
This is the step of exploring and understanding what kind of data you have. You also need to understand the distribution of individual variables and the relationships between variables. Expect to use graphs and tables.
Goals of exploratory analysis:
Provides example of showing a table in R. Then some graphs.
Univariate analysis analyzes one variable at a time. A histogram represents frequency distribution of the data, typically as a bar chart. I personally do not like the box plots, whisker plots, or box and whisker plots (all the same thing). They are a bit of work to just show a mere description of the data, like quartiles. The book shows the max and min values, and then the outliers beyond that, which is a nice touch.
There’s also the notched plots, similar to the box plot.
Bivariate data analysis is used to compare the relationship between 2 variables. The main goal is to explain the correlations of two variables. Then, you can suspect a causal relationship. If more than one variable is made on each observation, then multivariate analysis is applied.
A Scatter Plot is most common data visualization for bivariate analysis and needs little to no introduction. Too little data makes it hard to draw an inference. Too much data can create too much noise to draw an inference.
There’s a scatter plot matrix, which takes pairs of variables and creates graphs in a table.
hou<-read.table(header=TRUE,sep="\t","housing.data")
str(hou) # shows table
pairs(hou) # creates scatter plot matrix.
So, each variable is represented on both the x and y-axis and their plots shown on the table. The diagonal would be plots of the same variable and is ignored (unless you changed the order of variable on the x-axis to differ from the y-axis).
Trellis Graphis is a framework for data visualization that lets you examine multiple variable relationships. The trellis plot tries to solve the overplotting issue of scatter plots with different depths and intervals.
A correlation graph is like a matrix of correlation values between two variables. The numbers can be transformed into a plot with colours and shapes, also refered to as a heat map. Note that sometimes variables are not related directly, but could be polynomial, exponential, logarithmic, inverse…
corrplot(corel, method="circle")
corel<-cor(stk1[,2:9])
corrplot(corel,method="circle")
corel # prints correlation plot.
You can also plot density functions to illustrate the separation by class.
Your data might not always be the best. It could be skewed, not normally distributed, different measurement scales. This is when you must rely on data transformation techniques. Techniques include normalization, data aggregation, and smoothing. Then the inverse transformation should be applied.
We will go into detail about normalization. Techniques such as regression assume data is normally distributed and variables should be treated equally. However, different measuring units can lead to variables having more influence that others. All predictor variable data should be normalized to one single scale.
Ready to perform further analysis. Analytics is about explaining the past and predicting the future.
Descriptive Analytics explains the patterns hidden in data. You can group observations into the same clusters and call it cluster analysis. There is also affinity analysis, or association rules, that can uncover patterns in a transactional database.
There’s either classification or regression analysis.
Then there is Logistic Regression, which is also described in this Wikipedia article, that is like continuous classification. Classification isn’t true or false, but more like a probability like 80% sure.
The typical logistic function is like
Where is the location parameter and . And is a scale parameter.
This is about making computer learn and perform tasks better based on historical data. Instead of humans writing code, we provide the computer instructions on how to build mathematical models. The computers build the models and can perform tasks without the intervention of humans.
Machine learning has two main flavours: supervised and unsupervised. There’s also reinforcement learning, not discussed here apparently.
You need to evaluate your model to understand if it is good at making predictions on new data. To remove bias from assessing with data from development, the data must be partitioned into:
Let’s discuss the different sets:
The usual analogy is studying for a test. The training set is the course material, the validation might be practice exams, and the testing set is the maths test.
To avoid bias, the data set is portioned randomly. If you have limited amount of data, you achieve an unbiased performance estimate with the k-fold cross-validation technique. This divides data into “k-folds” and builds the model using folds. The last one is used on testing. You repeat the process times, and each time the test set is different.
To assess performance of classifier, we look at number of mistakes. This is the misclassification error, the observation belongs to a class other than what was predicted. A confusion matrix gives an estimate of true classification and misclassification rates.
A Lift Chart is used for marketing problems by helping to determine how effectively an advertisement campaign is doing. The Lift is a measure of effectiveness of the model, calculated by ratios of with and without model. So, it sounds like a placebo-controlled trial.
A Receiver Operating Characteristic (ROC) chart is similar to a lift chart with true positives plotted on the y-axis, and false-positives on the x-axis. Good for representing the performance of a classifier. That is Sensitivity vs Specificity. The Area under the curve (AUC) should be between .
There are many metrics for regression
Now, your model is presented to business leaders. This requires business communication. If changes are required, you would start over here. Usual topics of discussion are:
Your model may perform predictive analytics or simple reporting. Note that the model can behave differently in the production environment because it might see completely different data. This may require revisiting the model for improvements.
Success depends on
What are typical issues observed?
There are many more, but that’s just a general outline.
Should probably have an Section but this will do for now. Check out this article from DataQuest.io about the R language. It is a lexically scoped language, AKA static scoping. So, structure determines the scope of a variable, and not most recently assigned.
x <- 5 # What is this <- shit?
y <- 2
multiple <- function(y) y*x
func <- function() {
x <- 3
multiple(y)
}
func() # returns 10
Although we define an x
in the func()
function, the multiple()
function refers to the x
in the global scope. If this is confusing, that is OK. These functions are impure and I wouldn’t recommend this architecture for anything other than learning purposes.
If you declare a y
in the func()
function, that will have an affect because we pass y
in as an argument.
To download, for the CRAN.R-Project.org and download it. It comes with its own basic GUI. You can also look for RStudio IDE or use a Jupyter Notebook, which requires an R kernel. This may require Anaconda.
Data Science resembles sciences in that:
Data Science is different from other sciences because it focuses on:
Knowledge pyramid:
Data is not super useful without context, which is why we add metadata. We interpret data to create information, which provides meaning. Then we can understand it and relationships to create knowledge. That eventually leads to a theory, basically and explanation.
Data are like facts, they are quantity or quality. There are ways to categorise data, and three ways are:
Metadata is beyond data, it provides context. So, it tells you what data is about. It also comes in three flavours:
A unit and data give things like measurements.
Now discussing discrete and continuous quantitative variables. There’s also quantitative variable scales:
Qualitative variables are always categorical and discrete. You can name with numbers, but it is still qualitative. There are no scales, that wouldn’t really make sense they say. We have types of categories though:
Typical business scenarios try to convert unstructured data into structured data. Documents are not structured for computers, only to humans. We add metadata to make it at least Semi-Structured, like a webpage. We use HTML to give tags for web browser to read and understand. The goal is to get enough metadata to create tables of structured data.
This leads us to Data Science Activities. There are three activity containers to remember:
Cleaning data can be a lot of things, checking the dataset for accuracy, completeness, and adherence to predefined rules or constraints:
Feature extraction is like taking the maximum temperature from a day. Transformation of data can be different data types, like:
We can also transform data by changing visualizations:
We can also change transform the data domain, which is more mathematically complicated:
This leads to modelling! Models are abstract and simplified representations of complex systems or phenomena in a concrete scenario. In basic terms, they explain a causational relationship. Models can be mathematical, statistical, logical computational, or visual representations. They include classes and relationships and if we can quantify the relationship, we can: explain, simulate, and predict!
We hope towards statistics now. Descriptive statistics covers the methods for summarizing data via:
Some useful terms perhaps… Descriptive statistics is used to describe features of a data set through measures of:
Overall goal is to identify the distribution of the population, which is the foundation for inferential statistics. Underlying distribution would have some static parameters, as well as the data used. Distribution functions are typically characterized by 3-parameters:
We then look at the Poisson Distribution (discrete). It models the number of events occurring within a given time interval. Below is the probability mass function and the cumulative distribution function for a Poisson distribution.
happens to be the mean and I believe variance for this distribution.
The Normal Distribution is a continuous distribution.
The standard normal distribution is when the mean is 0 and standard deviation is 1. There is also an example of a probability tree to show what the values of the standard deviation represent which is very interesting.
We are looking at Normal Distribution again.
Then we look at Venn Diagrams to represent classes. Overlapping areas represent similarities of classes. Then, we express:
Now we can represent Joint and Disjoint events. Below is the probability that either A or B occurs, or both:
Looking at a Venn diagram, you can see that the intersection would be included twice if we just add the sets. Now, for the , same thing but:
Apparently referred to as the symmetric difference.
And if events are disjoint (mutually exclusive), the intersection is the empty set, so you don’t need to worry about it at all.
This naturally leads to unconditional and conditional events.
We look at a probability tree which basically tracks the sequence of events of all possible events. It helps you see:
We cover an example of a tree with a box of balls, say 2 red and 3 blue colours. Going over for terminology is important.
Q1: Probability of drawing a red ball on first attempt?
That is
Q2: Probability to draw a red ball individually in any second attempt?
This means if we previously drew a red, we have , and if we drew a blue the next red is drawn at . So total probability is .
Note: with the word ”individually” we actually do not want the combined probability. If asked like this on an exam, give individual cases and associated probabilities, not conditional either apparently.
Q3: What is probability to draw two red balls in a row?
I know there is a distribution to solve this, forgot what it is. He says it is as it is intersection of draw red and draw red. The question should say within first 2 attempts.
Q4: What is probability to draw any red ball in second consecutive attempt?
It is sum of 2 intersections.
Q5: What is the probability that the first ball drawn is red given the second drawn is blue?
This is true conditional probability.
We use fact that they are independent events. I was nearly there but lost confidence. Don’t lose confidence!
We now must look at Bayes Theorem, because it allows us to flip the conditional event, solving for new probabilities.
Where:
A Derivation is provided but isn’t too tough. The concept is used throughout Bayesian statistics and creates new distributions based on prior knowledge.
Inferential statistics covers methods to generalizing data, such as prediction, estimation, and hypothesis testing. Allows researchers to assess their ability to draw conclusions that extend beyond the immediate data:
Example of inferential statistics include:
We now go into Linear Regression, which relies on general model for ordinary least squares. I think the formulas would be listed above or else where.
When we use ANN (artificial neural networks), we do something very similar to nature with how the brain works. We are shown a biological neuron diagram…
The presentation draws a parallel, which is nice. So the inputs are similar to the dendrites and then the synaptic strength are the weights we give to the inputs as they come in. In the nucleus, we sum all the individual weights to get a net-weight. The Axon hillock is represented by a transfer function, example being a sigmoid function giving an “S” shape. And it provides an output if the Axon threshold is met, or something like that.
Lets dive a bit deeper into an Artificial Neural Network. Input values are passed into the input layer of your ANN model. Then, the processing happens in all the hidden layers. Finally, an output comes out of the output layer. All nodes in one layer are attached to all nodes in the proceding layer. Then, we train!
We start with feedforward, where information is passed in a forward fashion. During training, if the output is incorrect, we perform a backpropagation to calculate the difference to a desired output and correct the weights accordingly.
Supervised learning has inputs and labels. We tell the model what we expect exactly. Good for predictions and classification.
Unsupervised learning is more complicated and we only provide a model/network with inputs and ask it to create its own labels. It can be used for classification. This network can help us discover hidden patterns or relationships!
We also have reinforcement learning. We make a model and:
This helps discover optimal strategies.
Short Summary:
Data and metadata can be transformed into information which can be derived into knowledge.
We find some raw data and clean it to have regular data. Then, we extract useful features from data and classify those features to find patterns. Then, those patterns are learned to create a relationship in the data. Once we have relationship, we can create models for predicting.
If we do the leg-work up to the learning step, then we have supervised learning. If we don’t know how to classify the data, then we perform unsupervised learning. And if we are quite lazy or there is too much data and features to figure it all out, we let Deep Learning extract the features for us. Then, if we only feed a system raw data, we are probably going for reinforcement learning.
Good to have an overview like that.
Which of the following is the blind machine learning task of inferring a binary function for unlabeled training data?
This is unsupervised learning.
Incorrect choices include regression, supervised learning and data processing. The last isn’t machine learning. The first two are supervised.
The true positive rate achieved by a developed machine learning model is defined as…
This is when a “yes” record is correctly labelled “yes”. Accuracy is the total of correctly labelled “yes” and “no” records. Precision is the correctly labelled “yes”.
However, the answer is not “Precision”.
Possible Choices:
In which process are the data cleared from noise and the missing values are estimated/ignored?
This is part of data curation process. It is Data Preservation.
Incorrect choices are: data description, data publication, data security, all topics to be familiar with.
The data source which crosses all demographical borders and provides quantitative and qualitative perspectives on the characteristics of user interaction is the…
“Media includes videos, audios, and podcasts that provide quantitative and qualitative information on the characteristics of user interaction. Since media crosses all demographical borders, it is the quickest method for businesses to identify and extract patterns in data to enhance decision-making.”
Reader should be familiar with other sources as well!
The probability p(A|B) measures…
The probability that event occurs given that event has occurred.