title: Data Science
subtitle: DLMBDSA01
authors: Prof. Dr. Claudia Heß
publisher: IU International University of Applied Sciences
date: 2022

Unit 1: Introduction to Data Science

p. 12 - 32

We will explore a framework to create values from data through the study of data science. You will learn…

Study Goals

Introduction

Data science - Wikipedia

Definition - Data: Facts, observations, assumptions or incidences that can be analysed to get meaningful information.

Definition - Information: Patterns and relationships among data elements are instances of information. It is data that has been processed, interpreted, organized, structured or presented to be meaningful or useful.

Definition - Data Science: A field concerned with the systematic practice of analysing data, exploring the information contained in the data, and creating useful predictions to advise and guide the decision making process.

1.1 - Overview of Data Science

Data in its raw form is typically not useful. Data Science is concerned with the arrangement, analysis, visualization, and interpretation of collected data for the purpose of extracting embedded knowledge and useful information and predicting scenarios should new or different data be introduced. Output of Analysis can be used in by decision-makers.

Definition - Data Mining: The Process of discovering patterns in large datasets.

Data Science is sort of a cross between many fields:

Per the book’s Venn diagram, it looks like Data Science is also a subset of Knowledge Discovery, and Data Visualization also crosses over into every field as well, it at most just a little.

Definition - Business Intelligence: A set of strategies for identifying, extracting, analyzing, managing, and delivering imporant trends relevant to business metrics.

Another book, “Intelligent Techniques for Data Science” I believe, also covers Business Intelligence (BI) and Data Science, and how business intelligence is very different in that it requires less computing skills and much more domain knowledge. BI is meant to generate a descriptive analysis of data for use in reporting the past behaviour of a business. Data Science uses predictive analysis for estimating a business’s future behaviour.

Data Science Terms

Along with our errors are sensitivity and specificity

sensitivity=1type II errornumber of real positiviesspecificity=1type I errornumber of real negatives\begin{align*} \text{sensitivity} &= 1 - \frac{\text{type II error}}{\text{number of real positivies}}\\ \text{specificity} &= 1 - \frac{\text{type I error}}{\text{number of real negatives}}\\ \end{align*}

Formulas seem a little backwards, but maybe that is what the minus one is for.

Regression prediction model output is measured with metrics like absolute error, mean square error, and relative error.

Data Science Applications

1.2 - Data Science Activities

There are 3 dimensions:

Each has its own set of challenges, solution methodologies and numerical techniques.

First Dimension: Data Flow

Starts with collection of data including a list of possible sources and attributes. Stored data must be transparent, complete, and accessible.

Second Dimension: Data Curation

Put simply, data curation is the process of refining collected data. There are different ways to do this:

Third Dimension: Data Analytics

Data analytics uncovers hidden patterns in data and transforms data into relevant and useful information. Techniques include modelling and simulation, machine learning, artificial intelligence, and statistical analysis.

1.3 - Sources of Data

Sources must be trustworthy to ensure collected data are robust and high quality. There’s a saying, “Garbage In, Garbage Out,” with accompanying acronym GIGO. Sources include:

Data Types

Think quantitative and qualitative data types.

Definition - Quantitative Data: measurable values. Subtypes include

Definition - Qualitative Data: information about quality of a good or service such as customer feedback.

Data Shapes

You have the following shapes

Definition - Structured Data: These data have high level of organization, such as information shaped in tabular forms of rows and columns. Rows may also be called “records”.

Definition - Unstructured Data: Data with unknown form or structure. Think text and multi-media content. It is data in raw form that requires some processing.

Definition - Semi-Structured Data: All data shapes between structured and unstructured. Data might not be tabular but do have organizational properties that make them easier to analyze.

The Five “V”s of Data

These are main obstacles to handling any type of data and describing data overloads.

Definition - Volume: The amount and scale of data. Volume of data captured is expected to double every two years. Experts believe with advances in computational power and decreasing storage costs, this will not be a problem.

Definition - Variety: Data come from many sources, are of many types, and have different levels of complexity. Much data generated today can be considered unstructured.

Definition - Velocity: Speed at which data are created, stored, analysed, and visualized.

Definition - Veracity/Validity: Veracity refers to data quality. Validity refers to the value of data in extracting useful information for making a decision. Data contains veracity with elevated levels of noise obtained during data collection or data processing. Data can also become outdated and invalid, even if noise-free.

Definition - Value: Also called usage value, refers to the application the data are used for and the frequency of their use.

1.4 - Descriptive Statistics

Statistics provides a summary of the data. Several statistical parameters are calculated to describe a variable clearly (a Random Variable?). These include minimum, maximum, mean, median, mode, variance, and standard deviation. Definitions are better left to standard texts in statistics.

Important to note that the mean, μ=1NpiXi\mu = \frac{1}{N} \sum p_iX_i, is more sensitive to extreme values than the median. However, it is mathematically more convenient to work with. That is, many other statistical formulas can be derived from or with the mean.

Probability Theory

Definition - Probability: The chance of an even happening.

Essentially, data science overlaps a great deal with statistics. And statistics is a branch of mathematics that is intertwined with Probability Theory.

I think core topics should be left to text in probability theory, but sure we can cover some here as well.

The probability of an event occurring must be between 0 and 1 inclusive. The probabilities of all possibilities must always sum to 1 as well. That is, an event must always occur, even if that event is “do nothing”. Like rolling a die (singular dice). If you roll it, the chance of any side appearing, given a fair 6-sided die, is 1/61/6. The sum that it lands on a side, given you roll the die, is the sum of all sides, or 6×1/6=16 \times 1/6 = 1.

Please ignore the possibility that it would amazingly land and balance on an edge. It is so tiny we can say P(edge)0P(edge) \approx 0.

If you roll the die, it cannot simultaneously land on two sides at once. That is, you cannot role a 3 and a 4 with the same die in one roll. Events of this nature are called mutually exclusive, and represented as

P(M and N)=P(MN)=0P(M \text{ and } N) = P(M \cap N) = 0

So, the capital PP represents generic probability function, and other capital letters typically represent random variables, being the input parameters. For now, think of it like a whole set of values (or outcomes) in an event space, each with its own probability of occurring given that an event has occurred, like given we rolled the die.

In a Venn diagram, mutually exclusive events are non-overlapping fields. A direct result of mutually exclusive events is that summing their probabilities is as easy as actually summing them. That is, the probability of either a 3 or a 4 being rolled is 1/6+1/6=1/31/6+1/6=1/3.

P(M or N)=P(MN)=P(M)+P(N)P(M \text{ or } N) = P(M \cup N) = P(M) + P(N)

For events that are not mutually exclusive, we then consider if the events are independent. Independence just means that the events occurring simultaneously do not affect the probability of the other event occurring, increasing or decreasing the odds. You can roll two die and the outcome of one does not affect the outcome of another. However, if you have a standard deck of cards and you draw one card, the probability of the second draw is dependent on the first because there goes from 52 cards to 51. Maybe there are better examples…

Anyway, mutually independent events that are not mutually exclusive have their own maths

P(A and B)=P(AB)=P(A)P(B)P(A or B)=P(AB)=P(A)+P(B)P(A)P(B)\begin{gather*} P(A \text{ and } B) = P(A \cap B) = P(A) \cdot P(B)\\ P(A \text{ or } B) = P(A \cup B) = P(A) + P(B) - P(A) \cdot P(B) \end{gather*}

If you look at a Venn diagram, you can see that adding P(A)P(A) and P(B)P(B) double counts the P(A)P(B)P(A) \cdot P(B) portion, which is why we subtract one part.

Conditional Probability

If two events are correlated, so perhaps neither mutually exclusive nor independent, we run into conditional probability. The probability that AA occurs given event BB has occurred is

P(A  B)=P(AB)P(B)P(A \ | \ B)=\frac{P(A \cap B)}{P(B)}

You can see that if they are independent events, the P(B)P(B) factors would cancel out.

In Data Science, all predictions from developed models are probabilities or a probability density distribution (for regression models). A probability density function is a function whose value at any given sample in the sample space can be interpreted as providing a relative likelihood that the value of the random variable would be equal to that sample.

Probability Density Function

The PDF as a function takes in a range of possible values as inputs and returns the probability of that value occurring. Density is a continuous concept, so the probability of one singe point is approximately 0.

Wiki - Probability Density Function

For discrete distributions, we should refer to them as Probability Mass Functions. It appears the text confusing them. But no problem as when you integrate density, you can obtain mass.

The text shows how to calculate probability as sum of event you desire over the total number of possible events. As the sum of two dice, you can create a more interesting graph for a probability mass function that actually appears kind of normal.

Probability Distributions

Every variable of a dataset meets a particular frequency distribution that reflects how often each value of this specific variable occurs. There are some general and classical distributions that appear more regularly for many datasets. A perk of associating to a known distribution is they can usually be described mathematically with closed form expressions of a few parameters.

We will only cover a few distributions. Since this is not a course in probability and statistics, I would refer readers to those notes (when I write them).

Normal Distribution

The bell-curve, appearing in nature at every turn. The normal distribution has 68%68\% of values within one standard deviations, σ=±1\sigma = \pm 1. Then, 95%95\% within 2 standard deviations, σ=±2\sigma = \pm 2, and 99.7%99.7\% within 3, σ=±3\sigma = \pm 3.

Binomial Distribution

Monitors success of an event occurring. Like a coin toss.

Wiki - Binomial Distribution

Supported Functions · KaTeX

PMF is

f(k,n,p)=Pr(k;n,p)=Pr(X=k)=(nx)pk(1p)nk Where ...(nx)= nCk=n!k!(nk)!\begin{gather*} f(k,n,p) = Pr(k;n,p) = Pr(X=k) = \binom{n}{x} p^k(1-p)^{n-k}\cr \text{ Where ...}\\ {n \choose x} =\ _{n}C_k= \frac{n!}{k!(n-k)!} \end{gather*}

The “n-choose-k” is the binomial coefficient. It has some other convenient properties such as

E[X]=npVar(X)=npq\begin{gather*} E[X] = np \cr Var(X) = npq \end{gather*}

Wiki also has a bit about moments, possible the moment generating functions.

Poisson Distribution

A discrete probability distribution that expresses probability of given number of events occurring in a time interval. Events need properties like constant mean rate and independence from time since last event.

Wiki - Poisson Distribution

The PMF is

f(k;λ)=Pr(X=k)=λkekk!f(k;\lambda) = Pr(X=k) = \frac{\lambda^ke^{-k}}{k!}

It has a fun property of

λ=E[X]=Var(X)\lambda = E[X] = Var(X)

An example might be how many calls a call centre receives in a day. There’s also cosmic rays, radioactive decay and sales records!

Bayesian Statistics

Back to Bayes Theorem

P(AB)=P(BA)P(A)P(B)P(A | B)=\frac{P(B|A)P(A)}{P(B)}

Naming convention?

An example is a drug test, and positive and negative results. Even if the test is 99% accurate, if the pool of drug users is relatively small, like 1%, then the chance a positive result actually means the user is on drugs is… the probability the user is on drugs given they tested positive.

P(U+)=P(+U)P(U)P(+)=0.990.01(0.990.01)+(0.010.99)=0.5P(U|+) = \frac{P(+|U)P(U)}{P(+)} = \frac{0.99*0.01}{(0.99*0.01)+(0.01*0.99)}=0.5

If the denominator is confusing, it is the probability of a positive result. That is 99% likely for active users (1%) plus the unlikely false positive (1%) from non-users (99%).

But isn’t that just amazing. If we have a known pool of users, we can determine the test is positive with 99% accuracy. But just relying on the test itself to determine who is a user can only give us 50%. That means if a person tested positive, it’s a coin-flip whether or not they are actually a user.

This comes from the disproportionate sample sizes. Even with 1% error, the very large set of non-users ramps up the false positives, decreasing predictive power.

This is why domain knowledge is very important. If we know the person does not have a history of drug abuse maybe we would set P(U)=0.01%P(U) = 0.01\% for their case. Or if you know the person actively does drugs, you might set P(U)=50%P(U) = 50\%. In fact, just knowing there’s a 50% chance they should test positive could increase the accuracy 99%99\%.

It’s an example to show how prior probability of P(U)P(U) is adjusted according to the posterior probability P(U+)P(U|+). This can be the result of designing a classifier to predict the occurrence of an output for a new training set and is the main idea behind the Naïve Bayes Classifier for categorical data of independent random variables.


title: Business Analytics Using R
subtitle: A Practical Approach
authors:
	- Dr. Umesh R. Hodeghatta
	- Umesha Nayak
publisher: New York Apress
year: 2017
doi: 10.1007/978-1-4842-2514-1_5
iu_an: ihb.44870

Just a side note, these will not be the most intensive notes taken.

Ch. 5 - Business Analytics Process and Data Exploration

This chapter covers data exploration, validation, and cleaning required for data analysis. Good to know why we clean and prepare data, and some useful methods and techniques.

5.1 - Business Analytics Life Cycle

The purpose of business analytics is to derive information from data to help make good business decisions. There are about 8 phases to the business analytics project life cycle.

There’s also the Cross-Industry Standard Process for Data Mining (CRISP-DM), which has the following 6 phases:

  1. Business Understanding
  2. Data Understanding
  3. Data Preparation
  4. Modeling
  5. Evaluation
  6. Deployment

5.2 - Understanding the Business Problem

Know exactly what the client really wants you to solve and document it. Determine availability of data, data format, quality, amount, and the data stream for final model deployment.

Document business objective, data source, risks, limitation. Define timeline, required infrastructure to support the model, and expertise required to support project.

5.3 - Collecting and Integrating the Data

Quality data is most important factor determining the accuracy of results. Data can be:

A data warehouse is an integrated database created from multiple databases within the organization. Warehouse technology typically cleans, normalizes and pre-processes data before it is stored. Warehouse technology also supports Online Analytical Processing (OLAP).

NoSQL databases were developed to overcome limitations of relational databases and meet the challenges of enormous amounts of data being generated on the Web. Stores structured and unstructured data.

5.3.1 - Sampling

Unless you have big data infrastructure, only a sample of the population is used to building analytical modeling. A sample is a smaller collection of units from a population used to determine truths about said population. It should represent the population. Techniques depend on the type of business problem.

Probability sampling ensures all members of the population have an equal chance of being selected for the sample. There are different variations:

Calculating sample sizes is left to statistics or research methodology books.

5.3.2 - Variable Selection

For finding good relationship between YY and predictor variables XX, you need enough data. How much?

6×m×p6 \times m \times p

Where mm is the number of outcome classes and pp is the number or variables. The more records, the better the results.

5.4 - Preprocessing the Data

Data in a database may be susceptible to noise and inconsistencies. Various sources and collection methods and multiple people handling data over time can cause this. You must understand data in terms of data types, variables and characteristics of variables, and data tables.

5.4.1 - Data Types

Data can be…

5.4.2 - Data Preparation

Now you know the data’s types, study the data! Check values, find missing values, correct unknown characters, etc… this can all impact your model’s accuracy. How do you handle missing values?

It is important to predict the values based on the most probable value.

What about handling duplicate data, junk and null values? You should clean these from the database before analytics process.

5.4.3 - Data Preprocessing with R\mathbb{R}

There are multiple solutions to perform the same tasks, so here’s one:

Variables in R are represented as vectors. Here are basic data types in R:

Basic operations in R are performed as vector operations (element-wise). The book then covers an example in R. They note that converting factors to numeric values may not give actual values. So, first they convert the value to a character and then a numeric.

5.5 - Exploring and Visualizing the Data

This is the step of exploring and understanding what kind of data you have. You also need to understand the distribution of individual variables and the relationships between variables. Expect to use graphs and tables.

Goals of exploratory analysis:

Provides example of showing a table in R. Then some graphs.

Univariate analysis analyzes one variable at a time. A histogram represents frequency distribution of the data, typically as a bar chart. I personally do not like the box plots, whisker plots, or box and whisker plots (all the same thing). They are a bit of work to just show a mere description of the data, like quartiles. The book shows the max and min values, and then the outliers beyond that, which is a nice touch.

There’s also the notched plots, similar to the box plot.

Bivariate data analysis is used to compare the relationship between 2 variables. The main goal is to explain the correlations of two variables. Then, you can suspect a causal relationship. If more than one variable is made on each observation, then multivariate analysis is applied.

A Scatter Plot is most common data visualization for bivariate analysis and needs little to no introduction. Too little data makes it hard to draw an inference. Too much data can create too much noise to draw an inference.

There’s a scatter plot matrix, which takes pairs of variables and creates graphs in a table.

hou<-read.table(header=TRUE,sep="\t","housing.data")
str(hou) # shows table
pairs(hou) # creates scatter plot matrix.

So, each variable is represented on both the x and y-axis and their plots shown on the table. The diagonal would be plots of the same variable and is ignored (unless you changed the order of variable on the x-axis to differ from the y-axis).

Trellis Graphis is a framework for data visualization that lets you examine multiple variable relationships. The trellis plot tries to solve the overplotting issue of scatter plots with different depths and intervals.

A correlation graph is like a matrix of correlation values between two variables. The numbers can be transformed into a plot with colours and shapes, also refered to as a heat map. Note that sometimes variables are not related directly, but could be polynomial, exponential, logarithmic, inverse…

corrplot(corel, method="circle")
corel<-cor(stk1[,2:9])
corrplot(corel,method="circle")
corel # prints correlation plot.

You can also plot density functions to illustrate the separation by class.

5.5.5 - Data Transformation

Your data might not always be the best. It could be skewed, not normally distributed, different measurement scales. This is when you must rely on data transformation techniques. Techniques include normalization, data aggregation, and smoothing. Then the inverse transformation should be applied.

We will go into detail about normalization. Techniques such as regression assume data is normally distributed and variables should be treated equally. However, different measuring units can lead to variables having more influence that others. All predictor variable data should be normalized to one single scale.

5.6 - Using Modeling Techniques and Algorithms

Ready to perform further analysis. Analytics is about explaining the past and predicting the future.

5.6.1 - Descriptive Analytics

Descriptive Analytics explains the patterns hidden in data. You can group observations into the same clusters and call it cluster analysis. There is also affinity analysis, or association rules, that can uncover patterns in a transactional database.

5.6.2 - Predictive Analytics

There’s either classification or regression analysis.

Then there is Logistic Regression, which is also described in this Wikipedia article, that is like continuous classification. Classification isn’t true or false, but more like a probability like 80% sure.

The typical logistic function is like

p(x)=11+e(xμ)/sp(x) = \frac{1}{1+e^{-(x-\mu)/s}}

Where μ\mu is the location parameter and p(μ)=1/2p(\mu) = 1/2. And ss is a scale parameter.

5.6.3 - Machine Learning

This is about making computer learn and perform tasks better based on historical data. Instead of humans writing code, we provide the computer instructions on how to build mathematical models. The computers build the models and can perform tasks without the intervention of humans.

Machine learning has two main flavours: supervised and unsupervised. There’s also reinforcement learning, not discussed here apparently.

5.7 - Evaluation the Model

You need to evaluate your model to understand if it is good at making predictions on new data. To remove bias from assessing with data from development, the data must be partitioned into:

Let’s discuss the different sets:

The usual analogy is studying for a test. The training set is the course material, the validation might be practice exams, and the testing set is the maths test.

5.7.1 - Cross-Validation

To avoid bias, the data set is portioned randomly. If you have limited amount of data, you achieve an unbiased performance estimate with the k-fold cross-validation technique. This divides data into “k-folds” and builds the model using k1k-1 folds. The last one is used on testing. You repeat the process kk times, and each time the test set is different.

5.7.2 - Classification Model Evaluation

To assess performance of classifier, we look at number of mistakes. This is the misclassification error, the observation belongs to a class other than what was predicted. A confusion matrix gives an estimate of true classification and misclassification rates.

A Lift Chart is used for marketing problems by helping to determine how effectively an advertisement campaign is doing. The Lift is a measure of effectiveness of the model, calculated by ratios of with and without model. So, it sounds like a placebo-controlled trial.

A Receiver Operating Characteristic (ROC) chart is similar to a lift chart with true positives plotted on the y-axis, and false-positives on the x-axis. Good for representing the performance of a classifier. That is Sensitivity vs Specificity. The Area under the curve (AUC) should be between 0.510.5-1.

5.7.3 - Regression Model Evaluation

There are many metrics for regression

RMSE=k=0n(y^kyk)2n\text{RMSE} = \sqrt{\frac{\sum_{k=0}^n(\hat{y}_k-y_k)^2}{n}}

5.8 - Presenting a Management Report and Review

Now, your model is presented to business leaders. This requires business communication. If changes are required, you would start over here. Usual topics of discussion are:

5.9 - Deploying the Model

Your model may perform predictive analytics or simple reporting. Note that the model can behave differently in the production environment because it might see completely different data. This may require revisiting the model for improvements.

Success depends on

What are typical issues observed?

There are many more, but that’s just a general outline.


R

Should probably have an R\LARGE{\mathscr{R}} Section but this will do for now. Check out this article from DataQuest.io about the R language. It is a lexically scoped language, AKA static scoping. So, structure determines the scope of a variable, and not most recently assigned.

x <- 5 # What is this <- shit?
y <- 2

multiple <- function(y) y*x

func <- function() {
	x <- 3
	multiple(y)
}
func() # returns 10

Although we define an x in the func() function, the multiple() function refers to the x in the global scope. If this is confusing, that is OK. These functions are impure and I wouldn’t recommend this architecture for anything other than learning purposes.

If you declare a y in the func() function, that will have an affect because we pass y in as an argument.

To download, for the CRAN.R-Project.org and download it. It comes with its own basic GUI. You can also look for RStudio IDE or use a Jupyter Notebook, which requires an R kernel. This may require Anaconda.


Video Lecture

Data Science resembles sciences in that:

Data Science is different from other sciences because it focuses on:

Knowledge pyramid:

Data is not super useful without context, which is why we add metadata. We interpret data to create information, which provides meaning. Then we can understand it and relationships to create knowledge. That eventually leads to a theory, basically and explanation.

Data are like facts, they are quantity or quality. There are ways to categorise data, and three ways are:

Metadata is beyond data, it provides context. So, it tells you what data is about. It also comes in three flavours:

A unit and data give things like measurements.

Now discussing discrete and continuous quantitative variables. There’s also quantitative variable scales:

Qualitative variables are always categorical and discrete. You can name with numbers, but it is still qualitative. There are no scales, that wouldn’t really make sense they say. We have types of categories though:

Typical business scenarios try to convert unstructured data into structured data. Documents are not structured for computers, only to humans. We add metadata to make it at least Semi-Structured, like a webpage. We use HTML to give tags for web browser to read and understand. The goal is to get enough metadata to create tables of structured data.

This leads us to Data Science Activities. There are three activity containers to remember:

Cleaning data can be a lot of things, checking the dataset for accuracy, completeness, and adherence to predefined rules or constraints:

Feature extraction is like taking the maximum temperature from a day. Transformation of data can be different data types, like:

We can also transform data by changing visualizations:

We can also change transform the data domain, which is more mathematically complicated:

This leads to modelling! Models are abstract and simplified representations of complex systems or phenomena in a concrete scenario. In basic terms, they explain a causational relationship. Models can be mathematical, statistical, logical computational, or visual representations. They include classes and relationships and if we can quantify the relationship, we can: explain, simulate, and predict!

We hope towards statistics now. Descriptive statistics covers the methods for summarizing data via:

Some useful terms perhaps… Descriptive statistics is used to describe features of a data set through measures of:

Overall goal is to identify the distribution of the population, which is the foundation for inferential statistics. Underlying distribution would have some static parameters, as well as the data used. Distribution functions are typically characterized by 3-parameters:

We then look at the Poisson Distribution (discrete). It models the number of events occurring within a given time interval. Below is the probability mass function and the cumulative distribution function for a Poisson distribution.

P(x)=1x!λxexPMFF(x)=i=0x1x!λxexCDF\begin{array}{cc} P(x) = \frac{1}{x!} \lambda^xe^{-x} & \text{PMF}\\ F(x) = \displaystyle\sum_{i=0}^x \frac{1}{x!}\lambda^xe^{-x} & \text{CDF} \end{array}

λ\lambda happens to be the mean and I believe variance for this distribution.

The Normal Distribution is a continuous distribution.

P(x)=1σ2πe(xμ)2/2σ2PDFF(x)=12πi=xex2/2CDF\begin{array}{cc} P(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-(x-\mu)^2/2\sigma^2} & \text{PDF}\\ F(x) = \frac{1}{\sqrt{2\pi}}\displaystyle\int_{i=-\infty}^x e^{-x^2/2} & \text{CDF} \end{array}

The standard normal distribution is when the mean is 0 and standard deviation is 1. There is also an example of a probability tree to show what the values of the standard deviation represent which is very interesting.

We are looking at Normal Distribution again.

Then we look at Venn Diagrams to represent classes. Overlapping areas represent similarities of classes. Then, we express:

Now we can represent Joint and Disjoint events. Below is the probability that either A or B occurs, or both:

P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A)+P(B)-P(A \cap B)

Looking at a Venn diagram, you can see that the intersection would be included twice if we just add the sets. Now, for the XORXOR, same thing but:

P(AΔB)=P(A)+P(B)2P(AB)P(A \Delta B) = P(A)+P(B)-2P(A \cap B)

Apparently referred to as the symmetric difference.

And if events are disjoint (mutually exclusive), the intersection is the empty set, so you don’t need to worry about it at all.

This naturally leads to unconditional and conditional events.

P(AB)=P(AB)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}

We look at a probability tree which basically tracks the sequence of events of all possible events. It helps you see:

We cover an example of a tree with a box of balls, say 2 red and 3 blue colours. Going over for terminology is important.

Q1: Probability of drawing a red ball on first attempt?

That is 2/52/5

Q2: Probability to draw a red ball individually in any second attempt?

This means if we previously drew a red, we have 1/41/4, and if we drew a blue the next red is drawn at 2/42/4. So total probability is 1/4+2/4=3/41/4+2/4=3/4.

Note: with the word ”individually” we actually do not want the combined probability. If asked like this on an exam, give individual cases and associated probabilities, not conditional either apparently.

Q3: What is probability to draw two red balls in a row?

I know there is a distribution to solve this, forgot what it is. He says it is 1/101/10 as it is intersection of draw red and draw red. The question should say within first 2 attempts.

Q4: What is probability to draw any red ball in second consecutive attempt?

It is sum of 2 intersections. 2/5+1/42/5+1/4

Q5: What is the probability that the first ball drawn is red given the second drawn is blue?

This is true conditional probability.

P(AB)=P(AB)P(B)=P(A)P(B)P(B)=(2/53/4)(2/53/4)+(3/52/4)=3/103/10+3/10=1/2\begin{align*} P(A|B)&=\frac{P(A \cap B)}{P(B)}\\ &=\frac{P(A)P(B)}{P(B)}\\ &=\frac{(2/5*3/4)}{(2/5*3/4)+(3/5*2/4)}\\ &=\frac{3/10}{3/10+3/10} = 1/2 \end{align*}

We use fact that they are independent events. I was nearly there but lost confidence. Don’t lose confidence!

We now must look at Bayes Theorem, because it allows us to flip the conditional event, solving for new probabilities.

P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A)\cdot P(A)}{P(B)}

Where:

A Derivation is provided but isn’t too tough. The concept is used throughout Bayesian statistics and creates new distributions based on prior knowledge.

Inferential statistics covers methods to generalizing data, such as prediction, estimation, and hypothesis testing. Allows researchers to assess their ability to draw conclusions that extend beyond the immediate data:

Example of inferential statistics include:

We now go into Linear Regression, which relies on general model for ordinary least squares. I think the formulas would be listed above or else where.

When we use ANN (artificial neural networks), we do something very similar to nature with how the brain works. We are shown a biological neuron diagram…

The presentation draws a parallel, which is nice. So the inputs are similar to the dendrites and then the synaptic strength are the weights we give to the inputs as they come in. In the nucleus, we sum all the individual weights to get a net-weight. The Axon hillock is represented by a transfer function, example being a sigmoid function giving an “S” shape. And it provides an output if the Axon threshold is met, or something like that.

Lets dive a bit deeper into an Artificial Neural Network. Input values are passed into the input layer of your ANN model. Then, the processing happens in all the hidden layers. Finally, an output comes out of the output layer. All nodes in one layer are attached to all nodes in the proceding layer. Then, we train!

We start with feedforward, where information is passed in a forward fashion. During training, if the output is incorrect, we perform a backpropagation to calculate the difference to a desired output and correct the weights accordingly.

Supervised learning has inputs and labels. We tell the model what we expect exactly. Good for predictions and classification.

Unsupervised learning is more complicated and we only provide a model/network with inputs and ask it to create its own labels. It can be used for classification. This network can help us discover hidden patterns or relationships!

We also have reinforcement learning. We make a model and:

This helps discover optimal strategies.

Short Summary:

Data and metadata can be transformed into information which can be derived into knowledge.

We find some raw data and clean it to have regular data. Then, we extract useful features from data and classify those features to find patterns. Then, those patterns are learned to create a relationship in the data. Once we have relationship, we can create models for predicting.

If we do the leg-work up to the learning step, then we have supervised learning. If we don’t know how to classify the data, then we perform unsupervised learning. And if we are quite lazy or there is too much data and features to figure it all out, we let Deep Learning extract the features for us. Then, if we only feed a system raw data, we are probably going for reinforcement learning.

Good to have an overview like that.


Check Your Understanding

Which of the following is the blind machine learning task of inferring a binary function for unlabeled training data?

This is unsupervised learning.

Incorrect choices include regression, supervised learning and data processing. The last isn’t machine learning. The first two are supervised.

The true positive rate achieved by a developed machine learning model is defined as…

This is when a “yes” record is correctly labelled “yes”. Accuracy is the total of correctly labelled “yes” and “no” records. Precision is the correctly labelled “yes”.

However, the answer is not “Precision”.

Possible Choices:

In which process are the data cleared from noise and the missing values are estimated/ignored?

This is part of data curation process. It is Data Preservation.

Incorrect choices are: data description, data publication, data security, all topics to be familiar with.

The data source which crosses all demographical borders and provides quantitative and qualitative perspectives on the characteristics of user interaction is the…

“Media includes videos, audios, and podcasts that provide quantitative and qualitative information on the characteristics of user interaction. Since media crosses all demographical borders, it is the quickest method for businesses to identify and extract patterns in data to enhance decision-making.”

Reader should be familiar with other sources as well!

The probability p(A|B) measures…

The probability that event AA occurs given that event BB has occurred.