title: Data Science
subtitle: DLMBDSA01
authors: Prof. Dr. Claudia Heß
publisher: IU International University of Applied Sciences
date: 2022

Unit 4: Processing of Data

pp. 67 - 80

Learning objectives include:

Introduction

Data processing is the extraction of useful information from collected raw data. Similar concept to industrial production, raw inputs go in (data), labour transforms inputs (data processing), and a useful product comes out (insights and information).

Benefits of data processing to organizations:

4.1 - Stages of Data Processing

p. 69

Data processing transforms data into information. Stages are:

These stages should be completed in order, but can also be cyclical.

Data Collection

After raw data are collected from a source(s), they are converted into a computer friendly format to form a data lake.

Definition - Data Lake: A repository of data stored in both its natural and transformed formats.

collecting data can be difficult for noisy, redundant, and contradicting data.

Data Preparation

This is the pre-processing stage where data is cleaned, organized, standardized, and checked for errors. This stage may require significant domain knowledge and is meant to handle missing values, and eliminate redundant, duplicate and incorrect records. See the previous section in Data Processing.

Data Input

After the data have been prepared and cleaned, they are input into their destination and translated into a format for the consumers to understand. An example of the destination is a data warehouse.

Definition - Data Warehouse: A store gathered from data sources and used to guide decision-making in an organization.

To understand the data, one needs to have a grasp of their key characteristics such as

Data Analysis

May be performed through multiple threads of simultaneously executed instructions using machine learning and AI algorithms. This is probably the heart of the process, the one that makes many YouTube videos.

This stage can involve converting data into more suitable format, ensuring correctness, distilling detailed data into main points, and combining multiple groups of data records.

Data analysis has its own 5 steps:

The Difference Between Feature Extraction and Selection | GeeksForGeeks is that Extraction transforms original features into a new set of features that are more informative and compact. The goal is to capture the essential information from original features and represent it in a lower-dimensional feature space. Linear transformation methods such as Principal Component Analysis and Linear Discriminant Analysis, and non linear methods such as Kernel PCA and Autoencoder are used to extract features. Feature selection’s goal is to reduce dimensionality of the feature space, simplify the model, and improve its generalization performance. It has 3 methods:

Data Interpretation

Outcomes of machine learning predictions must be translated into actions. To do this, those outcomes must be interpreted to obtain beneficial information for guiding decisions.

Data Storage

Finally, we store the data, instructions, developed numerical models, and information for future use. It should be stored to be accessed easily and quickly when needed.

4.2 - Method and Types of Data Processing

Manual Data Processing

Manual Data Processing is not primitive, but was used when technology was young and expensive. This is the paper and pencil way to perform calculations and data transformations. Very prone to human error and takes a lot of time.

Mechanical Data Processing

Data is processed with mechanical, but not computer, devices. Things from printers and calculators to typewriters. Still prone to errors.

Electronic Data Processing

Electronic Data Processing is data processed automatically with computers and software. Both fast and accurate.

Types of electronic data processing:

4.3 - Output Formats of Processed Data

p. 74

Processed data should be presented in a format that meets the following criteria:

Definition - ASCII Text: American Standard Code for Information Interchange. ASCII code represents text for electronic communication in computers.

Processed data can be obtained in different forms:

We will look at several common software-specific data formats


Video Lectures

Data processing cycle - very important:

The professor does not agree 100% with this cycle but it is what we must learn for our exam.

What is a Data Warehouse? It’s like a real warehouse, but for data. It’s a central repository that stores and organizes large volumes of structured data from various sources. Employ a process known as Extract, Transform, Load (ETL) to collect data from source systems, transform it into a consistent format, and load it into the warehouse.

There’s a diagram, might be worth looking into diagrams to visualize where it sits in process. You kind of start with collection of data from various sources, operational to application. The Load Manager collects the data and puts it into a staging area, like a giant database or something. Then, the Warehouse Manager software performs ETL, the load for loading it into another database, probably the data warehouse at this point. The data can be used in Data Marts for reporting, Data Cubes for OLAP Analysis, or for Data Mining.

Data mining is automation AI used for looking for patterns in data and such.

A look at an OLAP Cubes. They are literally cubes, categories of data, maybe what sector it came from, region, product line, etc… OLAP is Online Analytical Processing. OLTP is Online Transaction Processing is for real transactions. OLAP is more for pushing related products and what it thinks you would like to purchase.

Exam questions may ask about processing. Take care you are not answering the wrong thing, which can happen. If asked what are processing techniques, some answer there’s a serial transmission and parallel transmission. That is not the data processing the exam is looking for. Those are actually data transmission methods, not processing techniques. With raw data, Processing techniques include

The term online is confusing because technically real-time, distributed, and time-sharing types all happen online as well. So, you need to thing beyond the internet. Sometimes data also is processes online, distributed, and in real-time!

The exam may ask if you can explain what a type is an give an example!

Another part is how data is stored. The Output Formats of processed Data. The Comma-Separated Value is common. We also know Excel (.xlsx) is quite popular. The second “x” stands for “XML” apparently.

Office Open XML documents exist in 3 markup languages:

Don’t get confused by “Open Office”. But he goes through, if you have an unzip program, you can unzip an Excel file and see it’s file structure, a bunch of XML files. You can do that for any Office Open file, even word and such.

Other usable schemes include:

You don’t get much back zipping a word document because basically, it’s already been zipped.

Now, to dive into XML. A real tutorial can be be found here, XML Introduction | mozilla.org:

<?xml version="1.0" endocing="UTF-8" ?>
<!-- Above is processing instruction -->
<!-- Below is a Start tag -->
<catalog>
	<book id="bk001">
		<title>XML Developer's Guide</title>
	</book>
</catalog>
<!-- above is an end tag -->

There’s also a whole website, XML.com. Basically information is stored in tags, and metadata is stored as attributes in the tags. The tags themselves are also metadata and very important for structure.

There’s also JSON format, and like XML it is handy for data transmission. However, XML requires an XML parser and an XML DOM. JSON can be parsed with JavaScript and is shorter and can use arrays. Both are self-describing, human readable, hierarchical, and can be fetched with XMLHttpRequest.

XML allows for more complex structures apparently.

Protocol Buffers - Serialisation, like XML or JSON, with interface specification.

Another interesting approach is Apache Parquet. This is a column-oriented storage format rather than the conventional row-oriented database format.

Advantages:

Disadvantages:

Sounds like a quick read with high memory use and low write performance. Hard drives are cheap and reliable for database storage so they are still like the go-to. But it takes time to read from the database. Hence, storing in memory makes it faster, but you need a lot of memory to hold a bunch of data.

Finally, we are looking into SQL. Suggested at checking out MySQL Workbench, free open source thing to practice using SQL if you would like. You SELECT data FROM a table, under certain conditions. That is the general idea, but not everything you need. There’s also JOIN, using one table to pull information from another based on relationships.

You can create and save databases with SQL, but using it as a transfer method probably isn’t the most popular method.


Knowledge Check

Q1: What is defined as facts, observations, assumptions, and incidences?

Data is defined as that.

Not to be confused with features, AKA variables, which are aspects of data, like name or date.

Q2: What is defined as the patterns and relationships among data elements?

Information is defined as such.

So, it sounds like information is made up of data, which is described by features.

Q3: What are the stages of data processing, and which stage is pre-processing performed?

The stages of data processing are:

Steps are kind of self-describing, but pre-processing is performed in data preparation.

Q4: Name some common data formats and name which format is the following

<img fig="Alice.jpg" tag="Alice" />

That is XML, AKA extensible markup language, which as a format similar to HTML.

Other formats include XLS (spreadsheets), CSV (comma separated values), and JSON (JavaScript Object Notation).

SQL is not a format but a query language for databases. However, I have seen files written in SQL for populating databases.

Q5: What are (generally) the five stages of Data Analysis? When do we handle data with missing values?

This is a tricky question.

The five stages of Data Analysis are:

Missing values are handled during data pre-processing, which was the topic of the last chapter, and occurs in the Data Preparation stage, well before we hit the Data Analysis stage.