title: Data Quality
subtitle: Dimensions, Measurement, Strategy, Management, and Governance
authors:
- Rupa Mahanti
ISBN: 978-0-87389-977-2
ISBN: 9781951058678
publisher: ASQ Quality Press
date: 2019
url: https://www.ebooks.com/en-ie/book/210693540/data-quality/rupa-mahanti/
price: €137.88
Required reading are chapters 1 and 3. Further reading is the entire book.
I have maybe less than a week to get through a lot of material so this is going to be very light note taking
The text discusses the significance of data in the 21st century, likening its importance to what steam power was in the 18th century, electricity in the 19th century, and hydrocarbons in the 20th century. It highlights the omnipresence of data due to advances in information technology and the Internet of Things, which has transformed how businesses operate, research is conducted, and communication occurs.
The definition of data is presented as “facts and statistics collected together for reference or analysis.” From an IT perspective, data are abstract representations of real-world entities and concepts, expressed through definable conventions related to their meaning, format, collection, and storage.
The text traces the evolution of data handling, from manual storage in physical files to electronic processing and storage on a massive scale. It emphasizes that everything in modern life is associated with data sources and is captured and stored electronically.
Furthermore, the text discusses the role of databases and how technological advancements have made accessing information easier. It mentions the various channels through which data is collected and stored, including call centres, web forms, and databases of different types.
The chapter’s content is structured to start with the importance of data and data quality, followed by an exploration of data quality dimensions, causes of poor data quality, and the associated costs. It also touches upon the ”butterfly effect of data quality,” illustrating how minor data issues can escalate within an organization. Finally, the chapter concludes by summarizing the key points discussed.
In essence, the text emphasizes the transformative role of data in contemporary society and lays the groundwork for discussing data quality and its implications.
Data is no longer a by-product of IT systems and applications. It can be an organizations most valuable asset. And Good quality data are essential to create the best models and produce the best results, the best services. Data can also be reusable, so quality goes a long way.
For medical fields, the best data can lead to new discoveries to improve health.
Data is very important, as noted above. Is the quality of data important? Yes, of course. But what is “data quality”? We need to understand categorization of data aapparently, and then the business impact and cost of bad data and poor quality data.
An enterprise can have about 5 types of data:
This is non-transactional data used to define the primary business entities and used by multiple business processes. It’s typically used often and doesn’t change much. Errors in master data can have substantial cost implications. If customer’s addresses are wrong the correspondence is sent to wrong address.
Usually described with nouns like “customer”.
These are sets of permissible values referenced and shared by a number of systems. They are meant to be referenced by other data records, like the master data. It provides standard terminology and structure across different systems. Think of it like abbreviations or status codes.
Organizations (should) use internal reference data to describe or standardize their own internal business data, like status codes for account status or something.
Reference data also does not changes excessively, but can be amended from time to time.
These describe business events. A record of a “fact” that transpired at a point in time. They depend on the master data and are usually described by verbs. The customer and account are master data. The customer opening an account is the action, generating transactional data.
When you think about it, actions always have a time component, opening, purchasing, ordering, etc… So it is usually some master data, maybe some reference data, and a data / time something occurred.
At this point, you can see that some corruption in the master and/or reference data will corrupt the transactional data as well. Transactional data can be extremely volatile as well, since it’s constantly be updated.
It’s the transactional data once complete. It shouldn’t be altered unless to correct an error. If a customer’s surname changes after marriage, then this piece of master data becomes historical data. Interesting!
This is data that defines other data. It describes characteristics to make interpreting data easier. There are 3 main types of Metadata:
The quality of metadata can have a positive or negative impact on the other data categories, Master, Transactional, Reference and Historical data.
Data Quality is the capability of data to satisfy the stated business, system, and technical requirements of an enterprise.
Break it down:
There are two aspects of data quality:
So, from just before, whale sighting data may have very high intrinsic data quality but very low contextual data quality for selling iPhones.
Perfection isn’t practical usually as there is typically a lot of data to draw meaningful insights and additional work to improve data quality may be more costly, and nearly unnecessary, than working with how it is.
You can establish a degree of conformance to represent a data quality threshold. These are minimum confidence level and maximum tolerance level of errors.
Six Sigma DMAIC (define, measure, analyze, improve, and control) methodology can be utilized to measure and improve data quality. See Ch. 6 for more details on DMAIC.
Quality of information depends on business processes, databases, software design, the data, and software production processes involved in producing the information. It’s similar to physical products except data and information can be reused.
This is good and bad. If you have a physical product of bad quality, maybe the next one will be ok. However, if you have bad data, it will continue to be bad when reused.
Data characteristics are also intangible. They don’t have weight or colour, but more like data type, format and such. If data is stored in a system poorly, then it may be hard to use and maintain. Date data is interesting. What is 01/02/08? User’s must understand the format that is used in the system they are accessing to correctly interpret the data.
I had an issue like this before where I mistakenly took the date to be European style when it was American. After processing, I realized there were many errors because of this. The error was hard to spot because there was so much data that the issue of seeing a month 13 didn’t pop-up until
Data quality is a function of imperceptible properties called data quality dimensions. Example being a data element of “age” cannot be less than 0.
Aside, if you know something will be created in the future, could you technically consider it to have negative age? Suppose it wouldn’t necessarily be a certain value though.
Basically, it is easier to measure the quality of a physical product because of physical characteristics. Some examples of data quality dimensions are accuracy, completeness, consistency, uniqueness and timeliness
Definition - Data Quality Dimensions: A set of data quality attributes that represent a single aspect or construct of data quality.
Provides means to quantify and manage quality of data. Some commonly used data quality dimensions are completeness, accuracy, timeliness, conformity, and consistency. More on this in Ch. 3 - 4.
Data quality issues can be injected into every phase of the data life cycle, from initial creation and procurement/collection touch points (like Internet), to data processing, transfer and storage. You can also consider backing up data, fetching archived data, and disaster recovery as other points were data can get messy.
Again, using date format as an example, where databases may interpret differently.
There’s a saying, Garbage in, garbage out. Humans are very susceptible to errors, especially with data entry. Many business sight data entry mistakes as a ”… barrier to achieving more accurate customer information.”
People entering data may do so on accident, intentionally, or because they don’t know. Think if a form asks for gender and only provides “Male” or “Female”. As this is but a mere subset of all genders, the form is setup to collect inaccurate data.
Figure on p. 15 shows many causes for bad data quality. Manual data entry is just one of well over 10 options.
The ease of entering data may also be a factor. Make it to hard and employees might not provide correct data, especially under time restraints and with a bad UI. Bankers have also been motivated by greed to provide incorrect data, whether it be bonuses of commission.
Inadequate validation can fail to catch simple data entry errors.
Data Decay (sounds cooler) is data degradation over time, resulting in bad data quality over time. In a data set, the average can shift over time. Like housing prices.
Some data, like a birthday, will never change. Other data, like address or job title, are subject to frequent change. Every data element has a different expiration date.
Data has a shelf life and requires continuous maintenance. It’s important to set guidelines for how often each field should be updated. There are different ways to update data in a database:
comments
field can be included in the table to state what field or fields were updated.I did work on an app that chose the first method for customer business locations. This confused customers as they would see their business as closed and they would complain they were not closed, but have moved. Be aware who will use data as clearly option 2 or 3 would have been more appropriate in said circumstance.
The following will have an adverse impact on data quality:
Example is an incentivized call center to take as many calls as possible and fill in forms. No checks were in place in data collections, so employees were not correctly filling in the forms to take more calls.
Data migration generally involves transfer of data from an existing data source to a new one. This can be a new database or just a new schema within the same database. Some issues can be data changing considerably in the new database or the new schema.
Data migration projects migrate data from one data structure to another, or transforms data from one platform to another with modified data structure.
An example is migrating data from a legacy data system to a relational database management system. Every system (should) be made of three layers: database (storage), business rules, and a user interface. A Legacy System would rarely have completed specifications nor well documented business rules. So, storing data in CSVs or spreadsheets? You are just storing data to store it, with little to no thought of how it will be later used.
The result is poorly formatted data perhaps. And since lack of documentation, data formatting can be inconsistent across sources.
Data Integration processes bring together and combine data from numerous heterogeneous source systems into a new technology application. It involves moving data between applications and can combine conflicting data that will require conflict resolution business rules.
Mapping Specifications are used to transform data from one source system and load them into the target applications. Example being a phone number in one system as a 10 digit string, and needing to be a nine digit integer in another.
Data quality thus depends on robustness of data mapping specifications and conflict resolution rules.
Data Cleansing is the process of correcting the erroneous data elements of a dataset. They are meant to resolve data quality issues, but can be the source of new ones. This is because the vast amount of data these days cannot be cleansed manually, and automated methods may make unexpected corrections.
Automated data cleansing algorithms can have bugs, and then be implemented over an entire set of data. Hence, thousands or millions of records could be ruined.
Also, sensitive data needs protection:
Data Anonymization is the process of removing PII from a dataset.
Organization changes, described below, can increase chances of data quality errors.
Data consolidation in these cases can lead to duplicate and inconsistent data. Each IT system may have their own rules and methods of handling data. Hence, a merger can be messy. There’s also the case of losing key employees during the process.
Going global is great for revenue but adds another dimension of complexity to data with different date formats, address, telephone numbers, currencies, etc… Current data may need to be transformed, new schema. For example to include country code in telephone numbers or the Country in the address.
Additionally, countries have different legal requirements for data, where it can be stored, how it is stored (encryption), retention requirements, etc…
When staff who has knowledge about the detailed working of IT systems leave, their knowledge is lost with them. That can have a negative impact on data quality.
Examples are: tax law changes, regulatory requirements such as financial reporting, compliance, data protection, etc… These require hastily changes and patches to systems. There’s a saying that you can have only 3 of 4, Cheap, Fast, Good (quality), Done. Companies sadly pick the quality as the flex point.
Apparently data fit for one system requires some massaging to fit in another. This can cause issues and expose inaccuracies.
Data Purging is the process of permanently deleting old data from storage space. This is done once the retention limit is reached or data becomes obsolete. It can impact the wrong data accidentally. You can purge too much or too little.
Again, data quality depends on context it is used. So, if collected for one purpose, it may not port over for other purposes. If the data collectors knew what it would be used for, the could collect good quality data for everyone.
p. 26
One person typically on a project knows the ins and outs of all the data. This can be especially true in a legacy system. If there’s no documentation, then if said individual leaves, there goes the knowledge silo.
Metadata helps users understand the data. Simple help text of address or date format can go a long way. Same terminology also helps. It doesn’t help when some sources reference “customer” and others use “client” when talking about the same thing.
p. 27
Important data assets may have several providers or producers and numerous consumers who are often oblivious of each other. Data quality is not a concern of the providers. When there’s not transparency or accountability for data quality, data can be incomplete, outdated, inaccurate and altogether unreliable and inconsistent.
Hackers can corrupt, steal and delete data. It can be serious like medical devices and GPS, or for silly games like Pokemon Go. However, if PII, PHI, or PCI data is stolen, organizations also face loss of customer confidence, reputational damage, and/or regulatory actions.
p. 28
Understanding the impact of bad quality data helps understand the importance of data quality in general. Poor quality data can lead to incorrect decisions which can negatively impact a business. It is the reason:
There’s a The Data Warehouse Institute?
Bad data can lead to wasted money in poorly targeted sales and marketing campaigns. Customer loyalty programs also suffer from bad contact data. A business can under and over promote the wrong products.
Apparently one of the key factors in the 2007 global financial crisis was inadequate IT and data architecture to support management of financial risk.
Poor data quality can also have serious health risks. Many people die each year from medical errors. One example being a woman in the UK whose address was recorded as “16” instead of “1b”. She missed many letters, appointments, and unfortunately died because of delayed treatment.
That’s only the half of it. People undergoing unnecessary surgeries because of swapped results.
Major impacts of bad data quality are:
Starting on p. 32, the author lists many examples.
Data produced/generated in one business process are consumed by several in other business processes.
There’s also an interesting picture showing the cost to prevent bad data quality is quite low. The cost to correct it is a bit more, but still ok. But the cost of failure due to bad quality can be quite a lot.
p. 37
You probably know what it is, but the butterfly effect is apparently an aspect of chaos theory to describe how small changes in one state of a deterministic non-linear system can result in large differences in a later state.
Data issues often follow the same trajectory. Small error in a product code, customer id, account number, etc… can then propagate across systems in an exponential manner.
Author provides nice summary on pp. 40 - 42.
pp. 43 - 68
This chapter is about the evolution of data collections, data structures, and storage over said period of time. We’ll also look at inception of the database, relational databases, and the data warehouse.
Then a discussion about databases, data models, normalization, database schema, relational databases, data warehousing, and data hierarchy.
We have come a long way since paper files and filing cabinets. Then came punch cards, paper tapes, and then magnetic tape. Some solutions didn’t hold much data and others were expensive, the typical struggle. Finally, we hit the hard disk drive era in 1956 thanks to IBM which were expensive and low capacity.
That would work itself out eventually. in 1960, Charles W. Bachman developed the first database management system called Integrated Database Store (IDS). IBM would develop their first DMS in 1960s as well called Information Management System (IMS).
There’s much more interesting history into the inception of the database. There were pre-relational databases before relational, and IBM developed SQL in 1974 which became the ANSI and OSI standards in 1986 and 1987.
The data warehouse began to show up in the 1970s with something first called the “data mart”. Sounds like it came from a need to clean and integrate data store in legacy systems. But the concept and practice of data warehousing we know today came in the 1980s.
Data warehousing exposed data quality issues as organizations attempted to integrate their disparate data. Disparate is things so unlike there is no basis for comparison, which might be a bit harsh. However, this was a result of limited technology resulting in isolated applications.
The database timeline:
Definition - database: a structured collection of interrelated data that are organized and stored in computer systems in a way to facilitate efficient retrieval.
Definition - database model: type of data model that determines the logical structure of a database and fundamentally determines in which manner data can be stored, organized and manipulated.
Definition - Database Schema: the structure in a database management system that holds the physical implementation of a data model. Think of it like blueprints of how data is stored, a references for the DMS.
Definition - Database Instance: This is a snapshot of the data in the database a a particular point in time. So, you might have backups of data saved as instances.
Definition - Relational Database: A digital database based on the relational model of data.
That definition isn’t great at explaining. It’s made up of tables that use a relational database management system (RDBMS).
Definition - Relational Model: A framework of structuring data using relationships. It’s the theoretical basis of relational databases.
Definition - Primary Key: Each record would be given a unique identifier known as the primary key.
Definition - Natural Key: An attribute or set of that already exists in the real world and can uniquely identify a record.
A record is a row in a table. Columns in a table are called field names. the records are a instance of something and the field names are the attributes of occurrence.
Definition - Candidate Key: This is the set of natural keys that can be declared as the primary key (choices).
Definition - Surrogate Key: Any column or set of columns in a relational table that does not have a business meaning but can be used to uniquely identify records in the table. So, it can be used as a primary key instead of a natural key.
Definition - Entity: Any object in the system that we want to model and store data about. Can be people, places, objects, phenomena, or events.
Definition - Attributes: Characteristics or features of an entity.
Definition - Instance: An occurrence of an entity. We store data of an instance in our relational database as a record with fields.
Definition - Relationship: An association or link between entities with an attribute or set of attributes establishing the link.
A relationship is established by a foreign key in one entity linking to the primary key in another entity.
Definition - Foreign Key: A column, or set of, from one relational table whose values must match another relational table’s primary key. This establishes a link.
Definition - Domain: A set or range of allowable values for a field (column).
Four main elements in relational database:
Definition - Structured Query Language (SQL): a relational database query and manipulation language allowing for the creation, modification, truncation, and deletion of databases and tables, and the manipulation and query of data.
p. 52
Definition - Data Model: An abstract representation of the database. It organizes data elements and data objects and standardizes how they relate to one another and the properties of the real-world entities at different levels of abstraction.
3 basic styles of data model are:
To me, sounds like the progression of building a schema. Main purpose of data modelling behind database construction is to make sure that entities, attributes, and attribute structure, and relationships between entities are completely and correctly represented. Typically, models are a pictorial representation.
Without careful planning, the database could omit attributes, resulting in incomplete and/or inaccurate data. Leads to poor quality by design.
p. 55
Definition - Data Normalization: A refinement process and systematic approach of restructuring data in database to eliminate redundancy, and promote logical data dependencies.
In contrast, you can have unnormalized relation, AKA unnormalized form or non-first normal form. These databases are often simple models that suffer from data redundancy and such.
p. 55
There’s a reference to a book “The Data Warehouse Toolkit”.
Definition - Data Warehouse: A copy of transaction data specifically structured for query and analysis. A collection of integrated, subject-oriented, non-volatile, and time variant databases where each unit of data is specific to some period of time.
The process of data warehousing involves the capture and integration of data from multiple heterogeneous sources. An enterprise data warehouse (basically the same thing but specific for business) usually covers a wide range of subject areas depending on the organization’s business process domain.
Poor data can creep into a data warehouse through:
Data in a warehouse is sourced from disparate sources. Each source would have its own method of storing data. A source offering any kind of unsecured access can become unreliable.
Definition - Data Overloading: When a particular field holds more that one data element.
Facts are measurements from business processes. Dimensions are groups of hierarchies and descriptors that define the facts. And a grain describes what a single fact table row represents. Different grains shouldn’t be mixed in one fact table.
Dimensional modelling involves 4 key steps:
The Star and Snowflake schemas are dimensional modelling techniques that use dimension tables to describe data aggregated in a fact table.
The dimensional tables in a star schema are not normalized. Normalizing dimension tables is called “snowflaking”. And hence, a snowflake schema is a star schema with fully normalized dimensions.
Definition - Hierarchy: Set of levels having many-to-one relationships between each other. The set of levels jointly makes up a dimension or a tree.
The book has examples around p. 59. Several dimension tables, like Customer, Product, and Time come together into a “Sales Fact” table to give facts like Sales recorded and Quantity sold.
Definition - Online Analytical Processing (OLAP) Cube: a dimensional structure used for storing multidimensional data and implemented in a multidimensional database, and optimized for advanced analytics. If more than 3 dimensions, it’s called a hyper cube. Often the final step in deployment of a dimensional data warehouse system.
The OLAP contains facts and dimensions but is accessed through multidimensional expression (MDX) languages.
2 approaches to designing data warehouse:
Definition - Source System: A transactional system, or data store, that provides data to another system.
Definition - Target System: A system that receives data from one or more data sources.
Definition - System of Data Origin: where data were created originally.
Definition - Direct Source System: Where the target system sources data.
Note, the target system will not always necessarily process data from origin.
A data warehouse is populated through a sequence of steps that:
You can see now the abbreviation ETL for extract, transform, and load. A data warehouse architecture consists of:
Qualitative data contains descriptive information. Quantitative data contains numerical information. Elementary data represents real-world atomic phenomena. Aggregated data is a collection of elementary data with some aggregation function applied to them.
A data element is the most basic unit of data and the smallest named unit in the database with any meaning. Typically represents an attribute, or collection of, of a real-world entity.
I feel like we covered this stuff already but
Since not all data is stored in a database, we refer to a broader sense of collected data as a data store.
pp. 66 - 67 has a very nice table of all terms. Not going to rewrite here.
Note that normalizing a data and normalizing data itself appear to be 2 different concepts. The former is a data refinement process to clean a data store. The latter is a data transformation technique to reduce the weight of data.
pp. 69 - 128
We now must manage data quality. Somehow, we need to measure quality to assess it and improve. Accuracy is not the only component of data quality.
Definition - Data Quality Dimension: a term used by information management professionals to describe an attribute/characteristic of data element(s) that can be measured or gauged against well-defined standards in order to determine, analyse, improve, and control or keep track of the quality of data. Common dimensions include:
Important to note that since the 90s, there has not been a general agreement on data quality dimensions.
Referencing “Data Quality for the Information Age” by Thomas C. Redman. The author looks at data quality from 3 perspective: data modeling, data value, and data presentation. He continues to describe many different data quality dimensions.
Referencing “Data Quality: The Field Guide” by Redman. Redman expands his grouping of data quality dimensions to include 55 dimensions across 7 categories of quality.
Referencing “Improving Data Warehouse and Business Information Quality” by Larry English. Larry takes a broader approach with broader categories. Larry also wrote “Information Quality Applied” where he revisits and clarifies the earlier set of information quality characteristics.
Referencing “The Practitioner’s Guide to Data Quality Improvement” by David Loshin. David classified data quality dimensions into 3 categories: Intrinsic, Contextual, and Qualitative. He recognizes 16 data quality dimensions across 3 categories and identifies a total of 38 characteristics.
There are then just a ton of definitions:
[completeness, timeliness, currency, consistency, reasonableness, identifiability]
[authoritative sources, trust, anonymity/privacy, quality indicator, edit and imputation, standards and policies]
There is figure 3.3 on p. 77 listing around 30 different data quality dimensions. I could see all of these dimensions as a good checklist. Will I list them?
Some of the words have very similar meaning to me. Ironically, like duplication and redundancy. Maybe it is a joke?
Ch. 4 discusses actually measuring data quality dimensions.
Then, much of the rest of the chapter goes into detail of the dimensions. I won’t be as granular.
p. 123
p. 129
As of now, reading is on hold as I have more pressing information to read.