Every day within your organization, a vast amount of customer data is collected from various online and internal systems. This information makes its way to your analytics platforms by passing through different technological processes along the way.
It is important to consider what could potentially go wrong during this data collection and transformation process. Issues often arise such as timestamps being recorded in the incorrect time zone, duplicate or inaccurate entries, and undetected non-human traffic that skews metrics. Additionally, different departments within an organization may lack alignment on how basic terms should be defined. For example, what precisely constitutes an “active user” could mean different things to product management versus marketing.
When such inconsistencies are not properly addressed, they can lead to discrepancies in the final data. These discrepancies undermine the reliability and trustworthiness of the information. As a consequence, teams may hesitate to use the flawed data to inform important projects and decisions. In severe cases, some stakeholders may even decide to rely solely on their own assumptions rather than data analysis.
Research underscores how prevalent poor data quality has become. According to one survey of over 500 data experts, around 77% said their companies experience some level of data quality issues. An overwhelming majority, 91%, believe these problems negatively impact organizational performance.
Clearly, a proactive strategy is needed to safeguard data reliability before analytics are misguided. Here are some tips on how to take a proactive approach to prevent data discrepancies.
Definition of data discrepancy
A data discrepancy is a disagreement between two corresponding data sets. For example, two analytics platforms might report different bounce rates for the same landing page.
Data discrepancies are common when data is collected from multiple sources, such as SaaS tools and online platforms. This is because different tools may use different definitions for the same terms, track data differently, and use different criteria to calculate metrics.
While it is impossible to eliminate data discrepancies completely, it is important to minimize and prevent them as much as possible.
The Hidden Costs of Data Discrepancies
Data discrepancies can have a significant financial impact on businesses. A survey of 1,200 CRM users found that 44% of respondents estimated that poor-quality CRM data causes their company to lose over 10% of annual revenue. And Gartner economists estimated that on average, organizations endure costs totaling a sizable $12.9 million annually due directly to poor or inconsistent data.
The top three most frequently encountered causes of data discrepancies in business analytics
- Non-standardized data collection:
Data discrepancies often arise due to variations in tracking methods, naming conventions, and data definitions across different data sources. For example, different platforms may measure and record metrics differently, leading to inconsistencies in the data. Other factors contributing to non-standardization include variations in attribution models, event locations, time zones, and tracking methods. These differences can lead to challenges when trying to analyze and compare data accurately, especially when working with multiple software providers or fulfilling reporting requirements for different entities.
- Inadequate data cleansing and quality control:
Insufficient data cleansing practices and a lack of robust quality control mechanisms can introduce errors and discrepancies into the data. This can include issues such as invalid data types, syntax errors, incomplete records, and duplicate entries. Discrepancies may also occur when different analytics tools apply different data filtering or transformation processes, leading to variations in the final analysis. Implementing effective data cleansing processes, including automated diagnostics and validation checks, is essential to ensure the accuracy and reliability of the data used for analysis.
- Sampling and data limitations:
Data sampling is a commonly used technique in analytics to estimate results based on a subset of the data. However, the use of sampling can introduce discrepancies if not properly managed. The size and selection method of the sample can impact the accuracy of the analysis. The sampled data may not fully represent the entire dataset or may exhibit biases due to factors such as selecting only a specific portion of the data or focusing on certain demographics. It’s important to understand the limitations of data sampling and consider its potential impact on the accuracy of the analysis results.
How to prevent and resolve data discrepancies
Data discrepancies can lead to missed insights, delayed decisions, and engineering hours, which all add up to expenses and opportunity costs. To prevent and resolve data discrepancies, you can follow these steps:
- Centralize data collection:
When data is collected and stored in different places, it can be difficult to identify and resolve discrepancies. For example, marketing and sales teams might unknowingly use different attribution models for the same campaigns, while product and customer support teams may have different definitions for disengaged users.
To address these discrepancies, consolidating data from multiple sources into a single repository, such as a customer data platform (CDP), is crucial. You can use a customer data platform (CDP) like DataS CDP to centralize data from multiple sources and create a shared source of truth for your organization.
- Data tracking plans:
Implementing a data tracking plan is a crucial step in preventing data discrepancies within an organization. A tracking plan is a document that outlines the data events to be collected, including their properties, naming conventions, and the rationale behind tracking these events to achieve business goals. It also encompasses the tracking methods employed.
By having all departments within the organization adhere to a single tracking plan, data discrepancies can be avoided. To ensure company-wide adoption, it’s important to create a tracking plan that addresses the specific data collection requirements and use cases of each team. The tracking plan should be treated as a dynamic document that can be updated and modified as needed to improve data standards and accommodate additional event types as the business evolves.
- Shared data dictionary:
A data dictionary provides a comprehensive list of data elements, their definitions, and associated attributes. To create a successful data dictionary, it is important to involve different departments in its development and enforcement. Even seemingly common terms like “users” or “sessions” can lead to disputes, so engaging stakeholders from various teams is crucial. While the process may require significant time, coordination, and negotiation, it is worthwhile in order to establish a common understanding.
For instance, Google Analytics provides an example of a data definition for the metric “Percent scrolled,” which indicates the percentage of a page that a user has scrolled. Additionally, metrics and significant events like conversions may require more detailed explanations, including their calculation methods and their relationships with other data objects.
To enhance the data definitions within the dictionary, it can be helpful to include additional resources such as links to related terms or external materials that explain the distinctions between similar data objects. In cases where certain terms are frequently disputed, it is beneficial to create and link to an internal resource that provides a thorough explanation and justification for the defined terms.
- Detect data quality problems automatically:
Automating the diagnosis of data quality issues is essential, especially when dealing with a high volume of daily events. Manual checks and audits alone are impractical for maintaining data quality. To address this, certain tasks can be automated, including:
- Flagging bad data: Implementing automated processes to identify and flag data that is inaccurate, invalid, duplicate, or incomplete. This helps in quickly identifying and addressing data quality issues.
- Preventing bad data from reaching repositories and downstream tools: Automation can be utilized to stop flawed data from being sent to data repositories and downstream tools. By implementing validation checks and filters, erroneous data can be intercepted and prevented from further propagation.
- Transforming, cleaning, deduplicating, and validating data: Automation tools can be employed to perform data transformations, cleaning tasks, deduplication processes, and data validation. This ensures that the data is standardized, consistent, and conforms to predefined rules and standards.
DataS can also help you to identify data discrepancies. For example, DataS can identify when a tracked event deviates from the established tracking plan, such as using different naming styles or input formats, or containing incomplete or invalid properties. It cleanses the data before sending it to the data warehouse, analytics tools, and business applications. By automating these processes, discrepancies are resolved in real time, preventing data quality issues from affecting the integrity of the data stored in the data repository.