Technology Sharing

Common data problems in big data: arbitrary and dirty

2024-07-08

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Imagine you have just joined a large enterprise that claims to be undergoing "digital transformation" as a big data development engineer. In your first week, you are full of enthusiasm and can't wait to show your skills and promote the company's data-driven decision-making.
image.png

However, as you begin to gain a deeper understanding of your company’s data infrastructure and processes, you gradually realize that the challenges ahead are much greater than you expected:

  • You try to obtain some historical sales data for analysis, but discover that the sales department's data is stored in an old database that is completely isolated from the company's main system.
  • When you try to integrate customer data from different departments, you find that each department uses a different customer ID format, making data matching extremely difficult.
  • You write a data processing script, but when you run it you discover a number of data quality issues, including missing values, outliers, and obvious erroneous inputs.
  • When you ask what certain data fields mean, no one can give you a clear answer, and you can't find any relevant data dictionary or documentation.
  • You propose to transfer some sensitive data to the cloud for processing, but the IT security team expresses serious concerns that this may bring the risk of data leakage.
  • You develop a predictive model that performs well, but when you present it to the business, they say they don't understand what the data means.

Faced with these challenges, you realize that there is still a long way to go to achieve true data-driven decision making in this company. You decide to systematically sort through these issues to better understand and address them.

Common data problems in big data

1. Data silos

image.png

Data silos refer to situations where data cannot be effectively shared between information systems or organizational units. This leads to duplicate development and waste of resources.

example:

  • The sales department and inventory management department of a large retail company used different systems and were unable to share data in real time.
  • The information systems between different government departments are not interoperable, resulting in citizens having to provide the same information repeatedly.

Code example (Python):

# 销售部门的数据库
sales_db = {
    "product_a": {"sales": 1000, "revenue": 50000},
    "product_b": {"sales": 800, "revenue": 40000}
}

# 库存部门的数据库
inventory_db = {
    "product_a": {"stock": 500},
    "product_b": {"stock": 200}
}

# 由于数据孤岛,我们无法直接获取销售和库存的综合信息
# 需要手动整合数据
def get_product_info(product):
    if product in sales_db and product in inventory_db:
        return {
            "sales": sales_db[product]["sales"],
            "revenue": sales_db[product]["revenue"],
            "stock": inventory_db[product]["stock"]
        }
    return None

print(get_product_info("product_a"))

2. Disconnection - Data value chain disconnection

image.png

The data value chain fault refers to the break in the process from data collection to final utilization, which results in the inability to fully realize the value of data.

example:

  • An e-commerce platform collected a large amount of user browsing data, but the analysis team lacked the corresponding skills to interpret this data.
  • Medical institutions collect genetic data from patients but lack the ability to translate this data into personalized treatment plans.

Code example (Python):

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# 假设我们有用户浏览数据
df = pd.DataFrame({
    'user_id': range(1000),
    'page_views': np.random.randint(1, 100, 1000),
    'time_spent': np.random.randint(10, 3600, 1000),
    'purchases': np.random.randint(0, 5, 1000)
})

# 尝试建立一个预测模型
X = df[['page_views', 'time_spent']]
y = df['purchases']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

# 模型评分
print(f"Model Score: {model.score(X_test, y_test)}")

# 但是,如果分析团队不理解这个模型或不知道如何解释结果,
# 那么这个模型就无法为业务决策提供有价值的指导

3. Lack - Lack of standards, governance, data, etc.

This issue involves many aspects of data management, including the lack of unified standards, data governance mechanisms, necessary data, standardized processes, specialized organizations and management systems.

image.png

example:

  • A multinational company's branches in different countries use different customer information formats, which makes data integration difficult.
  • A research project was missing key demographic data, which affected the accuracy of the analysis.

Code example (Python):

# 假设我们有来自不同国家的客户数据,格式不统一
us_customers = [
    {"name": "John Doe", "phone": "1234567890"},
    {"name": "Jane Smith", "phone": "0987654321"}
]

uk_customers = [
    {"full_name": "David Brown", "tel": " 44 1234567890"},
    {"full_name": "Emma Wilson", "tel": " 44 0987654321"}
]

# 由于缺乏统一标准,我们需要手动处理数据
def standardize_customer(customer, country):
    if country == "US":
        return {
            "full_name": customer["name"],
            "phone_number": " 1 "   customer["phone"]
        }
    elif country == "UK":
        return {
            "full_name": customer["full_name"],
            "phone_number": customer["tel"]
        }

# 标准化数据
standardized_customers = (
    [standardize_customer(c, "US") for c in us_customers]  
    [standardize_customer(c, "UK") for c in uk_customers]
)

print(standardized_customers)

4. Difficulty - Data is difficult to obtain, understand and trace

This issue involves the accessibility, understandability and traceability of data.

example:

  • A company's historical data was stored in legacy systems, making it difficult for new employees to access and understand the data.
  • Some results in a data analysis project cannot be traced back to the original data source, which affects the credibility of the results.
    image.png

Code example (Python):

import hashlib
import json
from datetime import datetime

class DataRecord:
    def __init__(self, data, source):
        self.data = data
        self.source = source
        self.timestamp = datetime.now().isoformat()
        self.hash = self._calculate_hash()

    def _calculate_hash(self):
        record = json.dumps({"data": self.data, "source": self.source, "timestamp": self.timestamp})
        return hashlib.sha256(record.encode()).hexdigest()

    def __str__(self):
        return f"Data: {self.data}, Source: {self.source}, Timestamp: {self.timestamp}, Hash: {self.hash}"

# 创建一些数据记录
record1 = DataRecord("User A purchased Product X", "Sales System")
record2 = DataRecord("Product X inventory decreased by 1", "Inventory System")

print(record1)
print(record2)

# 这种方法可以帮助追踪数据的来源和变化,但仍然需要额外的系统来管理这些记录

5. Dirty - poor data quality

Data quality issues include inaccuracy, incompleteness, inconsistency, duplication, etc.

image.png

example:

  • There is a lot of duplicate or outdated contact information in the customer database.
  • Sensor data contains outliers, which affects the accuracy of data analysis.

Code example (Python):

import pandas as pd
import numpy as np

# 创建一个包含一些"脏"数据的DataFrame
df = pd.DataFrame({
    'name': ['John', 'Jane', 'John', 'Bob', 'Alice', np.nan],
    'age': [30, 25, 30, -5, 200, 35],
    'email': ['[email protected]', 'jane@example', '[email protected]', '[email protected]', '[email protected]', 'invalid']
})

print("Original data:")
print(df)

# 数据清洗
def clean_data(df):
    # 删除重复行
    df = df.drop_duplicates()
    
    # 处理缺失值
    df['name'] = df['name'].fillna('Unknown')
    
    # 修正异常值
    df.loc[df['age']