Outline
5.1 Information Theory
5.2 Information Technology
5.3 Data quality
5.4 Data cleaning
5.5 Data fusion
5.6 Data storage
5.7 Data mining
5.8 Multimedia information processing
5.3 Data quality 数据质量
Uncertain Data 不确定数据
- Data uncertainty occur during:
Name | 名字 |
---|---|
Data collection | 数据收集 |
Data transmission | 数据传输 |
Data processing | 数据处理 |
Causes of Data Uncertainty
Name | 名字 |
---|---|
Environmental factors | 环境因素 |
Low battery power | 电池电量低 |
Packet losses | 丢包 |
Classification of Data Uncertainty
- Source Classification 根据不确定数据的来源分类 (重点)
Name | 实例 | 翻译 |
---|---|---|
Undesirable uncertainty | Noisy sensor data | |
Imprecise GPS Data | ||
Unreliable extracted/integrated data | 不可靠的提取/集成数据 | |
Desirable uncertainty | Medical data with generalized attributes | 具有通用属性的医疗数据 |
Cloaked trajectory data | 隐藏的轨迹数据 |
- Granularity Classification 根据粒度分类
Name | 翻译 |
---|---|
Tuple Uncertainty | 元组的不确定性 |
Attribute Uncertainty | 属性不确定性 |
- Correlations Classification 根据相互关系分类
Name | 翻译 |
---|---|
Independent Uncertainty | 独立的不确定性 |
Correlated Uncertainty | 相关的不确定性 |
Uncertainty with Local Correlations | 局部相关不确定性 |
Meaning of Data Quality 数据质量的意义(重点)
- Generally, you have a problem if the data doesn’t mean what you think it does, or should.
通常情况下,如果数据的含义与您认为的不同,或者不应该相同,那么就会出现问题 - Data quality problems are expensive and pervasive.
数据质量问题昂贵且普遍存在
Conventional Definition of Data Quality 数据质量的常规标准(定义
Name | 翻译 | 解释 |
---|---|---|
Accuarcy | 精度 | recorded correctly |
Completeness | 完整 | All data was recorded |
Uniqueness | 独一 | recorded once |
Timeliness | 及时 | The data is kept up to date |
Consistency | 一致 | The data agrees with itself |
5.4 Data Cleaning 数据清理
the process of detecting and correcting (or removing) errors and inconsistencies from data in order to improve the quality of data.
To identifying incomplete, incorrect, inaccurate, irrelevant, etc.
从数据中检测和纠正(或消除)错误和不一致以提高数据质量的过程。
该技术目的在于识别不完整、不正确、不准确、不相关等。
Data cleaning tasks 数据清洗的任务 (重点)
Name | 翻译 |
---|---|
Fill in missing values | 填充缺失的值 |
Identify outliers and smooth out noisy data | 识别异常值并平滑噪声数据 |
Correct inconsistent data | 纠正不一致的数据 |
Resolve redundancy caused by data integration | 解决数据集成造成的冗余 |
Methods to Handle Noisy Data
Name | 解释 |
---|---|
Binning | 装箱法,把数据按箱处理Smooth掉边缘数据 |
Regression | 回归函数拟合 |
Clustering | 聚类,检测到不属于大类的元素,删掉 |
Combined inspection | 计算机和人工检查相结合 |
Sensor Cleaning Pipeline
Uses temporal and spatial characteristics of sensor data
利用传感器数据的时空特性
Step 1: Point
- Operates: Single value of sensor stream.
操作:单值传感器流。 -
Purpose: Filter individual values
目的:过滤单独的值
① Errant (dirty / faulty) RFID tags
错误的RFID标签
② Obvious outliers
明显的异常值
③ Conversion of raw data into tuples
将原始数据转换为元组
Step 2: Smoothing
- Purpose: Interpolates (inserts) lost readings
目的:插入丢失的读数
①Temporal interpolation
时间插值
②Outlier detection
异常值检测 -
Method: Window based queries
方法:基于窗口的查询
Step 3: Merge
- Purpose: Spatial interpolation
目的:空间插值 - 例如:在一个空间颗粒中,通过计算来自不同尘埃的读数的平均值,并忽略偏离平均值两个偏差之外的单个读数。
Step 4: Arbitrate 仲裁
- Purpose: Remove
目的:删除
① conflicting readings
冲突的读数
② de-duplication
重复数据删除
Step 5: Virtualize 虚拟化
- Purpose: Multi-source integration
目的:多源集成
Data Fusion 数据融合
概念(重点)
Data fusion combine data from multiple sources and gather that information in order to achieve inferences, which will be more efficient and potentially more accurate than if they were achieved by means of a single source.
数据融合将来自多个来源的数据组合起来,并收集这些信息,以实现推断,这将比通过单一来源实现更有效和更准确。填空题
Sensors only give an estimate of the measured physical property
传感器只能对测量到的物理性质作出估计。
Nature of errors often determine the preferred fusion algorithm
误差的性质往往决定了融合算法的首选。
Three Processing Architectures 三个处理架构
Name | 翻译 |
---|---|
Data-level fusion | 数据级融合 |
Feature-level fusion | 特征级融合 |
Decision-level fusion | 决策级融合 |
- Data-level fusion: Direct fusion of sensor data
数据级融合: 传感器数据的直接融合, - Feature-level fusion: Representation of sensor data via feature vectors, with subsequent fusion of the feature vectors
特征级融合: 通过特征向量表示传感器数据,然后融合特征向量 -
Decision-level fusion: Processing of each sensor to achieve high-level inferences or decisions, which are subsequently combined.
决策级融合 :对每个传感器进行处理,以实现高级推理或决策,然后将这些推理或决策组合在一起。
Data-level Fusion
- 使用条件: if the sensors are measuring the same physical phenomena.
如果传感器测量的是相同的物理现象
Data Storage 数据存储
Database System
- Database: collection of persistent data
数据库:持久数据的收集 - Data: Known facts that can be recorded and have an implicit meaning.
数据:可以记录并具有隐含意义的已知事实。 -
Database Management System (DBMS): software system that supports creation, population, and querying of a database
数据库管理系统(DBMS):支持数据库的创建、填充和查询的软件系统 - Database System: DBMS + Database
数据库系统:DBMS +数据库
DBMS 功能
Name | 解释 |
---|---|
Define | 定义特定的数据库 |
Construct | 构造初始数据库 |
Manipulate | 增删改查数据库 |
Share a database | 数据库共享 |
- Define a database.
根据数据类型、结构和约束定义特定的数据库 - Construct or Load the initial database.
在辅助存储介质上构造或加载初始数据库内容 - Manipulate the database:
操作数据库:
① Retrieval, Modification
检索,修改
② Accessing the database through Web applications
通过Web应用程序访问数据库 - Share a database
共享数据库允许多个用户和程序同时访问数据库
Data Storage Solution 数据存储解决方案(重点)
Name | 解释 |
---|---|
Direct Attached Storage | 直接连接存储器(DAS) |
Network Attached Storage | 网络附加存储(NAS) |
Storage Area Network | 存储区域网络(SAN) |
- Direct Attached Storage (DAS)
Characteristics: Storage devices attached directly to servers (only point of access)
直接连接到服务器的存储设备(仅访问点)
- Network Attached Storage (NAS)
Characteristics: more reliable than DAS, limited by LAN bandwidth.
-
Storage Area Network (SAN)
Characteristics: more expensive
5.7 Data Mining 数据挖掘
Major Data Mining Tasks 数据挖掘的主要任务
Name | 解释 |
---|---|
Classification | 分类,预测项目类 |
Association Rule Discovery | 关联发现 |
Clustering | 聚类,查找项目类 |
Sequential Pattern Discovery | 顺序模式发现 |
Deviation Detection | 偏差检测 |
Forecasting | 预测 |
Description | 描述 |
Link analysis | 寻找联系和关联 |
Classification 分类
定义
Find a model for class attribute as a function of the
values of other attributes.
将class属性作为其他属性值的函数来查找模型。test set 测试集
A test set is used to determine the accuracy of the model.
测试集用于确定模型的准确性。Classification method 分类方法
Name | 解释 |
---|---|
Decision Tree | 决策树 |
Naive Bayesian classifiers | 朴素贝叶斯分类器 |
Using association rule | 使用关联规则 |
Neural networks | 神经网络 |
Clustering 聚类定义
Given a set of data points, each having a set ofattributes, and a similarity measure among them.
5.8 Multimedia Information Processing 多媒体信息处理
- 定义
Multimedia is a combination of text, graphic, sound, animation, and video that is delivered interactively to the user by electronic or digitally manipulated means.
多媒体是文本、图形、声音、动画和视频的组合,通过电子或数字操作的方式交互地传递给用户
Digital Image Processing 数字图像处理
- Digital Image
A digital image is a representation of a two-dimensional image as a finite set of digital values, called picture elements or pixels.
数字图像是二维图像的一种表示,它是一组有限的数字值,称为图像元素或像素。 - Pixel values 像素值
typically represent gray levels, colours, opacities etc.
表示灰度、颜色、不透明度。 - 填空:Remember digitization implies that a digital image is an approximation of a real scene.
Major tasks for digital Image Processing
- Improvement of pictorial information for human interpretation.
改善图像信息的人类解释。 - Processing of image data for storage, transmission and representation for autonomous machine perception.
用于存储、传输和表示自主机器感知的图像数据处理。