CRISP data mining process

CRISP Data Mining Process is a process model with six phases that naturally describes the Data Science Life Cycle. It will help you plan, organize and implement your data science(or machine learning) project.

I: The data mining process

  • Business understanding – What does the business need?
  • Data understanding – What data do we have / need? Is it clean?
  • Data preparation – How do we organize the data for modeling?
  • Modeling – What modeling techniques should we apply?
  • Evaluation – Which model best meets the business objectives?
  • Deployment – How do stakeholders access the results?
The Most Common Methodology.jpeg

Data science teams that combine a loose implementation of CRISP-DM with overarching team-based agile project management approaches will likely see the best results. Even teams that don’t explicitly follow CRISP-DM, can still use the framework diagram to explain how the differences between data science and software projects.

II What are the 6 CRISP-DM Phases

2.1 Business Understanding

The Business Understanding phase focuses on understanding the objectives and requirements of the project. Aside from the third task, the three other tasks in this phase are foundational project management activities that are universal to most projects

    1. Determine business objectives: You should first “thoroughly understand, from a business perspective, what the customer really wants to accomplish.” (CRISP-DM Guide) and then define business success criteria.
    1. Assess situation: Determine resources availability, project requirements, assess risks and contingencies, and conduct a cost-benefit analysis.
    1. Determine data mining goals: In addition to defining the business objectives, you should also define what success looks like from a technical data mining perspective.
    1. Produce project plan: Select technologies and tools and define detailed plans for each project phase.

    2.2 Data Understanding

Next is the Data Understanding phase. Adding to the foundation of Business Understanding, it drives the focus to identify, collect, and analyze the data sets that can help you accomplish the project goals. This phase also has four tasks:

  • Collect initial data: Acquire the necessary data and (if necessary) load it into your analysis tool.

  • Describe data: Examine the data and document its surface properties like data format, number of records, or field identities.

  • Explore data: Dig deeper into the data. Query it, visualize it, and identify relationships among the data.

  • Verify data quality: How clean/dirty is the data?

2.3 Data Preparetion

This phase, which is often referred to as “data munging”, prepares the final data set(s) for modeling. It has five tasks:

  • Select data: Determine which data sets will be used and document reasons for inclusion/exclusion.

  • Clean data: Often this is the lengthiest task. Without it, you’ll likely fall victim to garbage-in, garbage-out. A common practice during this task is to correct, impute, or remove erroneous values.

  • Construct data: Derive new attributes that will be helpful. For example, derive someone’s body mass index from height and weight fields.

  • Integrate data: Create new data sets by combining data from multiple sources.

  • Format data: Re-format data as necessary. For example, you might convert string values that store numbers to numeric values so that you can perform mathematical operations.

2.4. Modeling

e.g
Unsupervised & supervised tasks:
Classification & probability estimation; Regression; Similarity matching; Clustering; Co-occurrence grouping; Profiling; Link prediction; Data reduction; Causal modelling

Here you’ll likely build and assess various models based on several different modeling techniques. This phase has four tasks:

  • Select modeling techniques: Determine which algorithms to try (e.g. regression, neural net).
  • Generate test design: Pending your modeling approach, you might need to split the data into training, test, and validation sets.
  • Build model: As glamorous as this might sound, this might just be executing a few lines of code like “reg = LinearRegression().fit(X, y)”.
  • Assess model: Generally, multiple models are competing against each other, and the data scientist needs to interpret the model results based on domain knowledge, the pre-defined success criteria, and the test design.

2.5 Evaluation

Whereas the Assess Model task of the Modeling phase focuses on technical model assessment, the Evaluation phase looks more broadly at which model best meets the business and what to do next. This phase has three tasks:

  • Evaluate results: Do the models meet the business success criteria? Which one(s) should we approve for the business?
  • Review process: Review the work accomplished. Was anything overlooked? Were all steps properly executed? Summarize findings and correct anything if needed.
  • Determine next stepsBased on the previous three tasks, determine whether to proceed to deployment, iterate further, or initiate new projects.

2.6 Deployment

A model is not particularly useful unless the customer can access its results. The complexity of this phase varies widely. This final phase has four tasks:

  • Plan deployment: Develop and document a plan for deploying the model.
  • Plan monitoring and maintenance: Develop a thorough monitoring and maintenance plan to avoid issues during the operational phase (or post-project phase) of a model.
  • **Produce final report: **The project team documents a summary of the project which might include a final presentation of data mining results.
  • Review project: Conduct a project retrospective about what went well, what could have been better, and how to improve in the future.
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 199,902评论 5 468
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 84,037评论 2 377
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 146,978评论 0 332
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 53,867评论 1 272
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 62,763评论 5 360
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,104评论 1 277
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,565评论 3 390
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,236评论 0 254
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,379评论 1 294
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,313评论 2 317
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,363评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,034评论 3 315
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,637评论 3 303
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,719评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,952评论 1 255
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,371评论 2 346
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 41,948评论 2 341

推荐阅读更多精彩内容

  • Database Analysis & Decision Support Market analysis & ma...
    Vince_zzhang阅读 593评论 0 0
  • 16宿命:用概率思维提高你的胜算 以前的我是风险厌恶者,不喜欢去冒险,但是人生放弃了冒险,也就放弃了无数的可能。 ...
    yichen大刀阅读 6,026评论 0 4
  • 公元:2019年11月28日19时42分农历:二零一九年 十一月 初三日 戌时干支:己亥乙亥己巳甲戌当月节气:立冬...
    石放阅读 6,866评论 0 2
  • 年纪越大,人的反应就越迟钝,脑子就越不好使,计划稍有变化,就容易手忙脚乱,乱了方寸。 “玩坏了”也是如此,不但会乱...
    玩坏了阅读 2,120评论 2 1
  • 感动 我在你的眼里的样子,就是你的样子。 相互内化 没有绝对的善恶 有因必有果 当你以自己的价值观幸福感去要求其他...
    周粥粥叭阅读 1,630评论 1 5