CMU 15445 2.SQL + homework1

相信大家对SQL一定都不陌生，但SQL套路十分多，今天我们一次性来梳理一下

关系语言

• Edgar Codd在20世纪70年代早期发表了关于关系模型的主要论文。他最初只定义了DBMS如何在关系模型DBMS上执行查询的数学符号。
• 用户只需要使用声明性语言（即SQL）指定他们想要的结果。 DBMS负责确定产生该答案的最有效计划。
• 关系代数基于集合（set）（无序，无重复）。 SQL基于包(bag)（无序，允许重复）。

SQL 历史

•SQL：结构化查询语言， IBM最初将其称为“SEQUEL”。
•由不同类别的命令组成：
1.数据操作语言（DML）：SELECT，INSERT，UPDATE，DELETE。
2.数据定义语言（DDL）：模式定义。
3.数据控制语言（DCL）：安全性，访问控制。
•SQL不是一种死语言。每隔几年就会更新一次新功能。 SQL-92是DBMS必须支持的最小值才能声称它们支持SQL。每个供应商都遵循这个标准，但又会在一定程度上有许多专有扩展。

各个关系型数据库系统的比较可以参考
http://troels.arvin.dk/db/rdbms/

本章主题

image.png

1. 聚合

下面用到的样例数据库

image.png

聚合函数有5种，分别为
→ AVG(col)→ Return the average col value.
→ MIN(col)→ Return minimum col value.
→ MAX(col)→ Return maximum col value.
→ SUM(col)→ Return sum of values in col.
→ COUNT(col)→ Return # of values for col.

image.png

COUNT, SUM, AVG 支持 DISTINCT

image.png

聚合之外的其他列的输出是未定义(不合法)。

image.png

GROUP BY 关键词原理

image.png

HAVING 关键词是在GROUP BY基础上做的FILTER

image.png

String操作

•SQL标准表示字符串区分大小写且仅为单引号。

image.png

•有一些函数可以处理在查询任何部分使用的字符串

image.png

模式匹配：LIKE关键字用于谓词中的字符串匹配。

“％”匹配任何子串（包括空）。
“_”匹配任何一个字符。
•连接：两个垂直条（“||”）将两个或多个字符串连接成一个字符串。

输出重定向

•我们可以让DBMS将结果存储到另一个表中，而不是将结果返回给客户端（例如，终端）。然后，您可以在后续查询中访问此数据。
•比如重定向到新表：将查询的输出存储到新（永久）表中。

image.png

•比如重定向到现有的表

image.png

输出控制

主要的关键词是ORDER BY [ASC|DESC]，还有OFFSET 和 LIMIT

image.png

嵌套QUERY

image.png

在嵌套QUERY中，有4个关键词分别是ANY, ALL, EXISTS, IN
下面这个QUERY 上图等价。

image.png

如果要查一个学生的ID 要比其他的都大，就会用到ALL了

image.png

下面是一个用IN的等价写法

image.png

NOT EXISTS 可以用来找差集

image.png

WINDOW FUNCTION

image.png

FUNC-NAME 那里我们是可以用聚合函数的，比如SUM,COUNT之类。
除了聚合函数外，还有2个特殊函数也是可以使用。
→ ROW_NUMBER()→ # of the current row
→ RANK()→ Order position of the current row.

比如ROW_NUMBER 就会有如下效果

image.png

可以结合PARTITION BY 来对每个块做函数处理

image.png

如果要找每个课程排名第一的学生可以用如下SQL来实现

image.png

公用表表达式

公用表表达式（CTE）是Windows或嵌套查询的替代方法，用于编写更复杂的查询。人们可以将CTE看作只是一个查询的临时表。
这个在HOMEWORK1 里非常有用。
•WITH子句将内部查询的输出绑定到具有该名称的临时结果。示例：生成名为“cteName”的CTE，其中包含单个元组，其中单个属性设置为“1”。底部的查询然后只返回“cteName”中的所有属性。

image.png

Home work1

CMU的第一次作业，有些SQL十分复杂，工程中是绝对不建议用这种复杂的SQL的。
一个是效率不高，另外就是不易DEBUG，SQL也很不好维护，因为可读性差。
不过当做作业，用来训练思维还都是很好的题目。
作业链接如下：
https://15445.courses.cs.cmu.edu/fall2018/homework1/

一些难的题目，我把思路写在图片里

第一题是送的，就跳过。

第二题是热身

image.png

第三题开始后就难了

image.png

sqlite> select city, (ROUND(city_trip_cnt.cnt * 1.0 / all_trip_cnt.cnt, 4)) as ratio from (select city, count(distinct(id)) as cnt from trip, station where trip.start_station_id = station_id or trip.end_station_id = station_id group by city) as city_trip_cnt, (select count(1) as cnt from trip) as all_trip_cnt order by ratio DESC, city ASC;
San Francisco|0.9011
San Jose|0.0566
Mountain View|0.0278
Palo Alto|0.0109
Redwood City|0.0052

第四题

image.png

sqlite> select city, station_name, MAX(cnt) from (select city, station_name, count(distinct(id)) as cnt from trip, station where trip.start_station_id = station_id or trip.end_station_id = station_id group by station_name, city) group by city order by city ASC;

Mountain View|Mountain View Caltrain Station|12735
Palo Alto|Palo Alto Caltrain Station|3534
Redwood City|Redwood City Caltrain Station|2654
San Francisco|San Francisco Caltrain (Townsend at 4th)|111738
San Jose|San Jose Diridon Caltrain Station|18782

第五题（最难题）

两个辅助网站
http://nethelp.wikidot.com/date-add-subtract-in-sqlite
https://www.w3resource.com/sqlite/sqlite-strftime.php

image.png

with avail_date as (select date(start_time) as tdate from trip union select date(end_time) as tdate from trip where bike_id <= 100) 
select tdate, (ROUND(SUM((strftime('%s',MIN(datetime(tdate, '+1 day'),datetime(end_time))))-(strftime('%s',MAX(datetime(tdate),datetime(start_time))))) * 1.0 / (select count(distinct (bike_id)) from trip where bike_id <= 100),4)) as utilization from avail_date, trip 
where bike_id <= 100 and datetime(tdate) < datetime(end_time) and datetime(tdate) > datetime(start_time, '-1 day')
group by tdate order by utilization desc limit 10

2014-07-13|3884.1758
2014-10-12|3398.9011
2015-02-14|2728.3516
2014-08-29|2669.011
2015-07-04|2666.3736
2014-06-23|2653.1868
2013-10-01|2634.7253
2014-05-18|2618.2418
2015-02-15|2582.6374
2014-10-11|2555.6044

第6题

这题根据HINT，并不难实现

image.png

with tar_trip as ( select * from trip where bike_id >= 100 and bike_id <= 200)
select t1.bike_id, t1.id, t1.start_time, t1.end_time, 
t2.id, t2.start_time, t2.end_time from tar_trip as t1, tar_trip as t2 where t1.bike_id == t2.bike_id and t1.id < t2.id and t1.start_time < t2.end_time and t1.end_time > t2.start_time order by t1.bike_id ASC, t1.id ASC, t2.id ASC

144|815060|2015-06-19 21:26:00|2015-06-19 22:17:00|815073|2015-06-19 22:10:00|2015-06-19 22:17:00
158|576536|2014-12-15 15:05:00|2014-12-15 23:11:00|576591|2014-12-15 16:07:00|2014-12-15 16:17:00
158|576536|2014-12-15 15:05:00|2014-12-15 23:11:00|576604|2014-12-15 16:28:00|2014-12-15 16:40:00

第7题

比较简单

image.png

select bike_id, count(distinct city) as city_cnt from trip, station 
where station_id = start_station_id or station_id = end_station_id
group by bike_id
having city_cnt > 1
order by city_cnt desc, bike_id asc

15|5
25|5
27|5
31|5
43|5
51|5
56|5
59|5
64|5
69|5
76|5
90|5
94|5
116|5
119|5
...
579|2
609|2
664|2
665|2
666|2
687|2
697|2
709|2
717|2

第8题

image.png

with weacnt as (select count(distinct date) as wea_cnt ,events from weather group by events)
select w.events, (round((count(distinct t.id) * 1.0 / wea_cnt),4)) as avg_trips 
from trip t, station s, weather w, weacnt wc 
where t.start_station_id = s.station_id and s.zip_code = w.zip_code 
and w.events = wc.events and date(t.start_time) = w.date
group by w.events
order by avg_trips desc, w.events asc

\N|801.0528
Fog|682.4375
Fog-Rain|573.0833
Rain|560.0394
Rain-Thunderstorm|491.6667
rain|488.0

第9题

先实现红圈内的SQL，然后用WITH大法，把2个结果融合为1行

image.png

with shortt as (
select round(avg(mean_temp),4) from trip t, station s, weather w
where t.start_station_id = s.station_id and s.zip_code = w.zip_code 
and w.date = date(t.start_time) and
strftime('%s',datetime(t.end_time)) - strftime('%s',datetime(t.start_time)) <= 60
),
longt as (
select round(avg(mean_temp),4) from trip t, station s, weather w
where t.start_station_id = s.station_id and s.zip_code = w.zip_code 
and w.date = date(t.start_time) and
strftime('%s',datetime(t.end_time)) - strftime('%s',datetime(t.start_time)) > 60
)
select * from shortt, longt

60.9391|61.2227

第10题

先实现红框部分，然后想怎么获得到每个ZIP_CODE的MAX，就实现了蓝框部分，发现还要带上STATION_NAME，最后做个整合

image.png

with storm_res as (
select s.zip_code, s.station_name, count(distinct t.id) as cnt
from trip t, station s, weather w 
where t.start_station_id = s.station_id and s.zip_code = w.zip_code
and date(t.start_time) = w.date and w.events = 'Rain-Thunderstorm'
group by s.zip_code,s.station_name
having cnt > 0
) select tmp.zip_code, station_name, max_cnt 
from (select zip_code, max(cnt) as max_cnt from storm_res
group by zip_code ) as tmp, storm_res 
where tmp.max_cnt = storm_res.cnt
order by tmp.zip_code asc

94107|San Francisco Caltrain (Townsend at 4th)|114

CMU 15445 2.SQL + homework1

关系语言

SQL 历史

1. 聚合

String操作

输出重定向

输出控制

嵌套QUERY

WINDOW FUNCTION

公用表表达式

Home work1

第一题是送的，就跳过。

第二题是热身

第三题开始后就难了

第四题

第五题（最难题）

第6题

第7题

第8题

第9题

第10题

推荐阅读更多精彩内容