【翻译自:https://neo4j.com/blog/common-confusions-cypher/ 】
【由Neo4j APAC授权编译发布】
本文整理自Stack Overflow上用户提到的Neo4j的图查询语言Cypher的最常见问题。
Cypher是一种直观的、充满艺术性、完全基于ASCII的查询语言,它允许您通过指定节点和关系的模式来查询属性图。虽然Cypher的简洁性让很多开发人员选择了Neo4j,但它并不能避免常见的误区。
在这篇文章中,先回顾一下我在Stack Overflow或Neo4j培训中看到的一些反复出现的问题和错误。我的所有示例都将使用Neo4j浏览器附带的电影数据库。您可以在浏览器查询框中执行“:play movies”来加载数据库。
1、LIMIT x vs. collect()[..x]
一般来说,我们知道何时使用LIMIT,何时使用collect()。
1.1 LIMIT
LIMIT 的常用用法是:“根据演过的电影数量找出前五名演员。”
MATCH (actor:Person)
RETURN actor.name, size((actor)-[:ACTED_IN]->(:Movie)) AS movies
ORDER BY movies DESC
LIMIT 5;actor.name | movies
------------------+-------
TomHanks | 12
KeanuReeves | 7
HugoWeaving | 5
JackNicholson | 5
MegRyan | 5
1.2 collect()
collect() 的常用用法是:查询电影并返回其导演集合。
MATCH (director:Person)-[:DIRECTED]->(movie:Movie)
WHERE movie.title STARTS WITH "The Matrix"
RETURN movie.title, collect(director.name) AS directors;movie.title | directors
----------------------------------+-------------------------------------
The Matrix Revolutions | ['Andy Wachowski', 'Lana Wachowski']
The Matrix | ['Lana Wachowski', 'Andy Wachowski']
The Matrix Reloaded | ['Lana Wachowski', 'Andy Wachowski']
1.3 LIMIT 和collect()
但是,当你需要对一个实体通过某种聚合进行分组,并返回前x条记录,会变得非常棘手。以下是用户碰到该问题的几个例子:
1)Cypher:选定节点限制返回的链接数
https://stackoverflow.com/questions/33914217/cypher-limiting-the-number-of-returned-links-for-a-selection-of-nodes
2)Neo4j:对子查询使用“order by”和“limit”
https://stackoverflow.com/questions/30580401/neo4j-using-order-by-and-limit-with-subqueries
假设我们想在图中找到两个年龄最大的人和他们最近出演的三部电影。这需要用到 LIMIT 和 collect() 以及 ORDER BY 的组合。
// Get the two oldest people.
MATCH (actor:Person)
WITH actor
ORDER BY actor.born
LIMIT 2// Get their three most recent movies.
MATCH (actor)-[:ACTED_IN]->(movie:Movie)
WITH actor, movie
ORDER BY movie.released DESC
RETURN actor.name, 2016 -actor.born AS age, collect(movie.title)[..3] AS movies;actor.name | age | movies
-----------------------+--------+---------------------------------------------------
Max von Sydow | 87 | ['Snow Falling on Cedars', 'What DreamsMay Come']
Gene Hackman | 86 | ['The Replacements', 'The Birdcage', 'Unforgiven]
常用查询模式是根据需要,用 ORDER BY 对数据进行排序,然后使用 collect()[..x] 为每行聚合 x 条记录。
如你所想:如果返回这些演员和电影的笛卡尔积(所有组合)怎么办?这就是 UNWIND 的作用所在,如果您想继续查询,并与已经查询的演员和电影编写更多 MATCH 语句,它就非常有用了。
// Get the two oldest people.
MATCH (actor:Person)
WITH actor
ORDER BY actor.born
LIMIT 2// Get their three most recent movies.
MATCH (actor)-[:ACTED_IN]->(movie:Movie)
WITH actor, movie
ORDER BY movie.released DESC
WITH actor, collect(movie)[..3] AS m// Unwind the collection into rows.
UNWIND m AS movie
RETURN actor.name, 2016 -actor.born AS age, movie.title;actor.name | age | movie.title
---------------------+------+-----------------------
Gene Hackman | 86 | The Replacements
Gene Hackman | 86 | The Birdcage
Gene Hackman | 86 | Unforgiven
Max von Sydow | 87 | Snow Falling on Cedars
Max von Sydow | 87 | What Dreams May Come
2、MERGE
这是最容易被误用的Cypher关键词。按照帮助文档说明的方式使用它,但 MERGE 的实际效果常常与用户期望不同。以下是用户碰到该问题的几个例子:
1)Neo4j–使用Merge依然重复
http://stackoverflow.com/questions/22520418/neo4j-duplicates-despite-using-merge
2)Cypher使用Merge出现“Node Already Exists”的问题
https://stackoverflow.com/questions/35381968/cypher-node-already-exists-issue-with-merge
3)当某个属性具有唯一性约束时如何使用MERGE
https://stackoverflow.com/questions/23971829/merge-when-one-of-the-property-has-unique-constraint
在每次的Neo4j基础培训时,我都会让学员做下面的练习,以消除困惑。假设我们在 :Person 节点的 name 属性创建唯一性约束。
CREATE CONSTRAINT ON (p:Person) ASSERT p.name IS UNIQUE;
我们知道图中 :Person 节点中有一个属性值为 name:"Tom Hanks" 的节点。如果我们希望找到它,不存在则创建,同时添加一个新的属性 oscar_winner 。此时,用下面的查询来看看会发生什么:
MERGE (p:Person {name:"Tom Hanks", oscar_winner: true})
RETURN p.name, p.oscar_winner;
几乎所有人都说:它会找到属性为 name:"Tom Hanks"的 :Person 节点,然后添加属性 oscar_winner:true 。错了!
Node 23478 alreadyexists with label Person and property "name"=[Tom Hanks]
MERGE 匹配的是语句中指定的全部模式。当Neo4j确定 :Person 标签和 name:"Tom Hanks 属性的节点存在,而其属性 oscar_winner:true 不存在时,Neo4j尝试创建这个节点。但是属性为 name:"Tom Hanks" 的 :Person 节点已经存在,这违反了唯一性约束规则。
解决办法就是 MERGE 匹配唯一性属性,然后用 SET 更新其它的属性。
MERGE (p:Person {name:"Tom Hanks"})
SET p.oscar_winner = true
RETURN p.name, p.oscar_winner;p.name | p.oscar_winner
----------------+-----------------------
Tom Hanks | True
当然我们也可以用 ON MATCH SET 或者 ON CREATE SET,去操作查询或创建的节点。关于这个话题的更多信息,大家可以阅读下面的链接地址:
http://neo4j.com/docs/stable/query-merge.html#_use_on_create_and_on_match 。
这种方法同样适用于 MERGE 语句中的查询关系。下面我们来看一个有意思的语句,属性为name:"Tom Hanks" 的 :Person 节点已存在,而属性为 name:"Nicole White" 的 :Person 节点不存在。
MERGE (tom:Person {name:"Tom Hanks"})-[:KNOWS]->(nicole:Person {name:"Nicole White"})
RETURN tom.name, nicole.name;
通过前面的分析,我们可以发现这句话存在的问题,并猜测出问题原因所在。
Node 23478 already exists with label Person and property "name"=[Tom Hanks]
一旦Neo4j确定 MERGE 语句中指定的整个模式不存在,它就会尝试在模式中创建全部内容,包括:
1)属性为 name:"Tom Hanks" 的 :Person 节点a
2)属性为 name:" Nicole White " 的 :Person 节点a
3)两个节点a之间的 :KNOWS 关系
执行查询会抛出一个错误,2)和3) 没问题,但是 1)违反了唯一性约束规则。解决办法是在模式中可能存在或不存在的部分使用 MERGE 的最佳实践:
MERGE (tom:Person {name:"Tom Hanks"})
MERGE (nicole:Person {name:"Nicole White"})
MERGE (tom)-[:KNOWS]->(nicole)
RETURN tom.name, nicole.name;tom.name | nicole.name
----------------+-------------------
Tom Hanks | Nicole White
3、WITH
WITH 语句能够更改变量的作用范围,在使用时容易出问题。若语句中存在聚合,则会按语句中的其它变量自动分组。以下是用户碰到该问题的几个例子:
1)子节点无法正确排序和限制;WITH + OPTIONAL MATCH
https://stackoverflow.com/questions/33220696/unable-to-correctly-order-and-limit-subnodes-with-optional-match
2)聚合关系属性的Cypher查询
https://stackoverflow.com/questions/32658169/cypher-query-to-aggregate-relationship-properties
3.1 未绑定变量
如果您想在 WITH 或 WHERE 子句后面的语句中使用变量,则需要将变量加入到 WITH 子句中,移除会导致它未绑定,再次使用则会抛出错误。
例如,假设我们想找到既是 导演 又是 作家 的所有影片,先尝试写一个明显的未绑定变量问题的语句:
MATCH (p:Person)-[:DIRECTED]->(m:Movie)
WITH m, collect(p) AS directors
WHERE (p)-[:WROTE]->(m)
RETURN m.title, [x IN directors | x.name];p not defined (line 3, column 8 (offset: 79))
"WHERE (p)-[:WROTE]->(m)"
如果您想在 WITH 后的 MATCH 子句中使用该变量,则不会抛出错误,因为 MATCH 子句认为您正在绑定一个新的变量(但是这不是我们想要的)。
MATCH (p:Person)-[:DIRECTED]->(m:Movie)
WITH m, collect(p) AS directors
MATCH (p)-[:WROTE]->(m)
RETURN m.title, [x IN directors | x.name];m.title | [x IN directors | x.name]
----------------------------------+-------------------------------------
A Few Good Men | ['Rob Reiner']
Something's Gotta Give | ['Nancy Meyers']
Speed Racer | ['Andy Wachowski', 'Lana Wachowski']
Speed Racer | ['Andy Wachowski', 'Lana Wachowski']
Jerry Maguire | ['Cameron Crowe']
Top Gun | ['Tony Scott']
V for Vendetta | ['James Marshall']
V for Vendetta | ['James Marshall']
When Harry Met Sally | ['Rob Reiner']
看到上面的查询结果,有人会说返回的所有影片,它们的导演也都是作家,但结果是不正确的。实际上查询返回了有编剧影片的所有导演。变量 p 在该查询的第2行解除了绑定,第3行的 MATCH 子句返回的是与新变量 p 存在外向 :WROTE 关系的所有节点。该查询的正确书写为:
MATCH (p:Person)-[:DIRECTED]->(m:Movie)
WHERE (p)-[:WROTE]->(m)
WITH m, collect(p) AS directors
RETURN m.title, [x IN directors | x.name];m.title | [x IN directors | x.name]
---------------------------------+-------------------------------------
Something's Gotta Give | ['Nancy Meyers']
Speed Racer | ['Andy Wachowski', 'Lana Wachowski']
Jerry Maguire | ['Cameron Crowe']
3.2 自动分组
除了带变量外,如果存在聚合集,WITH 子句还自动按变量分组。WITH 常见用法是过滤聚合函数结果。例如,我们想找到演员平均年龄超过70岁的所有影片。
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WITH m, avg(2016 - p.born) AS avg_age
WHERE avg_age > 70
RETURN m.title, avg_age
ORDER BY avg_age DESC;m.title | avg_age
----------------------------------------------+---------
Unforgiven | 86.00
One Flew Over the Cuckoo's Nest | 76.50
The Birdcage | 70.33
第2行,WITH m, avg(...) 子句,绑定了影片的变量 m 和聚合函数 avg(),以便 WITH 子句按 m 对数据自动分组。如果此时您想返回影片和演员姓名,会出现下面的错误:
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WITH m, p, avg(2016 - p.born) AS avg_age
WHERE avg_age > 70
RETURN m.title, avg_age, collect(p.name) AS actors
ORDER BY avg_age DESC;m.title | avg_age | actors
----------------------------------------+-------------+-----------------------------------------------------
Snow Falling on Cedars | 87.0 | ['Max von Sydow']
What Dreams May Come | 87.0 | ['Max von Sydow']
The Birdcage | 86.0 | ['Gene Hackman']
The Replacements | 86.0 | ['Gene Hackman']
Unforgiven | 86.0 | ['Gene Hackman', 'Richard Harris','Clint Eastwood']
Top Gun | 83.0 | ['Tom Skerritt']
Hoffa | 79.0 | ['Jack Nicholson']
A Few Good Men | 79.0 | ['Jack Nicholson']
As Good as It Gets | 79.0 | ['Jack Nicholson']
Something's Gotta Give | 79.0 | ['Jack Nicholson']
Frost/Nixon | 78.0 | ['Frank Langella']
The Da Vinci Code | 77.0 | ['Ian McKellen']
V for Vendetta | 76.0 | ['John Hurt']
The Green Mile | 76.0 | ['James Cromwell']
The Devil's Advocate | 76.0 | ['Al Pacino']
Snow Falling on Cedars | 76.0 | ['James Cromwell']
RescueDawn | 74.0 | ['Marshall Bell']
Stand By Me | 74.0 | ['Marshall Bell']
What Dreams May Come | 74.0 | ['Werner Herzog']
Hoffa | 73.0 | ['J.T. Walsh']
A Few Good Men | 73.0 | ['J.T. Walsh']
Hoffa | 72.0 | ['Danny DeVito']
WITH 子句中包含 p,数据也按 p 分组,avg_age 按影片和人分组,返回了错误的结果:我们正在计算一个人的平均年龄。正确的书写应该在 WITH 子句中按电影使用 collect() 聚合演员:
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WITH m, avg(2016 - p.born) AS avg_age,collect(p.name)AS actors
WHERE avg_age > 70
RETURN m.title, avg_age, actors
ORDER BY avg_age DESC;m.title | avg_age | actors
----------------------------------------+-------------+-----------------------------------------------------
Unforgiven | 86.00 | ['Gene Hackman', 'Clint Eastwood','Richard Harris']
Flew Over the Cuckoo's Nest | 76.50 | ['Jack Nicholson', 'DannyDeVito']
The Birdcage | 70.33 | ['Gene Hackman', 'Nathan Lane','Robin Williams']
4、结束语
当然,提高Cypher能力的最好方法是多写、勤学苦练!
通过阅读他人的查询,以获取新知识和技巧也很有帮助;每周我都会学习 Michael Hunger的Cypher查询技巧。您可以注册参加在线培训(http://neo4j.com/graphacademy/online-course-getting-started/),在Cypher开发者页面(http://neo4j.com/developer/cypher/)上阅读更多内容,以及随时翻阅Cypher refcard(http://neo4j.com/docs/stable/cypher-refcard/)。
更多技术咨询:
Email:yusonglin@we-yun.com