Week2 hw1: MongoDB

The relationship in mongoDB

Like the MS Excel, mongoDB can be considered like a ExcelFile.
Each Database(db) is a separate .xls file, and each Collection is a table.
Meanwhile, each collection can record many items with there key&value.

Active mongoDB

Type ** mongod ** in Terminal, it will run in localhost with port 27017.

Basic moves in terminal

Start another terminal tab, and type ** mongo **, you can enter the mongo console which is running in your computer.

Check the Datebase

show dbs

This command can tell you how many db has storage in your disk. And it will also shows how many space they have taken.

use xx

xx is a db name. This command will switch your current work path to the db you select.

show tables

Print all the tables(collections) under this db.

Backup a table(collection)

Here is a example about how to backup the collection "xxCollect" into "bakCollectionName".

  1. Create a empty collection.

db.creatCollection('bakCollectionName')

  1. Copy your collection into the backup file.

db.xxCollect.copyTo('bakCollectionNmae')

Import json file into mongoDB

If there is a json file like this:

[ 
{
"title":"Introduction",
"url":"http://courses.engr.illinois.edu/cs598jhm/sp2013/Slides/Lecture01.pdf",
"description":""
}
,
{
"title":"Conjugate priors",
"url":"http://courses.engr.illinois.edu/cs598jhm/sp2013/Slides/Lecture02.pdf",
"description":"T. Griffiths and A. Yuille A primer on probabilistic inference; Chapters 8 and 9 of D. Barber Bayesian Reasoning and Machine Learning. See also this diagram of conjugate prior relationships"
}
]

It can be import to mongo as a collection by 2 steps.

  1. Create a empty collection. (mongo Shell)

db.creatCollection('newCollect')

  1. Use mongoimport. (in Terminal)

mongoimport --db datebaseName --collection newCollect --file /home/tmp/course_temp.json --jsonArray

also, it can be write as:

mongoimport -d dbName -c collectName path/file.json

Modify a table(collection) with Pymongo

There is a table named itemList in db named myDatabase.
All these code below is pymongo model function. It can help us manage mongoDB with python.

Start a connection

import pymongo

client = pymongo.MongoClient('localhost', 27017)
myDB = client['myDatabase']
myTable = myDB['itemList']

IF the database or collection doesn't exist, it will create one with this code. Like the open function in python.

Add a record

All record should be dict before it is add into collection.

myTable.insert_one(dataDict)

Delete a record

myTable.remove({'words':0})

The argument is also a dict, which means delete the item with a key&value compared.

Modify a record

myTable.update(arg1, arg2)
eg.
myTable.update({id:1}, {'$set':{name:2}}

arg1 is a selection, arg2 is the exact operation.

Check a record

myTable.find( )



HomeWork1: Find out all rooms whose price greater than 500

Target

First, crawl all rooms' info in the first three pages;
Second, select those rooms whose price greater than 500

Coding

import requests
from bs4 import BeautifulSoup
import pymongo


def getBriefFromListPage(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    # print(soup.prettify())
    itemsA = soup.select('#page_list > ul > li')
    itemsB = soup.select('#page_list li i')
    infos = soup.select('#page_list li em.hiddenTxt')

    dataList = []
    for itemA, itemB, info in zip(itemsA, itemsB, infos):
        link = itemA.a.get('href')
        # image = itemA.a.img.get('src')  # 图片是异步加载的,无法获取
        title = itemA.a.img.get('title')
        price = int(itemB.string)
        otherInfo = info.get_text().replace(' ', '').replace('\n', '')
        data = {  # 以字典的形式存入数据库中去
            'title': title,
            'price': price,
            'otherInfo': otherInfo,
            'link': link
        }
        dataList.append(data)
    return dataList


def putListDataInMongo(ListData, DBname, SHEETname):
    '把字典组成的列表放进数据库的指定位置中 DBname->SHeetname'
    client = pymongo.MongoClient('localhost', 27017)
    myDataBase = client[DBname]
    mysheet = myDataBase[SHEETname]
    for eachData in ListData:
        mysheet.insert_one(eachData)
    print('Already put:', len(ListData), 'datas into DB.')

Here is the utility function. Their usage is below.

start_url = 'http://bj.xiaozhu.com/search-duanzufang-p{pageNumber}-0/'  # pageNumber=1 的时候是第一页

for index in range(1, 4):
    listPageLink = start_url.format(pageNumber=index)
    listDataDict = getBriefFromListPage(listPageLink)
    print(listPageLink)
    print(listDataDict)
    putListDataInMongo(listDataDict, 'testDB', 'sheetXiaoZhu')

client = pymongo.MongoClient('localhost', 27017)
dbname = client['testDB']
sheet = dbname['sheetXiaoZhu']
for index, item in enumerate(sheet.find({'price': {'$gte': 500}})):
    print(index, item)

Meanwhile, I found that mongoDB can tolerant with those duplicate items. So I try to made a piece of code to remove those duplicities.

client = pymongo.MongoClient('localhost', 27017)
dbname = client['testDB']
sheet = dbname['sheetXiaoZhu']
allData = sheet.find()

for each in allData:
    lindAddr = each['link']
    check = sheet.find({'link': lindAddr})
    count = 0
    for che in check:
        count+=1
    if count == 2:
        sheet.remove({'link': lindAddr}, False)

Appendix

MongoDB_Tutorial ( cn_Zh )
MongoDB_CheatSheet.pdf (en)

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 206,482评论 6 481
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 88,377评论 2 382
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 152,762评论 0 342
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 55,273评论 1 279
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 64,289评论 5 373
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,046评论 1 285
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,351评论 3 400
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,988评论 0 259
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 43,476评论 1 300
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,948评论 2 324
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,064评论 1 333
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,712评论 4 323
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,261评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,264评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,486评论 1 262
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,511评论 2 354
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,802评论 2 345

推荐阅读更多精彩内容

  • 今早五点起床,刷牙、洗脸、抹淡妆,三十分钟火速拾掇完毕。五点四十分,用哥准时在楼下接妈妈和我。他先是送准丈母娘...
    Bgedert阅读 285评论 0 2
  • 在塞着耳塞,花了35分钟走完3.5公里后,却想不起手机播放器的这35分钟里播了哪几首歌。因为在这个过程里我一直在...
    WS三三阅读 546评论 3 1
  • 我认识很多人的爱人是他的初中同学,反而少有高中大学恋爱而结婚的。但是也不能说那时候的真纯就是延续一生的保证。 现在...
    xsseal阅读 233评论 0 0
  • 还有15天,我就19岁了。但我已经迫不及待地开始。 这是一个生活实验,为期一年,19岁到20岁。 整个青春,就是一...
    修一云阅读 348评论 0 1