The relationship in mongoDB
Like the MS Excel, mongoDB can be considered like a ExcelFile.
Each Database(db) is a separate .xls file, and each Collection is a table.
Meanwhile, each collection can record many items with there key&value.
Active mongoDB
Type ** mongod ** in Terminal, it will run in localhost with port 27017.
Basic moves in terminal
Start another terminal tab, and type ** mongo **, you can enter the mongo console which is running in your computer.
Check the Datebase
show dbs
This command can tell you how many db has storage in your disk. And it will also shows how many space they have taken.
use xx
xx is a db name. This command will switch your current work path to the db you select.
show tables
Print all the tables(collections) under this db.
Backup a table(collection)
Here is a example about how to backup the collection "xxCollect" into "bakCollectionName".
- Create a empty collection.
db.creatCollection('bakCollectionName')
- Copy your collection into the backup file.
db.xxCollect.copyTo('bakCollectionNmae')
Import json file into mongoDB
If there is a json file like this:
[
{
"title":"Introduction",
"url":"http://courses.engr.illinois.edu/cs598jhm/sp2013/Slides/Lecture01.pdf",
"description":""
}
,
{
"title":"Conjugate priors",
"url":"http://courses.engr.illinois.edu/cs598jhm/sp2013/Slides/Lecture02.pdf",
"description":"T. Griffiths and A. Yuille A primer on probabilistic inference; Chapters 8 and 9 of D. Barber Bayesian Reasoning and Machine Learning. See also this diagram of conjugate prior relationships"
}
]
It can be import to mongo as a collection by 2 steps.
- Create a empty collection. (mongo Shell)
db.creatCollection('newCollect')
- Use mongoimport. (in Terminal)
mongoimport --db datebaseName --collection newCollect --file /home/tmp/course_temp.json --jsonArray
also, it can be write as:
mongoimport -d dbName -c collectName path/file.json
Modify a table(collection) with Pymongo
There is a table named itemList in db named myDatabase.
All these code below is pymongo model function. It can help us manage mongoDB with python.
Start a connection
import pymongo
client = pymongo.MongoClient('localhost', 27017)
myDB = client['myDatabase']
myTable = myDB['itemList']
IF the database or collection doesn't exist, it will create one with this code. Like the open function in python.
Add a record
All record should be dict before it is add into collection.
myTable.insert_one(dataDict)
Delete a record
myTable.remove({'words':0})
The argument is also a dict, which means delete the item with a key&value compared.
Modify a record
myTable.update(arg1, arg2)
eg.
myTable.update({id:1}, {'$set':{name:2}}
arg1 is a selection, arg2 is the exact operation.
Check a record
myTable.find( )
HomeWork1: Find out all rooms whose price greater than 500
Target
First, crawl all rooms' info in the first three pages;
Second, select those rooms whose price greater than 500
Coding
import requests
from bs4 import BeautifulSoup
import pymongo
def getBriefFromListPage(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
# print(soup.prettify())
itemsA = soup.select('#page_list > ul > li')
itemsB = soup.select('#page_list li i')
infos = soup.select('#page_list li em.hiddenTxt')
dataList = []
for itemA, itemB, info in zip(itemsA, itemsB, infos):
link = itemA.a.get('href')
# image = itemA.a.img.get('src') # 图片是异步加载的,无法获取
title = itemA.a.img.get('title')
price = int(itemB.string)
otherInfo = info.get_text().replace(' ', '').replace('\n', '')
data = { # 以字典的形式存入数据库中去
'title': title,
'price': price,
'otherInfo': otherInfo,
'link': link
}
dataList.append(data)
return dataList
def putListDataInMongo(ListData, DBname, SHEETname):
'把字典组成的列表放进数据库的指定位置中 DBname->SHeetname'
client = pymongo.MongoClient('localhost', 27017)
myDataBase = client[DBname]
mysheet = myDataBase[SHEETname]
for eachData in ListData:
mysheet.insert_one(eachData)
print('Already put:', len(ListData), 'datas into DB.')
Here is the utility function. Their usage is below.
start_url = 'http://bj.xiaozhu.com/search-duanzufang-p{pageNumber}-0/' # pageNumber=1 的时候是第一页
for index in range(1, 4):
listPageLink = start_url.format(pageNumber=index)
listDataDict = getBriefFromListPage(listPageLink)
print(listPageLink)
print(listDataDict)
putListDataInMongo(listDataDict, 'testDB', 'sheetXiaoZhu')
client = pymongo.MongoClient('localhost', 27017)
dbname = client['testDB']
sheet = dbname['sheetXiaoZhu']
for index, item in enumerate(sheet.find({'price': {'$gte': 500}})):
print(index, item)
Meanwhile, I found that mongoDB can tolerant with those duplicate items. So I try to made a piece of code to remove those duplicities.
client = pymongo.MongoClient('localhost', 27017)
dbname = client['testDB']
sheet = dbname['sheetXiaoZhu']
allData = sheet.find()
for each in allData:
lindAddr = each['link']
check = sheet.find({'link': lindAddr})
count = 0
for che in check:
count+=1
if count == 2:
sheet.remove({'link': lindAddr}, False)