讲解：CS 3640、Web Scraping、HTML/CSS/webWeb|R

CS 3640: Introduction to Networks and Their Applications [Fall 2018]Assignment 4 | Web Scraping: Record and ReplayInstructor: Rishab Nithyanand | Office hours: Wednesday 9-10 am or by appointmentTeaching assistant: Md. Kowsar Hossain | Office hours: Monday 1:30-2:30 pmReleased on: October 25th | Due on: November 9th (11:59:59 pm)Maximum score: 100 | Value towards final grade: 13%GroupsGroup ID Group Hawk IDs1 [jblue, kzhang24, zluo1, ywang391, susmerano]2 [xchen117, jstoltz, jpflint, godkin]3 [mcagley, kdzhou, lye1, okueter, yitzhou]4 [msmith3, zzhang103, yonghfan, tnlowry]5 [mfmrphy, jmagri, trjns, jpthiede, uupadhyay]6 [dstutz, cweiske, hrunning, nicgoh]7 [awestemeier, nsonalkar, bzhang22, tsimonson]8 [xiaosong, jdhatch, tgoodmn, apatrck]9 [atran4, ymann, bchoskins, hpen]10 [apizzimenti, jglowacki, xxing2, yzheng19]11 [gongyzhou, ywang455, shangwchen, ppeterschmidt]12 [sklemm, weigui, lburden, gmich]Learning goalsThis assignment is intended to familiarize you with the HTTP protocol. HTTP is(arguably) the most important application level protocol on the Internet today: the Webruns on HTTP, and increasingly other applications use HTTP as well (includingBittorrent, streaming video, Facebook and Twitter social APIs, etc.). You will also getvery familiar with webdrivers and the HAR format.1Download your VMVM link:https://drive.google.com/file/d/1rwdZkCJS8fVLwpNLEYgAUYJUrwvBQ6jx/view?usp=sharing- Extract the tar.gz file that you just downloaded.- Download Virtual Box (its free and open source) from here: www.virtualbox.org.- Open Virtual Box and create a new machine. See instructions below!- The system will boot up Ubuntu 18.04. Your username and password on this systemis cs3640Virtual machine setup- Click New.- Type in the name of your new virtual machine. Select Linux as Type and Ubuntu 64bit as Version. Click next.- Allocate at least 2GB (2048MB) RAM to your VM.- When youre asked about creating a disk, click the Use an existing virtual hard diskfile option. The disk you should select is cs3640-assignment-4.vmdk located in thefolder you just extracted.- Thats it! Now every time you need to boot up the VM just open Virtual Box and selectyour VM.2Task I: Crawling the Web (30 points)As part of your first task, you will learn how to programmatically scrape the Web byinstrumenting a Web browser such as Chrome or Firefox to automatically loadwebpages for you and record content from these webpages.Specifically, you will do the following:- [15 points] You will write a program that will read an input file containing a list ofURLs and open a Web browser to visit each of the URLs in sequence.- You might consider using the Selenium Webdriver Python API to do this.I’ve used it for years and it’s always been (in my opinion) the bestwebdriver out there.- https://selenium-python.readthedocs.io/getting-started.html- Create your own input text file named “url_list” which contains a list ofURLs that the browser must visit in order. One URL per line.- [15 points] Your program should also record all HTTP requests being issued bythe browser and HTTP responses from the web servers. These requests andresponses should be saved in a HTTP Archive Record (HAR) format.- All recorded HAR files should be saved in a “har-data” folder contained inthe same working directory. This folder will contain one folder for eachURL crawled.- All HAR files generated file crawling this URL should be placed inthis folder. For example, you will have a ./har-data/ folderwhich will contain the HAR files recorded while visiting .- There are many different ways to generate a HAR file. The easiest is toconsider using a tool which sits between the network and your browserand records this information for you see. Another way is to use browserextensions such as Firebug + NetExport or HAR Export Trigger to do thisfor you (this is more complicated).- https://browsermob-proxy-py.readthedocs.io- https://selenium-python.readthedocs.io/faq.html3Task I submission instructions:- You will submit a single file named “task-1.zip”. This is the zipped version of afolder named “task-1”.- This folder will contain a file named “web-scraper.py”. This file shouldtake as input the path to the file containing the list of URLs.- This folder will contain a file “requirements.txt” which will contain a list ofpython packages that need to be installed to make your code work.- This folder will contain a file “install.sh” which will contain bash code toautomatically install any system tools/packages required to make yourcode work.- You will not submit your “har-data” folder.- Here is how we will evaluate this task:- We will run “sudo pip install -r requirements.txt”- We will then run “sudo chmod +x install.sh; ./install.sh”- We will then run “python web-scraper.py url_list” where “url_list” will besupplied by us -- expect it to at least contain www.google.com.- We will then check if the “har-data” folder created by your programcontains the HAR files expected to be generated while crawling each ofthe URLs on our list.4Task II: Parsing HAR files (30 points)As part of your second task, you will learn how to parse files in the HAR format. Thiswill also get you very familiar with the HTTP protocol and the request/responsemessage types.Specifically, you will do the following:- [15 points] You will write a program to parse HAR files and extract a list of all therequested URLs, corresponding HTTP response status codes, response contenttypes/sizes, and host-names.- The input to your program will be the path to a folder containing HARs.- The output generated by your program will be a collection of JSON files-- 1 per host-name seen in the HTTP requests present in your input HARfiles. These should be generated in a folder named “parsed-requests.- For example: If “file-1.har” has requests to the host-names“github.io” and “godaddy.com” and “file-2.har” has requests to thehost-names “github.io” and “google.com”, your program willgenerate 3 json files: “github.io”, “google.com”, and“godaddy.com”. These files will contain the extracted informationfor each request made to the corresponding host.- [15 points] Your program should also parse the generated HAR files andrecreate “html”, “png”, and “svg” content contained in all the HTTP res代写CS 3640作业、代做Web Scraping作业、代写HTML/CSS/web作业调试Web开发|代做R语言程ponses.- The output generated by your program should be a folder named“parsed-objects” containing the “html”, “png”, and “svg” received fromthe HTTP responses in each HAR file.- There should be one folder per observed host-name in “objects”.Continuing from the above example, there should be a“parsed-objects/github.io” folder containing all “html”, “png”, and“svg” objects loaded from “github.io”.- You might find it useful to know that you can decode the images(from base64 text) as follows: base64 -D text.txt > decoded.png5Task II submission instructions:- You will submit a single file named “task-2.zip”. This is the zipped version of afolder named “task-2”.- This folder will contain a file named “har-parser.py”. This file should takeas input the path to the folder containing the HAR files to be analyzed.- This folder will contain a file “requirements.txt” which will contain a list ofpython packages that need to be installed to make your code work.- This folder will contain a file “install.sh” which will contain bash code toautomatically install any system tools/packages required to make yourcode work.- Here is how we will evaluate this task:- We will run “sudo pip install -r requirements.txt”- We will then run “sudo chmod +x install.sh; ./install.sh”- We will then run “python har-parser.pytask-1/har-data/www.google.com”- We will then check if the “parsed-requests” folder created by yourprogram contains the expected output.- We will then check if the “parsed-objects” folder created by your programcontains the “html”, “png”, and “svg” objects loaded from each of theweb hosts.6Task III: Re-serving web content (30 points)As part of your third task, you will learn how to serve web content and replicate theHTTP request and response process in an emulated network.Specifically, you will do the following:- [15 points] You will write a program to create a mininet virtual networkcontaining one host named “client” and one “server” for each host-namecontained in the input folder. The input folder will have one sub-folder for eachhost-name which contains the objects expected to be served by the webservermimicking that host. Each of your emulated “web server” hosts will run an actualweb server which is expected to serve these objects. Each server will write thelist of files that they are able to serve to disk.- The input to your program will be the path to a “parsed-objects” folderderived from Task II.- Output: Each of the mininet web servers (1 for each host seen in the“parsed-objects” folder) will create a file named listing thefiles that they are able to serve. This output needs to be stored in a foldernamed “servable-content”.- [15 points] Your program will automatically rewrite the HTTP requests exitingfrom your client as follows: If the request is for an object that is already beingserved by one of our emulated web servers, then the HTTP request should bere-written to fetch that object from that web server. Other objects may befetched from the Web as normal (i.e., do not re-write the request URLs ofobjects not available through your emulated servers).- For example: If the client is loading “www.google.com” and the“index.html” file is already available through our emulated “google.com”web host, then the object should be fetched from there instead.- The input to your program will be the path to a file containing a list ofURLs, one URL per line.7- The output should be the responses to the GET requests. Theseresponses should be written into a “get-responses” folder. Files shouldbe named by their index in the input file.Task III submission instructions:- You will submit a single file named “task-3.zip”. This is the zipped version of afolder named “task-3”.- This folder will contain a file named “emulated-web.py”. This file shouldtake as input the path to a root “parsed-objects” folder.- This folder will contain a file “requirements.txt” which will contain a list ofpython packages that need to be installed to make your code work.- This folder will contain a file “install.sh” which will contain bash code toautomatically install any system tools/packages required to make yourcode work.- Here is how we will evaluate this task:- We will run “sudo pip install -r requirements.txt”- We will then run “sudo chmod +x install.sh; ./install.sh”- We will then run “python emulated-web.py ./input/parsed-objects./input/url_list”- We will then check if the “servable-content” folder contains the expectedset of files/content based on the supplied “./input/parsed-objects” folder.- We will then monitor the outgoing HTTP requests from your client to seeif the URLs are being rewritten as expected and if the responses in“get-responses” match the expected output.8Task IV: The credit reel (10 points)As always, you will get 10 points for submitting a well formatted credit reel. This shouldbe in a file named “credit-reel.txt”. Follow the same instructions as the previousassignments.Submission instructionsEach group is to submit a single zip file (which will contain 3 zip files -- 1 per task). Thesubmissions are due on ICON at 23:59:59 on November 9th, 2018. The last submissionsubmitted by a team member before midnight on the due date will be the one gradedunless ALL team members let the TA and me know that they want another submissionto be graded (the late penalty if a submission made past the due date is chosen).Late submissionsI am being generous in the amount of time allotted to this assignment to account fordifficulties in scheduling meetings, etc. There will be no extensions of the due dateunder any circumstances. If a submission is received past the due date, the late policydetailed on the course webpage will apply.Team-mate feedbackEach team member may also send me an email (rishab-nithyanand@uiowa.edu) withsubject Feedback: Assignment 4, Group N detailing their experience working witheach of their team-mates. For each team member, tell me at least one good thing andone thing they could improve. These will be anonymized and released to eachindividual at the end of the term. Its important to know how to work well in a team andearly feedback before you move on to bigger and better things is always helpful.Sending feedback for all 4 assignments will fetch you a 4% bonus at the end of theterm. Note: Sending with an incorrect subject line means that the email will not getforwarded to the right inbox.转自：http://ass.3daixie.com/2018110131238842.html

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 200,302评论 5赞 470
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 84,232评论 2赞 377
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 147,337评论 0赞 332
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 53,977评论 1赞 272
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 62,920评论 5赞 360
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,194评论 1赞 277
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,638评论 3赞 390
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,319评论 0赞 254
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,455评论 1赞 294
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,379评论 2赞 317
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,426评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,106评论 3赞 315
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,696评论 3赞 303
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,786评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,996评论 1赞 255
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,467评论 2赞 346
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,043评论 2赞 341

讲解：CS 3640、Web Scraping、HTML/CSS/webWeb|R

推荐阅读更多精彩内容