讲解:CS 3640、Web Scraping、HTML/CSS/webWeb|R

CS 3640: Introduction to Networks and Their Applications [Fall 2018]Assignment 4 | Web Scraping: Record and ReplayInstructor: Rishab Nithyanand | Office hours: Wednesday 9-10 am or by appointmentTeaching assistant: Md. Kowsar Hossain | Office hours: Monday 1:30-2:30 pmReleased on: October 25th | Due on: November 9th (11:59:59 pm)Maximum score: 100 | Value towards final grade: 13%GroupsGroup ID Group Hawk IDs1 [jblue, kzhang24, zluo1, ywang391, susmerano]2 [xchen117, jstoltz, jpflint, godkin]3 [mcagley, kdzhou, lye1, okueter, yitzhou]4 [msmith3, zzhang103, yonghfan, tnlowry]5 [mfmrphy, jmagri, trjns, jpthiede, uupadhyay]6 [dstutz, cweiske, hrunning, nicgoh]7 [awestemeier, nsonalkar, bzhang22, tsimonson]8 [xiaosong, jdhatch, tgoodmn, apatrck]9 [atran4, ymann, bchoskins, hpen]10 [apizzimenti, jglowacki, xxing2, yzheng19]11 [gongyzhou, ywang455, shangwchen, ppeterschmidt]12 [sklemm, weigui, lburden, gmich]Learning goalsThis assignment is intended to familiarize you with the HTTP protocol. HTTP is(arguably) the most important application level protocol on the Internet today: the Webruns on HTTP, and increasingly other applications use HTTP as well (includingBittorrent, streaming video, Facebook and Twitter social APIs, etc.). You will also getvery familiar with webdrivers and the HAR format.1Download your VMVM link:https://drive.google.com/file/d/1rwdZkCJS8fVLwpNLEYgAUYJUrwvBQ6jx/view?usp=sharing- Extract the tar.gz file that you just downloaded.- Download Virtual Box (its free and open source) from here: www.virtualbox.org.- Open Virtual Box and create a new machine. See instructions below!- The system will boot up Ubuntu 18.04. Your username and password on this systemis cs3640Virtual machine setup- Click New.- Type in the name of your new virtual machine. Select Linux as Type and Ubuntu 64bit as Version. Click next.- Allocate at least 2GB (2048MB) RAM to your VM.- When youre asked about creating a disk, click the Use an existing virtual hard diskfile option. The disk you should select is cs3640-assignment-4.vmdk located in thefolder you just extracted.- Thats it! Now every time you need to boot up the VM just open Virtual Box and selectyour VM.2Task I: Crawling the Web (30 points)As part of your first task, you will learn how to programmatically scrape the Web byinstrumenting a Web browser such as Chrome or Firefox to automatically loadwebpages for you and record content from these webpages.Specifically, you will do the following:- [15 points] You will write a program that will read an input file containing a list ofURLs and open a Web browser to visit each of the URLs in sequence.- You might consider using the Selenium Webdriver Python API to do this.I’ve used it for years and it’s always been (in my opinion) the bestwebdriver out there.- https://selenium-python.readthedocs.io/getting-started.html- Create your own input text file named “url_list” which contains a list ofURLs that the browser must visit in order. One URL per line.- [15 points] Your program should also record all HTTP requests being issued bythe browser and HTTP responses from the web servers. These requests andresponses should be saved in a HTTP Archive Record (HAR) format.- All recorded HAR files should be saved in a “har-data” folder contained inthe same working directory. This folder will contain one folder for eachURL crawled.- All HAR files generated file crawling this URL should be placed inthis folder. For example, you will have a ./har-data/ folderwhich will contain the HAR files recorded while visiting .- There are many different ways to generate a HAR file. The easiest is toconsider using a tool which sits between the network and your browserand records this information for you see. Another way is to use browserextensions such as Firebug + NetExport or HAR Export Trigger to do thisfor you (this is more complicated).- https://browsermob-proxy-py.readthedocs.io- https://selenium-python.readthedocs.io/faq.html3Task I submission instructions:- You will submit a single file named “task-1.zip”. This is the zipped version of afolder named “task-1”.- This folder will contain a file named “web-scraper.py”. This file shouldtake as input the path to the file containing the list of URLs.- This folder will contain a file “requirements.txt” which will contain a list ofpython packages that need to be installed to make your code work.- This folder will contain a file “install.sh” which will contain bash code toautomatically install any system tools/packages required to make yourcode work.- You will not submit your “har-data” folder.- Here is how we will evaluate this task:- We will run “sudo pip install -r requirements.txt”- We will then run “sudo chmod +x install.sh; ./install.sh”- We will then run “python web-scraper.py url_list” where “url_list” will besupplied by us -- expect it to at least contain www.google.com.- We will then check if the “har-data” folder created by your programcontains the HAR files expected to be generated while crawling each ofthe URLs on our list.4Task II: Parsing HAR files (30 points)As part of your second task, you will learn how to parse files in the HAR format. Thiswill also get you very familiar with the HTTP protocol and the request/responsemessage types.Specifically, you will do the following:- [15 points] You will write a program to parse HAR files and extract a list of all therequested URLs, corresponding HTTP response status codes, response contenttypes/sizes, and host-names.- The input to your program will be the path to a folder containing HARs.- The output generated by your program will be a collection of JSON files-- 1 per host-name seen in the HTTP requests present in your input HARfiles. These should be generated in a folder named “parsed-requests.- For example: If “file-1.har” has requests to the host-names“github.io” and “godaddy.com” and “file-2.har” has requests to thehost-names “github.io” and “google.com”, your program willgenerate 3 json files: “github.io”, “google.com”, and“godaddy.com”. These files will contain the extracted informationfor each request made to the corresponding host.- [15 points] Your program should also parse the generated HAR files andrecreate “html”, “png”, and “svg” content contained in all the HTTP res代写CS 3640作业、代做Web Scraping作业、代写HTML/CSS/web作业 调试Web开发|代做R语言程ponses.- The output generated by your program should be a folder named“parsed-objects” containing the “html”, “png”, and “svg” received fromthe HTTP responses in each HAR file.- There should be one folder per observed host-name in “objects”.Continuing from the above example, there should be a“parsed-objects/github.io” folder containing all “html”, “png”, and“svg” objects loaded from “github.io”.- You might find it useful to know that you can decode the images(from base64 text) as follows: base64 -D text.txt > decoded.png5Task II submission instructions:- You will submit a single file named “task-2.zip”. This is the zipped version of afolder named “task-2”.- This folder will contain a file named “har-parser.py”. This file should takeas input the path to the folder containing the HAR files to be analyzed.- This folder will contain a file “requirements.txt” which will contain a list ofpython packages that need to be installed to make your code work.- This folder will contain a file “install.sh” which will contain bash code toautomatically install any system tools/packages required to make yourcode work.- Here is how we will evaluate this task:- We will run “sudo pip install -r requirements.txt”- We will then run “sudo chmod +x install.sh; ./install.sh”- We will then run “python har-parser.pytask-1/har-data/www.google.com”- We will then check if the “parsed-requests” folder created by yourprogram contains the expected output.- We will then check if the “parsed-objects” folder created by your programcontains the “html”, “png”, and “svg” objects loaded from each of theweb hosts.6Task III: Re-serving web content (30 points)As part of your third task, you will learn how to serve web content and replicate theHTTP request and response process in an emulated network.Specifically, you will do the following:- [15 points] You will write a program to create a mininet virtual networkcontaining one host named “client” and one “server” for each host-namecontained in the input folder. The input folder will have one sub-folder for eachhost-name which contains the objects expected to be served by the webservermimicking that host. Each of your emulated “web server” hosts will run an actualweb server which is expected to serve these objects. Each server will write thelist of files that they are able to serve to disk.- The input to your program will be the path to a “parsed-objects” folderderived from Task II.- Output: Each of the mininet web servers (1 for each host seen in the“parsed-objects” folder) will create a file named listing thefiles that they are able to serve. This output needs to be stored in a foldernamed “servable-content”.- [15 points] Your program will automatically rewrite the HTTP requests exitingfrom your client as follows: If the request is for an object that is already beingserved by one of our emulated web servers, then the HTTP request should bere-written to fetch that object from that web server. Other objects may befetched from the Web as normal (i.e., do not re-write the request URLs ofobjects not available through your emulated servers).- For example: If the client is loading “www.google.com” and the“index.html” file is already available through our emulated “google.com”web host, then the object should be fetched from there instead.- The input to your program will be the path to a file containing a list ofURLs, one URL per line.7- The output should be the responses to the GET requests. Theseresponses should be written into a “get-responses” folder. Files shouldbe named by their index in the input file.Task III submission instructions:- You will submit a single file named “task-3.zip”. This is the zipped version of afolder named “task-3”.- This folder will contain a file named “emulated-web.py”. This file shouldtake as input the path to a root “parsed-objects” folder.- This folder will contain a file “requirements.txt” which will contain a list ofpython packages that need to be installed to make your code work.- This folder will contain a file “install.sh” which will contain bash code toautomatically install any system tools/packages required to make yourcode work.- Here is how we will evaluate this task:- We will run “sudo pip install -r requirements.txt”- We will then run “sudo chmod +x install.sh; ./install.sh”- We will then run “python emulated-web.py ./input/parsed-objects./input/url_list”- We will then check if the “servable-content” folder contains the expectedset of files/content based on the supplied “./input/parsed-objects” folder.- We will then monitor the outgoing HTTP requests from your client to seeif the URLs are being rewritten as expected and if the responses in“get-responses” match the expected output.8Task IV: The credit reel (10 points)As always, you will get 10 points for submitting a well formatted credit reel. This shouldbe in a file named “credit-reel.txt”. Follow the same instructions as the previousassignments.Submission instructionsEach group is to submit a single zip file (which will contain 3 zip files -- 1 per task). Thesubmissions are due on ICON at 23:59:59 on November 9th, 2018. The last submissionsubmitted by a team member before midnight on the due date will be the one gradedunless ALL team members let the TA and me know that they want another submissionto be graded (the late penalty if a submission made past the due date is chosen).Late submissionsI am being generous in the amount of time allotted to this assignment to account fordifficulties in scheduling meetings, etc. There will be no extensions of the due dateunder any circumstances. If a submission is received past the due date, the late policydetailed on the course webpage will apply.Team-mate feedbackEach team member may also send me an email (rishab-nithyanand@uiowa.edu) withsubject Feedback: Assignment 4, Group N detailing their experience working witheach of their team-mates. For each team member, tell me at least one good thing andone thing they could improve. These will be anonymized and released to eachindividual at the end of the term. Its important to know how to work well in a team andearly feedback before you move on to bigger and better things is always helpful.Sending feedback for all 4 assignments will fetch you a 4% bonus at the end of theterm. Note: Sending with an incorrect subject line means that the email will not getforwarded to the right inbox.转自:http://ass.3daixie.com/2018110131238842.html

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 200,302评论 5 470
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 84,232评论 2 377
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 147,337评论 0 332
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 53,977评论 1 272
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 62,920评论 5 360
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,194评论 1 277
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,638评论 3 390
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,319评论 0 254
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,455评论 1 294
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,379评论 2 317
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,426评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,106评论 3 315
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,696评论 3 303
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,786评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,996评论 1 255
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,467评论 2 346
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,043评论 2 341

推荐阅读更多精彩内容

  • By clicking to agree to this Schedule 2, which is hereby ...
    qaz0622阅读 1,419评论 0 2
  • NAME dnsmasq - A lightweight DHCP and caching DNS server....
    ximitc阅读 2,793评论 0 0
  • Here is our training plan. Our agenda for this training m...
    FlyingPeter阅读 377评论 0 0
  • 小朋友会说很多话了。爱说:今天是个阳光明媚的日子。标榜自己:卖花的小孩。买鲜花的小孩儿。会总结:小鸟家在树上我们家...
    凤凰爱番茄阅读 114评论 0 0
  • 不知道是不是年纪大了,大前天下了夜班怎么都缓不过来,只好在家躺了两天,结果躺得腰酸背痛的,走起路来两腿打颤,感觉有...
    飞雪_飘渺阅读 397评论 0 1