CS 3640: Introduction to Networks and Their Applications [Fall 2018]Assignment 4 | Web Scraping: Record and ReplayInstructor: Rishab Nithyanand | Office hours: Wednesday 9-10 am or by appointmentTeaching assistant: Md. Kowsar Hossain | Office hours: Monday 1:30-2:30 pmReleased on: October 25th | Due on: November 9th (11:59:59 pm)Maximum score: 100 | Value towards final grade: 13%GroupsGroup ID Group Hawk IDs1 [jblue, kzhang24, zluo1, ywang391, susmerano]2 [xchen117, jstoltz, jpflint, godkin]3 [mcagley, kdzhou, lye1, okueter, yitzhou]4 [msmith3, zzhang103, yonghfan, tnlowry]5 [mfmrphy, jmagri, trjns, jpthiede, uupadhyay]6 [dstutz, cweiske, hrunning, nicgoh]7 [awestemeier, nsonalkar, bzhang22, tsimonson]8 [xiaosong, jdhatch, tgoodmn, apatrck]9 [atran4, ymann, bchoskins, hpen]10 [apizzimenti, jglowacki, xxing2, yzheng19]11 [gongyzhou, ywang455, shangwchen, ppeterschmidt]12 [sklemm, weigui, lburden, gmich]Learning goalsThis assignment is intended to familiarize you with the HTTP protocol. HTTP is(arguably) the most important application level protocol on the Internet today: the Webruns on HTTP, and increasingly other applications use HTTP as well (includingBittorrent, streaming video, Facebook and Twitter social APIs, etc.). You will also getvery familiar with webdrivers and the HAR format.1Download your VMVM link:https://drive.google.com/file/d/1rwdZkCJS8fVLwpNLEYgAUYJUrwvBQ6jx/view?usp=sharing- Extract the tar.gz file that you just downloaded.- Download Virtual Box (its free and open source) from here: www.virtualbox.org.- Open Virtual Box and create a new machine. See instructions below!- The system will boot up Ubuntu 18.04. Your username and password on this systemis cs3640Virtual machine setup- Click New.- Type in the name of your new virtual machine. Select Linux as Type and Ubuntu 64bit as Version. Click next.- Allocate at least 2GB (2048MB) RAM to your VM.- When youre asked about creating a disk, click the Use an existing virtual hard diskfile option. The disk you should select is cs3640-assignment-4.vmdk located in thefolder you just extracted.- Thats it! Now every time you need to boot up the VM just open Virtual Box and selectyour VM.2Task I: Crawling the Web (30 points)As part of your first task, you will learn how to programmatically scrape the Web byinstrumenting a Web browser such as Chrome or Firefox to automatically loadwebpages for you and record content from these webpages.Specifically, you will do the following:- [15 points] You will write a program that will read an input file containing a list ofURLs and open a Web browser to visit each of the URLs in sequence.- You might consider using the Selenium Webdriver Python API to do this.I’ve used it for years and it’s always been (in my opinion) the bestwebdriver out there.- https://selenium-python.readthedocs.io/getting-started.html- Create your own input text file named “url_list” which contains a list ofURLs that the browser must visit in order. One URL per line.- [15 points] Your program should also record all HTTP requests being issued bythe browser and HTTP responses from the web servers. These requests andresponses should be saved in a HTTP Archive Record (HAR) format.- All recorded HAR files should be saved in a “har-data” folder contained inthe same working directory. This folder will contain one folder for eachURL crawled.- All HAR files generated file crawling this URL should be placed inthis folder. For example, you will have a ./har-data/ folderwhich will contain the HAR files recorded while visiting .- There are many different ways to generate a HAR file. The easiest is toconsider using a tool which sits between the network and your browserand records this information for you see. Another way is to use browserextensions such as Firebug + NetExport or HAR Export Trigger to do thisfor you (this is more complicated).- https://browsermob-proxy-py.readthedocs.io- https://selenium-python.readthedocs.io/faq.html3Task I submission instructions:- You will submit a single file named “task-1.zip”. This is the zipped version of afolder named “task-1”.- This folder will contain a file named “web-scraper.py”. This file shouldtake as input the path to the file containing the list of URLs.- This folder will contain a file “requirements.txt” which will contain a list ofpython packages that need to be installed to make your code work.- This folder will contain a file “install.sh” which will contain bash code toautomatically install any system tools/packages required to make yourcode work.- You will not submit your “har-data” folder.- Here is how we will evaluate this task:- We will run “sudo pip install -r requirements.txt”- We will then run “sudo chmod +x install.sh; ./install.sh”- We will then run “python web-scraper.py url_list” where “url_list” will besupplied by us -- expect it to at least contain www.google.com.- We will then check if the “har-data” folder created by your programcontains the HAR files expected to be generated while crawling each ofthe URLs on our list.4Task II: Parsing HAR files (30 points)As part of your second task, you will learn how to parse files in the HAR format. Thiswill also get you very familiar with the HTTP protocol and the request/responsemessage types.Specifically, you will do the following:- [15 points] You will write a program to parse HAR files and extract a list of all therequested URLs, corresponding HTTP response status codes, response contenttypes/sizes, and host-names.- The input to your program will be the path to a folder containing HARs.- The output generated by your program will be a collection of JSON files-- 1 per host-name seen in the HTTP requests present in your input HARfiles. These should be generated in a folder named “parsed-requests.- For example: If “file-1.har” has requests to the host-names“github.io” and “godaddy.com” and “file-2.har” has requests to thehost-names “github.io” and “google.com”, your program willgenerate 3 json files: “github.io”, “google.com”, and“godaddy.com”. These files will contain the extracted informationfor each request made to the corresponding host.- [15 points] Your program should also parse the generated HAR files andrecreate “html”, “png”, and “svg” content contained in all the HTTP res代写CS 3640作业、代做Web Scraping作业、代写HTML/CSS/web作业 调试Web开发|代做R语言程ponses.- The output generated by your program should be a folder named“parsed-objects” containing the “html”, “png”, and “svg” received fromthe HTTP responses in each HAR file.- There should be one folder per observed host-name in “objects”.Continuing from the above example, there should be a“parsed-objects/github.io” folder containing all “html”, “png”, and“svg” objects loaded from “github.io”.- You might find it useful to know that you can decode the images(from base64 text) as follows: base64 -D text.txt > decoded.png5Task II submission instructions:- You will submit a single file named “task-2.zip”. This is the zipped version of afolder named “task-2”.- This folder will contain a file named “har-parser.py”. This file should takeas input the path to the folder containing the HAR files to be analyzed.- This folder will contain a file “requirements.txt” which will contain a list ofpython packages that need to be installed to make your code work.- This folder will contain a file “install.sh” which will contain bash code toautomatically install any system tools/packages required to make yourcode work.- Here is how we will evaluate this task:- We will run “sudo pip install -r requirements.txt”- We will then run “sudo chmod +x install.sh; ./install.sh”- We will then run “python har-parser.pytask-1/har-data/www.google.com”- We will then check if the “parsed-requests” folder created by yourprogram contains the expected output.- We will then check if the “parsed-objects” folder created by your programcontains the “html”, “png”, and “svg” objects loaded from each of theweb hosts.6Task III: Re-serving web content (30 points)As part of your third task, you will learn how to serve web content and replicate theHTTP request and response process in an emulated network.Specifically, you will do the following:- [15 points] You will write a program to create a mininet virtual networkcontaining one host named “client” and one “server” for each host-namecontained in the input folder. The input folder will have one sub-folder for eachhost-name which contains the objects expected to be served by the webservermimicking that host. Each of your emulated “web server” hosts will run an actualweb server which is expected to serve these objects. Each server will write thelist of files that they are able to serve to disk.- The input to your program will be the path to a “parsed-objects” folderderived from Task II.- Output: Each of the mininet web servers (1 for each host seen in the“parsed-objects” folder) will create a file named listing thefiles that they are able to serve. This output needs to be stored in a foldernamed “servable-content”.- [15 points] Your program will automatically rewrite the HTTP requests exitingfrom your client as follows: If the request is for an object that is already beingserved by one of our emulated web servers, then the HTTP request should bere-written to fetch that object from that web server. Other objects may befetched from the Web as normal (i.e., do not re-write the request URLs ofobjects not available through your emulated servers).- For example: If the client is loading “www.google.com” and the“index.html” file is already available through our emulated “google.com”web host, then the object should be fetched from there instead.- The input to your program will be the path to a file containing a list ofURLs, one URL per line.7- The output should be the responses to the GET requests. Theseresponses should be written into a “get-responses” folder. Files shouldbe named by their index in the input file.Task III submission instructions:- You will submit a single file named “task-3.zip”. This is the zipped version of afolder named “task-3”.- This folder will contain a file named “emulated-web.py”. This file shouldtake as input the path to a root “parsed-objects” folder.- This folder will contain a file “requirements.txt” which will contain a list ofpython packages that need to be installed to make your code work.- This folder will contain a file “install.sh” which will contain bash code toautomatically install any system tools/packages required to make yourcode work.- Here is how we will evaluate this task:- We will run “sudo pip install -r requirements.txt”- We will then run “sudo chmod +x install.sh; ./install.sh”- We will then run “python emulated-web.py ./input/parsed-objects./input/url_list”- We will then check if the “servable-content” folder contains the expectedset of files/content based on the supplied “./input/parsed-objects” folder.- We will then monitor the outgoing HTTP requests from your client to seeif the URLs are being rewritten as expected and if the responses in“get-responses” match the expected output.8Task IV: The credit reel (10 points)As always, you will get 10 points for submitting a well formatted credit reel. This shouldbe in a file named “credit-reel.txt”. Follow the same instructions as the previousassignments.Submission instructionsEach group is to submit a single zip file (which will contain 3 zip files -- 1 per task). Thesubmissions are due on ICON at 23:59:59 on November 9th, 2018. The last submissionsubmitted by a team member before midnight on the due date will be the one gradedunless ALL team members let the TA and me know that they want another submissionto be graded (the late penalty if a submission made past the due date is chosen).Late submissionsI am being generous in the amount of time allotted to this assignment to account fordifficulties in scheduling meetings, etc. There will be no extensions of the due dateunder any circumstances. If a submission is received past the due date, the late policydetailed on the course webpage will apply.Team-mate feedbackEach team member may also send me an email (rishab-nithyanand@uiowa.edu) withsubject Feedback: Assignment 4, Group N detailing their experience working witheach of their team-mates. For each team member, tell me at least one good thing andone thing they could improve. These will be anonymized and released to eachindividual at the end of the term. Its important to know how to work well in a team andearly feedback before you move on to bigger and better things is always helpful.Sending feedback for all 4 assignments will fetch you a 4% bonus at the end of theterm. Note: Sending with an incorrect subject line means that the email will not getforwarded to the right inbox.转自:http://ass.3daixie.com/2018110131238842.html
讲解:CS 3640、Web Scraping、HTML/CSS/webWeb|R
©著作权归作者所有,转载或内容合作请联系作者
- 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
- 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
- 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
推荐阅读更多精彩内容
- By clicking to agree to this Schedule 2, which is hereby ...