1 00:00:00,060 --> 00:00:00,750 Hey guys, 2 00:00:00,750 --> 00:00:05,330 welcome to Day 45 of 100 Days of Code. Now, 3 00:00:05,330 --> 00:00:08,450 today, we're going to be getting back to coding with Python, 4 00:00:08,900 --> 00:00:12,080 and we're going to be learning how to scrape the web for data 5 00:00:12,320 --> 00:00:14,810 using a module called BeautifulSoup. 6 00:00:15,770 --> 00:00:19,040 Now we've been working with APIs for quite a while now, 7 00:00:19,550 --> 00:00:20,600 and we know that 8 00:00:20,630 --> 00:00:25,630 we can use a website's API to access their data or to interact with the 9 00:00:26,120 --> 00:00:27,860 website using code. 10 00:00:28,520 --> 00:00:32,810 But some websites don't have an API or their API 11 00:00:32,810 --> 00:00:35,630 doesn't let us do all the things that we want to do. 12 00:00:36,500 --> 00:00:40,460 So this is where we start thinking about using web scraping 13 00:00:41,090 --> 00:00:44,540 where we look through the underlying HTML code 14 00:00:44,570 --> 00:00:48,110 of a website to get hold of the information that we want. 15 00:00:49,130 --> 00:00:52,880 So the aim of today is to learn how to make soup, 16 00:00:53,390 --> 00:00:57,410 but not this kind of soup. We're going to be making BeautifulSoup. 17 00:00:57,980 --> 00:00:59,960 What exactly is BeautifulSoup? Well, 18 00:00:59,990 --> 00:01:04,989 it's a module that helps developers like us make sense of websites. 19 00:01:06,170 --> 00:01:09,770 We could think of a lot of websites as a bit of a spaghetti soup, 20 00:01:10,190 --> 00:01:14,000 even something seemingly as simple as the Google front page, 21 00:01:14,270 --> 00:01:16,850 when you right click on it and view page source, 22 00:01:17,120 --> 00:01:20,060 you can see that it's horrendously complicated. 23 00:01:20,570 --> 00:01:25,100 And if you wanted to make sense of this webpage and pull out the relevant parts 24 00:01:25,100 --> 00:01:25,933 of the data, 25 00:01:26,270 --> 00:01:31,130 then you'll need an HTML parser like BeautifulSoup so that you can 26 00:01:31,130 --> 00:01:36,130 find and pull out the HTML elements that you're interested in from this 27 00:01:36,680 --> 00:01:39,140 soup of jumbled HTML code. 28 00:01:39,770 --> 00:01:42,380 And once we've mastered this skill, 29 00:01:42,500 --> 00:01:46,040 then we'll be able to take any website, for example, 30 00:01:46,070 --> 00:01:48,980 Empire's 100 Greatest Movies Of All Time, 31 00:01:49,220 --> 00:01:53,000 this is a huge list of a hundred movies that apparently everyone should have 32 00:01:53,000 --> 00:01:54,950 watched at some point in their life, 33 00:01:55,400 --> 00:01:58,460 and we can pull out the relevant parts to us 34 00:01:58,700 --> 00:02:02,540 namely the title and the ranking of each movie 35 00:02:02,840 --> 00:02:07,840 and we're going to use it to compile a list of movies that we have to watch so 36 00:02:07,880 --> 00:02:11,390 that we can look at the list, cross out the ones that we've already seen, 37 00:02:11,720 --> 00:02:16,280 and then pick at random one from the list so that we can watch all of the 38 00:02:16,280 --> 00:02:19,550 hundred movies of all time. That's the goal. 39 00:02:19,760 --> 00:02:24,290 And once you're ready head over to the next lesson and we're going to get started 40 00:02:24,350 --> 00:02:25,850 using BeautifulSoup.