1
00:00:00,510 --> 00:00:05,250
Now that we've learned how to do web scraping with requests and Beautiful Soup,

2
00:00:05,880 --> 00:00:10,880
it's time to step back for a moment and have a think about what we're allowed to

3
00:00:11,490 --> 00:00:16,490
do and what might not be a good idea when we're scraping data from other

4
00:00:16,710 --> 00:00:20,640
websites. Because after all, we don't own that data, right?

5
00:00:21,450 --> 00:00:26,370
When you think about services like Google or Bing or any other search engine,

6
00:00:26,790 --> 00:00:31,790
essentially what they're doing is they're constantly scraping data from all of the

7
00:00:32,940 --> 00:00:35,280
websites that are listed on the internet.

8
00:00:35,850 --> 00:00:39,780
And that's how they manage to get the information about what's on each page

9
00:00:40,110 --> 00:00:43,740
and for it to show up for users who use their search service.

10
00:00:44,490 --> 00:00:49,490
Now we have to step back for a moment and think about what is the law on web

11
00:00:49,920 --> 00:00:52,530
scraping? What is legal and what is illegal?

12
00:00:53,250 --> 00:00:55,530
Even as we were looking at Hacker News

13
00:00:55,530 --> 00:00:59,850
just now I noticed that one of the articles in fact talks about the Genius law

14
00:00:59,880 --> 00:01:04,170
suit with Google. And in terms of recent history,

15
00:01:04,170 --> 00:01:08,310
there's two really famous cases, which is Genius suing Google

16
00:01:08,640 --> 00:01:12,810
because they're saying that Google is scraping the song lyrics from their

17
00:01:12,810 --> 00:01:17,340
website and they're actually displaying it without taking people to Genius.

18
00:01:17,910 --> 00:01:21,270
So for example, if we're looking at the lyrics for Code m
Monkey,

19
00:01:21,930 --> 00:01:26,930
you can see that Google automatically shows the lyrics straight inside of

20
00:01:26,940 --> 00:01:27,773
Google.

21
00:01:27,810 --> 00:01:32,760
That means that a user can potentially just get all the information they need,

22
00:01:33,120 --> 00:01:34,410
say all of the song

23
00:01:34,410 --> 00:01:39,410
lyrics to this song, without ever needing to visit the website where this lyric

24
00:01:39,960 --> 00:01:40,793
might come from.

25
00:01:41,430 --> 00:01:45,600
And that lyric might've been compiled by somebody on Genius.

26
00:01:46,170 --> 00:01:50,190
Genius has a lyric annotation website. And of course,

27
00:01:50,280 --> 00:01:51,510
as with all websites,

28
00:01:51,510 --> 00:01:56,510
they rely on users actually visiting their website to make money or to show ads.

29
00:01:57,630 --> 00:02:00,570
And if Google simply just shows it in their search results,

30
00:02:00,870 --> 00:02:03,930
then this can be a problem for websites like Genius.

31
00:02:04,470 --> 00:02:09,470
So they sued them over this and actually ended up losing the lawsuit.

32
00:02:10,560 --> 00:02:15,560
Another really famous example of a lawsuit over scraping is hiQ versus

33
00:02:17,130 --> 00:02:22,080
LinkedIn. So hiQ was scraping data from LinkedIn to use commercially.

34
00:02:22,620 --> 00:02:27,360
So LinkedIn sued them and ended up losing in the lawsuit.

35
00:02:28,050 --> 00:02:29,250
Based on these lawsuits,

36
00:02:29,280 --> 00:02:33,960
we have a little bit of a better idea of what is legal when it comes to web

37
00:02:33,960 --> 00:02:36,180
scraping and what is not legal.

38
00:02:36,780 --> 00:02:41,460
The law actually seems to favor web scraping in the sense that you're allowed to

39
00:02:41,460 --> 00:02:43,350
scrape a website data

40
00:02:43,920 --> 00:02:47,430
as long as you think about a couple of things.

41
00:02:48,090 --> 00:02:53,090
A lot of people have been writing about web scraping being legal based on the

42
00:02:53,370 --> 00:02:55,860
LinkedIn versus hiQ case.

43
00:02:56,250 --> 00:03:01,060
But the important thing to remember is that this is not a blanket sort of,

44
00:03:01,300 --> 00:03:04,690
you can do whatever you want, scraping any website's data.

45
00:03:05,320 --> 00:03:10,320
It only means that data that is publicly available and not copyrighted is

46
00:03:11,560 --> 00:03:16,390
probably legal for companies to scrape. Now,

47
00:03:16,420 --> 00:03:18,400
if you are using this data privately

48
00:03:18,400 --> 00:03:22,510
like we are creating some sort of service for ourselves, then it doesn't really

49
00:03:22,510 --> 00:03:24,160
matter. You're  just a user.

50
00:03:24,820 --> 00:03:28,150
The difficulty comes when you're trying to commercialize that data,

51
00:03:28,150 --> 00:03:32,590
when you set up a business and your business kind of involves somebody else's

52
00:03:32,590 --> 00:03:35,890
data. That is a bit of a gray area. Now,

53
00:03:35,950 --> 00:03:38,170
the things that we definitely know are

54
00:03:38,170 --> 00:03:40,720
that you can't commercialize copyrighted content.

55
00:03:40,990 --> 00:03:45,790
So if you scrape data from YouTube and you scraped the video data,

56
00:03:45,820 --> 00:03:50,820
you can't just use that video on your own website. That is still not allowed

57
00:03:51,040 --> 00:03:56,040
because that video is copyrighted and it's created by a YouTube user and the

58
00:03:56,650 --> 00:04:00,820
copyright belongs to that user, not to you. So this is still illegal.

59
00:04:01,600 --> 00:04:05,620
This might also apply to other things like a Medium blog post that somebody else

60
00:04:05,620 --> 00:04:09,310
wrote or a piece of music that's being hosted on Spotify.

61
00:04:09,760 --> 00:04:12,430
So copyrighted content you can't commercialize.

62
00:04:13,030 --> 00:04:17,110
The second thing is that you can't scrape data that's behind authentication.

63
00:04:17,470 --> 00:04:21,100
So if you have to log into Facebook in order to scrape the data,

64
00:04:21,310 --> 00:04:22,810
then that's pretty much illegal.

65
00:04:23,380 --> 00:04:27,460
And the reason for this is when you sign up as a user to any of these services

66
00:04:27,460 --> 00:04:30,400
like Facebook or Twitter or Instagram,

67
00:04:30,820 --> 00:04:35,020
there's a policy in there that you are agreeing to when you sign up that says

68
00:04:35,050 --> 00:04:39,610
I agree to not use this data that I obtained on this website commercially.

69
00:04:40,180 --> 00:04:43,120
But the data that is not behind authentication,

70
00:04:43,420 --> 00:04:46,720
so any website that you can access as it is,

71
00:04:47,140 --> 00:04:51,490
they can't bind you to a policy because you haven't agreed to anything.

72
00:04:51,970 --> 00:04:56,260
So if the website has data that just out there in the open that you can access

73
00:04:56,260 --> 00:05:00,430
without logging in and the content is not something that can be copyrighted,

74
00:05:00,670 --> 00:05:03,640
then it is fair game legally. Now,

75
00:05:03,670 --> 00:05:07,780
just because it's legal doesn't mean that you can actually do it.

76
00:05:08,350 --> 00:05:13,350
A lot of websites will use captcha or recaptcha in order to prevent bots like

77
00:05:13,750 --> 00:05:18,610
our Python code to get data from their websites. Every single time,

78
00:05:18,640 --> 00:05:21,850
you're agreeing to one of these captchas, it's testing

79
00:05:21,850 --> 00:05:24,100
whether to see if your actually a real human

80
00:05:24,340 --> 00:05:28,840
or if you just a bit of code that is trying to access their data. Captcha was the

81
00:05:28,840 --> 00:05:33,340
old version where you had the type in some squiggle letters and recaptcha is the

82
00:05:33,340 --> 00:05:36,130
new version where you just have to tick a checkbox.

83
00:05:36,460 --> 00:05:38,560
And it's actually really interesting how it works.

84
00:05:39,400 --> 00:05:43,210
It looks at things like how your mouse approaches the checkbox,

85
00:05:43,210 --> 00:05:47,020
how you maybe quiver a little bit before you actually check it

86
00:05:47,260 --> 00:05:51,280
and other things like your cookies and the store data that they have on you.

87
00:05:51,970 --> 00:05:56,170
Essentially, this service is used by websites to prevent people

88
00:05:56,230 --> 00:06:00,590
to scrape their data using a bot. The other thing to remember is that,

89
00:06:00,830 --> 00:06:03,560
you know, if you get sued by somebody like LinkedIn

90
00:06:03,560 --> 00:06:08,420
because you're using their data and you're building a business on it

91
00:06:08,450 --> 00:06:10,970
like hiQ is, then you can

92
00:06:10,970 --> 00:06:14,060
at any moment be hit with a really expensive lawsuit

93
00:06:14,450 --> 00:06:18,620
and you are going to have to pay a lot of money to lawyer up in order to contest

94
00:06:18,620 --> 00:06:20,420
this and actually to fight them in court.

95
00:06:20,930 --> 00:06:25,400
Unless you have the money to lawyer up and fight a company like LinkedIn,

96
00:06:26,000 --> 00:06:29,810
it's really important to know what are the implications of web scraping,

97
00:06:29,930 --> 00:06:33,590
especially when you're selling that data as a part of your business.

98
00:06:34,250 --> 00:06:37,040
But in addition to the sort of legal side of things,

99
00:06:37,190 --> 00:06:40,940
the other part that you should really think about is the ethics of web scraping.

100
00:06:41,390 --> 00:06:44,810
This is basically putting aside what is legal and what is illegal,

101
00:06:45,020 --> 00:06:46,640
but more thinking about what is right

102
00:06:46,640 --> 00:06:51,640
and what's wrong because let's say that you've built a website and you've got

103
00:06:51,770 --> 00:06:56,000
some sort of bot that's constantly scraping it for data, data that you know,

104
00:06:56,300 --> 00:06:58,730
has been generated by your own users

105
00:06:58,970 --> 00:07:03,950
that's really precious and that you might even charge for it, then,

106
00:07:03,980 --> 00:07:07,400
is it really right for somebody to do that?

107
00:07:07,940 --> 00:07:11,780
So I often follow the rule where if I don't want something to happen to me,

108
00:07:11,840 --> 00:07:16,160
I try to not do that to others. In terms of the ethics, a couple of things

109
00:07:16,170 --> 00:07:18,470
I would recommend abiding by is

110
00:07:18,770 --> 00:07:21,800
if you come across a website and they have a public API

111
00:07:21,860 --> 00:07:26,210
which we've already learned about and we know how to use, then always

112
00:07:26,210 --> 00:07:30,770
always go for the API. If it requires an application, then apply for it.

113
00:07:31,100 --> 00:07:35,570
Don't just go ahead and try to take their data when there's already a route for

114
00:07:35,570 --> 00:07:37,310
you to use and access their data.

115
00:07:38,480 --> 00:07:42,590
The second thing is to respect the web owner, because you know,

116
00:07:42,590 --> 00:07:46,550
you don't want somebody to access your website a million times a second,

117
00:07:46,610 --> 00:07:49,520
potentially making your website go down

118
00:07:49,670 --> 00:07:51,680
or it could count as a DDoS

119
00:07:51,680 --> 00:07:55,310
attack where it affects other users using the website.

120
00:07:56,090 --> 00:07:57,290
When you are on a website,

121
00:07:57,590 --> 00:08:02,360
they actually provide a way for you to tell what it is that you can scrape and

122
00:08:02,360 --> 00:08:02,810
what it is

123
00:08:02,810 --> 00:08:07,810
you can't. At the very end of the URLs after the.com or.co.uk,

124
00:08:08,930 --> 00:08:13,220
if you put a forward slash and put robots.txt, you can see

125
00:08:13,220 --> 00:08:18,220
this is the advice that they give to any bots that are potentially scraping

126
00:08:18,260 --> 00:08:19,093
their website.

127
00:08:19,610 --> 00:08:24,610
User agent is the person who is scraping, the person or the bot that's scraping,

128
00:08:25,280 --> 00:08:27,890
and it tells you what are the things that it disallows.

129
00:08:28,220 --> 00:08:32,690
So it doesn't want you to access the /vote?, /reply?, 

130
00:08:32,690 --> 00:08:35,299
/submitted?, /threads?.

131
00:08:35,600 --> 00:08:39,950
So basically any of these end points are ones that they don't really want you to

132
00:08:39,950 --> 00:08:41,840
use. For example,

133
00:08:41,840 --> 00:08:45,050
here I've access the /reply?

134
00:08:45,380 --> 00:08:48,890
which is a way to log in and reply to a particular comment.

135
00:08:49,280 --> 00:08:51,740
Now that really shouldn't be a bot kind of action

136
00:08:51,740 --> 00:08:54,230
because then it means the data that's generated

137
00:08:54,530 --> 00:08:57,690
or the replies on here will be automated, right?

138
00:08:57,690 --> 00:09:01,980
You actually want humans to comment and reply on the articles rather than some

139
00:09:01,980 --> 00:09:02,813
sort of robot.

140
00:09:03,660 --> 00:09:07,590
So these are the paths that they don't want you to access as a bot.

141
00:09:08,040 --> 00:09:10,890
And finally, it even tells you a crawl-delay.

142
00:09:10,920 --> 00:09:15,630
So this is the number of seconds that you should leave between each time you hit

143
00:09:15,630 --> 00:09:16,463
up the website.

144
00:09:17,250 --> 00:09:22,200
If we're writing Python code and we're using Beautiful Soup and response to

145
00:09:22,200 --> 00:09:24,210
scrape data from YCombinator,

146
00:09:24,480 --> 00:09:28,590
we could potentially get that code to run every fraction of a second right?

147
00:09:28,590 --> 00:09:33,450
I could just write a for loop and just get this to keep scraping again and again

148
00:09:33,450 --> 00:09:34,283
and again.

149
00:09:34,350 --> 00:09:38,880
But that means that you're adding a lot of extra traffic and a lot of extra

150
00:09:38,880 --> 00:09:43,560
demand on their servers which could potentially mean that real users,

151
00:09:43,560 --> 00:09:48,560
real humans who want to access their website might not be able to do it at a fast

152
00:09:48,780 --> 00:09:51,090
speed. So this is the reason why

153
00:09:51,120 --> 00:09:53,880
when a lot of people accessing the same website,

154
00:09:53,910 --> 00:09:58,910
say when a new ticket has been released for Glastonbury or some sort of big

155
00:09:59,100 --> 00:10:01,950
concert, that the website can go down.

156
00:10:02,010 --> 00:10:05,430
Its because a lot of servers can't cope with so much demand.

157
00:10:05,850 --> 00:10:08,070
And when that demand is coming from a for loop,

158
00:10:08,340 --> 00:10:12,150
then you can imagine that you're just adding a lot of extra work onto the web

159
00:10:12,150 --> 00:10:15,480
server. So always respect their crawl-delay

160
00:10:15,480 --> 00:10:20,190
if you see one in the robots.txt, and even if you don't see one,

161
00:10:20,280 --> 00:10:24,450
just try to limit your rate so that you don't max out their server.

162
00:10:24,840 --> 00:10:27,450
I recommend not scraping more than once a minute.

163
00:10:28,200 --> 00:10:32,340
The YCombinator's of robots.txt is actually quite permissive.

164
00:10:32,370 --> 00:10:35,430
It allows you to do pretty much anything you want,

165
00:10:35,760 --> 00:10:37,950
but that's not true for all websites.

166
00:10:38,160 --> 00:10:40,260
If you look at the robots.txt for LinkedIn,

167
00:10:40,590 --> 00:10:43,770
you can see that they really don't want anyone to scrape it.

168
00:10:43,770 --> 00:10:45,450
There is a bit of legal jargon,

169
00:10:45,480 --> 00:10:49,470
there's a lot more disallows that you can see, right?

170
00:10:49,950 --> 00:10:53,820
This is probably not a website where I would scrape their data and try to build a

171
00:10:53,820 --> 00:10:54,690
company around.

172
00:10:55,620 --> 00:10:59,940
So remember that this is a piece of text that the website owners have written

173
00:11:00,180 --> 00:11:04,920
for you to look at to see what you can do and you can't do with their website.

174
00:11:05,280 --> 00:11:06,990
So before you scrape a website,

175
00:11:07,320 --> 00:11:12,320
always go to the root of their URL and check out their robots.txt and follow

176
00:11:14,490 --> 00:11:18,420
the ethical codes of conduct when you're trying to commercialize a project.

177
00:11:18,810 --> 00:11:22,770
So this is just the quick tip on the law and ethics of web scraping

178
00:11:22,980 --> 00:11:24,960
just so that you don't get into trouble in the future.