tubestats-presentation/index.html

414 lines
12 KiB
HTML
Raw Normal View History

2012-10-21 01:14:50 +00:00
<!doctype html>
<html>
2011-06-07 19:10:59 +00:00
<head>
<meta charset="utf-8">
2016-03-20 17:57:30 +00:00
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
2012-10-21 01:14:50 +00:00
2021-05-23 09:47:07 +00:00
<title>TubeStats</title>
2012-10-21 01:14:50 +00:00
2021-05-23 09:47:07 +00:00
<link rel="stylesheet" href="dist/reset.css">
<link rel="stylesheet" href="dist/reveal.css">
2021-05-23 09:47:07 +00:00
<link rel="stylesheet" href="dist/theme/simple.css" id="theme">
2020-05-08 19:04:19 +00:00
<!-- Theme used for syntax highlighted code -->
<link rel="stylesheet" href="plugin/highlight/monokai.css" id="highlight-theme">
2011-06-07 19:10:59 +00:00
</head>
<body>
<div class="reveal">
<div class="slides">
2021-05-23 09:47:07 +00:00
<section data-markdown>
# TubeStats
*A hobby project: Consistency in a YouTube channel*
Shivan Sivakumaran
</section>
<section data-markdown>
## Inspiration
- Ali Abdaal
- Consistency - how consistent?
- Getting better as a beginner
- www.tubestats.app
</section>
<section data-markdown>
## What does TubeStats do?
1. Takes user input
2. Provides statistics
</section>
<section data-background-image="tubestats_parsing.gif"
data-background-size="750px">
</section>
<section data-markdown>
## 1. User input
```python
# Channel ID
'UCoOae5nYA7VqaXzerajD0lg'
# Link to channel
'https://www.youtube.com/channel/UCoOae5nYA7VqaXzerajD0lg'
# Link to video
'https://www.youtube.com/watch?v=epF2SYpWtos'
# Video ID
'epF2SYpWtos'
```
</section>
<section data-markdown>
## 2. Statistics
![](all-graph.png)
</section>
<section data-markdown>
![](time-diff.png)
</section>
<section data-markdown>
## How does TubeStats work?
### Part 1 of 2
1. How to set up a development environment?
2. How to access the video information?
3. How to store password and API keys?
4. How do we get and store the video statistics?
</section>
<section data-markdown>
2021-05-24 01:58:53 +00:00
## 1. Development environment
2021-05-23 09:47:07 +00:00
```bash
$ mkdir tubestats
$ cd tubestats
$ python3 -m venv venv
$ source venv/bin/activate
$ (venv)
$ git init
```
</section>
<section data-markdown>
2021-05-24 01:58:53 +00:00
## 2. Video information
- use `beautifulsoup`, `scraPY`, `selenium`?
2021-05-23 09:47:07 +00:00
- YouTube Data API
2021-05-24 01:58:53 +00:00
- `google-api-python-client`
2021-05-23 09:47:07 +00:00
</section>
2021-05-23 10:29:03 +00:00
<section data-markdown>
2021-05-24 01:58:53 +00:00
## 3. Storing API Keys
2021-05-23 10:29:03 +00:00
- Hard code?
- `python-dotenv`
- `.env`
- `.gitignore`
</section>
2021-05-23 10:57:02 +00:00
<section data-markdown>
2021-05-24 01:58:53 +00:00
### 3. Storing API keys
2021-05-23 10:57:02 +00:00
```
# .env
API_KEY=xxxxxxxx
```
```python
# tubestats/youtube_api.py
from dotenv import load_dotenv
2021-05-23 10:29:03 +00:00
2021-05-23 10:57:02 +00:00
load_dotenv()
API_KEY = os.getenv('API_KEY')
```
</section>
2021-05-24 01:58:53 +00:00
<section data-markdown>
## 4. Get YouTube video statistics
- Access API
- Channel upload playlist
- Video statistics
2021-05-24 04:16:42 +00:00
- `pandas` dataframe
2021-05-24 01:58:53 +00:00
</section>
<section data-markdown>
### 4. Get YouTube video statistics
```python
import googleapiclient.discovery
load_dotenv()
api_service_name = 'youtube'
api_version = 'v3'
youtube = googleapiclient.discovery.build(
api_service_name,
api_version,
developerKey=os.getenv('API_KEY'))
```
</section>
2021-05-24 04:16:42 +00:00
<section data-markdown>
```python [|3|5-16|17-18|20-29|30-32]
# tubestates/youtube_api.py
upload_playlist_ID = channel_data['upload_playlist_ID']
video_response = []
next_page_token = None
while True:
# obtaining video ID + titles
2021-05-25 10:47:14 +00:00
playlist_request = youtube.playlistItems().list(
2021-05-24 04:16:42 +00:00
part='snippet,contentDetails',
maxResults=50, # API Limit is 50
pageToken=next_page_token,
playlistId=upload_playlist_ID,
)
playlist_response = playlist_request.execute()
# isolating video ID
vid_subset = [ vid_ID['contentDetails']['videoId']
for vid_ID in playlist_response['items'] ]
# retrieving video statistics
2021-05-25 10:47:14 +00:00
vid_info_subset_request = youtube.videos().list(
2021-05-24 04:16:42 +00:00
part='snippet,contentDetails,statistics',
id=vid_subset
)
vid_info_subset_response = vid_info_subset_request.execute()
video_response.append(vid_info_subset_response)
# obtaining page token
next_page_token = playlist_response.get('nextPageToken') # get method used because token may not exist
if next_page_token is None:
break
df = pd.json_normalize(video_response, 'items')
return df
</section>
<section data-markdown>
### Video statistics
![](dataframe.png)
2021-05-24 01:58:53 +00:00
</section>
2021-05-23 09:47:07 +00:00
<section data-markdown>
## How does TubeStats work?
### Part 2 of 2
5. How to organise the code?
6. How to test the code?
7. How to display the data and allow interaction?
8. How to account for variable input?
2021-05-24 01:58:53 +00:00
</section>
<section data-markdown>
## 5. Organising code
```bash [1-9|10-15|16-20|21]
tubestats/
├── data
│   ├── channel_data.pkl
│   └── video_data.pkl
├── LICENSE
├── Procfile
├── README.MD
├── requirements.txt
├── setup.sh
├── tests
│   ├── __init__.py
│   ├── test_settings.py
│   ├── test_youtube_api.py
│   ├── test_youtube_data.py
│   └── test_youtube_parser.py
├── tubestats
│   ├── __init__.py
│   ├── youtube_api.py
│   ├── youtube_data.py
│   └── youtube_parser.py
└── youtube_presenter.py
</section>
<section data-markdown>
## 6. Testing
2021-05-24 04:16:42 +00:00
```python [|15-20]
2021-05-24 01:58:53 +00:00
# tests/tests_youtube_api.py
from tubestats.youtube_api import create_api, YouTubeAPI
from tests.test_settings import set_channel_ID_test_case
2021-05-23 09:47:07 +00:00
2021-05-24 01:58:53 +00:00
from pathlib import Path
import pytest
import googleapiclient
import pandas
def test_create_api():
youtube = create_api()
assert isinstance(youtube, googleapiclient.discovery.Resource)
@pytest.fixture()
def youtubeapi():
channel_ID = set_channel_ID_test_case()
yt = YouTubeAPI(channel_ID)
return yt
def test_get_video_data(youtubeapi):
df = youtubeapi.get_video_data()
assert isinstance(df, pandas.core.frame.DataFrame)
# saving video data to save API calls for later testing
BASE_DIR = Path(__file__).parent.parent
df.to_pickle(BASE_DIR / 'data' / 'video_data.pkl')
</section>
2021-05-24 04:25:46 +00:00
<section data-markdown>
## 7. Sharing to the world
2021-05-24 01:58:53 +00:00
- graphs with tool tips, `altair`
- creating interaction with `streamlit`
- hosting on Heroku
</section>
<section data-markdown>
### 7. Sharing to the world
```python []
# tubestats/youtube_data.py
import altair as alt
def scatter_all_videos(self, df: pd.core.frame.DataFrame) -> alt.vegalite.v4.Chart:
df_views = df
c = alt.Chart(df_views, title='Plot of videos over time').mark_point().encode(
2021-05-25 10:47:14 +00:00
x=alt.X('snippet\.publishedAt_REFORMATED:T', axis=alt.Axis(title='Date Published'), scale=alt.Scale(type='log')),
y=alt.Y('statistics\.viewCount:Q', axis=alt.Axis(title='View Count')),
2021-05-24 01:58:53 +00:00
color=alt.Color('statistics\.like-dislike-ratio:Q', scale=alt.Scale(scheme='turbo'), legend=None),
tooltip=['snippet\.title:N', 'statistics\.viewCount:Q', 'statistics\.like-dislike-ratio:Q'],
size=alt.Size('statistics\.viewCount:Q', legend=None)
)
return c
</section>
<section data-markdown>
### 7. Sharing to the world
```python [|5-14|16-19]
# youtube_presenter.py
import streamlit as st
def date_slider(date_end=datetime.today()):
date_start, date_end = st.slider(
'Select date range to include:',
min_value=first_video_date, # first video
max_value=last_video_date, #value for date_end
value=(first_video_date , last_video_date), #same as min value
step=timedelta(days=2),
format='YYYY-MM-DD',
key=999)
return date_start, date_end
date_start, date_end = date_slider()
transformed_df = youtuber_data.transform_dataframe(date_start=date_start, date_end=date_end)
c = youtuber_data.scatter_all_videos(transformed_df)
st.altair_chart(c, use_container_width=True)
</section>
<section data-markdown>
### 7. Sharing with the world
```shell []
$ streamlit run youtube_presenter.py
You can now view your Streamlit app in your browser.
Local URL: http://localhost:8501
Network URL: http://192.0.0.0.1
</section>
<section data-markdown>
![](all-graph.png)
</section>
<section data-markdown>
### 7. Sharing to the world
```bash
$ (venv) pip freeze > requirements.txt
```
```bash
# setup.sh
mkdir -p ~/.streamlit/echo "\
[server]\n\
headless = true\n\
port = $PORT\n\
enableCORS = false\n\
\n\
" > ~/.streamlit/config.toml
```
```bash
# Procfile
web: sh setup.sh && streamlit run youtube_presenter.py
```
```bash
$ heroku login
$ heroku create tubestats
$ git push heroku main
```
</section>
<section data-markdown>
## 8. Different user input
- taking video ID
- URL links
- using regex, `re` module
</section>
<section data-markdown>
### 8. Different user input
```python []
import re
LINK_MATCH = r'(^.*youtu)(\.be|be\.com)(\/watch\?v\=|\/)([a-zA-Z0-9_-]+)(\/)?([a-zA-Z0-9_-]+)?'
m = re.search(LINK_MATCH, for_parse)
video_id = m.group(4) # video ID
if video_id == 'channel':
return m.group(6) # Channel ID
elif video_id == 'user':
channel_username = m.group(6) # Channel Username
</section>
2021-05-25 10:47:14 +00:00
<section data-markdown>
### 8. Different user input
![](regex.png)
</section>
2021-05-24 01:58:53 +00:00
<section data-markdown>
2021-05-26 20:35:28 +00:00
## Things to consider for the future
2021-05-24 04:16:42 +00:00
- Error handling
2021-05-26 20:35:28 +00:00
- Maxing API calls
- Comparing different channels
- DataFrame and memory
2021-05-24 01:58:53 +00:00
</section>
<section data-markdown>
2021-05-24 04:16:42 +00:00
### DataFrame immutability and memory?
```python []
df = self.df
df = df[['snippet.publishedAt',
'snippet.title',
...
'statistics.favoriteCount',
'statistics.commentCount']]
df = df.fillna(0)
df = df.astype({'statistics.viewCount': 'int',
...
'statistics.commentCount': 'int',})
df['statistics.viewCount_NLOG'] = df['statistics.viewCount'].apply(lambda x : np.log(x))
df = df.sort_values(by='snippet.publishedAt_REFORMATED', ascending=True)
</section>
<section data-markdown>
## What did I learn
- Project based learning
- 'minimal viable product'
2021-05-24 01:58:53 +00:00
</section>
2021-05-24 04:16:42 +00:00
<section data-markdown>
## Conclusion
- Analysing consistency
2021-05-26 20:35:28 +00:00
- YouTube Data API --> pandas --> altair --> Heroku
2021-05-24 04:16:42 +00:00
- Share your work!
</section>
<section data-markdown>
## Acknowledgements
2021-05-24 04:25:46 +00:00
- Menno Finlay-Smits
2021-05-23 09:47:07 +00:00
</div>
2011-06-07 19:10:59 +00:00
</div>
2012-03-24 16:48:16 +00:00
<script src="dist/reveal.js"></script>
<script src="plugin/notes/notes.js"></script>
<script src="plugin/markdown/markdown.js"></script>
<script src="plugin/highlight/highlight.js"></script>
<script>
// More info about initialization & config:
2020-05-19 16:27:00 +00:00
// - https://revealjs.com/initialization/
// - https://revealjs.com/config/
Reveal.initialize({
hash: true,
2020-05-19 16:27:00 +00:00
// Learn about plugins: https://revealjs.com/plugins/
plugins: [ RevealMarkdown, RevealHighlight, RevealNotes ]
});
2012-08-04 19:53:52 +00:00
</script>
2011-06-07 19:10:59 +00:00
</body>
</html>