414 lines
12 KiB
HTML
414 lines
12 KiB
HTML
<!doctype html>
|
|
<html>
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
|
|
|
|
<title>TubeStats</title>
|
|
|
|
|
|
<link rel="stylesheet" href="dist/reset.css">
|
|
<link rel="stylesheet" href="dist/reveal.css">
|
|
<link rel="stylesheet" href="dist/theme/simple.css" id="theme">
|
|
|
|
<!-- Theme used for syntax highlighted code -->
|
|
<link rel="stylesheet" href="plugin/highlight/monokai.css" id="highlight-theme">
|
|
</head>
|
|
<body>
|
|
<div class="reveal">
|
|
<div class="slides">
|
|
<section data-markdown>
|
|
# TubeStats
|
|
*A hobby project: Consistency in a YouTube channel*
|
|
|
|
Shivan Sivakumaran
|
|
</section>
|
|
<section data-markdown>
|
|
## Inspiration
|
|
- Ali Abdaal
|
|
- Consistency - how consistent?
|
|
- Getting better as a beginner
|
|
- www.tubestats.app
|
|
</section>
|
|
<section data-markdown>
|
|
## What does TubeStats do?
|
|
1. Takes user input
|
|
2. Provides statistics
|
|
</section>
|
|
<section data-background-image="tubestats_parsing.gif"
|
|
data-background-size="750px">
|
|
</section>
|
|
<section data-markdown>
|
|
## 1. User input
|
|
```python
|
|
# Channel ID
|
|
'UCoOae5nYA7VqaXzerajD0lg'
|
|
# Link to channel
|
|
'https://www.youtube.com/channel/UCoOae5nYA7VqaXzerajD0lg'
|
|
# Link to video
|
|
'https://www.youtube.com/watch?v=epF2SYpWtos'
|
|
# Video ID
|
|
'epF2SYpWtos'
|
|
```
|
|
</section>
|
|
<section data-markdown>
|
|
## 2. Statistics
|
|
![](all-graph.png)
|
|
</section>
|
|
<section data-markdown>
|
|
![](time-diff.png)
|
|
</section>
|
|
<section data-markdown>
|
|
## How does TubeStats work?
|
|
### Part 1 of 2
|
|
1. How to set up a development environment?
|
|
2. How to access the video information?
|
|
3. How to store password and API keys?
|
|
4. How do we get and store the video statistics?
|
|
</section>
|
|
<section data-markdown>
|
|
## 1. Development environment
|
|
```bash
|
|
$ mkdir tubestats
|
|
$ cd tubestats
|
|
|
|
$ python3 -m venv venv
|
|
$ source venv/bin/activate
|
|
$ (venv)
|
|
|
|
$ git init
|
|
```
|
|
</section>
|
|
<section data-markdown>
|
|
## 2. Video information
|
|
- use `beautifulsoup`, `scraPY`, `selenium`?
|
|
- YouTube Data API
|
|
- `google-api-python-client`
|
|
</section>
|
|
<section data-markdown>
|
|
## 3. Storing API Keys
|
|
- Hard code?
|
|
- `python-dotenv`
|
|
- `.env`
|
|
- `.gitignore`
|
|
</section>
|
|
<section data-markdown>
|
|
### 3. Storing API keys
|
|
```
|
|
# .env
|
|
|
|
API_KEY=xxxxxxxx
|
|
```
|
|
|
|
```python
|
|
# tubestats/youtube_api.py
|
|
|
|
from dotenv import load_dotenv
|
|
|
|
load_dotenv()
|
|
API_KEY = os.getenv('API_KEY')
|
|
|
|
```
|
|
</section>
|
|
<section data-markdown>
|
|
## 4. Get YouTube video statistics
|
|
- Access API
|
|
- Channel upload playlist
|
|
- Video statistics
|
|
- `pandas` dataframe
|
|
</section>
|
|
<section data-markdown>
|
|
### 4. Get YouTube video statistics
|
|
```python
|
|
import googleapiclient.discovery
|
|
|
|
load_dotenv()
|
|
api_service_name = 'youtube'
|
|
api_version = 'v3'
|
|
|
|
youtube = googleapiclient.discovery.build(
|
|
api_service_name,
|
|
api_version,
|
|
developerKey=os.getenv('API_KEY'))
|
|
```
|
|
|
|
</section>
|
|
<section data-markdown>
|
|
```python [|3|5-16|17-18|20-29|30-32]
|
|
# tubestates/youtube_api.py
|
|
|
|
upload_playlist_ID = channel_data['upload_playlist_ID']
|
|
|
|
video_response = []
|
|
next_page_token = None
|
|
while True:
|
|
# obtaining video ID + titles
|
|
playlist_request = youtube.playlistItems().list(
|
|
part='snippet,contentDetails',
|
|
maxResults=50, # API Limit is 50
|
|
pageToken=next_page_token,
|
|
playlistId=upload_playlist_ID,
|
|
)
|
|
playlist_response = playlist_request.execute()
|
|
# isolating video ID
|
|
vid_subset = [ vid_ID['contentDetails']['videoId']
|
|
for vid_ID in playlist_response['items'] ]
|
|
# retrieving video statistics
|
|
vid_info_subset_request = youtube.videos().list(
|
|
part='snippet,contentDetails,statistics',
|
|
id=vid_subset
|
|
)
|
|
vid_info_subset_response = vid_info_subset_request.execute()
|
|
video_response.append(vid_info_subset_response)
|
|
# obtaining page token
|
|
next_page_token = playlist_response.get('nextPageToken') # get method used because token may not exist
|
|
if next_page_token is None:
|
|
break
|
|
|
|
df = pd.json_normalize(video_response, 'items')
|
|
return df
|
|
</section>
|
|
<section data-markdown>
|
|
### Video statistics
|
|
![](dataframe.png)
|
|
</section>
|
|
<section data-markdown>
|
|
## How does TubeStats work?
|
|
### Part 2 of 2
|
|
5. How to organise the code?
|
|
6. How to test the code?
|
|
7. How to display the data and allow interaction?
|
|
8. How to account for variable input?
|
|
</section>
|
|
<section data-markdown>
|
|
## 5. Organising code
|
|
```bash [1-9|10-15|16-20|21]
|
|
tubestats/
|
|
├── data
|
|
│ ├── channel_data.pkl
|
|
│ └── video_data.pkl
|
|
├── LICENSE
|
|
├── Procfile
|
|
├── README.MD
|
|
├── requirements.txt
|
|
├── setup.sh
|
|
├── tests
|
|
│ ├── __init__.py
|
|
│ ├── test_settings.py
|
|
│ ├── test_youtube_api.py
|
|
│ ├── test_youtube_data.py
|
|
│ └── test_youtube_parser.py
|
|
├── tubestats
|
|
│ ├── __init__.py
|
|
│ ├── youtube_api.py
|
|
│ ├── youtube_data.py
|
|
│ └── youtube_parser.py
|
|
└── youtube_presenter.py
|
|
</section>
|
|
<section data-markdown>
|
|
## 6. Testing
|
|
```python [|15-20]
|
|
# tests/tests_youtube_api.py
|
|
from tubestats.youtube_api import create_api, YouTubeAPI
|
|
from tests.test_settings import set_channel_ID_test_case
|
|
|
|
from pathlib import Path
|
|
|
|
import pytest
|
|
import googleapiclient
|
|
import pandas
|
|
|
|
def test_create_api():
|
|
youtube = create_api()
|
|
assert isinstance(youtube, googleapiclient.discovery.Resource)
|
|
|
|
@pytest.fixture()
|
|
def youtubeapi():
|
|
channel_ID = set_channel_ID_test_case()
|
|
yt = YouTubeAPI(channel_ID)
|
|
return yt
|
|
|
|
def test_get_video_data(youtubeapi):
|
|
df = youtubeapi.get_video_data()
|
|
assert isinstance(df, pandas.core.frame.DataFrame)
|
|
|
|
# saving video data to save API calls for later testing
|
|
BASE_DIR = Path(__file__).parent.parent
|
|
df.to_pickle(BASE_DIR / 'data' / 'video_data.pkl')
|
|
</section>
|
|
<section data-markdown>
|
|
## 7. Sharing to the world
|
|
- graphs with tool tips, `altair`
|
|
- creating interaction with `streamlit`
|
|
- hosting on Heroku
|
|
</section>
|
|
<section data-markdown>
|
|
### 7. Sharing to the world
|
|
```python []
|
|
# tubestats/youtube_data.py
|
|
|
|
import altair as alt
|
|
|
|
def scatter_all_videos(self, df: pd.core.frame.DataFrame) -> alt.vegalite.v4.Chart:
|
|
df_views = df
|
|
c = alt.Chart(df_views, title='Plot of videos over time').mark_point().encode(
|
|
x=alt.X('snippet\.publishedAt_REFORMATED:T', axis=alt.Axis(title='Date Published'), scale=alt.Scale(type='log')),
|
|
y=alt.Y('statistics\.viewCount:Q', axis=alt.Axis(title='View Count')),
|
|
color=alt.Color('statistics\.like-dislike-ratio:Q', scale=alt.Scale(scheme='turbo'), legend=None),
|
|
tooltip=['snippet\.title:N', 'statistics\.viewCount:Q', 'statistics\.like-dislike-ratio:Q'],
|
|
size=alt.Size('statistics\.viewCount:Q', legend=None)
|
|
)
|
|
return c
|
|
</section>
|
|
<section data-markdown>
|
|
### 7. Sharing to the world
|
|
```python [|5-14|16-19]
|
|
# youtube_presenter.py
|
|
|
|
import streamlit as st
|
|
|
|
def date_slider(date_end=datetime.today()):
|
|
date_start, date_end = st.slider(
|
|
'Select date range to include:',
|
|
min_value=first_video_date, # first video
|
|
max_value=last_video_date, #value for date_end
|
|
value=(first_video_date , last_video_date), #same as min value
|
|
step=timedelta(days=2),
|
|
format='YYYY-MM-DD',
|
|
key=999)
|
|
return date_start, date_end
|
|
|
|
date_start, date_end = date_slider()
|
|
transformed_df = youtuber_data.transform_dataframe(date_start=date_start, date_end=date_end)
|
|
c = youtuber_data.scatter_all_videos(transformed_df)
|
|
st.altair_chart(c, use_container_width=True)
|
|
</section>
|
|
<section data-markdown>
|
|
### 7. Sharing with the world
|
|
```shell []
|
|
$ streamlit run youtube_presenter.py
|
|
|
|
You can now view your Streamlit app in your browser.
|
|
|
|
Local URL: http://localhost:8501
|
|
Network URL: http://192.0.0.0.1
|
|
</section>
|
|
<section data-markdown>
|
|
![](all-graph.png)
|
|
</section>
|
|
<section data-markdown>
|
|
### 7. Sharing to the world
|
|
```bash
|
|
$ (venv) pip freeze > requirements.txt
|
|
```
|
|
|
|
```bash
|
|
# setup.sh
|
|
|
|
mkdir -p ~/.streamlit/echo "\
|
|
[server]\n\
|
|
headless = true\n\
|
|
port = $PORT\n\
|
|
enableCORS = false\n\
|
|
\n\
|
|
" > ~/.streamlit/config.toml
|
|
```
|
|
|
|
```bash
|
|
# Procfile
|
|
|
|
web: sh setup.sh && streamlit run youtube_presenter.py
|
|
```
|
|
|
|
```bash
|
|
$ heroku login
|
|
$ heroku create tubestats
|
|
|
|
$ git push heroku main
|
|
```
|
|
</section>
|
|
<section data-markdown>
|
|
## 8. Different user input
|
|
- taking video ID
|
|
- URL links
|
|
- using regex, `re` module
|
|
</section>
|
|
<section data-markdown>
|
|
### 8. Different user input
|
|
```python []
|
|
import re
|
|
|
|
LINK_MATCH = r'(^.*youtu)(\.be|be\.com)(\/watch\?v\=|\/)([a-zA-Z0-9_-]+)(\/)?([a-zA-Z0-9_-]+)?'
|
|
m = re.search(LINK_MATCH, for_parse)
|
|
video_id = m.group(4) # video ID
|
|
if video_id == 'channel':
|
|
return m.group(6) # Channel ID
|
|
elif video_id == 'user':
|
|
channel_username = m.group(6) # Channel Username
|
|
</section>
|
|
<section data-markdown>
|
|
### 8. Different user input
|
|
![](regex.png)
|
|
</section>
|
|
<section data-markdown>
|
|
## Things to consider for the future
|
|
- Error handling
|
|
- Maxing API calls
|
|
- Comparing different channels
|
|
- DataFrame and memory
|
|
</section>
|
|
<section data-markdown>
|
|
### DataFrame immutability and memory?
|
|
```python []
|
|
df = self.df
|
|
df = df[['snippet.publishedAt',
|
|
'snippet.title',
|
|
...
|
|
'statistics.favoriteCount',
|
|
'statistics.commentCount']]
|
|
|
|
df = df.fillna(0)
|
|
|
|
df = df.astype({'statistics.viewCount': 'int',
|
|
...
|
|
'statistics.commentCount': 'int',})
|
|
df['statistics.viewCount_NLOG'] = df['statistics.viewCount'].apply(lambda x : np.log(x))
|
|
|
|
df = df.sort_values(by='snippet.publishedAt_REFORMATED', ascending=True)
|
|
</section>
|
|
<section data-markdown>
|
|
## What did I learn
|
|
- Project based learning
|
|
- 'minimal viable product'
|
|
</section>
|
|
<section data-markdown>
|
|
## Conclusion
|
|
- Analysing consistency
|
|
- YouTube Data API --> pandas --> altair --> Heroku
|
|
- Share your work!
|
|
</section>
|
|
<section data-markdown>
|
|
## Acknowledgements
|
|
- Menno Finlay-Smits
|
|
|
|
</div>
|
|
</div>
|
|
|
|
<script src="dist/reveal.js"></script>
|
|
<script src="plugin/notes/notes.js"></script>
|
|
<script src="plugin/markdown/markdown.js"></script>
|
|
<script src="plugin/highlight/highlight.js"></script>
|
|
<script>
|
|
// More info about initialization & config:
|
|
// - https://revealjs.com/initialization/
|
|
// - https://revealjs.com/config/
|
|
Reveal.initialize({
|
|
hash: true,
|
|
|
|
// Learn about plugins: https://revealjs.com/plugins/
|
|
plugins: [ RevealMarkdown, RevealHighlight, RevealNotes ]
|
|
});
|
|
</script>
|
|
</body>
|
|
</html>
|