announcement-3509489_1920.jpg

What I Learned from Scraping 15k Data Science Articles on Medium

Khuyen Tran

Khuyen Tran

And why having 5 Claps in your Article is okay

This is a Datapane import of the original Medium post, which can be found here

Motivation

Have you ever wondered about what factors make an article receive a high number of claps? Besides, as a data science writer, I wonder

  • What is the average number of claps? Some articles I came across have 100 or even 1000 claps. Is that a typical number of claps for a Data Science article?
  • Which titles are most used by data science articles?
  • What is the ideal reading time for a good article?
  • Will publishing on the weekdays give more claps than the weekends?

To answer these questions, I scraped all data science articles on Medium published within the last year.

Tools

To scrape medium, I used the excellent repository from Harrison Jansma with slight changes in the packages to deal with the errors in the requirements. I choose 6 tags related to data science:

  • Data science
  • Machine learning
  • AI
  • Python
  • Data visualization
  • Big data

The articles are published anytime between July 2019 and July 2020. It took me 4 to 5 hours to scrape all of these tags but I got good data ready for cleaning and analyzing. I merged data from 6 tags together and added a column Tag showing which tags the article belongs to.

If you want to play along with the data and follow along with the articles, you could download the data here

or use Datapane Blob to get direct access to the data

  1. import datapane as dp
  2. medium = dp.Blob.get(name='medium', owner='khuyentran1401').download_df()

Make sure to sign up and log in Datapane as a user if you want to use Blob. More detailed instructions on using Blob could be found here.

Get Started

Save the data as medium. Import libraries. Make sure to change the string nan to null values and drop duplicates. We will keep the same articles with different tags for now.

  1. import datapane as dp
  2. import pandas as pd
  3. import numpy as np
  4. medium = medium.replace('nan', np.nan)
  5. # Drop duplicated
  6. medium = medium.drop_duplicates(subset=['Title', 'Subtitle', 'Author', 'Year','Month', 'Day', 'Tag'])

and take a look at what we got

  1. medium.info()

Screenshot from 2020-07-15 11-29-51.png

What I Found

Topics

Which tags are most popular among data science-related tags?

  1. import plotly.express as px
  2. # Save the charts to build an interactive report later
  3. charts = []
  4. tag_plot = px.bar(x=medium.Tag.value_counts().index,
  5. y=medium.Tag.value_counts().values,
  6. labels={'y': 'Number of Articles',
  7. 'x': 'Tags'},
  8. title='Number of articles in each data science-related topic')
  9. tag_plot

Since different rows may just be the same articles with different tags, we will drop these articles to make sure we have only unique articles in our data.

  1. # Number of duplicated articles with different tags
  2. >>> sum(medium.iloc[:,:8].duplicated())
  3. 38516
  4. # Drop duplicated
  5. medium = medium.drop_duplicates(subset=['Title', 'Subtitle', 'Author', 'Year', 'Month', 'Day'])

Comment

From looking at the data, we can see the number of comments in Medium is often really low. But how low are they exactly?

  1. import plotly.express as px
  2. # Save the charts to build an interactive report later
  3. charts = []
  4. comment = px.histogram(medium, x='Comment')
  5. comment

No article has above 1 comment and 96.1% of them have no comment!

It is surprising to know that even articles with a high number of claps are low in a number of comments. The hidden comment tab in Medium may discourage readers from commenting. A more visible comment section may increase the number of comments.

Claps

What is the average number of claps for an article? Change string to numerical such as 1.5k to 1500 and use describe()

  1. # Change type category to object
  2. medium.Claps = medium.Claps.astype('object')
  3. def str_to_float(feature):
  4. '''Change string with K or M to a float (.i.e, 5k)'''
  5. feature = feature.replace(r'[KM]+$', '', regex=True).astype(float) * \
  6. feature.str.extract(r'[\d\.]+([KM]+)', expand=False).fillna(1).replace(['K','M'], [10**3, 10**6]).astype(int)
  7. return feature
  8. medium.Claps.describe()

Screenshot from 2020-07-15 16-17-06.png

The average is 55. Not so bad. But look at the 50th percentile. It is 3! And the max is 26000. This seems to be highly skewed data. Let’s double-check with histogram

Since the data is large and the number of claps is highly skewed, we sort the data by order of decreasing and plot the first 80k instances.

  1. claps = px.histogram(medium.sort_values(by='Claps')[:80000],
  2. x='Claps',
  3. title='Number of Claps')

From the plot, we could see that a majority of the number of claps is in the range from 0 to 10. When dealing with highly skewed data like this, it is better to grasp the ‘middle’ of the data using median than mean!

If you are a data science writer who feels discouraged because you have 0–10 claps, you should feel OK, because this is pretty typical!

Reading Time vs Claps

You may hear from somewhere that the duration of reading time can affect how much an audience likes the articles. Let’s test it out

  1. readingTime_claps = px.scatter(medium,
  2. x='Reading_Time',
  3. y='Claps',
  4. title='Claps vs Reading Time')
  5. readingTime_claps.show()

There seems to be a low correlation between claps and reading time. The correlation between claps and reading time is

  1. >>> medium.corr().loc['Reading_Time', 'Claps']
  2. 0.1301349558669967

But one thing to notice is that the articles with a long reading time have a really low number of claps. The articles with a high number of claps tend to have short reading time.

Let’s find out the average reading time for the articles with the top 25% number of claps

  1. medium[medium.Claps.between(63, medium.Claps.max())].Reading_Time.describe()

Screenshot from 2020-07-15 16-19-48.png

The average number of claps of well-received articles is around 6.6 minutes with a standard deviation of 3.9. This means that the articles the duration from 2 to 10.5 minutes are ideal.

Author

What is the typical publishing frequency of a data science writer? We use groupby() to group the data based on author and count the total number of articles that the author published within last year

  1. # Group df by author and count the number of articles they publish
  2. author_groupby = medium.groupby(['Author']).count().sort_values(by='Year', ascending=False).reset_index()f
  3. ig = px.bar(author_groupby[:100],
  4. x='Author',
  5. y='Year',
  6. labels={'Year': 'Number of articles'},
  7. title='Top 350 most active authors with topics related to data science'
  8. )
  9. fig.update_layout({
  10. 'plot_bgcolor': 'rgba(133, 227, 239, 0.04)',
  11. })
  12. fig.show()

I am curious where I am in this rank so I used a simple code to find out

  1. >>> author_rank = medium.Author.value_counts().index
  2. >>> 100-(list(author_rank).index('Khuyen Tran')+1)/len(author_rank) * 100
  3. 99.85684944295761

I am in the 99.86 percentile of authors publishing data science articles last year. Considering I have just started writing in December 2019, this is a rewarding finding for the efforts I put on articles every week.

Find the median of the number of articles an author publishes last year

  1. >>> author_groupby.Year.median()
  2. 1

This means the average author publishes one article per last month.

Publications

Which publication publishes data science articles most frequently among all publications in Medium?

  1. publication_groupby = medium.groupby(by='Publication').count().sort_values(by='Title',ascending=False).reset_index()[:40000]
  2. fig = px.bar(publication_groupby[:50],
  3. x='Publication',
  4. y='Title',
  5. labels={'Title':'Number of article'},
  6. title='Top 50 most active data science publication',
  7. )
  8. fig.update_layout({
  9. 'plot_bgcolor': 'rgba(133, 227, 239, 0.04)',
  10. 'margin': dict(b=250),
  11. 'height': 600,
  12. })
  13. fig.update_traces(textposition='outside')
  14. fig.update_xaxes(title_font_family="Arial",tickangle=45)

From the charts, the top four most active data science publications within last year are:

  • Towards Data Science
  • Analytics Vidhya
  • The Startup
  • Data Driven Investor

Let’s take one step further and find out how many percentages of articles 1% of the publications post

  1. >>> sum(publication_groupby.sort_values(by='Year', ascending=False).head(int(len(publication_groupby)*0.01)).Year)/sum(publication_groupby.Year)
  2. 0.6225407930121115

1% of the publications post 62% of the articles last year!

Trend

What is the trend of data science articles? Has the number of data science remained stable or changed within the last year?

  1. import datetime
  2. # Turn year, month, and day columns into datetime column
  3. medium['Dates'] = medium.apply(lambda row: datetime.date(row.Year,row.Month,row.Day), axis=1)
  4. dates_groupby = medium.groupby('Dates').count().reset_index()

The number of data science articles tends to be stable until March and increases significantly from March to the present. Since this is about the same time that coronavirus hits many countries, could staying at home more give the writers more time to write? Or has data science become a more attractive topic to readers and writers?

Day of the Week

What is the day that the authors prefer to publish their articles?

  1. def date_to_weekday(year, month, day):
  2. '''Find the day of the week with regarding to the date'''
  3. return datetime.date(year, month, day).weekday()
  4. years = list(medium.Year)
  5. months = list(medium.Month)
  6. days = list(medium.Day)
  7. medium['week_days'] = medium.apply(lambda row: date_to_weekday(row.Year, row.Month, row.Day), axis=1)
  8. # Map the number to the day of the week
  9. day_of_week = {0: 'Monday',
  10. 1: 'Tuesday',
  11. 2: 'Wednesday',
  12. 3: 'Thursday',
  13. 4: 'Friday',
  14. 5: 'Saturday',
  15. 6: 'Sunday'}
  16. medium['week_days'].replace(day_of_week, inplace=True)
  17. day_groupby = medium.groupby(by='week_days').count()
  18. day_groupby = day_groupby.reindex(['Monday', 'Tuesday', 'Wednesday',
  19. 'Thursday', 'Friday', 'Saturday',
  20. 'Sunday']).reset_index()
  21. publish_dates = px.bar(day_groupby,
  22. x='week_days',
  23. y='Year',
  24. labels={'week_days':'Day of the week',
  25. 'Year':'Number of articles'})
  26. publish_dates.show()

There are more articles published on weekdays than weekends. Is it because there are more readers if the article is published on weekdays?

  1. fig = px.bar(medium.groupby(by='week_days').mean()['Claps'].reset_index(),
  2. x='week_days',
  3. y='Claps',
  4. labels={'week_days':'Days of the week'},
  5. title='Average number of claps on each day of a week')
  6. fig.show()

Not quite. There seems to be a similar number of claps for both weekdays and weekends.

Titles

What are the most used titles of data science articles?

For 6488 missing titles, I use the URL to get the title. since the url contains the name of the article such as this:

https://towardsdatascience.com/to-become-a-better-data-scientist-you-need-to-think-like-a-programmer-18d0a00994dc?source=your_stories_page---------------------------. From the URL, the title is: to become a better data scientist you need to think like a programmer

  1. import re
  2. import string
  3. import nltk
  4. from nltk.corpus import stopwords
  5. from nltk.tokenize import word_tokenize
  6. from nltk.stem import WordNetLemmatizer
  7. from nltk.tokenize import word_tokenize
  8. from wordcloud import WordCloud, STOPWORDS
  9. import matplotlib.pyplot as plt
  10. import math
  11. def url_to_title(url):
  12. '''Find title from url'''
  13. url = url.replace('https://towardsdatascience.com/', '')
  14. url = url.replace('https://medium.com/', '')
  15. url = re.sub(r'.*/', '', url)
  16. url = re.sub(r'([A-Za-z]+[\[email protected]]+[\[email protected]]*|[\[email protected]]+[A-Za-z]+[\[email protected]]*).+', '', url)
  17. title = url.replace('-', ' ')
  18. return title
  19. null_urls = (list(medium.loc[medium.Title.isnull(), 'url']))
  20. null_titles = []
  21. for url in null_urls:
  22. null_titles.append(url_to_title(url))
  23. medium.loc[medium.Title.isnull(), 'Title'] = null_titles

Then we can process the text, combine subtitle and title, and visualize with word cloud

  1. def process_text(texts: list):
  2. processed = []
  3. for text in texts:
  4. # lowercase
  5. text = text.lower()
  6. #remove punctuation
  7. text = text.translate(str.maketrans('', '', string.punctuation))
  8. #remove stopwords
  9. stop_words = set(stopwords.words('english'))
  10. #tokenize
  11. tokens = word_tokenize(text)
  12. new_text = [i for i in tokens if not i in stop_words]
  13. new_text = ' '.join(new_text)
  14. processed.append(new_text)
  15. return processed
  16. text_features = ['Title', 'Subtitle']
  17. # Replace nan with None
  18. def isNaN(string):
  19. return string != string
  20. subtitle = medium.Subtitle.fillna('None')
  21. subtitle = process_text(list(subtitle))
  22. titles = process_text(list(medium.Title))
  23. combine_titles = ' '.join(titles) + ' '.join([text for text in subtitle if text !='none'])
  24. def make_wordcloud(new_text):
  25. ''''funciton to make wordcloud'''
  26. wordcloud = WordCloud(width = 800, height = 800,
  27. background_color ='white',
  28. min_font_size = 10).generate(new_text)
  29. fig = plt.figure(figsize = (8, 8), facecolor = None)
  30. plt.imshow(wordcloud)
  31. plt.axis("off")
  32. plt.tight_layout(pad = 0)
  33. plt.show()
  34. return fig
  35. cloud = make_wordcloud(combine_titles)
  36. cloud.show()

The most popular words are as we expect from data science articles’ titles such as machine learning, python, data science, algorithm, data scientist, data analyst, etc.

Conclusion

Key takeaways from this article:

  • A majority of articles have no comment
  • It is totally normal to have a low number of claps. In fact, that should be what expected
  • There is no magic number of reading time but the ideal reading time should not be too long. A lengthy article could scare the audience away.
  • Many authors prefer to publish on weekdays but they do not necessarily gain more claps by publishing over the weekdays
  • A typical author publishes one article per year
  • Some publications publish significantly more data science articles than others
  • The most popular words in data science articles are machine learning, python, data science, algorithm, data scientist, data analyst, etc.

I hope this article has given you interesting insights into data science articles. I encourage you to play with the data to use this for your advantage, either to gain more claps for your articles or to find the next article on topics related to data science that you should read.

The whole visualizations for this report could be found here

and the notebook of this article could be found here.

Need to share Python analyses?

Datapane is an API and framework which makes it easy for people analysing data in Python to publish interactive reports and deploy their analyses.

© 2020 Datapane. All rights reserved. Terms of Service

0.7.1 (v0.7.1-0-gadcb28b5)