facebook-1200x630.png

What I found after Scraping more than 1k Top Machine Learning Github Profiles

Khuyen Tran

Khuyen Tran

These Findings might not be what you Expected

This is a Datapane import of the original Medium post, which can be found here

Introduction

When searching the keyword “machine learning” on Github, I found 246,632 machine learning repositories. Since these are top repositories in machine learning, I expect the owners and the contributors of these repositories to be experts or competent in machine learning. Thus, I decided to extract the profiles of these users to gain some interesting insights into their background as well as statistics.

My Method for Scraping

Tools

To scrape, I use three tools:

  • Beautiful Soup to extract the URLs of all the repositories under the machine learning tag. If you are not aware of Beautiful Soup, I wrote a tutorial on how to scrape with Beautiful Soup in this article.
  • PyGithub to extract the information about the users. PyGithub is a Python library to use the Github API v3. With it, you can manage your Github resources (repositories, user profiles, organizations, etc.) from Python scripts.
  • Requests to extract the information about the repositories and the links to contributors’ profiles.

Methods

I scrape the owners as well as the top 30 contributors of the top 90 repositories that pop up in the search

Screenshot from 2020-07-01 19-32-30.png

Screenshot from 2020-07-01 19-34-02.png

By removing duplicates as well as removing the profiles that are organizations like udacity, I obtain a list of 1208 users. For each user, I scrape the 20 data points as listed below

  1. new_profile.info()

Screenshot from 2020-07-01 19-38-20.png

Specifically, the first 13 data points are obtained from here

Screenshot from 2020-07-01 20-09-09.png

The rest of the data points are obtained from the repositories of a user:

  • total_stars is the total number of stars of all repositories
  • max_star is the maximum star among all repositories
  • forks is the total number of forks of all repositories
  • descriptions are the descriptions from all repository of a user of all repositories
  • contribution is the number of contribution within last year

Screenshot from 2020-07-01 20-07-43.png

Visualize the Data

Bar graphs

After cleaning the data, it comes to the fun part: data visualization. Visualizing the data can give us many insights about the data. I use Plotly for its ease to create interactive plots

  1. import matplotlib.pyplot as plt
  2. import numpy as np
  3. import plotly.express as px # for plotting
  4. import altair as alt # for plotting
  5. import datapane as dp # for creating a report for your findings
  6. top_followers = new_profile.sort_values(by='followers', axis=0, ascending=False)
  7. fig = px.bar(top_followers,
  8. x='user_name',
  9. y='followers',
  10. hover_data=['followers'],
  11. )
  12. fig.show()

The graph is hard to see because of the long tail of users with followers below 100. We can zoom in the left-most portion of the graph to have a better view of the graph.

As we can see, llSourcell (Siraj Raval) gets the majority of the followers (36261). The next user gets about 1/3 followers of what llSourcell has (12682).

We can do further analysis to determine what percentage of followers top 1% of users get

  1. >>> top_n = int(len(top_followers) * 0.01)
  2. 12
  3. >>> sum(top_followers.iloc[0: top_n,:].loc[:, 'followers'])/sum(top_followers.followers)
  4. 0.41293075864408607

1% of the top users get o.41% of the followers!

The same patterns with other data points such as total_stars, max_star, forks. To have a better look at these columns, we will change the y-axis of these features to a logarithmic scale. contribution’s y-axis is not changed.

  1. figs = [] # list to save all the plots and table
  2. features = ['followers',
  3. 'following',
  4. 'total_stars',
  5. 'max_star',
  6. 'forks',
  7. 'contribution']
  8. for col in features:
  9. top_col = new_profile.sort_values(by=col, axis=0, ascending=False)
  10. log_y = False
  11. #change scale of y-axis of every feature to log except contribution
  12. if col != 'contribution':
  13. log_y = True
  14. fig = px.bar(top_col,
  15. x='user_name',
  16. y=col,
  17. hover_data=[col],
  18. log_y = log_y
  19. )
  20. fig.update_layout({'plot_bgcolor': 'rgba(36, 83, 97, 0.06)'}) #change background coor
  21. fig.show()
  22. figs.append(dp.Plot(fig))

These graphs are similar to Zipf’s law, where statistical distribution in certain data sets, such as words in a linguistic corpus, in which the frequencies of certain words are inversely proportional to their ranks.

For example, the most common word in English is “the,” which appears about one-tenth of the time in a typical text even though it is not as important as other words.

We see Zipf’s law a lot in other rankings such as population ranks of cities in various counties, income rankings, ranks of a number of people purchasing a book, etc. Now we see this pattern again in Github data.

Correlation

But what are the relationships between these data points? Is there a strong relationship between them? I use scatter_matrix to get a big picture of the correlation among these data points.

  1. correlation = px.scatter_matrix(new_profile, dimensions=['forks', 'total_stars', 'followers',
  2. 'following', 'max_star','contribution'],
  3. title='Correlation between datapoints',
  4. width=800, height=800)
  5. correlation.show()
  6. corr = new_profile.corr()
  7. figs.append(dp.Plot(correlation))
  8. figs.append(dp.Table(corr))
  9. corr

The data points tend to cluster around the left bottom axis because the majority of the users’ data points lie in this range. There is a strong positive relationship between

  • Maximum number of stars and the total number of stars (0.939)
  • Number of forks (from others) and the total number of stars (0.929)
  • The number of forks and the number of followers (0.774)
  • The number of followers and the total number of stars (0.632)

Languages

What are the favorite languages of the top machine learning users? What is the percentage of Python, Jupyter Notebook, C, R? We could use a bar chart to find out. To have a better look at the most popular languages, we remove languages that below the count of 10

  1. # Collect languages from all repos of al users
  2. languages = []
  3. for language in list(new_profile['languages']):
  4. try:
  5. languages += language
  6. except:
  7. languages += ['None']
  8. # Count the frequency of each language
  9. from collections import Counter
  10. occ = dict(Counter(languages))
  11. # Remove languages below count of 10
  12. top_languages = [(language, frequency) for language, frequency in occ.items() if frequency > 10]
  13. top_languages = list(zip(*top_languages))
  14. language_df = pd.DataFrame(data = {'languages': top_languages[0],
  15. 'frequency': top_languages[1]})
  16. language_df.sort_values(by='frequency', axis=0, inplace=True, ascending=False)
  17. language = px.bar(language_df, y='frequency', x='languages',
  18. title='Frequency of languages')
  19. figs.append(dp.Plot(language))
  20. language.show()

From the bar chart above, we have the ranking of languages among machine learning users:

  • Python
  • JavaScript
  • HTML
  • Jupyter Notebook
  • Shell and so on

Hireable

We use Altair to visualize the percentage of users listing themself as hireable

  1. import altair as alt
  2. hireable = alt.Chart(new_profile).transform_aggregate(
  3. count='count()',
  4. groupby=['hireable']
  5. ).mark_bar().encode(
  6. x='hireable:O',
  7. y='count:Q')
  8. figs.append(dp.Plot(hireable))
  9. hireable

Locations

To get a sense of where the users are in the world, our next task is to visualize the locations of users. We will use the locations obtained from 31% of users who show their location. Start with extracting a list of locations from the df and locating them with geopy.geocoders.Nominatim

  1. from geopy.geocoders import Nominatim
  2. locations = list(new_profile['location'])
  3. # Extract lats and lons
  4. lats = []
  5. lons = []
  6. exceptions = []
  7. for loc in locations:
  8. try:
  9. location = geolocator.geocode(loc)
  10. lats.append(location.latitude)
  11. lons.append(location.longitude)
  12. print(location.address)
  13. except:
  14. print('exception', loc)
  15. exceptions.append(loc)
  16. print(len(exceptions)) # output: 17
  17. # Remove the locations not found in map
  18. location_df['latitude'] = lats
  19. location_df['longitude'] = lons

Then use Plotly’s scatter_geo to create a map!

  1. # Visualize with Plotly's scatter_geo
  2. m = px.scatter_geo(location_df, lat='latitude', lon='longitude',
  3. color='total_stars', size='forks',
  4. hover_data=['user_name','followers'],
  5. title='Locations of Top Users')
  6. m.show()
  7. figs.append(dp.Plot(m))

Word Clouds of Descriptions and Bios

Our data also includes the bios of the users as well as all the descriptions of their repositories. We will use these to answer the questions: what are their main focuses and backgrounds.

Generating Word Clouds could give us a big picture of the words and their frequency using in the descriptions and bios. And creating word clouds with Python cannot be easier with WordCloud!

  1. import string
  2. import nltk
  3. from nltk.corpus import stopwords
  4. from nltk.tokenize import word_tokenize
  5. from nltk.stem import WordNetLemmatizer
  6. from nltk.tokenize import word_tokenize
  7. from wordcloud import WordCloud, STOPWORDS
  8. import matplotlib.pyplot as plt
  9. nltk.download('stopwords')
  10. nltk.download('punkt')
  11. nltk.download('wordnet')
  12. def process_text(features):
  13. '''Function to process texts'''
  14. features = [row for row in features if row != None]
  15. text = ' '.join(features)
  16. # lowercase
  17. text = text.lower()
  18. #remove punctuation
  19. text = text.translate(str.maketrans('', '', string.punctuation))
  20. #remove stopwords
  21. stop_words = set(stopwords.words('english'))
  22. #tokenize
  23. tokens = word_tokenize(text)
  24. new_text = [i for i in tokens if not i in stop_words]
  25. new_text = ' '.join(new_text)
  26. return new_text
  27. def make_wordcloud(new_text):
  28. ''funciton to make wordcloud'''
  29. wordcloud = WordCloud(width = 800, height = 800,
  30. background_color ='white',
  31. min_font_size = 10).generate(new_text)
  32. fig = plt.figure(figsize = (8, 8), facecolor = None)
  33. plt.imshow(wordcloud)
  34. plt.axis("off")
  35. plt.tight_layout(pad = 0)
  36. plt.show()
  37. return fig
  38. descriptions = []
  39. for desc in new_profile['descriptions']:
  40. try:
  41. descriptions += desc
  42. except:
  43. pass
  44. descriptions = process_text(bios)
  45. cloud = make_wordcloud(text)
  46. figs.append(dp.Plot(cloud))

Screenshot from 2020-07-02 17-39-30.png

Make word clouds with bio

  1. bios = []
  2. for bio in new_profile['bio']:
  3. try:
  4. bios.append(bio)
  5. except:
  6. pass
  7. text = process_text(bios)
  8. cloud = make_wordcloud(text)
  9. figs.append(dp.Plot(cloud))

Screenshot from 2020-07-02 17-42-32.png

The keywords look like what we expect to see from machine learning users.

Share your findings

We have been collecting the plots on the list. It is time to create a report and share what you found! Datapane is an ideal tool for this.

  1. dp.Report(*figs).publish(name='finding')

Now we have all the plots that we created in this article on a website hosted by Datapane and ready to share!

Conclusion

The data is obtained from the users and contributors of the first 90 best match repositories in the machine learning keyword. Thus, this data does not guarantee to gather all the top machine learning users in Github.

But I hope you could use this article as the guide or inspiration to scrape your own data and visualize it. You will be most likely surprised by what you found. Data science is impactful and interesting when you can use your knowledge to analyze things around you.

Need to share Python analyses?

Datapane is an API and framework which makes it easy for people analysing data in Python to publish interactive reports and deploy their analyses.