https://github.com/ghtorrent/ghtorrent.org
Raw File
Tip revision: 51890965af72da85bdd0954a50ef1a71603fb4e7 authored by Georgios Gousios on 18 September 2020, 08:15:22 UTC
Merge pull request #866 from catgofire/patch-1
Tip revision: 5189096
gcloud.md
---
layout: page
title: GHTorrent on the Google cloud
tagline:
---

GHTorrent can be accessed over Google Cloud services. To access the data
requires you to have a Google Cloud account. Reasonable use is free of charge
and, in the case of BigQuery, it [should no longer require a credit
card](https://cloud.google.com/blog/big-data/2017/01/how-to-run-a-terabyte-of-google-bigquery-queries-each-month-without-a-credit-card). (Pub/Sub still requires a credit card). You can check what Google considers reasonable at any given moment
[here](https://cloud.google.com/pricing/free).

* [Google BigQuery](https://bigquery.cloud.google.com/dataset/ghtorrent-bq:ght)
  contains an up to date import of the latest GHTorrent MySQL dump.

* [Google Pub/Sub](https://console.cloud.google.com/cloudpubsub/topicList?project=ghtorrent-bq) exposes real-time streams of GitHub activity.

Both services can be accessed through the Web, the command line (after
installing the Google Cloud [command line utils](https://cloud.google.com/sdk/)) or though various programming languages.

### BigQuery

With BigQuery, you can query GHTorrent's MySQL dataset using an SQL-like
language (lately, BigQuery also supports vanilla SQL); more importantly, you can 
join the dataset with other open datasets (e.g. GitHub's own project data, Reddit, 
[TravisTorrent](https://travistorrent.testroots.org/page_access/) etc) hosted on BigQuery.

To get the most popular programming languages by number of bytes written,
run the following:

{% highlight sql %}
select pl3.lang, sum(pl3.size) as total_bytes
from (
  select pl2.bytes as size, pl2.language as lang
  from (
    select pl.language as lang, max(pl.created_at) as latest, pl.project_id as project_id
    from [ght.project_languages] pl
      join [ght.projects] p on p.id = pl.project_id
    where p.deleted is false
      and p.forked_from is null
    group by lang, project_id
  ) pl1 join [ght.project_languages] pl2 on pl1.project_id = pl2.project_id
                                        and pl1.latest = pl2.created_at
                                        and pl1.lang = pl2.language
) pl3
group by pl3.lang
order by total_bytes desc
{% endhighlight %}

To get the user with the most Java commits in the Netherlands in June 2016,
do the following:

{% highlight sql %}
select u.login as login, u.location as location, count(c.id) as num_commits
from [ghtorrent-bq.ght.project_commits] pc join
     (SELECT id, author_id FROM [ghtorrent-bq.ght.commits] WHERE
     date(created_at) between date('2016-06-01')
                          and date('2016-07-01') )c on pc.commit_id = c.id join
     (SELECT id
     FROM [ghtorrent-bq.ght.projects] WHERE language = 'Java') p on p.id = pc.project_id join
     (SELECT login, location, id
     FROM [ghtorrent-bq.ght.users]
     WHERE country_code = 'nl') u on c.author_id = u.id,
group by login, location
order by num_commits desc;
{% endhighlight %}

See also some queries by [Felipe Hoffa](https://medium.com/@hoffa/github-top-countries-201608-13f642493773).

### Pub/Sub

Pub/Sub allows subscribers to get events of what is happening on GitHub (or at
least GHTorrent's interpretation of what is happening on GitHub) in almost real time. 
To do so, one needs to *subscribe* to one of the available *topics* with
a client in order to start receiving *events*.

The service is complimentary, even though less fine-grained, to GHTorrent's own
[streaming interface](streaming.html). As is also the case with GHTorrent
streaming, the contents of the streams are generated by following the live
MongoDB server replication stream. See the code [here](https://github.com/ghtorrent/ghtorrent-streaming).

To subscribe to a topic, e.g. `commits`, run the following:

```
gcloud beta pubsub subscriptions create my_commits_subscription --topic projects/ghtorrent-bq/topics/commits
```

To start receiving events, you can try the command line

```
gcloud beta pubsub subscriptions pull --auto-ack --max-messages 5 -- my_commits_subscription
```

The available topics are the following:

{% highlight txt%}
projects/ghtorrent-bq/topics/commits
projects/ghtorrent-bq/topics/events
projects/ghtorrent-bq/topics/followers
projects/ghtorrent-bq/topics/forks
projects/ghtorrent-bq/topics/issue_comments
projects/ghtorrent-bq/topics/issue_events
projects/ghtorrent-bq/topics/issues
projects/ghtorrent-bq/topics/org_members
projects/ghtorrent-bq/topics/pull_request_comments
projects/ghtorrent-bq/topics/pull_requests
projects/ghtorrent-bq/topics/repo_collaborators
projects/ghtorrent-bq/topics/repo_labels
projects/ghtorrent-bq/topics/repos
projects/ghtorrent-bq/topics/users
projects/ghtorrent-bq/topics/watchers
{% endhighlight %}

back to top