Revision 632036cd933fd0bf4525a19451341bbf1532d03b authored by Dirk Roorda on 08 September 2022, 15:17:15 UTC, committed by Dirk Roorda on 08 September 2022, 15:17:15 UTC
1 parent c8fb775
repo.py
"""
# Auto downloading from a backend repository
## Description
Text-Fabric maintains local copies of subfolders of backend repositories,
where it stores the feature data of corpora that the user is working with.
Currently GitHub and GitLab are supported as backends.
In case of GitLab, not only [gitlab.com](https://gitlab.com) is supported,
but also GitLab instances on other servers that support the GitLab API.
There is some bookkeeping to account for which release and commit the
feature files come from.
Users can request data from any repo according to any release and/or commit.
## Rate limiting
The `checkRepo()` function uses the GitHub and GitLab APIs.
GitHub has a rate limiting policy for its API.
See below to deal with this if it becomes a problem.
On GitLab we ignore the rate limiting.
# GitHub
GitHub has a rate limiting policy for its API of max 60 calls per hour.
This can be too restrictive, and here are two ways to keep working nevertheless.
## Increase the rate limit
If you use this function in an application of yours that uses it very often,
you can increase the limit to 5000 calls per hour by making yourself known.
* [create a personal access token](https://github.com/settings/tokens)
* Copy your token and put it in an environment variable named `GHPERS`
on the system where your app runs.
See below how to do that.
* If `checkoutRepo` finds this variable, it will add the
token to every GitHub API call it makes, and that will
increase the rate.
* Never pass your personal credentials on to others, let them obtain their own!
You might want to read this:
* [Read more about rate limiting on GitHub](https://docs.github.com/en/rest/overview/resources-in-the-rest-api#rate-limiting)
# GitLab
In order to reach an on-premise GitLab and have access to the repository in
question, you may need to have a VPN connection with the GitLab backend.
Additionally, you may need to make your identity known.
If you have an account on the GitLab instance, go to your settings and request
a personal token with *api* privileges.
On your own system, make an environment variable named GL_*BACKEND*`_PERS` whose
content is exactly the value of this token.
And *BACKEND* should be the uppercase variant of the name of the GitLab backend,
where every character that is not a letter or digit or `_` is replaced by a `_`.
For example, for `gitlab.huc.knaw.nl` use `GL_GITLAB_HUC_KNAW_NL_PERS`
and for `gitlab.com` use `GL_GITLAB_COM_PERS`.
See below how to put this in an environment variable.
# Token in environment variables
How to put your personal access token into an environment variable?
!!! note "What is an environment variable?"
It is a setting on your system that various programs/processes can read.
On Windows it is part of the `Registry`.
In this particular case, you put a personal token
that you obtain from GitHub/GitLab
in such an environment variable.
When Text-Fabric accesses the backend, it will look up this token
first, and pass it to the backend API. The backend then knows who you are and
will give you more privileges.
### On Mac and Linux
Find the file that contains your terminal settings. In many cases that is
`.bash_profile` in your home directory.
Some people put commands like these in their `~/.bashrc` file, which is also fine.
If you do not see a `.bashrc` file, put it into your `.bash_profile` file.
A slightly more advanced shell than `bash` is `zsh` and it is the default on newer
Macs. If that is your case, look for a file `.zshrc` in your home directory or
create one.
Whatever is your case, pick the file indicated above and edit it.
!!! hint "How to edit a file in your terminal?"
If you are already familiar with `vi`, `vim`, `emacs`, or `nano`
you already know how to do it.
If not, `nano` is simple editor that is useful for tasks like this.
Assuming that you want to edit the `.zshrc` in your home directory,
go to your terminal and say this:
nano ~/.zshrc
Then you get a view on your file. Then
* press `Ctrl V` a number of times till you are at the end of the file,
* type the two lines lines of text (specified in the next step), or
copy them from the clipboard
* type `Ctrl X` to exit; nano will ask you to save changes, type `Y`,
it will then verify the file name, type `Enter` and you're done
**GitHub**
Put the following lines in this file:
``` sh
GHPERS="xxx"
export GHPERS
```
**GitLab**
Put the following lines in this file:
``` sh
GL_BACKEND_PERS="xxx"
export GL_BACKEND_PERS
```
where
* `xxx` is replaced by your actual token.
* `BACKEND` is replaced by the uppercase GitLab backend
e.g.
* `gitlab.com` becomes `GL_GITLAB_COM_PERS`
* `gitlab.huc.knaw.nl` becomes `GL_GITLAB_HUC_KNAW_NL_PERS`
In this way you can store tokens for multiple GitLab backends.
Then restart your terminal or say in an existing terminal
``` sh
source ~/.zshrc
```
### On Windows
Click on the Start button and type in `environment variable` into the search box.
Click on `Edit the system environment variables`.
This will open up the System Properties dialog to the Advanced tab.
Click on the `Environment Variables button` at the bottom.
Click on `New ...` under `User environment variables`.
**GitHub**: Then fill in `GHPERS` under *name* and the token string under *value*.
**GitLab**: Then fill in `GL_BACKEND_PERS`
under *name* and the token string under *value*.
Then quit the command prompt and start a new one.
### Result
**GitHub**
With this done, you will automatically get the good rate limit,
whenever you fire up Text-Fabric in the future.
**GitLab**
You are now known to the GitLab backend, and you have the same access to its
repository as when you log in via the web interface.
## Minimize accessing GitHub
Another way te avoid being bitten by the rate limit is to reduce the number
of your access actions to GitHub.
There are two instances where Text-Fabric wants to access GitHub:
1. when you start the Text-Fabric browser from the command line
2. when you give the `use()` command in your Python program (or in a Jupyter Notebook).
### Using a corpus for the first time, within the rate limit
If you are still within the rate limit, just give the usual commands, such as
``` sh
text-fabric org/repo
```
or
``` python
use('org/repo', hoist=globals())
```
where `corpus` should be replaced with the real name of your corpus.
The data will be downloaded to your computer and stored in your
`~/text-fabric-data` directory tree.
### Using a corpus for the first time, after hitting the rate limit
If you want to load a new corpus after having passed the rate limit, and not
wanting to wait an hour, you could directly clone the repos from GitHub/GitLab:
Open your terminal, and go to (or create) directory `~/github` or `~/gitlab`
(in your home directory).
Inside that directory, go to or create directory `org`
Go to that directory.
Then do
``` sh
git clone https://github.com/org/repo
```
or
``` sh
git clone https://gitlab.com/org/repo
```
(replacing `org` and `repo` with the values that apply to your corpus).
This will fetch the Text-Fabric *data*, *app*, and *tutorials* for that corpus.
Now you have all data you need on your system.
If you want to see by example how to use this data, have a look at
[repo](https://nbviewer.jupyter.org/github/annotation/banks/blob/master/tutorial/repo.ipynb),
especially when it discusses `clone`.
In order to run Text-Fabric without further access to the backend, say
``` sh
text-fabric corpus:clone checkout=clone
```
or, in a program,
``` python
A = use('org/repo:clone', checkData='clone', hoist=globals())
```
This will instruct Text-Fabric to use the app and data from within your `~/github`
or `~/gitlab` directory tree.
### Using a corpus that you already have
Depending on how you got the corpus, it is in your
`~/github`, `~/gitlab` or in your `~/text-fabric-data` directory tree:
1. if you cloned it from GitHub, it is in your `~/github` tree;
2. if you cloned it from GitLab, it is in your `~/gitlab` tree;
3. if you cloned it from an other instance of GitLab, say hosted at
`gitlab.huc.knaw.nl`, it is in your `~/gitlab.huc.knaw.nl` tree;
4. if you used the autoload of Text-Fabric it is in your `~/text-fabric-data`.
In the first case, do this:
``` sh
text-fabric corpus:clone checkout=clone
```
or, in a program,
``` python
A = use('org/repo:clone', checkData='clone', hoist=globals())
```
In the second case, do this:
``` sh
text-fabric corpus:clone checkout=clone --backend=gitlab
```
or, in a program,
``` python
A = use('org/repo:clone', checkData='clone', backend="gitlab", hoist=globals())
```
In the third case, do this:
``` sh
text-fabric corpus:clone checkout=clone --backend=gitlab.huc.knaw.nl
```
or, in a program,
``` python
A = use(
'org/repo:clone',
checkData='clone',
backend="gitlab".huc.knaw.nl,
hoist=globals(),
)
```
In the fourth case, do just this:
``` sh
text-fabric corpus
```
or, in a program,
``` python
A = use('org/repo', hoist=globals())
```
See also `tf.advanced.app.App`.
### Updating a corpus that you already have
If you cloned it from GitHub/GitLab:
In your terminal:
``` sh
cd ~/github/organization/repo
```
or
``` sh
cd ~/gitlab/organization/repo
```
(replacing `organization` with the name of the organization where the corpus resides
and `corpus` with the name of your corpus).
And then:
git pull origin master
```
Now you have the newest corpus data on your system.
and you can use it as follows
(we show the example for `github`):
``` sh
text-fabric corpus:clone checkout=clone
```
or, in a program,
``` python
A = use('org/repo:clone', checkData='clone', hoist=globals())
```
If you have autoloaded it from the backend,
you have to add the `latest` or `hot` specifier:
``` sh
text-fabric corpus:latest checkout=latest
```
or, in a program,
``` python
A = use('org/repo:latest', checkData='latest', hoist=globals())
```
And after that, you can omit `latest` or `hot` again, until you need new data again.
!!! hint "App versus data"
The checkout specifiers such as `latest`, `hot`, `clone` apply
to either the corpus data or the TF App.
If the specifier follows the app name, separated with a colon,
it directs how the app code is being obtained.
If it is the value of the `checkout` parameter, it directs how the corpus data
is being obtained.
See further under `checkoutRepo`.
"""
import os
import io
import re
from shutil import rmtree
import requests
import base64
from zipfile import ZipFile
from github import Github, GithubException, UnknownObjectException
from gitlab import Gitlab
from gitlab.exceptions import GitlabGetError
from ..parameters import (
GH,
URL_TFDOC,
backendRep,
EXPRESS_SYNC,
EXPRESS_SYNC_LEGACY,
DOWNLOADS,
)
from ..core.helpers import console, htmlEsc, expanduser, initTree
from ..core.timestamp import SILENT_D, AUTO, TERSE, VERBOSE, silentConvert
from .helpers import dh, runsInNotebook
from .zipdata import zipData
VERSION_DIGIT_RE = re.compile(r"^([0-9]+).*")
SHELL_VAR_RE = re.compile(r"[^A-Z0-9_]")
def GLPERS(backend):
return f"GL_{SHELL_VAR_RE.sub('_', backend.upper())}_PERS"
class Repo:
"""Auxiliary class for `releaseData`"""
def __init__(
self,
backend,
org,
repo,
folder,
version,
increase,
source=backendRep(GH, "clone"),
dest=DOWNLOADS,
):
self.org = org
self.repo = repo
self.folder = folder
self.version = version
self.increase = increase
self.source = expanduser(source)
self.dest = expanduser(dest)
self.repoOnline = None
self.backend = backend
onGithub = backend is None
self.onGithub = onGithub
self.conn = None
def newRelease(self):
if not self.makeZip():
return False
conn = self.connect()
if not conn:
return False
if not self.fetchInfo():
return False
if not self.bumpRelease():
return False
if not self.makeRelease():
return False
if not self.uploadZip():
return False
return True
def makeZip(self):
source = self.source
dest = self.dest
backend = self.backend
org = self.org
repo = self.repo
folder = self.folder
version = self.version
dataIn = f"{source}/{org}/{repo}/{folder}/{version}"
if not os.path.exists(dataIn):
console(f"No data found in {dataIn}", error=True)
return False
zipData(
backend,
org,
repo,
version=version,
relative=folder,
source=source,
dest=dest,
)
return True
def connect(self):
backend = self.backend
conn = self.conn
if not conn:
ghPerson = os.environ.get("GHPERS", None)
if ghPerson:
conn = Github(ghPerson)
else:
ghClient = os.environ.get("GHCLIENT", None)
ghSecret = os.environ.get("GHSECRET", None)
if ghClient and ghSecret:
conn = Github(client_id=ghClient, client_secret=ghSecret)
else:
conn = Github()
try:
rate = conn.get_rate_limit().core
self.info(
f"rate limit is {rate.limit} requests per hour,"
f" with {rate.remaining} left for this hour"
)
if rate.limit < 100:
self.warning(
f"To increase the rate,"
f"see {URL_TFDOC}/advanced/repo.html#github"
)
self.info(
f"\tconnecting to online {backend} repo {self.org}/{self.repo} ... ",
newline=False,
)
self.repoOnline = conn.get_repo(f"{self.org}/{self.repo}")
self.info("connected")
except GithubException as why:
self.warning("failed")
self.warning(f"{backend} says: {why}")
except IOError:
self.warning("no internet")
self.conn = conn
return conn
def fetchInfo(self):
g = self.repoOnline
if not g:
return False
self.commitOn = None
self.releaseOn = None
self.releaseCommitOn = None
result = self.getRelease()
if result:
self.releaseOn = result
result = self.getCommit()
if result:
self.commitOn = result
return True
def bumpRelease(self):
increase = self.increase
latestR = self.releaseOn
if latestR:
console(f"Latest release = {latestR}")
else:
latestR = "v0.0.0"
console("No releases yet")
# bump the release version
v = ""
if latestR.startswith("v"):
v = "v"
r = latestR[1:] if latestR.startswith("v") else latestR
parts = [int(p) for p in r.split(".")]
nParts = len(parts)
if nParts < increase:
for i in range(nParts, increase):
parts.append(0)
parts[increase - 1] += 1
parts[increase:] = []
newTag = f"{v}{'.'.join(str(p) for p in parts)}"
console(f"New release = {newTag}")
self.newTag = newTag
return True
def makeRelease(self):
g = self.repoOnline
if not g:
return False
commit = self.commitOn
newTag = self.newTag
tag_message = "data update"
release_name = "data update"
release_message = "data update"
try:
newReleaseObj = g.create_git_tag_and_release(
newTag,
tag_message,
release_name,
release_message,
commit,
"commit",
)
except Exception as e:
self.error("\tcannot create release", newline=True)
console(str(e), error=True)
return False
self.newReleaseObj = newReleaseObj
return True
def uploadZip(self):
newTag = self.newTag
newReleaseObj = self.newReleaseObj
dest = self.dest
org = self.org
repo = self.repo
folder = self.folder
version = self.version
dataFile = f"{folder}-{version}.zip"
dataDir = f"{dest}/{org}-release/{repo}"
dataPath = f"{dataDir}/{dataFile}"
if not os.path.exists(dataPath):
console(f"No release data found: {dataPath}", error=True)
return False
try:
newReleaseObj.upload_asset(
dataPath, label="", content_type="application/zip", name=dataFile
)
console(f"{dataFile} attached to release {newTag}")
except Exception as e:
self.error("\tcannot attach zipfile to release", newline=True)
console(str(e), error=True)
return False
return True
def getRelease(self):
r = self.getReleaseObj()
if not r:
return None
return r.tag_name
def getReleaseObj(self):
g = self.repoOnline
if not g:
return None
r = None
try:
r = g.get_latest_release()
except UnknownObjectException:
self.error("\tno releases", newline=True)
except Exception:
self.error("\tcannot find releases", newline=True)
return r
def getCommit(self):
c = self.getCommitObj()
if not c:
return None
return c.sha
def getCommitObj(self):
g = self.repoOnline
if not g:
return None
c = None
try:
cs = g.get_commits()
if cs.totalCount:
c = cs[0]
else:
self.error("\tno commits")
except Exception:
self.error("\tcannot find commits")
return c
def info(self, msg, newline=True):
silent = self.silent
if silent in {VERBOSE, AUTO}:
console(msg, newline=newline)
def warning(self, msg, newline=True):
silent = self.silent
if silent in {VERBOSE, AUTO, TERSE}:
console(msg, newline=newline)
def error(self, msg, newline=True):
console(msg, error=True, newline=newline)
def releaseData(
backend,
org,
repo,
folder,
version,
increase,
source=None,
dest=DOWNLOADS,
):
"""Makes a new data release for a repository.
!!!caution "GitHub only"
Only GitHub repositories are supported.
GitLab support will be implemented if there is a need for it.
Parameters
----------
backend: string
`github` or `gitlab` or a GitLab instance such as `gitlab.huc.knaw.nl`.
org: string
The organization name of the repo
repo: string
The name of the repo
folder: string
The subfolder in the repo that contains the text-fabric files.
If the tf files are versioned, it is the directory that
contains the version directories.
In most cases it is `tf` or it ends in `/tf`.
version: string
The version of the data that should be attached as a zip file to the release
increase:
The way in which the release version should be increased:
1 = bump major version;
2 = bump intermediate version;
3 = bump minor version
source: string, optional `None`
Path to where the local GitHub clones are stored
dest: string, optional `DOWNLOADS`
Path to where the zipped data should be stored
"""
if source is None or not source:
source = (backendRep(GH, "clone"),)
R = Repo(GH, org, repo, folder, version, increase, source=source, dest=dest)
return R.newRelease()
class Checkout:
"""Auxiliary class for `checkoutRepo`"""
@staticmethod
def fromString(string):
commit = None
release = None
local = None
if not string:
commit = ""
release = ""
elif string == "latest":
commit = None
release = ""
elif string == "hot":
commit = ""
release = None
elif string in {"local", "clone"}:
commit = None
release = None
local = string
elif "." in string or len(string) < 12:
commit = None
release = string
else:
commit = string
release = None
return (commit, release, local)
@staticmethod
def toString(commit, release, local, backend, source=None, dest=None):
extra = ""
if local:
if source is None:
source = backendRep(backend, "clone")
if dest is None:
dest = backendRep(backend, "cache")
baseRep = source if local == "clone" else dest
extra = f" offline under {baseRep}"
if local == "clone":
result = "repo clone"
elif commit and release:
result = f"r{release}=#{commit}"
elif commit:
result = f"#{commit}"
elif release:
result = f"r{release}"
elif commit is None and release is None:
result = "unknown release or commit"
elif commit is None:
result = "latest release"
elif release is None:
result = "latest commit"
else:
result = "latest release or commit"
return f"{result}{extra}"
def isClone(self):
return self.local == "clone"
def isOffline(self):
return self.local in {"clone", "local"}
def __init__(
self,
backend,
org,
repo,
relative,
checkout,
source,
dest,
keep,
withPaths,
silent,
_browse,
inNb,
version=None,
label="data",
):
self.backend = backend
onGithub = backend == GH
self.onGithub = onGithub
self.conn = None
self._browse = _browse
self.inNb = inNb
self.label = label
self.org = org
self.repo = repo
self.source = source
self.dest = dest
(self.commitChk, self.releaseChk, self.local) = self.fromString(checkout)
clone = self.isClone()
offline = self.isOffline()
self.relative = relative
self.version = version
versionRep = f"/{version}" if version else ""
self.versionRep = versionRep
relativeRep = f"/{relative}" if relative else ""
self.dataDir = f"{relative}{versionRep}"
self.baseLocal = expanduser(self.dest)
self.dataRelLocal = f"{org}/{repo}{relativeRep}"
self.dirPathSaveLocal = f"{self.baseLocal}/{org}/{repo}"
self.dirPathLocal = f"{self.baseLocal}/{self.dataRelLocal}{versionRep}"
self.dataPathLocal = f"{self.dataRelLocal}{versionRep}"
self.filePathLocal = f"{self.dirPathLocal}/{EXPRESS_SYNC}"
self.baseClone = expanduser(self.source)
self.dataRelClone = f"{org}/{repo}{relativeRep}"
self.dirPathClone = f"{self.baseClone}/{self.dataRelClone}{versionRep}"
self.dataPathClone = f"{self.dataRelClone}{versionRep}"
self.dataPath = self.dataRelClone if clone else self.dataRelLocal
self.keep = keep
self.withPaths = withPaths
self.commitOff = None
self.releaseOff = None
self.commitOn = None
self.releaseOn = None
self.releaseCommitOn = None
self.silent = silentConvert(silent)
self.repoOnline = None
self.localBase = False
self.localDir = None
if clone:
self.commitOff = None
self.releaseOff = None
else:
self.fixInfo()
self.readInfo()
if not offline:
self.connect()
self.fetchInfo()
def login(self):
onGithub = self.onGithub
conn = self.conn
backend = self.backend
self.canDownloadSubfolders = False
if onGithub:
person = os.environ.get("GHPERS", None)
if person:
conn = Github(person)
else:
client = os.environ.get("GHCLIENT", None)
secret = os.environ.get("GHSECRET", None)
if client and secret:
conn = Github(client_id=client, client_secret=secret)
else:
conn = Github()
else:
bUrl = backendRep(backend, "url")
bMachine = backendRep(backend, "machine")
person = os.environ.get(GLPERS(bMachine), None)
if person:
conn = Gitlab(bUrl, private_token=person)
else:
conn = Gitlab(bUrl)
backendVersion = conn.version()
if (
not backendVersion
or backendVersion[0] == "unknown"
or backendVersion[-1] == "unknown"
):
self.conn = None
self.error(f"Cannot connect to GitLab instance {backend}\n")
return
versionThreshold = (14, 4, 0)
if backendVersion:
backendVersion = [
int(VERSION_DIGIT_RE.sub(r"\1", vc))
for vc in backendVersion[0].split(".")
]
if len(backendVersion) < 3:
backendVersion.extend([0] * (3 - len(backendVersion)))
canDownloadSubfolders = True
for (t, v) in zip(versionThreshold, backendVersion):
if t != v:
canDownloadSubfolders = t < v
break
self.canDownloadSubfolders = canDownloadSubfolders
self.conn = conn
return conn
def connect(self):
conn = self.conn
onGithub = self.onGithub
backend = self.backend
org = self.org
repo = self.repo
bName = backendRep(backend, "name")
if not conn:
conn = self.login()
if not self.conn:
return
if onGithub:
try:
rate = conn.get_rate_limit().core
self.info(
f"rate limit is {rate.limit} requests per hour,"
f" with {rate.remaining} left for this hour"
)
if rate.limit < 100:
self.warning(
f"To increase the rate,"
f"see {URL_TFDOC}/advanced/repo.html#github"
)
except GithubException as why:
self.warning("Could not get rate limit details")
self.warning(f"{bName} says: {why}")
self.info(
f"\tconnecting to online {bName} repo {org}/{repo} ... ",
newline=False,
)
repoOnline = None
try:
if onGithub:
try:
repoOnline = conn.get_repo(f"{org}/{repo}")
self.info("connected")
except GithubException as why:
self.warning("failed")
self.warning(f"{bName} says: {why}")
else:
try:
repoOnline = conn.projects.get(f"{org}/{repo}")
self.info("connected")
except GitlabGetError as why:
self.warning("failed")
self.warning(f"{bName} says: {why}")
except IOError as why:
self.warning("no internet")
self.warning("failed")
self.warning(f"{bName} says: {why}")
self.repoOnline = repoOnline
def info(self, msg, newline=True):
silent = self.silent
if silent in {VERBOSE, AUTO}:
console(msg, newline=newline)
def warning(self, msg, newline=True):
silent = self.silent
if silent in {VERBOSE, AUTO, TERSE}:
console(msg, newline=newline)
def error(self, msg, newline=True):
console(msg, error=True, newline=newline)
def display(self, msg, msgPlain):
inNb = self.inNb
silent = self.silent
if silent in {VERBOSE, AUTO, TERSE}:
if inNb:
dh(msg)
else:
console(msgPlain)
def possibleError(self, msg, showErrors, again=False, indent="\t", newline=False):
if showErrors:
self.error(msg, newline=newline)
else:
self.warning(msg, newline=newline)
if again:
self.warning(f"{indent}Will try something else")
def makeSureLocal(self, attempt=False):
_browse = self._browse
backend = self.backend
label = self.label
offline = self.isOffline()
clone = self.isClone()
cOff = self.commitOff
rOff = self.releaseOff
cChk = self.commitChk
rChk = self.releaseChk
cOn = self.commitOn
rOn = self.releaseOn
rcOn = self.releaseCommitOn
askExact = rChk or cChk
askExactRelease = rChk
askExactCommit = cChk
askLatest = not askExact and (rChk == "" or cChk == "")
askLatestAny = rChk == "" and cChk == ""
askLatestRelease = rChk == "" and cChk is None
askLatestCommit = cChk == "" and rChk is None
isExactReleaseOff = rChk and rChk == rOff
isExactCommitOff = cChk and cChk == cOff
isExactReleaseOn = rChk and rChk == rOn
isExactCommitOn = cChk and cChk == cOn
isLatestRelease = rOff and rOff == rOn or cOff and cOff == rcOn
isLatestCommit = cOff and cOff == cOn
isLocal = (
askExactRelease
and isExactReleaseOff
or askExactCommit
and isExactCommitOff
or askLatestAny
and (isLatestRelease or isLatestCommit)
or askLatestRelease
and isLatestRelease
or askLatestCommit
and isLatestCommit
)
mayLocal = (
askLatestAny
and (rOff or cOff)
or askLatestRelease
and rOff
or askLatestCommit
and cOff
)
canOnline = self.repoOnline
isOnline = canOnline and (
askExactRelease
and isExactReleaseOn
or askExactCommit
and isExactCommitOn
or askLatestAny
or askLatestRelease
or askLatestCommit
)
if offline:
if clone:
dirPath = self.dirPathClone
self.localBase = self.baseClone if os.path.exists(dirPath) else False
else:
self.localBase = (
self.baseLocal
if (
cChk
and cChk == cOff
or cChk is None
and cOff
or rChk
and rChk == rOff
or rChk is None
and rOff
)
else False
)
if not self.localBase:
method = self.warning if attempt else self.error
method(f"The requested {label} is not available offline")
# base = self.baseClone if clone else self.baseLocal
dirVersion = self.dirPathClone if clone else self.dirPathLocal
# method(f"\t{base}/{self.dataPath} not found")
method(f"\t{dirVersion} not found")
else:
if isLocal:
self.localBase = self.baseLocal
else:
if not canOnline:
if askLatest:
if mayLocal:
self.warning(f"The offline {label} may not be the latest")
self.localBase = self.baseLocal
else:
self.error(
f"The requested {label} is not available offline"
)
else:
self.warning(f"The requested {label} is not available offline")
self.error("No online connection")
elif not isOnline:
self.error(f"The requested {label} is not available online")
else:
self.localBase = self.baseLocal if self.download() else False
if self.localBase:
self.localDir = self.dataPath
state = (
"requested"
if askExact
else "latest release"
if rChk == "" and canOnline and self.releaseOff
else "latest? release"
if rChk == "" and not canOnline and self.releaseOff
else "latest commit"
if cChk == "" and canOnline and self.commitOff
else "latest? commit"
if cChk == "" and not canOnline and self.commitOff
else "local release"
if self.local == "local" and self.releaseOff
else "local commit"
if self.local == "local" and self.commitOff
else "local github"
if self.local == "clone"
else "for whatever reason"
)
offString = self.toString(
self.commitOff,
self.releaseOff,
self.local,
backend,
dest=self.dest,
source=self.source,
)
labelEsc = htmlEsc(label)
stateEsc = htmlEsc(state)
offEsc = htmlEsc(offString)
loc = f"{self.localBase}/{self.localDir}{self.versionRep}"
locEsc = htmlEsc(loc)
if _browse:
self.info(
f"Using {label} in {self.localBase}/{self.localDir}{self.versionRep}:"
)
self.info(f"\t{offString} ({state})")
else:
self.display(
(
f'<b title="{stateEsc}">{labelEsc}:</b>'
f' <span title="{offEsc}">{locEsc}</span>'
),
f"{label}: {loc}",
)
def download(self):
cChk = self.commitChk
rChk = self.releaseChk
fetched = False
if rChk is not None:
fetched = self.downloadRelease(rChk, showErrors=cChk is None)
if not fetched and cChk is not None:
fetched = self.downloadCommit(cChk, showErrors=True)
if fetched:
self.writeInfo()
return fetched
def downloadRelease(self, release, showErrors=True):
cChk = self.commitChk
r = self.getReleaseObj(release, showErrors=showErrors)
if not r:
return False
onGithub = self.onGithub
version = self.version
g = self.repoOnline
(commit, release) = self.getReleaseFromObj(r)
fetched = False
if onGithub:
assets = None
try:
assets = r.get_assets()
except Exception:
pass
assetUrl = None
versionRep3 = f"-{version}" if version else ""
relativeFlat = self.relative.replace("/", "-")
dataFile = f"{relativeFlat}{versionRep3}.zip"
if assets and assets.totalCount > 0:
for asset in assets:
if asset.name == dataFile:
assetUrl = asset.browser_download_url
break
if assetUrl:
fetched = self.downloadZip(assetUrl, showErrors=False)
if not fetched:
thisShowErrors = not cChk == ""
fetched = self.downloadCommit(commit, showErrors=thisShowErrors)
else:
fetched = self.downloadZip(g, shiftUp=True, commit=commit, showErrors=True)
if fetched:
self.commitOff = commit
self.releaseOff = release
return fetched
def downloadCommit(self, commit, showErrors=True):
c = self.getCommitObj(commit)
if not c:
return False
commit = self.getCommitFromObj(c)
fetched = self.downloadDir(commit, exclude=r"\.tfx", showErrors=showErrors)
if fetched:
self.commitOff = commit
self.releaseOff = None
return fetched
def downloadZip(self, where, shiftUp=False, commit=None, showErrors=True):
# commit parameter only supported for GitLab
conn = self.conn
backend = self.backend
label = self.label
repo = self.repo
dataDir = self.dataDir
canDownloadSubfolders = self.canDownloadSubfolders
dirPathLocal = self.dirPathLocal
withPaths = self.withPaths
dataUrl = None
g = None
if type(where) is str:
dataUrl = where
notice = where
again = True
else:
g = where
notice = backendRep(backend, "name")
again = False
self.info(f"\tdownloading from {notice} ... ")
try:
if dataUrl is not None:
r = requests.get(dataUrl, allow_redirects=True)
zf = r.content
elif g is not None:
if canDownloadSubfolders:
# zf = g.repository_archive(format="zip", sha=commit, path=dataDir)
response = conn.http_get(
f"/projects/{g.id}/repository/archive.zip",
query_data=dict(sha=commit, path=dataDir),
raw=True,
)
zf = response.content
if len(zf) == 0:
self.possibleError(
f"No directory {dataDir} in #{commit}",
showErrors,
again=False,
)
msg = "\tFailed"
self.possibleError(msg, showErrors=showErrors)
return False
else:
zf = g.repository_archive(format="zip", sha=commit)
zf = io.BytesIO(zf)
except Exception as e:
msg = f"\t{str(e)}\n\tcould not download from {notice}"
self.possibleError(msg, showErrors, again=again)
return False
self.info(f"\tsaving {label}")
cwd = os.getcwd()
destZip = (
os.path.dirname(dirPathLocal) if shiftUp and withPaths else dirPathLocal
)
good = True
if g:
gitlabSlugRe = re.compile(f"^{repo}(?:-master)?-[^/]*/")
try:
z = ZipFile(zf)
initTree(destZip, fresh=not self.keep)
os.chdir(destZip)
if withPaths:
if g:
nItems = 0
for zInfo in z.infolist():
zInfo.filename = gitlabSlugRe.sub("", zInfo.filename) or "/"
if zInfo.filename[-1] == "/":
continue
if zInfo.filename.startswith("__MACOS"):
continue
if not canDownloadSubfolders:
if not zInfo.filename.startswith(dataDir):
continue
z.extract(zInfo)
nItems += 1
if nItems == 0:
self.possibleError(
f"No directory {dataDir} in #{commit}",
showErrors,
again=False,
)
good = False
else:
z.extractall()
if os.path.exists("__MACOSX"):
rmtree("__MACOSX")
else:
nItems = 0
for zInfo in z.infolist():
if g:
zInfo.filename = gitlabSlugRe.sub("", zInfo.filename) or "/"
if zInfo.filename[-1] == "/":
continue
if zInfo.filename.startswith("__MACOS"):
continue
if (
g
and not canDownloadSubfolders
and not zInfo.filename.startswith(dataDir)
):
continue
zInfo.filename = os.path.basename(zInfo.filename)
z.extract(zInfo)
nItems += 1
if nItems == 0:
msg = f"#{commit}" if g else notice
self.possibleError(
f"No directory {dataDir} in {msg}",
showErrors,
again=False,
)
msg = "\tFailed"
self.possibleError(msg, showErrors=showErrors)
good = False
except Exception:
msg = f"\tcould not save {label} to {destZip}"
self.possibleError(msg, showErrors=showErrors, again=True)
os.chdir(cwd)
return False
os.chdir(cwd)
return good
def downloadDir(self, commit, exclude=None, showErrors=False):
g = self.repoOnline
if not g:
return None
onGithub = self.onGithub
backend = self.backend
destDir = f"{self.dirPathLocal}"
destSave = f"{self.dirPathSaveLocal}"
initTree(destDir, fresh=not self.keep)
excludeRe = re.compile(exclude) if exclude else None
good = True
if onGithub:
def _downloadDir(subPath, level=0):
nonlocal good
if not good:
return
lead = "\t" * level
try:
contents = g.get_contents(subPath, ref=commit)
except UnknownObjectException:
msg = (
f"{lead}No directory {subPath} in "
f"{self.toString(commit, None, False, backend)}"
)
self.possibleError(msg, showErrors, again=True, indent=lead)
good = False
return
for content in contents:
thisPath = content.path
self.info(f"\t{lead}{thisPath}...", newline=False)
if exclude and excludeRe.search(thisPath):
self.info("excluded")
continue
if content.type == "dir":
self.info("directory")
os.makedirs(f"{destSave}/{thisPath}", exist_ok=True)
_downloadDir(thisPath, level + 1)
else:
try:
fileContent = g.get_git_blob(content.sha)
fileData = base64.b64decode(fileContent.content)
fileDest = f"{destSave}/{thisPath}"
with open(fileDest, "wb") as fd:
fd.write(fileData)
self.info("downloaded")
except (GithubException, IOError):
msg = "error"
self.possibleError(msg, showErrors, again=True, indent=lead)
good = False
_downloadDir(self.dataDir, 0)
else:
good = self.downloadZip(g, shiftUp=True, commit=commit, showErrors=True)
if good:
self.info("\tOK")
else:
if onGithub:
msg = "\tFailed"
self.possibleError(msg, showErrors=showErrors if onGithub else True)
return good
def getRelease(self, release, showErrors=True):
r = self.getReleaseObj(release, showErrors=showErrors)
if not r:
return None
return self.getReleaseFromObj(r)
def getCommit(self, commit):
c = self.getCommitObj(commit)
if not c:
return None
return self.getCommitFromObj(c)
def getReleaseObj(self, release, showErrors=True):
g = self.repoOnline
if not g:
return None
onGithub = self.onGithub
r = None
msg = f' tagged "{release}"' if release else "s"
if onGithub:
try:
r = g.get_release(release) if release else g.get_latest_release()
except UnknownObjectException:
self.possibleError(
f"\tcannot find release{msg}", showErrors, newline=True
)
except Exception:
self.possibleError(
f"\tcannot find release{msg}",
showErrors,
newline=True,
)
else:
try:
if release:
r = g.releases.get(release)
else:
releases = g.releases.list(all=True)
r = (
sorted(releases, key=lambda r: r.released_at)[-1]
if releases
else None
)
except Exception:
r = None
if r is None:
self.possibleError(
f"\tcannot find release{msg}", showErrors, newline=True
)
return r
def getCommitObj(self, commit):
g = self.repoOnline
if not g:
return None
onGithub = self.onGithub
c = None
msg = f' with hash "{commit}"' if commit else "s"
if onGithub:
try:
cs = g.get_commits(sha=commit) if commit else g.get_commits()
if cs.totalCount:
c = cs[0]
else:
self.error(f"\tcannot find commit{msg}")
except Exception:
self.error(f"\tcannot find commit{msg}")
else:
try:
cs = g.commits.list(all=True)
if not len(cs):
self.error(f"\tno commit{msg}")
else:
cs = sorted(cs, key=lambda x: x.created_at)
if commit:
for com in cs:
if com.id == commit:
c = com
break
else:
if len(cs):
c = cs[-1]
if c is None:
self.error(f"\tcannot find commit{msg}")
except Exception:
self.error(f"\tcannot find commit{msg}")
return c
def getReleaseFromObj(self, r):
g = self.repoOnline
if not g:
return None
onGithub = self.onGithub
release = r.tag_name
if onGithub:
ref = g.get_git_ref(f"tags/{release}")
commit = ref.object.sha
else:
commit = r.commit["id"]
return (commit, release)
def getCommitFromObj(self, c):
g = self.repoOnline
if not g:
return None
onGithub = self.onGithub
return c.sha if onGithub else c.id
def fetchInfo(self):
g = self.repoOnline
if not g:
return
self.commitOn = None
self.releaseOn = None
self.releaseCommitOn = None
if self.releaseChk is not None:
result = self.getRelease(self.releaseChk, showErrors=self.commitChk is None)
if result:
(self.releaseCommitOn, self.releaseOn) = result
if self.commitChk is not None:
result = self.getCommit(self.commitChk)
if result:
self.commitOn = result
def fixInfo(self):
sDir = self.dirPathLocal
if not os.path.exists(sDir):
return
for sFile in EXPRESS_SYNC_LEGACY:
sPath = f"{sDir}/{sFile}"
if os.path.exists(sPath):
goodPath = f"{sDir}/{EXPRESS_SYNC}"
if os.path.exists(goodPath):
os.remove(sPath)
else:
os.rename(sPath, goodPath)
def readInfo(self):
if os.path.exists(self.filePathLocal):
with open(self.filePathLocal, encoding="utf8") as f:
for line in f:
string = line.strip()
(commit, release, local) = self.fromString(string)
if commit:
self.commitOff = commit
if release:
self.releaseOff = release
def writeInfo(self):
if not os.path.exists(self.dirPathLocal):
os.makedirs(self.dirPathLocal, exist_ok=True)
with open(self.filePathLocal, "w", encoding="utf8") as f:
if self.releaseOff:
f.write(f"{self.releaseOff}\n")
if self.commitOff:
f.write(f"{self.commitOff}\n")
def checkoutRepo(
backend,
_browse=False,
org="annotation",
repo="banks",
folder="tf",
version="",
checkout="",
source=None,
dest=None,
withPaths=True,
keep=True,
silent=SILENT_D,
label="data",
):
"""Checks out text-fabric data from an (online) repository.
The copy may be taken from any point in the commit history of the online repo.
If you call this function, it will check whether the requested data is already
on your computer in the expected location.
If not, it may check whether the data is online and if so, download it to the
expected location.
Parameters
----------
backend: string
`github` or `gitlab` or a GitLab instance such as `gitlab.huc.knaw.nl`.
org: string, optional "annotation"
The *org* on GitHub or the group on GitLab
repo: string, optional "banks"
The *repo* on GitHub or the project on GitLab
folder: string, optional `tf`
The subfolder in the repo that contains the text-fabric files.
If the tf files are versioned, it is the directory that
contains the version directories.
In most cases it is `tf` or it ends in `/tf`.
version: string, optional, the empty string
The version of the tf feature data
checkout: string, optional the mepty string
From which version/release/local copy we should extract the data.
* `""`: whatever you have locally in `~/text-fabric-data`.
If there is no data there, data will be downloaded.
* `local`: whatever you have locally in `~/text-fabric-data`.
If there is no data there, you get an error message.
* `clone`: whatever you have locally as a GitHub/GitLab clone
If there is no data there, you get an error message.
* `latest`: make sure the latest release has been fetched from online
* `hot`: make sure the latest commit has been fetched from online
* `vx.y.z`: make sure this specific release has been fetched from online
* `1234567890abcdef`: make sure this specific commit
has been fetched from online
See the
[repo](https://nbviewer.jupyter.org/github/annotation/banks/blob/master/tutorial/repo.ipynb)
notebook for an exhaustive demo of all the checkout options.
source: string, optional empty string
The base of your local repository clones.
If given, it overrides the semi-baked in `~/github` value.
dest: string, optional empty string
The base of your local cache of downloaded tf feature files.
If given, it overrides the semi-baked in `~/text-fabric-data` value.
withPaths: boolean, optional `True`
The data will be saved without the directory structure
of files that are being downloaded.
keep: boolean, optional `True`
If False, the destination directory will be cleared
before a download takes place.
silent: string, optional `tf.core.timestamp.SILENT_D`
See `tf.core.timestamp.Timestamp`
label: string, optional `data`
If passed, it will will change the word "data" in info messages
to what you choose.
We use `label='TF-app'` when we use this function to checkout the code
of a TF-app.
Returns
-------
(commitOffline, releaseOffline, kindLocal, localBase, localDir)
* *commitOffline* is the commit hash of the data you have offline afterwards
* *releaseOffline* is the release tag of the data you have offline afterwards
* *kindLocal* indicates whether an online check has been performed:
it is `None` if there has been an online check. Otherwise it is
`clone` if the data is in your `~/github` directory else it is `local`.
* *localBase* where the data is under: `~/github` or `~/text-fabric-data`,
or whatever you have passed as *source* and *dest*.
* *localDir* releative path from *localBase* to your data.
If your data has versions, *localDir* points to directory that has the versions,
not to a specific version.
Your local copy can be found:
* in the cache under your `~/text-fabric-data`, and from
there under *backend*/*org/repo* where *backend* is github or gitlab or
the server name of a gitlab instance.
or
* in the place where you store your clones from GitHub/GitLab:
`~/github` or `~/gitlab` or `~/`*backend*
(whatever the value of the *backend* parameter is.
From there it is under *org/repo*.
The actual feature files are in *folder/version* if there is a *version*,
else *folder*.
"""
inNb = runsInNotebook()
silent = silentConvert(silent)
if source is None:
source = backendRep(backend, "clone")
if dest is None:
dest = backendRep(backend, "cache")
def resolve(chkout, attempt=False):
rData = Checkout(
backend,
org,
repo,
folder,
chkout,
source,
dest,
keep,
withPaths,
silent,
_browse,
inNb,
version=version,
label=label,
)
rData.makeSureLocal(attempt=attempt)
return (
(
rData.commitOff,
rData.releaseOff,
rData.local,
rData.localBase,
rData.localDir,
)
if rData.localBase
else (None, None, False, False, None)
)
if checkout == "":
rData = resolve("local", attempt=True)
if rData[3]:
return rData
return resolve(checkout)
Computing file changes ...