https://github.com/snowballstem/pystemmer
Raw File
Tip revision: cbe740d525d01ae668ad59354aa75eff2b325f4e authored by Stefano Rivera on 26 November 2023, 19:48:16 UTC
Add Python 3.12 to CI
Tip revision: cbe740d
README.rst
PyStemmer
=========

What is PyStemmer?
------------------

PyStemmer is a Python interface to the stemming algorithms from the Snowball
project (https://snowballstem.org/).

Snowball can generate pure-Python stemmer code, but if you want to stem a
lot of words this can be rather slow.

PyStemmer instead wraps the "libstemmer_c" library which is built from C
code generated by Snowball.

An alternative to using PyStemmer directly is to use the snowballstemmer
module from Snowball, which will automatically use PyStemmer if available,
falling back to the pure Python implementations if not.  This allows your
users to choose between the convenience of only dealing with pure Python
code and the significantly better performance of PyStemmer.

What is Stemming?
-----------------

Stemming maps different forms of the same word to a common "stem" - for
example, the English stemmer maps *connection*, *connections*, *connective*,
*connected*, and *connecting* to *connect*.  So a searching for *connected*
would also find documents which only have the other forms.

This stem form is often a word itself, but this is not always the case as this
is not a requirement for text search systems, which are the intended field of
use.  We also aim to conflate words with the same meaning, rather than all
words with a common linguistic root (so *awe* and *awful* don't have the same
stem), and over-stemming is more problematic than under-stemming so we tend not
to stem in cases that are hard to resolve.  If you want to always reduce words
to a root form and/or get a root form which is itself a word then Snowball's
stemming algorithms likely aren't the right answer.

Requirements
------------

Python header files should be installed.

This version of PyStemmer has been CI tested using Python series 3.6, 3.7,
3.8, 3.9, 3.10, 3.11, 3.12, pypy and pypy3.

We no longer actively support Python 2 as the Python developers stopped
supporting it at the start of 2020. PyStemmer 2.2.0.1 was the final version
which we tested with Python 2.

PyStemmer can use a system install of libstemmer_c (from a package manager or
an install you've previously done by hand).  To do this, make sure that the
development headers are installed (these may be in a separate binary package
with a ``-dev`` or ``--devel`` suffix) and set environment variable
``PYSTEMMER_SYSTEM_LIBSTEMMER`` to a non-empty value.

Otherwise PyStemmer will do a private build of libstemmer_c and use that.
It looks for a tarball of the corresponding libstemmer_c release in the top
level directory, and will attempt to automatically download it if not
present (with a checksum check).  If you want to avoid the downloading step
(for example, to build in an environment which doesn't allow internet access,
or to avoid build failures due to connectivity problems) you can make sure
that the tarball is already present before building.

Installation
------------

PyStemmer uses distutils, so all that is necessary to build and install
PyStemmer is the usual distutils invocation::

    python setup.py install

You can also install using ``pip``:

    * from PyPI: ``pip install pystemmer``
    * from a local copy of the code: ``pip install .``
    * from git: ``pip install git+git://github.com/snowballstem/pystemmer``

API
---

PyStemmer's API is documented by documentation comments.

A brief overview can be found in docs/quickstart.txt

License
-------

PyStemmer is copyright (c) 2006, Richard Boulton, and is licensed under the MIT
license: see the file "LICENSE" for the full text of this.  It is was inspired
by an earlier implementation (which was copyright (c) 2001, Andreas Jung, and
also licensed under the MIT license, but no portions of which remain in this
package, and had a different API).

The snowball algorithms, and the snowball library, are copyright (c) 2001-2006,
Dr Martin Porter and Richard Boulton, and are licensed under the BSD license.
back to top