https://github.com/cran/stringi
Raw File
Tip revision: 29ff8be9fae3ad316384d836756ecf06e3a9da8c authored by Marek Gagolewski on 28 June 2015, 00:00:00 UTC
version 0.5-5
Tip revision: 29ff8be
NEWS
                   stringi package NEWS and CHANGELOG
===============================================================================

## Known bugs in ICU

* [REGEX] #147: regex look-behind assertions may fail to find a multibyte
Unicode search pattern [scheduled for ICU4C 56.1, see
http://bugs.icu-project.org/trac/ticket/11554]

-------------------------------------------------------------------------------

## 0.5-5 (2015-06-28) **CRAN**

* [BACKWARD INCOMPATIBILITY] `stri_install_check()` and `stri_install_icudt()`
are now deprecated. From now on they are supposed to be used only
by the `stringi` installer.

* [BUGFIX] #176: a patch for `sys/feature_tests.h` no longer included
(the original file was copyrighted by Sun Microsystems); fixed the *Compiler
or options invalid for pre-UNIX 03 X/Open applications and pre-2001 POSIX
applications* error by forcing (conditionally) `_XPG6` conformance.

* [BUGFIX] #174: `stri_paste()` did not generate any warning when
the recycling rule is violated and `sep==""`.

* [BUGFIX] #170: `icu::setDataDirectory` no longer called if our ICU source bundle
is not used (this used to cause build problems on openSUSE).

* [BUILD TIME] #169: `./configure` now tries to switch to the "standard"
C++ compiler if a C++11 one is not properly configured.

* [BUILD TIME] `configure.win` (`Biarch: TRUE`) now mimics `autoconf`'s
`AC_SUBST` and `AC_CONFIG_FILES` so that the build process is now
more similar across different platforms.

* [NEW FEATURE] `stri_info()` now also gives information on which ICU4C is
used (system or bundle).

-------------------------------------------------------------------------------

## 0.5-2 (2015-06-21) **CRAN**

* [BACKWARD INCOMPATIBILITY] The second argument to `stri_pad_*()` has
been renamed `width`.

* [GENERAL] #69: `stringi` is now bundled with ICU4C 55.1.

* [NEW FUNCTIONS] `stri_extract_*_boundaries()` extract text between text
boundaries.

* [NEW FUNCTION] #46: `stri_trans_char()` is a `stringi`-flavoured
`chartr()` equivalent.

* [NEW FUNCTION] #8: `stri_width()` approximates the *width* of a string
in a more Unicodish fashion than `nchar(..., "width")`

* [NEW FEATURE] #149: `stri_pad()` and `stri_wrap()` now by default bases on
code point widths instead of the number of code points. Moreover, the default
behavior of `stri_wrap()` is now such that it does not get rid
of non-breaking, zero width, etc. spaces

* [NEW FEATURE] #133: `stri_wrap()` silently allows for `width <= 0`
(for compatibility with `strwrap()`).

* [NEW FEATURE] #139: `stri_wrap()` gained a new argument: `whitespace_only`.

* [NEW FUNCTIONS] #137: date-time formatting/parsing:
     * `stri_timezone_list()` - lists all known time zone identifiers
     * `stri_timezone_set()`, `stri_timezone_get()` - manage current default time zone
     * `stri_timezone_info()` - basic information on a given time zone
     * `stri_datetime_symbols()` - localizable date-time formatting data
     * `stri_datetime_fstr()` - convert a `strptime`-like format string
            to an ICU date/time format string
     * `stri_datetime_format()` - convert date/time to string
     * `stri_datetime_parse()` - convert string to date/time object
     * `stri_datetime_create()` - construct date-time objects
            from numeric representations
     * `stri_datetime_now()` - return current date-time
     * `stri_datetime_fields()` - get values for date-time fields
     * `stri_datetime_add()` - add specific number of date-time units
            to a date-time object

* [GENERAL] #144: Performance improvements in handling ASCII strings
(these affect `stri_sub()`, `stri_locate()` and other string index-based
 operations)

* [GENERAL] #143: Searching for short fixed patterns (`stri_*_fixed()`) now
relies on the current `libC`'s implementation of `strchr()` and `strstr()`.
This is very fast e.g. on `glibc` utilizing the `SSE2/3/4` instruction set.

* [BUILD TIME] #141: a local copy of `icudt*.zip` may be used on package
install; see the `INSTALL` file for more information.

* [BUILD TIME] #165: the `./configure` option `--disable-icu-bundle`
forces the use of system ICU when building the package.

* [BUGFIX] locale specifiers are now normalized in a more intelligent way:
e.g. `@calendar=gregorian` expands to `DEFAULT_LOCALE@calendar=gregorian`.

* [BUGFIX] #134: `stri_extract_all_words()` did not accept `simplify=NA`.

* [BUGFIX] #132: incorrect behavior in `stri_locate_regex()` for matches
of zero lengths

* [BUGFIX] stringr/#73: `stri_wrap()` returned `CHARSXP` instead of `STRSXP`
on empty string input with `simplify=FALSE` argument.

* [BUGFIX] #164: `libicu-dev` usage used to fail on Ubuntu
(`LIBS` shall be passed after `LDFLAGS` and the list of `.o` files).

* [BUGFIX] #168: Build now fails if `icudt` is not available.

* [BUGFIX] #135: C++11 is now used by default (see the `INSTALL` file,
however) to build `stringi` from sources. This is because ICU4C uses the
`long long` type which is not part of the C++98 standard.

* [BUGFIX] #154: Dates and other objects with a custom class attribute
were not coerced to the character type correctly.

* [BUGFIX] Force ICU `u_init()` call on `stringi` dynlib load.

* [BUGFIX] #157: many overfull hboxes in the package PDF manual has been
corrected.

-------------------------------------------------------------------------------

## 0.4-1 (2014-12-11) **CRAN**

* [IMPORTANT CHANGE] `n_max` argument in `stri_split_*()` has been renamed `n`.

* [IMPORTANT CHANGE] `simplify=FALSE` in `stri_extract_all_*()` and
`stri_split_*()` now calls `stri_list2matrix()` with `fill=""`.
`fill=NA_character_` may be obtained by using `simplify=NA`.

* [IMPORTANT CHANGE, NEW FUNCTIONS] #120: `stri_extract_words` has been
renamed `stri_extract_all_words` and `stri_locate_boundaries` -
`stri_locate_all_boundaries` as well as `stri_locate_words` -
`stri_locate_all_words`. New functions are now available:
`stri_locate_first_boundaries`, `stri_locate_last_boundaries`,
`stri_locate_first_words`, `stri_locate_last_words`,
`stri_extract_first_words`, `stri_extract_last_words`.

* [IMPORTANT CHANGE] #111: `opts_regex`, `opts_collator`, `opts_fixed`, and
`opts_brkiter` can now be supplied individually via `...`.
In other words, you may now simply call e.g.
`stri_detect_regex(str, pattern, case_insensitive=TRUE)` instead of
`stri_detect_regex(str, pattern, opts_regex=stri_opts_regex(case_insensitive=TRUE))`.

* [NEW FEATURE] #110: Fixed pattern search engine's settings can
now be supplied via `opts_fixed` argument in `stri_*_fixed()`,
see `stri_opts_fixed()`. A simple (not suitable for natural language
processing) yet very fast `case_insensitive` pattern matching can be
performed now. `stri_extract_*_fixed` is again available.

* [NEW FEATURE] #23: `stri_extract_all_fixed`, `stri_count`, and
`stri_locate_all_fixed` may now also look for overlapping pattern
matches, see `?stri_opts_fixed`.

* [NEW FEATURE] #129: `stri_match_*_regex` gained a `cg_missing` argument.

* [NEW FEATURE] #117: `stri_extract_all_*()`, `stri_locate_all_*()`,
`stri_match_all_*()` gained a new argument: `omit_no_match`.
Setting it to `TRUE` makes these functions compatible with their
`stringr` equivalents.

* [NEW FEATURE] #118: `stri_wrap()` gained `indent`, `exdent`, `initial`,
and `prefix` arguments. Moreover Knuth's dynamic word wrapping algorithm
now assumes that the cost of printing the last line is zero, see #128.

* [NEW FEATURE] #122: `stri_subset()` gained an `omit_na` argument.

* [NEW FEATURE] `stri_list2matrix()` gained an `n_min` argument.

* [NEW FEATURE] #126: `stri_split()` now is also able to act
just like `stringr::str_split_fixed()`.

* [NEW FEATURE] #119:  `stri_split_boundaries()` now have
`n`, `tokens_only`, and `simplify` arguments. Additionally,
`stri_extract_all_words()` is now equipped with `simplify` arg.

* [NEW FEATURE] #116: `stri_paste()` gained a new argument:
`ignore_null`. Setting it to `TRUE` makes this function more compatible
with `paste()`.

* [OTHER] #123: `useDynLib` is used to speed up symbol look-up in
the compiled dynamic library.

* [BUGFIX] #114: `stri_paste()`: could return result in an incorrect order.

* [BUGFIX]  #94: Run-time errors on Solaris caused by setting
`-DU_DISABLE_RENAMING=1` -- memory allocation errors in i.a. ICU's
UnicodeString. This setting also caused some ABSan sanity check
failures within ICU code.

-------------------------------------------------------------------------------

## 0.3-1 (2014-11-06) **CRAN**

* [IMPORTANT CHANGE] #87: `%>%` overlapped with the pipe operator from
the `magrittr` package; now each operator like `%>%` has been renamed `%s>%`.

* [IMPORTANT CHANGE] #108: Now the BreakIterator (for text boundary analysis)
may be better controlled via `stri_opts_brkiter()` (see options `type`
and `locale` which aim to replace now-removed `boundary` and `locale` parameters
to `stri_locate_boundaries`, `stri_split_boundaries`, `stri_trans_totitle`,
`stri_extract_words`, `stri_locate_words`).

* [NEW FUNCTIONS] #109: `stri_count_boundaries()` and `stri_count_words()`
count the number of text boundaries in a string.

* [NEW FUNCTIONS] #41: `stri_startswith_*()` and `stri_endswith_*()`
determine whether a string starts or ends with a given pattern.

* [NEW FEATURE] #102: `stri_replace_all_*()` gained a `vectorize_all()` parameter,
which defaults to TRUE for backward compatibility.

* [NEW FUNCTION] #91: `stri_subset_*()`, a convenient and more efficient
substitute for `str[stri_detect_*(str, ...)]`, added.

* [NEW FEATURE] #100: `stri_split_fixed()`, `stri_split_charclass()`,
`stri_split_regex()`, `stri_split_coll()` gained a `tokens_only` parameter,
which defaults to `FALSE` for backward compatibility.

* [NEW FUNCTION] #105: `stri_list2matrix()` converts lists of atomic vectors
to character matrices, useful in connection with `stri_split`
and `stri_extract`.

* [NEW FEATURE] #107: `stri_split_*()` now allow setting an `omit_empty=NA` argument.

* [NEW FEATURE] #106: `stri_split()` and `stri_extract_all()` gained a `simplify`
argument (if `TRUE`, then `stri_list2matrix(..., byrow=TRUE)`
is called on the resulting list.

* [NEW FUNCTION] #77: `stri_rand_lipsum()` generates
(pseudo)random dummy *lorem ipsum* text.

* [NEW FEATURE] #98: `stri_trans_totitle()` gained a `opts_brkiter`
parameter; it indicates which ICU BreakIterator should be used when
performing case mapping.

* [NEW FEATURE] `stri_wrap()` gained a new parameter: `normalize`.

* [BUGFIX] #86: `stri_*_fixed()`, `stri_*_coll()`, and `stri_*_regex()` could
give incorrect results if one of search strings were of length 0.

* [BUGFIX] #99: `stri_replace_all()` did not use the `replacement` arg.

* [BUGFIX] #112: Some of the objects were not PROTECTed from
being garbage collected, which might have caused spontaneous SEGFAULTS.

* [BUGFIX] Some collator's options were not passed correctly to ICU services.

* [BUGFIX] Memory leaks causes as detected by
`valgrind --tool=memcheck --leak-check=full` have been removed.

* [DOCUMENTATION] Significant extensions/clean ups in the `stringi` manual.

-------------------------------------------------------------------------------

## 0.2-5 (2014-05-16) **CRAN**

* ~~icudt-dependent examples are no longer run if `icudt` is not available.~~

-------------------------------------------------------------------------------

## 0.2-4 (2014-05-15) **CRAN**

* [BUGFIX] Issues with loading of misaligned addresses in `stri_*_fixed()`.

-------------------------------------------------------------------------------

## 0.2-3 (2014-05-14) **CRAN**

* [IMPORTANT CHANGE] `stri_cmp*()` now do not allow for passing
 `opts_collator=NA`. From now on, `stri_cmp_eq`, `stri_cmp_neq`,
 and the new operators `%===%`, `%!==%`, `%stri===%`, and `%stri!==%`
 are locale-independent operations, which base on code point comparisons.
 New functions `stri_cmp_equiv` and `stri_cmp_nequiv`
 (and from now on also `%==%`, `%!=%`, `%stri==%`, and `%stri!=%`)
 test for canonical equivalence.

* [IMPORTANT CHANGE] `stri_*_fixed()` search functions now perform
 a locale-independent exact (byte-wise, of course after conversion to UTF-8)
 pattern search. All the Collator-based, locale-dependent search routines
 are now available via `stri_*_coll()`. The reason for this is that
 ICU USearch has currently very poor performance and in many search tasks
 in fact it is sufficient to do exact pattern matching.

* [GENERAL] `stri_*_fixed` now use a tweaked Knuth-Morris-Pratt search
 algorithm, which improves the search performance drastically.

* [IMPORTANT CHANGE] `stri_enc_nf*()` and `stri_enc_isnf*()` function families
have been renamed to `stri_trans_nf*()` and `stri_trans_isnf*()`,
respectively. This is because they deal with text transforming,
and not with character encoding. Moreover, all such operations may
also be performed by ICU's Transliterator (see below).

* [NEW FUNCTION] `stri_trans_general()` and `stri_trans_list()` give access
to ICU's Transliterator: may be used to perform very general
text transforms.

* [NEW FUNCTION `stri_split_boundaries()` utilizes ICU's BreakIterator
to split strings at specific text boundaries. Moreover,
stri_locate_boundaries indicates positions of these boundaries.

* [NEW FUNCTION] `stri_extract_words()` uses ICU's BreakIterator to
extract all words from a text. Additionally, `stri_locate_words`
locates start and end positions of words in a text.

* [NEW FUNCTION] `stri_pad()`, `stri_pad_left()`, `stri_pad_right()`,
and `stri_pad_both` pad a string with a specific code point.

* [NEW FUNCTION] `stri_wrap()` breaks paragraphs of text into lines.
Two algorithms (greedy and minimal raggedness) are available.

* [IMPORTANT CHANGE] `stri_*_charclass()` search functions now
rely solely on ICU's UnicodeSet patterns. All previously accepted
charclass identifiers became invalid. However, new patterns
should now be more familiar to the users (they are regex-like).
Moreover, we observe a very nice performance gain.

* [IMPORTANT CHANGE] `stri_sort()` now does not include `NA`s
in output vectors by default, for compatibility with `sort()`.
Moreover, currently none of the input vector's attributes are preserved.

* [NEW FUNCTION] `stri_unique()` extracts unique elements from
a character vector.

* [NEW FUNCTIONS] `stri_duplicated()` and `stri_duplicated_any()`
determine duplicate elements in a character vector.

* [NEW FUNCTION] `stri_replace_na()` replaces `NA`s in a character vector
with a given string, useful for emulating e.g. R's `paste()` behavior.

* [NEW FUNCTION] `stri_rand_shuffle()` generates a random permutation
of code points in a string.

* [NEW FUNCTION] `stri_rand_strings()` generates random strings.

* [NEW FUNCTIONS] New functions and binary operators for string comparison:
`stri_cmp_eq()`, `stri_cmp_neq()`, `stri_cmp_lt()`, `stri_cmp_le()`,
`stri_cmp_gt()`, `stri_cmp_ge()`, `%==%`, `%!=%`, `%<%`, `%<=%`, `%>%`, `%>=%`.

* [NEW FUNCTION] `stri_enc_mark()` reads declared encodings of character
strings as seen by `stringi`.

* [NEW FUNCTION] `stri_enc_tonative(str)` is an alias to
`stri_encode(str, NULL, NULL)`.

* [NEW FEATURE] `stri_order()` and `stri_sort()` now have an additional argument
`na_last` (defaults to `TRUE` and `NA`, respectively).

* [NEW FEATURE] `stri_replace_all_charclass()`, `stri_extract_all_charclass()`,
and `stri_locate_all_charclass()` now have a new arg, `merge`
(defaults to `FALSE` for backward-compatibility). It may be used
to e.g. replace sequences of white spaces with a single space.

* [NEW FEATURE] `stri_enc_toutf8()` now has a new `validate` arg (defaults
to FALSE for backward-compatibility). It may be used in a (rare) case
in which a user wants to fix an invalid UTF-8 byte sequence.
stri_length (among others) now detect invalid UTF-8 byte sequences.

* [NEW FEATURE] All binary operators `%???%` now also have aliases `%stri???%`.

* [GENERAL] Performance improvements in `StriContainerUTF8`
and `StriContainerUTF16` (they affect most other functions).

* [GENERAL] Significant performance improvements in `stri_join()`,
`stri_flatten()`, `stri_cmp()`, `stri_trans_to*()`, and others.

* [GENERAL] Added 3rd mirror site for our `icudt` binary distribution.

* `U_MISSING_RESOURCE_ERROR` message in `StriException` now suggests
calling `stri_install_check()`.

* [BUGFIX] UTF-8 BOMs are now silently removed from input strings.

* [BUGFIX] no more attempts to re-encode UTF-8 encoded strings
if native encoding=UTF-8 in `StriContainerUTF8`.

* [BUGFIX] possible memory leaks when throwing errors via `Rf_error()`.

* [BUGFIX] `stri_order()` and `stri_cmp()` could return incorrect results
for `opts_collator=NA`.

* [BUGFIX] `stri_sort()` did not guarantee to return strings in UTF-8.

-------------------------------------------------------------------------------

## 0.1-25 (2014-03-12) **CRAN**

* LICENSE tweaks.

* Initial CRAN release.

-------------------------------------------------------------------------------

## 0.1-24 (2014-03-11) **devel**

* Fixed bugs detected with `ASan` and `UBSan`,
   e.g. fixed `CharClass::gcmask` type (`enum` -> `uint32_t`)
   (reported by `UBSan`).

* Fixed array over-runs detected with `valgrind` in `string8.h`.

* Fixed unitialized class fields in `StriContainerUTF8`
   (reported by `valgrind`).

-------------------------------------------------------------------------------

## 0.1-23 (2014-03-11) **devel**

* License changed to BSD-3-clause, COPYRIGHTS updated.

* `icudt` is not shipped with `stringi` anymore;
   it is now downloaded in `install.libs.R` from one of our servers.

* New functions: `stri_install_check()`, `stri_install_icudt()`.

-------------------------------------------------------------------------------

## 0.1-22 (2014-02-20) **devel**

* System ICU is used on systems which do have one (version >= 50 needed).
  ICU is autodetected with `pkg-config` in `./configure`.
  Pass `'--disable-pkg-config'` to `./configure` to force building
  ICU from sources.

* `icudt52b` (custom subset) is now shipped with `stringi`
  (for big-endian, ASCII systems).

-------------------------------------------------------------------------------

## 0.1-21 (2014-02-19) **devel**

* Fixed some Solaris-related issues while preparing `stringi`
  for CRAN submission.

-------------------------------------------------------------------------------

## 0.1-20 (2014-02-17) **devel**

* ICU4C 52.1 sources included (common, i18n, stubdata + icu52dt.dat
  loaded dynamically). Compilation via Makevars.

* `stringi` now does not depend on any external libraries.

-------------------------------------------------------------------------------

## 0.1-11 (2013-11-16) **devel**

* ICU4C is now statically linked on Windows.

* First OS X binary build.

* The package is being intensively tested by our students @ FMIS WUT.

-------------------------------------------------------------------------------

## 0.1-10 (2013-11-13) **devel**

* Using `pkg-config` via `./configure` to look for ICU4C libs.

-------------------------------------------------------------------------------

## 0.1-6 (2013-07-05) **devel**

* First Windows binary build.

* Compilation passed on Oracle Sun Studio compiler collection.

* By now we have implemented most of the functionality
  scheduled for milestone 0.1.

-------------------------------------------------------------------------------

## 0.1-1 (2013-01-05) **devel**

* The `stringi` project has been established on GitHub.

-------------------------------------------------------------------------------
back to top