Raw File
basic.jl
# This file is a part of Julia. License is MIT: https://julialang.org/license

"""
The `AbstractString` type is the supertype of all string implementations in
Julia. Strings are encodings of sequences of [Unicode](https://unicode.org/)
code points as represented by the `AbstractChar` type. Julia makes a few assumptions
about strings:

* Strings are encoded in terms of fixed-size "code units"
  * Code units can be extracted with `codeunit(s, i)`
  * The first code unit has index `1`
  * The last code unit has index `ncodeunits(s)`
  * Any index `i` such that `1 ≤ i ≤ ncodeunits(s)` is in bounds
* String indexing is done in terms of these code units:
  * Characters are extracted by `s[i]` with a valid string index `i`
  * Each `AbstractChar` in a string is encoded by one or more code units
  * Only the index of the first code unit of an `AbstractChar` is a valid index
  * The encoding of an `AbstractChar` is independent of what precedes or follows it
  * String encodings are [self-synchronizing] – i.e. `isvalid(s, i)` is O(1)

[self-synchronizing]: https://en.wikipedia.org/wiki/Self-synchronizing_code

Some string functions that extract code units, characters or substrings from
strings error if you pass them out-of-bounds or invalid string indices. This
includes `codeunit(s, i)` and `s[i]`. Functions that do string
index arithmetic take a more relaxed approach to indexing and give you the
closest valid string index when in-bounds, or when out-of-bounds, behave as if
there were an infinite number of characters padding each side of the string.
Usually these imaginary padding characters have code unit length `1` but string
types may choose different "imaginary" character sizes as makes sense for their
implementations (e.g. substrings may pass index arithmetic through to the
underlying string they provide a view into). Relaxed indexing functions include
those intended for index arithmetic: `thisind`, `nextind` and `prevind`. This
model allows index arithmetic to work with out-of- bounds indices as
intermediate values so long as one never uses them to retrieve a character,
which often helps avoid needing to code around edge cases.

See also: [`codeunit`](@ref), [`ncodeunits`](@ref), [`thisind`](@ref),
[`nextind`](@ref), [`prevind`](@ref)
"""
AbstractString

## required string functions ##

"""
    ncodeunits(s::AbstractString) -> Int

Return the number of code units in a string. Indices that are in bounds to
access this string must satisfy `1 ≤ i ≤ ncodeunits(s)`. Not all such indices
are valid – they may not be the start of a character, but they will return a
code unit value when calling `codeunit(s,i)`.

See also: [`codeunit`](@ref), [`checkbounds`](@ref), [`sizeof`](@ref),
[`length`](@ref), [`lastindex`](@ref)
"""
ncodeunits(s::AbstractString)

"""
    codeunit(s::AbstractString) -> Type{<:Union{UInt8, UInt16, UInt32}}

Return the code unit type of the given string object. For ASCII, Latin-1, or
UTF-8 encoded strings, this would be `UInt8`; for UCS-2 and UTF-16 it would be
`UInt16`; for UTF-32 it would be `UInt32`. The unit code type need not be
limited to these three types, but it's hard to think of widely used string
encodings that don't use one of these units. `codeunit(s)` is the same as
`typeof(codeunit(s,1))` when `s` is a non-empty string.

See also: [`ncodeunits`](@ref)
"""
codeunit(s::AbstractString)

"""
    codeunit(s::AbstractString, i::Integer) -> Union{UInt8, UInt16, UInt32}

Return the code unit value in the string `s` at index `i`. Note that

    codeunit(s, i) :: codeunit(s)

I.e. the value returned by `codeunit(s, i)` is of the type returned by
`codeunit(s)`.

See also: [`ncodeunits`](@ref), [`checkbounds`](@ref)
"""
@propagate_inbounds codeunit(s::AbstractString, i::Integer) = typeof(i) === Int ?
    throw(MethodError(codeunit, (s, i))) : codeunit(s, Int(i))

"""
    isvalid(s::AbstractString, i::Integer) -> Bool

Predicate indicating whether the given index is the start of the encoding of a
character in `s` or not. If `isvalid(s, i)` is true then `s[i]` will return the
character whose encoding starts at that index, if it's false, then `s[i]` will
raise an invalid index error or a bounds error depending on if `i` is in bounds.
In order for `isvalid(s, i)` to be an O(1) function, the encoding of `s` must be
[self-synchronizing](https://en.wikipedia.org/wiki/Self-synchronizing_code) this
is a basic assumption of Julia's generic string support.

See also: [`getindex`](@ref), [`iterate`](@ref), [`thisind`](@ref),
[`nextind`](@ref), [`prevind`](@ref), [`length`](@ref)

# Examples

```jldoctest
julia> str = "αβγdef";

julia> isvalid(str, 1)
true

julia> str[1]
'α': Unicode U+03b1 (category Ll: Letter, lowercase)

julia> isvalid(str, 2)
false

julia> str[2]
ERROR: StringIndexError("αβγdef", 2)
Stacktrace:
[...]
```
"""
@propagate_inbounds isvalid(s::AbstractString, i::Integer) = typeof(i) === Int ?
    throw(MethodError(isvalid, (s, i))) : isvalid(s, Int(i))

"""
    iterate(s::AbstractString, i::Integer) -> Union{Tuple{<:AbstractChar, Int}, Nothing}

Return a tuple of the character in `s` at index `i` with the index of the start
of the following character in `s`. This is the key method that allows strings to
be iterated, yielding a sequences of characters. If `i` is out of bounds in `s`
then a bounds error is raised. The `iterate` function, as part of the iteration
protocol may assume that `i` is the start of a character in `s`.

See also: [`getindex`](@ref), [`checkbounds`](@ref)
"""
@propagate_inbounds iterate(s::AbstractString, i::Integer) = typeof(i) === Int ?
    throw(MethodError(iterate, (s, i))) : iterate(s, Int(i))

## basic generic definitions ##

eltype(::Type{<:AbstractString}) = Char # some string types may use another AbstractChar

"""
    sizeof(str::AbstractString)

Size, in bytes, of the string `str`. Equal to the number of code units in `str` multiplied by
the size, in bytes, of one code unit in `str`.

# Examples
```jldoctest
julia> sizeof("")
0

julia> sizeof("∀")
3
```
"""
sizeof(s::AbstractString) = ncodeunits(s) * sizeof(codeunit(s))
firstindex(s::AbstractString) = 1
lastindex(s::AbstractString) = thisind(s, ncodeunits(s))

function getindex(s::AbstractString, i::Integer)
    @boundscheck checkbounds(s, i)
    @inbounds return isvalid(s, i) ? iterate(s, i)[1] : string_index_err(s, i)
end

getindex(s::AbstractString, i::Colon) = s
# TODO: handle other ranges with stride ±1 specially?
# TODO: add more @propagate_inbounds annotations?
getindex(s::AbstractString, v::AbstractVector{<:Integer}) =
    sprint(io->(for i in v; write(io, s[i]) end), sizehint=length(v))
getindex(s::AbstractString, v::AbstractVector{Bool}) =
    throw(ArgumentError("logical indexing not supported for strings"))

function get(s::AbstractString, i::Integer, default)
# TODO: use ternary once @inbounds is expression-like
    if checkbounds(Bool, s, i)
        @inbounds return s[i]
    else
        return default
    end
end

## bounds checking ##

checkbounds(::Type{Bool}, s::AbstractString, i::Integer) =
    1 ≤ i ≤ ncodeunits(s)
checkbounds(::Type{Bool}, s::AbstractString, r::AbstractRange{<:Integer}) =
    isempty(r) || (1 ≤ minimum(r) && maximum(r) ≤ ncodeunits(s))
checkbounds(::Type{Bool}, s::AbstractString, I::AbstractArray{<:Real}) =
    all(i -> checkbounds(Bool, s, i), I)
checkbounds(::Type{Bool}, s::AbstractString, I::AbstractArray{<:Integer}) =
    all(i -> checkbounds(Bool, s, i), I)
checkbounds(s::AbstractString, I::Union{Integer,AbstractArray}) =
    checkbounds(Bool, s, I) ? nothing : throw(BoundsError(s, I))

## construction, conversion, promotion ##

string() = ""
string(s::AbstractString) = s

(::Type{Vector{UInt8}})(s::AbstractString) = unsafe_wrap(Vector{UInt8}, String(s))
(::Type{Array{UInt8}})(s::AbstractString) = unsafe_wrap(Vector{UInt8}, String(s))
(::Type{Vector{T}})(s::AbstractString) where {T<:AbstractChar} = collect(T, s)

Symbol(s::AbstractString) = Symbol(String(s))
Symbol(x...) = Symbol(string(x...))

convert(::Type{T}, s::T) where {T<:AbstractString} = s
convert(::Type{T}, s::AbstractString) where {T<:AbstractString} = T(s)

## string & character concatenation ##

"""
    *(s::Union{AbstractString, AbstractChar}, t::Union{AbstractString, AbstractChar}...) -> AbstractString

Concatenate strings and/or characters, producing a [`String`](@ref). This is equivalent
to calling the [`string`](@ref) function on the arguments. Concatenation of built-in
string types always produces a value of type `String` but other string types may choose
to return a string of a different type as appropriate.

# Examples
```jldoctest
julia> "Hello " * "world"
"Hello world"

julia> 'j' * "ulia"
"julia"
```
"""
(*)(s1::Union{AbstractChar, AbstractString}, ss::Union{AbstractChar, AbstractString}...) = string(s1, ss...)

one(::Union{T,Type{T}}) where {T<:AbstractString} = convert(T, "")

## generic string comparison ##

"""
    cmp(a::AbstractString, b::AbstractString) -> Int

Compare two strings. Return `0` if both strings have the same length and the character
at each index is the same in both strings. Return `-1` if `a` is a prefix of `b`, or if
`a` comes before `b` in alphabetical order. Return `1` if `b` is a prefix of `a`, or if
`b` comes before `a` in alphabetical order (technically, lexicographical order by Unicode
code points).

# Examples
```jldoctest
julia> cmp("abc", "abc")
0

julia> cmp("ab", "abc")
-1

julia> cmp("abc", "ab")
1

julia> cmp("ab", "ac")
-1

julia> cmp("ac", "ab")
1

julia> cmp("α", "a")
1

julia> cmp("b", "β")
-1
```
"""
function cmp(a::AbstractString, b::AbstractString)
    a === b && return 0
    a, b = Iterators.Stateful(a), Iterators.Stateful(b)
    for (c, d) in zip(a, b)
        c ≠ d && return ifelse(c < d, -1, 1)
    end
    isempty(a) && return ifelse(isempty(b), 0, -1)
    return 1
end

"""
    ==(a::AbstractString, b::AbstractString) -> Bool

Test whether two strings are equal character by character (technically, Unicode
code point by code point).

# Examples
```jldoctest
julia> "abc" == "abc"
true

julia> "abc" == "αβγ"
false
```
"""
==(a::AbstractString, b::AbstractString) = cmp(a, b) == 0

"""
    isless(a::AbstractString, b::AbstractString) -> Bool

Test whether string `a` comes before string `b` in alphabetical order
(technically, in lexicographical order by Unicode code points).

# Examples
```jldoctest
julia> isless("a", "b")
true

julia> isless("β", "α")
false

julia> isless("a", "a")
false
```
"""
isless(a::AbstractString, b::AbstractString) = cmp(a, b) < 0

# faster comparisons for symbols

cmp(a::Symbol, b::Symbol) = Int(sign(ccall(:strcmp, Int32, (Cstring, Cstring), a, b)))

isless(a::Symbol, b::Symbol) = cmp(a, b) < 0

## character index arithmetic ##

"""
    length(s::AbstractString) -> Int
    length(s::AbstractString, i::Integer, j::Integer) -> Int

The number of characters in string `s` from indices `i` through `j`. This is
computed as the number of code unit indices from `i` to `j` which are valid
character indices. With only a single string argument, this computes the
number of characters in the entire string. With `i` and `j` arguments it
computes the number of indices between `i` and `j` inclusive that are valid
indices in the string `s`. In addition to in-bounds values, `i` may take the
out-of-bounds value `ncodeunits(s) + 1` and `j` may take the out-of-bounds
value `0`.

See also: [`isvalid`](@ref), [`ncodeunits`](@ref), [`lastindex`](@ref),
[`thisind`](@ref), [`nextind`](@ref), [`prevind`](@ref)

# Examples
```jldoctest
julia> length("jμΛIα")
5
```
"""
length(s::AbstractString) = @inbounds return length(s, 1, ncodeunits(s))

function length(s::AbstractString, i::Int, j::Int)
    @boundscheck begin
        0 < i ≤ ncodeunits(s)+1 || throw(BoundsError(s, i))
        0 ≤ j < ncodeunits(s)+1 || throw(BoundsError(s, j))
    end
    n = 0
    for k = i:j
        @inbounds n += isvalid(s, k)
    end
    return n
end

@propagate_inbounds length(s::AbstractString, i::Integer, j::Integer) =
    length(s, Int(i), Int(j))

"""
    thisind(s::AbstractString, i::Integer) -> Int

If `i` is in bounds in `s` return the index of the start of the character whose
encoding code unit `i` is part of. In other words, if `i` is the start of a
character, return `i`; if `i` is not the start of a character, rewind until the
start of a character and return that index. If `i` is equal to 0 or `ncodeunits(s)+1`
return `i`. In all other cases throw `BoundsError`.

# Examples
```jldoctest
julia> thisind("α", 0)
0

julia> thisind("α", 1)
1

julia> thisind("α", 2)
1

julia> thisind("α", 3)
3

julia> thisind("α", 4)
ERROR: BoundsError: attempt to access String
  at index [4]
[...]

julia> thisind("α", -1)
ERROR: BoundsError: attempt to access String
  at index [-1]
[...]
```
"""
thisind(s::AbstractString, i::Integer) = thisind(s, Int(i))

function thisind(s::AbstractString, i::Int)
    z = ncodeunits(s) + 1
    i == z && return i
    @boundscheck 0 ≤ i ≤ z || throw(BoundsError(s, i))
    @inbounds while 1 < i && !isvalid(s, i)
        i -= 1
    end
    return i
end

"""
    prevind(str::AbstractString, i::Integer, n::Integer=1) -> Int

* Case `n == 1`

  If `i` is in bounds in `s` return the index of the start of the character whose
  encoding starts before index `i`. In other words, if `i` is the start of a
  character, return the start of the previous character; if `i` is not the start
  of a character, rewind until the start of a character and return that index.
  If `i` is equal to `1` return `0`.
  If `i` is equal to `ncodeunits(str)+1` return `lastindex(str)`.
  Otherwise throw `BoundsError`.

* Case `n > 1`

  Behaves like applying `n` times `prevind` for `n==1`. The only difference
  is that if `n` is so large that applying `prevind` would reach `0` then each remaining
  iteration decreases the returned value by `1`.
  This means that in this case `prevind` can return a negative value.

* Case `n == 0`

  Return `i` only if `i` is a valid index in `str` or is equal to `ncodeunits(str)+1`.
  Otherwise `StringIndexError` or `BoundsError` is thrown.

# Examples
```jldoctest
julia> prevind("α", 3)
1

julia> prevind("α", 1)
0

julia> prevind("α", 0)
ERROR: BoundsError: attempt to access String
  at index [0]
[...]

julia> prevind("α", 2, 2)
0

julia> prevind("α", 2, 3)
-1
```
"""
prevind(s::AbstractString, i::Integer, n::Integer) = prevind(s, Int(i), Int(n))
prevind(s::AbstractString, i::Integer)             = prevind(s, Int(i))
prevind(s::AbstractString, i::Int)                 = prevind(s, i, 1)

function prevind(s::AbstractString, i::Int, n::Int)
    n < 0 && throw(ArgumentError("n cannot be negative: $n"))
    z = ncodeunits(s) + 1
    @boundscheck 0 < i ≤ z || throw(BoundsError(s, i))
    n == 0 && return thisind(s, i) == i ? i : string_index_err(s, i)
    while n > 0 && 1 < i
        @inbounds n -= isvalid(s, i -= 1)
    end
    return i - n
end

"""
    nextind(str::AbstractString, i::Integer, n::Integer=1) -> Int

* Case `n == 1`

  If `i` is in bounds in `s` return the index of the start of the character whose
  encoding starts after index `i`. In other words, if `i` is the start of a
  character, return the start of the next character; if `i` is not the start
  of a character, move forward until the start of a character and return that index.
  If `i` is equal to `0` return `1`.
  If `i` is in bounds but greater or equal to `lastindex(str)` return `ncodeunits(str)+1`.
  Otherwise throw `BoundsError`.

* Case `n > 1`

  Behaves like applying `n` times `nextind` for `n==1`. The only difference
  is that if `n` is so large that applying `nextind` would reach `ncodeunits(str)+1` then
  each remaining iteration increases the returned value by `1`. This means that in this
  case `nextind` can return a value greater than `ncodeunits(str)+1`.

* Case `n == 0`

  Return `i` only if `i` is a valid index in `s` or is equal to `0`.
  Otherwise `StringIndexError` or `BoundsError` is thrown.

# Examples
```jldoctest
julia> nextind("α", 0)
1

julia> nextind("α", 1)
3

julia> nextind("α", 3)
ERROR: BoundsError: attempt to access String
  at index [3]
[...]

julia> nextind("α", 0, 2)
3

julia> nextind("α", 1, 2)
4
```
"""
nextind(s::AbstractString, i::Integer, n::Integer) = nextind(s, Int(i), Int(n))
nextind(s::AbstractString, i::Integer)             = nextind(s, Int(i))
nextind(s::AbstractString, i::Int)                 = nextind(s, i, 1)

function nextind(s::AbstractString, i::Int, n::Int)
    n < 0 && throw(ArgumentError("n cannot be negative: $n"))
    z = ncodeunits(s)
    @boundscheck 0 ≤ i ≤ z || throw(BoundsError(s, i))
    n == 0 && return thisind(s, i) == i ? i : string_index_err(s, i)
    while n > 0 && i < z
        @inbounds n -= isvalid(s, i += 1)
    end
    return i + n
end

## string index iteration type ##

struct EachStringIndex{T<:AbstractString}
    s::T
end
keys(s::AbstractString) = EachStringIndex(s)

length(e::EachStringIndex) = length(e.s)
first(::EachStringIndex) = 1
last(e::EachStringIndex) = lastindex(e.s)
iterate(e::EachStringIndex, state=firstindex(e.s)) = state > ncodeunits(e.s) ? nothing : (state, nextind(e.s, state))
eltype(::Type{<:EachStringIndex}) = Int

"""
    isascii(c::Union{AbstractChar,AbstractString}) -> Bool

Test whether a character belongs to the ASCII character set, or whether this is true for
all elements of a string.

# Examples
```jldoctest
julia> isascii('a')
true

julia> isascii('α')
false

julia> isascii("abc")
true

julia> isascii("αβγ")
false
```
"""
isascii(c::Char) = bswap(reinterpret(UInt32, c)) < 0x80
isascii(s::AbstractString) = all(isascii, s)
isascii(c::AbstractChar) = UInt32(c) < 0x80

## string map, filter, has ##

function map(f, s::AbstractString)
    out = IOBuffer(sizehint=sizeof(s))
    for c in s
        c′ = f(c)
        isa(c′, AbstractChar) || throw(ArgumentError(
            "map(f, s::AbstractString) requires f to return AbstractChar; try map(f, collect(s)) or a comprehension instead"))
        write(out, c′::AbstractChar)
    end
    String(take!(out))
end

function filter(f, s::AbstractString)
    out = IOBuffer(sizehint=sizeof(s))
    for c in s
        f(c) && write(out, c)
    end
    String(take!(out))
end

## string first and last ##

"""
    first(s::AbstractString, n::Integer)

Get a string consisting of the first `n` characters of `s`.

```jldoctest
julia> first("∀ϵ≠0: ϵ²>0", 0)
""

julia> first("∀ϵ≠0: ϵ²>0", 1)
"∀"

julia> first("∀ϵ≠0: ϵ²>0", 3)
"∀ϵ≠"
```
"""
first(s::AbstractString, n::Integer) = @inbounds s[1:min(end, nextind(s, 0, n))]

"""
    last(s::AbstractString, n::Integer)

Get a string consisting of the last `n` characters of `s`.

```jldoctest
julia> last("∀ϵ≠0: ϵ²>0", 0)
""

julia> last("∀ϵ≠0: ϵ²>0", 1)
"0"

julia> last("∀ϵ≠0: ϵ²>0", 3)
"²>0"
```
"""
last(s::AbstractString, n::Integer) = @inbounds s[max(1, prevind(s, ncodeunits(s)+1, n)):end]

"""
    reverseind(v, i)

Given an index `i` in [`reverse(v)`](@ref), return the corresponding index in
`v` so that `v[reverseind(v,i)] == reverse(v)[i]`. (This can be nontrivial in
cases where `v` contains non-ASCII characters.)

# Examples
```jldoctest
julia> r = reverse("Julia")
"ailuJ"

julia> for i in 1:length(r)
           print(r[reverseind("Julia", i)])
       end
Julia
```
"""
reverseind(s::AbstractString, i::Integer) = thisind(s, ncodeunits(s)-i+1)

"""
    repeat(s::AbstractString, r::Integer)

Repeat a string `r` times. This can be written as `s^r`.

See also: [`^`](@ref)

# Examples
```jldoctest
julia> repeat("ha", 3)
"hahaha"
```
"""
repeat(s::AbstractString, r::Integer) = repeat(String(s), r)

"""
    ^(s::Union{AbstractString,AbstractChar}, n::Integer)

Repeat a string or character `n` times. This can also be written as `repeat(s, n)`.

See also: [`repeat`](@ref)

# Examples
```jldoctest
julia> "Test "^3
"Test Test Test "
```
"""
(^)(s::Union{AbstractString,AbstractChar}, r::Integer) = repeat(s, r)

# reverse-order iteration for strings and indices thereof
iterate(r::Iterators.Reverse{<:AbstractString}, i=lastindex(r.itr)) = i < firstindex(r.itr) ? nothing : (r.itr[i], prevind(r.itr, i))
iterate(r::Iterators.Reverse{<:EachStringIndex}, i=lastindex(r.itr.s)) = i < firstindex(r.itr.s) ? nothing : (i, prevind(r.itr.s, i))

## code unit access ##

"""
    CodeUnits(s::AbstractString)

Wrap a string (without copying) in an immutable vector-like object that accesses the code units
of the string's representation.
"""
struct CodeUnits{T,S<:AbstractString} <: DenseVector{T}
    s::S
    CodeUnits(s::S) where {S<:AbstractString} = new{codeunit(s),S}(s)
end

length(s::CodeUnits) = ncodeunits(s.s)
sizeof(s::CodeUnits{T}) where {T} = ncodeunits(s.s) * sizeof(T)
size(s::CodeUnits) = (length(s),)
elsize(s::CodeUnits{T}) where {T} = sizeof(T)
@propagate_inbounds getindex(s::CodeUnits, i::Int) = codeunit(s.s, i)
IndexStyle(::Type{<:CodeUnits}) = IndexLinear()
iterate(s::CodeUnits, i=1) = (@_propagate_inbounds_meta; i == length(s)+1 ? nothing : (s[i], i+1))

write(io::IO, s::CodeUnits) = write(io, s.s)

unsafe_convert(::Type{Ptr{T}},    s::CodeUnits{T}) where {T} = unsafe_convert(Ptr{T}, s.s)
unsafe_convert(::Type{Ptr{Int8}}, s::CodeUnits{UInt8}) = unsafe_convert(Ptr{Int8}, s.s)

"""
    codeunits(s::AbstractString)

Obtain a vector-like object containing the code units of a string.
Returns a `CodeUnits` wrapper by default, but `codeunits` may optionally be defined
for new string types if necessary.
"""
codeunits(s::AbstractString) = CodeUnits(s)
back to top