GPTEncoder

Alt text

Swift BPE Encoder/Decoder for OpenAI GPT Models. A programmatic interface for tokenizing text for OpenAI GPT API.

The GPT family of models process text using tokens, which are common sequences of characters found in text. The models understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens.

You can use the tool below to understand how a piece of text would be tokenized by the API, and the total count of tokens in that piece of text.

This library is based on nodeJS gpt-3-encoder and OpenAI Official Python GPT Encoder/Decoder

Supported Platforms

  • iOS/macOS/watchOS/tvOS
  • Linux

Installation

Swift Package Manager

Cocoapods

platform :ios, '15.0'
use_frameworks!

target 'MyApp' do
  pod 'GPTEncoder', '~> 1.0.0'
end

Usage

let encoder = SwiftGPTEncoder()

let str = "The GPT family of models process text using tokens, which are common sequences of characters found in text."
let encoded = encoder.encode(text: str)
print("String: \(str)")
print("Encoded this string looks like: \(encoded)")
print("Total number of token(s): \(encoded.count) and character(s): \(str.count)")

print("We can look at each token and what it represents")
encoded.forEach { print("Token: \(encoder.decode(tokens: [$0]))") }
print(encoded)

let decoded = encoder.decode(tokens: encoded)
print("We can decode it back into:\n\(decoded)")

Encode

To encode a String to array of Int tokens, you can simply invoke encode passing the string.

let encoded = encoder.encode(text: "The GPT family of models process text using tokens, which are common sequences of characters found in text.")
// Output: [464, 402, 11571, 1641, 286, 4981, 1429, 2420, 1262, 16326, 11, 543, 389, 2219, 16311, 286, 3435, 1043, 287, 2420, 13]

Decode

To decode an array of Int tokens back to the String you can invoke decode passing the tokens array.

let decoded = encoder.decode(tokens: [464, 402, 11571, 1641, 286, 4981, 1429, 2420, 1262, 16326, 11, 543, 389, 2219, 16311, 286, 3435, 1043, 287, 2420, 13])
// Output: "The GPT family of models process text using tokens, which are common sequences of characters found in text."

Clear Cache

Internally, a cache is used to improve performance when encoding the tokens, you can reset the cache as well.

encoder.clearCache()

GitHub

View Github