Parser

SwiftSoup: Pure Swift HTML Parser, with best of DOM, CSS, and jquery

Sep 16, 2021 20 min read

SwiftSoup is a pure Swift library, cross-platform (macOS, iOS, tvOS, watchOS and Linux!), for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jQuery-like methods. SwiftSoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

Scrape and parse HTML from a URL, file, or string
Find and extract data, using DOM traversal or CSS selectors
Manipulate the HTML elements, attributes, and text
Clean user-submitted content against a safe white-list, to prevent XSS attacks
Output tidy HTML SwiftSoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; SwiftSoup will create a sensible parse tree.

Swift

Swift 5 >=2.0.0

Swift 4.2 1.7.4

Installation

Cocoapods

SwiftSoup is available through CocoaPods. To install it, simply add the following line to your Podfile:

pod 'SwiftSoup'

Carthage

SwiftSoup is also available through Carthage. To install it, simply add the following line to your Cartfile:

github "scinfu/SwiftSoup"

Swift Package Manager

SwiftSoup is also available through Swift Package Manager. To install it, simply add the dependency to your Package.Swift file:

...
dependencies: [
    .package(url: "https://github.com/scinfu/SwiftSoup.git", from: "1.7.4"),
],
targets: [
    .target( name: "YourTarget", dependencies: ["SwiftSoup"]),
]
...

Try

Try out the simple online CSS selectors site:

SwiftSoup Test Site

Try out the example project opening Terminal and type:

pod try SwiftSoup

To parse an HTML document:

<div class="highlight highlight-source-swift position-relative" data-snippet-clipboard-copy-content="do {
let html = "First parse"
+ "

Parsed HTML into a doc.

"
let doc: Document = try SwiftSoup.parse(html)
return try doc.text()
} catch Exception.Error(let type, let message) {
print(message)
} catch {
print("error")
}
“>

do {
   let html = "<html><head><title>First parse</title></head>"
       + "<body><p>Parsed HTML into a doc.</p></body></html>"
   let doc: Document = try SwiftSoup.parse(html)
   return try doc.text()
} catch Exception.Error(let type, let message) {
    print(message)
} catch {
    print("error")
}

Unclosed tags (e.g. <p>Lorem <p>Ipsum parses to <p>Lorem</p> <p>Ipsum</p>)
Implicit tags (e.g. a naked <td>Table data</td> is wrapped into a <table><tr><td>...)
Reliably creating the document structure (html containing a head and body, and only appropriate elements within the head)

The object model of a document

Documents consist of Elements and TextNodes
The inheritance chain is: Document extends Element extends Node.TextNode extends Node.
An Element contains a list of children Nodes, and has one parent Element. They also have provide a filtered list of child Elements only.

Extract attributes, text, and HTML from elements

Problem

After parsing a document, and finding some elements, you’ll want to get at the data inside those elements.

Solution

To get the value of an attribute, use the Node.attr(_ String key) method
For the text on an element (and its combined children), use Element.text()
For HTML, use Element.html(), or Node.outerHtml() as appropriate

<div class="highlight highlight-source-swift position-relative" data-snippet-clipboard-copy-content="do {
let html: String = "

An example link.

";
let doc: Document = try SwiftSoup.parse(html)
let link: Element = try doc.select("a").first()!

let text: String = try doc.body()!.text(); // "An example link"
let linkHref: String = try link.attr("href"); // "http://example.com/"
let linkText: String = try link.text(); // "example""

let linkOuterH: String = try link.outerHtml(); // "example"
let linkInnerH: String = try link.html(); // "example"
} catch Exception.Error(let type, let message) {
print(message)
} catch {
print("error")
}
“>

do {
    let html: String = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
    let doc: Document = try SwiftSoup.parse(html)
    let link: Element = try doc.select("a").first()!
    
    let text: String = try doc.body()!.text(); // "An example link"
    let linkHref: String = try link.attr("href"); // "http://example.com/"
    let linkText: String = try link.text(); // "example""
    
    let linkOuterH: String = try link.outerHtml(); // "<a href="http://example.com"><b>example</b></a>"
    let linkInnerH: String = try link.html(); // "<b>example</b>"
} catch Exception.Error(let type, let message) {
    print(message)
} catch {
    print("error")
}

Description

The methods above are the core of the element data access methods. There are additional others:

Element.id()
Element.tagName()
Element.className() and Element.hasClass(_ String className)

All of these accessor methods have corresponding setter methods to change the data.

Parse a document from a String

Problem

You have HTML in a Swift String, and you want to parse that HTML to get at its contents, or to make sure it’s well formed, or to modify it. The String may have come from user input, a file, or from the web.

Solution

Use the static SwiftSoup.parse(_ html: String) method, or SwiftSoup.parse(_ html: String, _ baseUri: String).

<div class="highlight highlight-source-swift position-relative" data-snippet-clipboard-copy-content="do {
let html = "First parse"
+ "

Parsed HTML into a doc.

"
let doc: Document = try SwiftSoup.parse(html)
return try doc.text()
} catch Exception.Error(let type, let message) {
print("")
} catch {
print("")
}
“>

do {
    let html = "<html><head><title>First parse</title></head>"
        + "<body><p>Parsed HTML into a doc.</p></body></html>"
    let doc: Document = try SwiftSoup.parse(html)
    return try doc.text()
} catch Exception.Error(let type, let message) {
    print("")
} catch {
    print("")
}

Swift

Installation

Cocoapods

Carthage

Swift Package Manager

Try

Try out the simple online CSS selectors site:

Try out the example project opening Terminal and type:

To parse an HTML document:

The object model of a document

Extract attributes, text, and HTML from elements

Problem

Solution

Description

Parse a document from a String

Problem

Solution

Description

Parsing a body fragment

Problem

Solution

Description

Stay safe

Sanitize untrusted HTML (to prevent XSS)

Problem

Solution

Discussion

See also

Set attribute values

Problem

Solution

Description

Set the HTML of an element

Problem

Solution

Discussion

See also

Setting the text content of elements

Problem

Solution

Discussion

Use DOM methods to navigate a document

Problem

Solution

Description

Finding elements

Element data

Manipulating HTML and text

Use selector syntax to find elements

Problem

Solution

Description

Selector overview

Selector combinations

Pseudo selectors

Examples

To parse an HTML document from String:

Get all text nodes:

Set CSS using SwiftSoup:

Get HTML value

How to remove all the html from a string

How to get and update XML values

How to get all <img src>

Get all href of <a>

Escape and Enescape

Author

Note

License

GitHub

Tiny http server engine written in Swift programming language

Kitura: A Swift web framework and HTTP server

You might also like...

Subscribe to iOS Example

How to get all `<img src>`

Get all `href` of `<a>`