Automate Safari Webarchives via Command Line Easily

BYMark Howell 1 years ago9 MINS READ
Automate Safari Webarchives via Command Line Easily

Today in Edworking News, we want to talk about creating a Safari webarchive from the command line. Recently, I’ve been trying to create a local archive of my bookmarked web pages. I already have tools to take screenshots, and I love them as a way to take quick snapshots and skim the history of a site. But bitmap images aren’t a great archival representation of a website. What if I also want to save the HTML, CSS, and JavaScript and keep an interactive copy of the page?
There are lots of tools in this space; for my personal stuff, I’ve come to like Safari webarchives. There are several reasons I find them appealing: Each saved web page is stored as a single file. Each file includes the entire content of the page, and a single file per web page is pretty manageable. I can create backups, keep multiple copies, and so on. I can easily add pages to my archive that can’t be crawled from the public web. Lots of the modern web is locked behind paywalls, login screens, and interstitial modals that are difficult for automated crawlers to get past. It’s much easier for me to get through them as a human using Safari as my default browser. Once I have a page open, I can save it as a webarchive with the File > Save As… menu item.
The archive can be stored locally and offline. It will always remain available to me, as long as I keep up with backups and maintenance, and I can archive private web pages that I don’t want to put in somebody else’s archive. (For example, I wouldn’t want to save private tweets in the publicly available Wayback Machine.) I can read the format without Safari. Although Safari is only maintained by Apple, the Safari webarchive format can be read by non-Apple tools – it’s a binary property list that stores the raw bytes of the original files. I’m comfortable that I’ll be able to open these archives for a while, even if Safari unexpectedly goes away.
The one thing that’s missing is a way to create webarchive files programmatically. Although I could open each page and save it in Safari individually, I have about 6000 bookmarks – I’d like a way to automate this process. I was able to write a short script in Swift that does this for me. In the rest of this article, I’ll explain how it works.

Copy link Prior Art: newzealandpaul/webarchiver

I found an existing tool for creating Safari webarchives on the command line, written by newzealandpaul. I did some brief testing, and it seems to work okay, but I had a few issues:

  • The error messages aren’t very helpful – some of my bookmarks failed to save with an error like “invalid URL,” even though the URL opens just fine.
  • The code is written in Objective-C and uses deprecated classes like WebView and WebArchive.
    Given that it’s only about 350 lines, I wanted to see if I could rewrite it using Swift and the newest classes. I thought that might be easier than trying to understand a language and classes that I’m not super familiar with.

Copy link Playing with WKWebView and createWebArchiveData

It didn’t take much googling to learn that WebView has been replaced by WKWebView, and that class has a method `createWebArchiveData` which “creates a web archive of the web view’s current contents asynchronously.” Perfect!
I watched a WWDC session by Brady Eison, a WebKit engineer, where the `createWebArchiveData` API was introduced. It gave me some useful context about the purpose of WKWebView – it’s for showing web content inside Mac and iOS apps. If you’ve ever used an in-app browser, there was probably an instance of WKWebView somewhere underneath. The session included some sample code for using this API, which I fashioned into an initial script:
```swift
let webview = WKWebView()
webview.load(URLRequest(url: URL(string: "https://example.com")!))
webview.createWebArchiveData { data, error in
if let data = data {
try? data.write(to: URL(fileURLWithPath: "example.webarchive"), options: .atomic)
}
}
RunLoop.main.run()
Image:

Description: Example of a Safari webarchive file containing all essential elements of a saved webpage.
However, the script only created an empty file. Upon debugging, I realized that `WKWebView` was never actually loading the web page.

Edworking
All your work in one place
All-in-one platform for your team and your work. Register now for Free.
Get Started Now

Copy link We Need a Loop-de-Loop

Using a WKWebView inside a Swift script isn’t how it’s normally used. Most of the time, it appears as part of a web browser inside a Mac or iOS app. In that context, you don’t want fetching web pages to be a blocking operation – you want the rest of the app to remain responsive and usable, and download the web page as a background operation.
This made me wonder if my problem was that my script doesn’t have “background operations.” When I ask WKWebView to load my page, it’s getting shoved in a queue of background tasks, but nothing is picking up work from that queue. I don’t fully understand what I did next, but I think I’ve got the gist of the problem. I had another look at newzealandpaul’s code, and I found some lines that look a bit like they’re solving the same problem. I think the NSRunLoop is doing work that’s on that background queue, and it’s waiting until the page has finished loading:
```swift
while webview.isLoading {
RunLoop.main.run(until: Date().addingTimeInterval(0.1))
}
I was able to adapt this idea for my Swift script. Here’s my updated script:
```swift
let webview = WKWebView()
let semaphore = DispatchSemaphore(value: 0)
webview.load(URLRequest(url: URL(string: "https://example.com")!))
while webview.isLoading {
RunLoop.main.run(until: Date().addingTimeInterval(0.1))
}
webview.createWebArchiveData { data, error in
if let data = data {
try? data.write(to: URL(fileURLWithPath: "example.webarchive"), options: .atomic)
}
semaphore.signal()
}
semaphore.wait()
This works, but there’s a fairly glaring hole – it will archive whatever got loaded into the web view, even if the page wasn’t loaded successfully. Let’s fix that next.

Copy link Checking the Page Loaded Successfully with WKNavigationDelegate

If there’s some error getting the page – say, my Internet connection is down or the remote server doesn’t respond – the WKWebView will still complete loading and set `isLoading = false`. My code will then proceed to archive the error page, which is unhelpful. I’d rather the script threw an error and prompted me to investigate.
While I was reading more about WKWebView, I came across the `WKNavigationDelegate` protocol. If you implement this protocol, you can track the progress of a page load, and get detailed events like “the page has started to load” and “the page failed to load with an error.”
Here’s the delegate I wrote:
```swift
class WebArchiveDelegate: NSObject, WKNavigationDelegate {
func webView(_ webView: WKWebView, didFail navigation: WKNavigation!, withError error: Error) {
print("Failed to load page: \(error.localizedDescription)")
exit(1)
}
func webView(_ webView: WKWebView, didFailProvisionalNavigation navigation: WKNavigation!, withError error: Error) {
print("Failed to start loading page: \(error.localizedDescription)")
exit(1)
}
func webView(_ webView: WKWebView, didFinish navigation: WKNavigation!) {
if let url = webView.url, (200..<300).contains(HTTPURLResponse(url: url)!.statusCode) {
// Load was successful, can proceed
} else {
print("Invalid response")
exit(1)
}
}
}
let webview = WKWebView()
webview.navigationDelegate = WebArchiveDelegate()
let semaphore = DispatchSemaphore(value: 0)
webview.load(URLRequest(url: URL(string: "https://example.com")!))
RunLoop.main.run()
// Other code for creating the web archive
semaphore.wait()
Edworking is the best and smartest decision for SMEs and startups to be more productive. Edworking is a FREE superapp of productivity that includes all you need for work powered by AI in the same superapp, connecting Task Management, Docs, Chat, Videocall, and File Management. Save money today by not paying for Slack, Trello, Dropbox, Zoom, and Notion.
I also wrote a method that checks the HTTP status code of the response, and terminates the script if it’s not an HTTP 200 OK. This means that 404 pages and server errors won’t be automatically archived – I can do that manually in Safari if I think they’re really important.

Edworking
All your work in one place
All-in-one platform for your team and your work. Register now for Free.
Get Started Now

Copy link Adding Some Command-Line Arguments

Right now, the URL string and save location are both hard-coded; I wanted to make them command-line arguments.
```swift
let arguments = CommandLine.arguments
guard arguments.count == 3 else {
print("Usage: \(arguments[0]) <url> <output_file>")
exit(1)
}
let url = arguments[1]
let outputFile = arguments[2]
let webview = WKWebView()
webview.load(URLRequest(url: URL(string: url)!))
webview.createWebArchiveData { data, error in
if let data = data {
try? data.write(to: URL(fileURLWithPath: outputFile), options: .atomic)
}
semaphore.signal()
}
semaphore.wait()
I used this script to create webarchives for 6000 or so bookmarks in my Pinboard account. It worked pretty well and captured 85% of my bookmarks – the remaining 15% are broken due to link rot. I did a spot check of a few dozen archives that did get saved, and they all look good. I tweaked the script to improve error messages and ensure that existing webarchive files aren’t overwritten.
You can access the finished script in the GitHub repository: alexwlchan/safari-webarchiver. This repo will be the canonical home for this code, offering periodic updates and a small collection of tests.

Copy link Remember These 3 Key Ideas for Your Startup:

  1. Automate repetitive tasks: Utilize scripts to automate routine processes, such as creating webarchives of important bookmarks.
  2. Use existing tools and improve upon them: Don't reinvent the wheel—start with existing tools and tailor them to your needs. For example, check out the 5 best JIRA alternatives.
  3. Error handling is crucial: Robust error handling can save time and frustration when dealing with large data sets or complex operations, such as setting up an effective document management workflow.
    Implementing these practices can make your startup more efficient and scalable.
    For more details, see the original source.
Mark Howell

About the Author: Mark Howell

LinkedIn

Mark Howell is a talented content writer for Edworking's blog, consistently producing high-quality articles on a daily basis. As a Sales Representative, he brings a unique perspective to his writing, providing valuable insights and actionable advice for readers in the education industry. With a keen eye for detail and a passion for sharing knowledge, Mark is an indispensable member of the Edworking team. His expertise in task management ensures that he is always on top of his assignments and meets strict deadlines. Furthermore, Mark's skills in project management enable him to collaborate effectively with colleagues, contributing to the team's overall success and growth. As a reliable and diligent professional, Mark Howell continues to elevate Edworking's blog and brand with his well-researched and engaging content.

Startups

Try Edworking Background

A new way to work from anywhere, for everyone for Free!

Get Started Now