Fixing objc_initializeAfterForkError the right way

Publié le 13 décembre 2023 par Pierre dans Technique.

This is a technical blog post, about how we investigated and solved an error occurring when running Rails tests.

Sometimes, when running a Rails command on macOS, we can encounter a error, promptly followed by a crash:

+[NSCheapMutableString initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.

A quick web search reveals that this crash is not specific to Rails, and happens in many scripting languages.

The Rails-specific variant is documented in this Rails issue. For us, it happened when running the test suite of cocarto, our Rails app: when invoking rails test, one of the test processes would crash with this error.

The root cause

To make it short, this issue occurs in any program when:

A process is multi-threaded,
This process forks,
One of the forked processes loads an Objective-C code bundle.

This is because in that case, the Objective-C runtime can’t guarantee there won’t be any deadlock later. So it stops the program execution instead – by crashing.

In Ruby, this usually happens when require-ing a gem with a native code extension linked against the Objective-C runtime. In the past, a usual culprit was the pg gem – but any gem with native-extension is prone to the issue.

The usual kludge

There are several ways to fix the issue.

The usual workaround is to define an environment variable: export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES. And with that, the program stops crashing.

But the underlying issue is still there, although unreported: at any time, the runtime might deadlock, because the initialization was messed up.

The actual fix would be to ensure that the relevant native code extensions are loaded before the initial process forks. In that case, the initialization is consistent, and the source of the error disappears.

Finding the culprit

So we need to find which library causes the Objective-C runtime to be initialized – in order to load it before the process forks.

Luckily, the error message tells us how to do that: setting a breakpoint on the objc_initializeAfterForkError function. Let’s do that!

However a thing making this a bit tedious is precisely the forking model: the error doesn’t occur in the main process, but in one the forked processes. So we need to connect our debugger to each forked process – which is a use-case debuggers don’t handle very gracefully.

Oh well, we’ll do it manually. Let’s print out the PID of each Ruby process in the console after the fork, and connect our debugger to those.

Here’s how to print the PIDs for a Rails test suite (but any way of executing Ruby code after the fork will do):

class ActiveSupport::TestCase
  parallelize_setup do |_i|
    puts "Forked process pid: " + Process.pid.to_s
    puts "Leaving you some time to connect a debugger…"
    sleep 15
  end
  
  # …
end

Now, once we run rails test, we’ll see in the console the PID of each forked process. Let’s attach a debugger to those.

We’ll use lldb for this. Let’s open a new terminal per process, and in each Terminal type:

$ lldb
(lldb) process attach --pid <insert the printed PID here>
(lldb) breakpoint set --name objc_initializeAfterForkError
(lldb) continue

Now the test suite will start executing. And, at some point, one of the connected debuggers should break on objc_initializeAfterForkError.

We can ask for a stack trace, to see what caused the issue:

(lldb) backtrace
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 2.1
  * frame #0: 0x00007ff80f3a4822 libobjc.A.dylib`objc_initializeAfterForkError
    frame #1: 0x00007ff80f390cc7 libobjc.A.dylib`initializeNonMetaClass + 951
    frame #2: 0x00007ff80f3a595b libobjc.A.dylib`initializeAndMaybeRelock(objc_class*, objc_object*, locker_mixin<lockdebug::lock_mixin<objc_lock_base_t>>&, bool) + 221
    frame #3: 0x00007ff80f390665 libobjc.A.dylib`lookUpImpOrForward + 778
    frame #4: 0x00007ff80f38fd9b libobjc.A.dylib`_objc_msgSend_uncached + 75
    frame #5: 0x00007ff81062fd56 Foundation`_NSResolveSymlinksInPathUsingCache + 78
    frame #6: 0x00007ff81062fbaf Foundation`-[NSString(NSPathUtilities) _stringByResolvingSymlinksInPathUsingCache:] + 155
    frame #7: 0x00007ff81062ea9b Foundation`-[NSBundle initWithPath:] + 162
    frame #8: 0x00007ff81062e90d Foundation`+[NSBundle mainBundle] + 145
    frame #9: 0x00007ff81285edc9 AppKit`+[NSApplication load] + 71
    frame #10: 0x00007ff80f392332 libobjc.A.dylib`load_images + 1319
    frame #11: 0x00007ff80f3d66d8 dyld`dyld4::RuntimeState::notifyObjCInit(dyld4::Loader const*) + 170
    # snip some dyld frames…
    frame #17: 0x00007ff80f3fd3f6 dyld`dyld4::APIs::dlopen_from(char const*, int, void*) + 944
    frame #18: 0x0000000203cc4908 ffi_c.bundle`library_initialize + 136
    # snip lot of Ruby VM instructions…
    frame #165: 0x00000001074d7f65 ruby`main + 101
    frame #166: 0x00007ff80f3ca41f dyld`start + 1903

From this we can gather a few interesting informations:

The issue occurs when calling AppKit’s +[NSApplication load],
AppKit is loaded following a call to dlopen_from.

So probably what happens is that:

Some Ruby code requires a dynamic library, which gets loaded through dlopen,
This dynamic library contains Objective-C code, which loads AppKit,
Loading AppKit calls +[NSCheapMutableString initialize],
Boom 💥

But which native library gets loaded at this point?

To answer this, we can put an additional breakpoint at dyld4::APIs::dlopen_from. We’ll then ask the breakpoint to print the name of the library every time it is called. (The library name is conveniently the first argument of this function).

(lldb) b dyld`dyld4::APIs::dlopen_from
(lldb) breakpoint command add 2.1
(lldb) > p (char*)$rsi
(lldb) > continue
(lldb) > DONE

Now let’s restart the whole thing: running the tests, connecting to the debuggers. This time we an output like this:

(lldb)  p (char*)$rsi
(char *) 0x00007fc2b60e7aa0 "/Users/kemenaran/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/bcrypt-3.1.19/lib/bcrypt_ext.bundle"
(lldb)  continue
Process 32989 resuming
Command #2 'continue' continued the target.
(lldb)  p (char*)$rsi
(char *) 0x00007fc2b15ee0b0 "/Users/kemenaran/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/nio4r-2.5.9/lib/nio4r_ext.bundle"
(lldb)  continue
Process 32989 resuming
Command #2 'continue' continued the target.
# snip other commands…
(lldb)  p (char*)$rsi
(char *) 0x000000010e03d738 "libvips.42.dylib"
(lldb)  continue
Process 32989 resuming
Command #2 'continue' continued the target.
Process 32989 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x00007ff80f3a4822 libobjc.A.dylib`objc_initializeAfterForkError
libobjc.A.dylib`objc_initializeAfterForkError:
Target 0: (ruby) stopped.

Hurray: we see that the last library loaded by dyld before the error is libvips.42.dylib.

So, we know that some code that dynamically loads libvips is responsible for our error. This is probably because libvips is the native library behind the ruby-vips gem, which is used by ActiveStorage to process thumbnails. And ActiveStorage loads the library lazily the first time an image processor is needed, i.e. after forking – which is enough to trigger our issue.

Not anyone fault though: just a bad combination of lazy loading, native extension, forking model and OS choices.

The fix

Now how to solve the issue?

Well, we just need to ensure that libvips is loaded before the process forks.

Now that we identified the native code causing the issue, a seach points us to this issue on the ruby-vips project.

Their suggestion is to add the image_processing gem explicitely to the Gemfile, and ask it to require the vips bindings. This causes Bundler to require it automatically – and voilà, problem fixed.

# Gemfile
group :development, :test
+  gem "image_processing", require: "image_processing/vips"
end

However I prefer not to add an explicit dependency in the Gemfile (as we don’t need image_processing directly; only ActiveStorage does). So we rather went with a solution in code:

# test/test_helper.rb
require "rails/test_help"
require "minitest/reporters"

+ if Rails.application.config.active_storage.variant_processor == :vips && RUBY_PLATFORM.include?("darwin")
+  # Preload `vips` *before* the test process is forked,
+  # to initialize the objc runtime properly
+  require "ruby-vips"
+ end

class ActiveSupport::TestCase
  # snip…

And that’s it: now rails test no longer crashes!

Conclusion

The debugger makes it possible to pinpoint the culprit code. However this could definitely be improved with a better way to connect lldb to child processes. (The manual process of copy-pasting pids in subshells isn’t ergonomic, to say the least).

But now, even if another gem with native extensions starts triggering the same error in the future, it becomes relatively fast to find the relevant dylib, and preload it.

Of course, even if we fixed the issue in cocarto, the best thing would be to fix the issue in Rails itself. We’ll open an issue or PR soon, to see if a generic solution is workable.