Timer fires with corrupted callback after Python object is garbage collected #251

Closed
opened 2026-02-19 23:38:20 +00:00 by john · 2 comments
Owner

Summary

When a mcrfpy.Timer is created without storing the returned Python object, Python's garbage collector destroys it. However, the C++ timer system still holds a reference and continues to fire the callback — resulting in wrong argument counts and eventual segfaults.

This is a footgun because the natural pattern mcrfpy.Timer("name", callback, 100) (without t = ...) silently breaks.

Minimal Reproduction

import mcrfpy, sys

def my_callback(timer, runtime):
    print(f"timer={timer} runtime={runtime}")
    timer.stop()
    sys.exit(0)

def start_timer():
    # Timer created but reference not stored — GC'd on function return
    mcrfpy.Timer("test", my_callback, 50)

start_timer()
mcrfpy.step(0.1)

Actual output:

Timer callback raised an exception:
TypeError: my_callback() missing 1 required positional argument: 'runtime'

Repeated firings eventually segfault.

Expected behavior (option A — prevent GC):
Timer stays alive as long as it's active. The C++ timer system prevents Python GC by holding a strong reference to the Python Timer object.

Expected behavior (option B — clean teardown):
Timer is automatically cancelled/removed from the C++ timer system when the Python object is GC'd (via tp_dealloc).

Workaround

Store the Timer reference so it isn't garbage collected:

t = mcrfpy.Timer("test", my_callback, 50)  # prevent GC

Analysis

The C++ timer system registers the timer and holds a raw reference to the Python callback. When Python GC destroys the Timer object (because no Python references remain), the C++ side still fires on schedule. The callback's Python function reference is stale/corrupted, causing the wrong number of arguments to be passed and eventually a segfault.

Option A (prevent GC) is probably what users expect — creating a Timer should keep it alive until stop() is called or the scene changes. This could be implemented by having the C++ timer registry hold a Py_INCREF'd reference to the Python Timer object, released on stop() or scene teardown.

Option B (clean teardown) is the minimal safety fix — tp_dealloc should unregister the timer from the C++ system so it never fires a dead callback.

Both options could be combined: prevent GC while active (A), and as a safety net, unregister on dealloc (B).

## Summary When a `mcrfpy.Timer` is created without storing the returned Python object, Python's garbage collector destroys it. However, the C++ timer system still holds a reference and continues to fire the callback — resulting in wrong argument counts and eventual segfaults. This is a footgun because the natural pattern `mcrfpy.Timer("name", callback, 100)` (without `t = ...`) silently breaks. ## Minimal Reproduction ```python import mcrfpy, sys def my_callback(timer, runtime): print(f"timer={timer} runtime={runtime}") timer.stop() sys.exit(0) def start_timer(): # Timer created but reference not stored — GC'd on function return mcrfpy.Timer("test", my_callback, 50) start_timer() mcrfpy.step(0.1) ``` **Actual output:** ``` Timer callback raised an exception: TypeError: my_callback() missing 1 required positional argument: 'runtime' ``` Repeated firings eventually segfault. **Expected behavior (option A — prevent GC):** Timer stays alive as long as it's active. The C++ timer system prevents Python GC by holding a strong reference to the Python Timer object. **Expected behavior (option B — clean teardown):** Timer is automatically cancelled/removed from the C++ timer system when the Python object is GC'd (via `tp_dealloc`). ## Workaround Store the Timer reference so it isn't garbage collected: ```python t = mcrfpy.Timer("test", my_callback, 50) # prevent GC ``` ## Analysis The C++ timer system registers the timer and holds a raw reference to the Python callback. When Python GC destroys the Timer object (because no Python references remain), the C++ side still fires on schedule. The callback's Python function reference is stale/corrupted, causing the wrong number of arguments to be passed and eventually a segfault. **Option A** (prevent GC) is probably what users expect — creating a Timer should keep it alive until `stop()` is called or the scene changes. This could be implemented by having the C++ timer registry hold a `Py_INCREF`'d reference to the Python Timer object, released on `stop()` or scene teardown. **Option B** (clean teardown) is the minimal safety fix — `tp_dealloc` should unregister the timer from the C++ system so it never fires a dead callback. Both options could be combined: prevent GC while active (A), and as a safety net, unregister on dealloc (B).
Author
Owner

Root Cause Found: tp_dealloc unconditionally destroys callbacks on ALL drawable types

The Timer bug reported here is one instance of a systemic issue across every UIDrawable type (Frame, Caption, Sprite, Grid, Circle, Arc, Line, Viewport3D).

The Mechanism

When you do frame.on_click = callback:

  1. PyCallable constructor calls Py_XNewRef(callback) — properly increments refcount
  2. Callback is stored as unique_ptr<PyClickCallable> on the C++ UIDrawable
  3. C++ object lives in parent's children vector via shared_ptr

But when the Python wrapper goes out of scope:

  1. tp_dealloc fires on the wrapper
  2. It unconditionally calls click_unregister(), on_enter_unregister(), etc.
  3. These destroy the unique_ptr, which calls Py_DECREF on the callback
  4. Callback is freed even though the C++ UIDrawable is still alive in the scene tree

The comment in the code even said "Clear Python references to break cycles" — this is tp_clear logic that was copy-pasted into tp_dealloc, where it's premature destruction.

The Fix

Guard callback unregistration on shared_ptr::use_count():

// Before (WRONG): always destroy callbacks
if (obj->data) {
    obj->data->click_unregister();
    // ...
}

// After (FIXED): only destroy if we're the last owner
if (obj->data && obj->data.use_count() <= 1) {
    obj->data->click_unregister();
    // ...
}

Applied to all 8 drawable types: UIFrame.h, UICaption.h, UISprite.h, UIGrid.h, UICircle.h, UIArc.h, UILine.h, Viewport3D.h. tp_clear (cycle breaker) left as-is since that's specifically called by the cyclic GC to break reference cycles.

Regression test: tests/regression/issue_251_callback_gc_test.py — 6 tests covering Frame, Caption, Sprite, multi-callback survival, standalone cleanup.

Affected Files

  • src/UIFrame.h (tp_dealloc)
  • src/UICaption.h (tp_dealloc)
  • src/UISprite.h (tp_dealloc)
  • src/UIGrid.h (tp_dealloc + cell callbacks)
  • src/UICircle.h (tp_dealloc)
  • src/UIArc.h (tp_dealloc)
  • src/UILine.h (tp_dealloc)
  • src/3d/Viewport3D.h (tp_dealloc)

Lessons Learned: Python Patterns That Were Silently Broken

This bug made perfectly reasonable Python patterns silently fail. Users would write natural code, see no error, and wonder why nothing happened. These all worked fine before this commit — they just silently lost their callbacks:

1. Factory functions returning UI trees

def make_button(parent, label, callback):
    """Standard factory pattern — build and attach a widget."""
    btn = mcrfpy.Frame(pos=(10, 10), size=(100, 30))
    btn.on_click = callback
    parent.children.append(btn)
    # btn goes out of scope — callback SILENTLY LOST

make_button(panel, "OK", handle_ok)  # Broken: button exists but doesn't respond

2. List comprehension UI building

# Build a grid of clickable cells
for i in range(64):
    cell = mcrfpy.Frame(pos=(i%8*50, i//8*50), size=(48, 48))
    cell.on_click = lambda p, b, a, idx=i: select_cell(idx)
    grid.children.append(cell)
    # cell reassigned each iteration — previous wrapper GC'd — callback LOST

3. Scene setup in a function

def setup_game():
    scene = mcrfpy.Scene("game")
    bg = mcrfpy.Frame(pos=(0,0), size=(1024, 768))
    scene.children.append(bg)
    
    play_btn = mcrfpy.Frame(pos=(400, 300), size=(200, 50))
    play_btn.on_click = start_game
    bg.children.append(play_btn)
    # ALL local variables GC'd on return — ALL callbacks LOST

setup_game()  # Scene renders fine but nothing is interactive

4. Temporary variable reuse

f = mcrfpy.Frame(pos=(0, 0), size=(100, 100))
f.on_click = handler_a
parent.children.append(f)

f = mcrfpy.Frame(pos=(100, 0), size=(100, 100))  # rebind f
f.on_click = handler_b
parent.children.append(f)
# First frame's wrapper was GC'd when f was rebound — handler_a LOST

5. The "it works if you store it" mystery

# This FAILS (callback lost):
mcrfpy.Frame(pos=(0,0), size=(100,100)).on_click = handler
parent.children.append(???)  # can't even append, but the pattern exists in loops

# This WORKS (wrapper stays alive):
btn = mcrfpy.Frame(...)
btn.on_click = handler
parent.children.append(btn)
buttons.append(btn)  # Extra reference keeps wrapper alive — WORKAROUND, not a fix

Why it wasn't caught earlier

  • The workaround (storing references) is also a common Python pattern, so code written by careful developers happened to work
  • No error, no exception, no warning — callbacks just silently disappeared
  • The bug only manifests when the Python wrapper is GC'd, which depends on CPython's reference counting timing
  • Tests that create and immediately use objects in the same scope never trigger GC
  • Interactive console use keeps references alive in _ and history

Note on Timer (original report)

The Timer issue is a related but slightly different variant: mcrfpy.Timer() stores the callback in the C++ timer system, but the Python Timer wrapper is GC'd. The fix for Timers should follow Option A from the original report (C++ timer system holds Py_INCREF'd reference to prevent GC while active). The drawable callback fix here addresses the broader class of the same problem for all UI elements.

## Root Cause Found: `tp_dealloc` unconditionally destroys callbacks on ALL drawable types The Timer bug reported here is one instance of a **systemic issue across every UIDrawable type** (Frame, Caption, Sprite, Grid, Circle, Arc, Line, Viewport3D). ### The Mechanism When you do `frame.on_click = callback`: 1. `PyCallable` constructor calls `Py_XNewRef(callback)` — properly increments refcount ✅ 2. Callback is stored as `unique_ptr<PyClickCallable>` on the C++ `UIDrawable` ✅ 3. C++ object lives in parent's children vector via `shared_ptr` ✅ But when the **Python wrapper** goes out of scope: 1. `tp_dealloc` fires on the wrapper 2. It **unconditionally** calls `click_unregister()`, `on_enter_unregister()`, etc. 3. These destroy the `unique_ptr`, which calls `Py_DECREF` on the callback 4. Callback is freed even though the C++ UIDrawable is still alive in the scene tree The comment in the code even said "Clear Python references to break cycles" — this is `tp_clear` logic that was copy-pasted into `tp_dealloc`, where it's premature destruction. ### The Fix Guard callback unregistration on `shared_ptr::use_count()`: ```cpp // Before (WRONG): always destroy callbacks if (obj->data) { obj->data->click_unregister(); // ... } // After (FIXED): only destroy if we're the last owner if (obj->data && obj->data.use_count() <= 1) { obj->data->click_unregister(); // ... } ``` Applied to all 8 drawable types: UIFrame.h, UICaption.h, UISprite.h, UIGrid.h, UICircle.h, UIArc.h, UILine.h, Viewport3D.h. `tp_clear` (cycle breaker) left as-is since that's specifically called by the cyclic GC to break reference cycles. Regression test: `tests/regression/issue_251_callback_gc_test.py` — 6 tests covering Frame, Caption, Sprite, multi-callback survival, standalone cleanup. ### Affected Files - `src/UIFrame.h` (tp_dealloc) - `src/UICaption.h` (tp_dealloc) - `src/UISprite.h` (tp_dealloc) - `src/UIGrid.h` (tp_dealloc + cell callbacks) - `src/UICircle.h` (tp_dealloc) - `src/UIArc.h` (tp_dealloc) - `src/UILine.h` (tp_dealloc) - `src/3d/Viewport3D.h` (tp_dealloc) --- ## Lessons Learned: Python Patterns That Were Silently Broken This bug made **perfectly reasonable Python patterns** silently fail. Users would write natural code, see no error, and wonder why nothing happened. These all worked fine before this commit — they just silently lost their callbacks: ### 1. Factory functions returning UI trees ```python def make_button(parent, label, callback): """Standard factory pattern — build and attach a widget.""" btn = mcrfpy.Frame(pos=(10, 10), size=(100, 30)) btn.on_click = callback parent.children.append(btn) # btn goes out of scope — callback SILENTLY LOST make_button(panel, "OK", handle_ok) # Broken: button exists but doesn't respond ``` ### 2. List comprehension UI building ```python # Build a grid of clickable cells for i in range(64): cell = mcrfpy.Frame(pos=(i%8*50, i//8*50), size=(48, 48)) cell.on_click = lambda p, b, a, idx=i: select_cell(idx) grid.children.append(cell) # cell reassigned each iteration — previous wrapper GC'd — callback LOST ``` ### 3. Scene setup in a function ```python def setup_game(): scene = mcrfpy.Scene("game") bg = mcrfpy.Frame(pos=(0,0), size=(1024, 768)) scene.children.append(bg) play_btn = mcrfpy.Frame(pos=(400, 300), size=(200, 50)) play_btn.on_click = start_game bg.children.append(play_btn) # ALL local variables GC'd on return — ALL callbacks LOST setup_game() # Scene renders fine but nothing is interactive ``` ### 4. Temporary variable reuse ```python f = mcrfpy.Frame(pos=(0, 0), size=(100, 100)) f.on_click = handler_a parent.children.append(f) f = mcrfpy.Frame(pos=(100, 0), size=(100, 100)) # rebind f f.on_click = handler_b parent.children.append(f) # First frame's wrapper was GC'd when f was rebound — handler_a LOST ``` ### 5. The "it works if you store it" mystery ```python # This FAILS (callback lost): mcrfpy.Frame(pos=(0,0), size=(100,100)).on_click = handler parent.children.append(???) # can't even append, but the pattern exists in loops # This WORKS (wrapper stays alive): btn = mcrfpy.Frame(...) btn.on_click = handler parent.children.append(btn) buttons.append(btn) # Extra reference keeps wrapper alive — WORKAROUND, not a fix ``` ### Why it wasn't caught earlier - The workaround (storing references) is also a common Python pattern, so code written by careful developers happened to work - No error, no exception, no warning — callbacks just silently disappeared - The bug only manifests when the Python wrapper is GC'd, which depends on CPython's reference counting timing - Tests that create and immediately use objects in the same scope never trigger GC - Interactive console use keeps references alive in `_` and history ### Note on Timer (original report) The Timer issue is a related but slightly different variant: `mcrfpy.Timer()` stores the callback in the C++ timer system, but the Python Timer wrapper is GC'd. The fix for Timers should follow Option A from the original report (C++ timer system holds `Py_INCREF`'d reference to prevent GC while active). The drawable callback fix here addresses the broader class of the same problem for all UI elements.
Author
Owner

Timer GC Fix Implemented (Option A)

Active timers now hold a strong reference (Py_INCREF) to their Python wrapper object, preventing garbage collection while the timer is registered in the engine.

Mechanism

Added PyObject* py_wrapper to the C++ Timer class with retainPyWrapper()/releasePyWrapper() methods:

  • INCREF at all activation points: init(start=True), start(), restart(), set_active(True)
  • DECREF at all deactivation points: stop(), one-shot fires, ~Timer() destructor

State invariant

py_wrapper is non-null if and only if the timer is active in the engine's timer map.

Edge cases handled

  • stop() from callback: Safe because callback frame holds its own references to the timer args. DECREF won't trigger immediate dealloc.
  • One-shot timers: releasePyWrapper() called alongside stopped = true in Timer::test().
  • restart() after stop(): Re-INCREFs the wrapper, re-preventing GC.
  • Double retain: retainPyWrapper() checks if (py_wrapper == wrapper) return to avoid double-INCREF.
  • Timer destructor: Calls releasePyWrapper() as safety net.

Files modified

  • src/Timer.h — added py_wrapper, retainPyWrapper(), releasePyWrapper()
  • src/Timer.cpp — implemented retain/release, added to stop(), one-shot path, destructor
  • src/PyTimer.cpp — calls retainPyWrapper() in init(), start(), restart(), set_active()

Patterns that now work

# The original bug report - this now works:
def start_timer():
    mcrfpy.Timer("test", my_callback, 50)  # No need to store!
start_timer()  # Timer stays alive and fires correctly

# One-shot timer in a function:
def delayed_action():
    mcrfpy.Timer("once", lambda t, r: do_something(), 1000, once=True)
delayed_action()  # Fires once, then allows GC

# Timer that stops itself:
def start_countdown():
    def tick(timer, runtime):
        if runtime > 5000:
            timer.stop()  # Releases strong ref, allows GC
    mcrfpy.Timer("countdown", tick, 100)
start_countdown()  # Works perfectly

Test

tests/regression/issue_251_timer_gc_test.py — 6 tests covering unstored timers, correct callback args, one-shot timers, stop-from-callback safety, and restart behavior.

Combined with the drawable callback fix from the previous comment, this fully resolves issue #251 for both UI callbacks and timers.

## Timer GC Fix Implemented (Option A) Active timers now hold a strong reference (`Py_INCREF`) to their Python wrapper object, preventing garbage collection while the timer is registered in the engine. ### Mechanism Added `PyObject* py_wrapper` to the C++ `Timer` class with `retainPyWrapper()`/`releasePyWrapper()` methods: - **INCREF** at all activation points: `init(start=True)`, `start()`, `restart()`, `set_active(True)` - **DECREF** at all deactivation points: `stop()`, one-shot fires, `~Timer()` destructor ### State invariant `py_wrapper` is non-null if and only if the timer is active in the engine's timer map. ### Edge cases handled - **stop() from callback**: Safe because callback frame holds its own references to the timer args. DECREF won't trigger immediate dealloc. - **One-shot timers**: `releasePyWrapper()` called alongside `stopped = true` in `Timer::test()`. - **restart() after stop()**: Re-INCREFs the wrapper, re-preventing GC. - **Double retain**: `retainPyWrapper()` checks `if (py_wrapper == wrapper) return` to avoid double-INCREF. - **Timer destructor**: Calls `releasePyWrapper()` as safety net. ### Files modified - `src/Timer.h` — added `py_wrapper`, `retainPyWrapper()`, `releasePyWrapper()` - `src/Timer.cpp` — implemented retain/release, added to `stop()`, one-shot path, destructor - `src/PyTimer.cpp` — calls `retainPyWrapper()` in `init()`, `start()`, `restart()`, `set_active()` ### Patterns that now work ```python # The original bug report - this now works: def start_timer(): mcrfpy.Timer("test", my_callback, 50) # No need to store! start_timer() # Timer stays alive and fires correctly # One-shot timer in a function: def delayed_action(): mcrfpy.Timer("once", lambda t, r: do_something(), 1000, once=True) delayed_action() # Fires once, then allows GC # Timer that stops itself: def start_countdown(): def tick(timer, runtime): if runtime > 5000: timer.stop() # Releases strong ref, allows GC mcrfpy.Timer("countdown", tick, 100) start_countdown() # Works perfectly ``` ### Test `tests/regression/issue_251_timer_gc_test.py` — 6 tests covering unstored timers, correct callback args, one-shot timers, stop-from-callback safety, and restart behavior. Combined with the drawable callback fix from the previous comment, this fully resolves issue #251 for both UI callbacks and timers.
john closed this issue 2026-02-20 01:49:31 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
john/McRogueFace#251
No description provided.