XPK Start: Mon Apr 20 15:44:22 UTC 2026 2026-04-20 15:44:50.457491: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303) I0420 15:44:50.671699 137435586180928 max_utils.py:273] Attempting to initialize the jax distributed system... INFO:2026-04-20 15:44:59,713:jax._src.distributed:149: Starting JAX distributed service on [::]:8482 I0420 15:44:59.713864 137435586180928 distributed.py:149] Starting JAX distributed service on [::]:8482 INFO:2026-04-20 15:44:59,716:jax._src.distributed:166: Connecting to JAX distributed service on mt-02-sft-linen-ckpt-lbw2p-slice-job-0-0.mt-02-sft-linen-ckpt-lbw2p:8482 I0420 15:44:59.716297 137435586180928 distributed.py:166] Connecting to JAX distributed service on mt-02-sft-linen-ckpt-lbw2p-slice-job-0-0.mt-02-sft-linen-ckpt-lbw2p:8482 I0420 15:45:01.825257 137435586180928 max_utils.py:284] Jax distributed system initialized! I0420 15:45:08.251928 137435586180928 max_utils.py:800] System Information: Jax Version: 0.8.3 I0420 15:45:08.252031 137435586180928 max_utils.py:801] System Information: Jaxlib Version: 0.8.3 I0420 15:45:08.252070 137435586180928 max_utils.py:802] System Information: Jax Backend: PJRT C API TFRT TPU v6 lite Built on Dec 15 2025 14:03:46 (1765836226) cl/844590465 I0420 15:45:08.255484 137435586180928 maxtext_utils.py:1718] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1) I0420 15:45:08.448967 137435586180928 maxtext_utils.py:1718] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1) I0420 15:45:09.538882 137435586180928 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None I0420 15:45:09.539337 137435586180928 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7cfe81a52240>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB) I0420 15:45:09.539397 137435586180928 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.28 W0420 15:45:10.143502 137435586180928 checkpoint.py:202] Metadata file does not exist: gs://lance-maxtext/pt_seed_ckpts/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items/_CHECKPOINT_METADATA I0420 15:45:10.695383 1938 google_auth_provider.cc:181] Running on GCE, using service account 562977990677-compute@developer.gserviceaccount.com I0420 15:45:12.310913 137435586180928 checkpointer.py:304] Restoring checkpoint from gs://lance-maxtext/pt_seed_ckpts/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items. W0420 15:45:14.310637 137435586180928 transform_utils.py:230] The transformations API will eventually be replaced by an upgraded design. The current API will not be removed until this point, but it will no longer be actively worked on. I0420 15:45:14.310934 137435586180928 transform_utils.py:288] The following keys are not loaded from the original tree after applying specified transforms: params/params/decoder/to_nnx__rngs/aqt/count, params/params/decoder/to_nnx__rngs/aqt/key, params/params/decoder/to_nnx__rngs/dropout/count, params/params/decoder/to_nnx__rngs/dropout/key, params/params/decoder/to_nnx__rngs/params/count, params/params/decoder/to_nnx__rngs/params/key I0420 15:45:14.374490 137435586180928 checkpointer.py:318] Finished restoring checkpoint in 2.89 seconds from gs://lance-maxtext/pt_seed_ckpts/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items. I0420 15:45:14.457303 137435586180928 config.py:112] TensorFlow version 2.20.0 available. I0420 15:45:14.457803 137435586180928 config.py:125] JAX version 0.8.3 available. /deps/src/maxtext/input_pipeline/input_pipeline_utils.py:467: UserWarning: WARNING: Inefficient dataloading. Your train or eval dataset contains 3 shards, smaller than number of host loading data. This is known to lead to inefficient dataloading. Seegithub.com/google/maxtext/blob/main/getting_started/Data_Input_Pipeline.md#multihost-dataloading-best-practice warnings.warn( E0420 15:45:20.053145 137435586180928 packing.py:209] PackAndBatchOperation is deprecated. Please use lazy_dataset.FirstFitPackIterDataset instead. I0420 15:45:20.053358 137435586180928 data_loader.py:408] Adding CopyNumPyArrayToSharedMemory MapTransform. I0420 15:45:20.385571 137435586180928 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None I0420 15:45:20.385712 137435586180928 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7cfe81a52240>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB) I0420 15:45:20.385759 137435586180928 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None I0420 15:45:20.385794 137435586180928 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7cfe81a52240>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB) I0420 15:45:20.385837 137435586180928 checkpoint_manager.py:702] [process=0][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'model_params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7ce6f94ca6f0>, 'optimizer_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7ce6f8c70d40>, 'custom_metadata': <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7ce6f8c776b0>}, handler_registry=None I0420 15:45:20.386042 137435586180928 composite_checkpoint_handler.py:237] Deferred registration for item: "model_params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7ce6f94ca6f0>` for item "model_params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`. I0420 15:45:20.386107 137435586180928 composite_checkpoint_handler.py:237] Deferred registration for item: "optimizer_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7ce6f8c70d40>` for item "optimizer_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`. I0420 15:45:20.386138 137435586180928 composite_checkpoint_handler.py:237] Deferred registration for item: "custom_metadata". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7ce6f8c776b0>` for item "custom_metadata" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`. I0420 15:45:20.386164 137435586180928 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7ce6f8c70e00>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`. I0420 15:45:20.386191 137435586180928 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7ce6f94ca6f0>, ('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7ce6f94ca6f0>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7ce6f8c70d40>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7ce6f8c70d40>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7ce6f8c776b0>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7ce6f8c776b0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7ce6f8c70e00>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7ce6f8c70e00>}). I0420 15:45:20.386575 137435586180928 async_checkpointer.py:177] [process=0][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>._fn at 0x7ce6f8cea480> timeout: 600 secs and primary_host=0 for async checkpoint writes I0420 15:45:22.925202 137435586180928 checkpoint_manager.py:558] Created directory=gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260420_153552/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260420_153552_02_sft_linen_ckpt/checkpoints I0420 15:45:23.358787 137435586180928 checkpoint_manager.py:1788] Found 0 checkpoint steps in gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260420_153552/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260420_153552_02_sft_linen_ckpt/checkpoints I0420 15:45:23.827020 137435586180928 checkpoint_manager.py:921] [process=0][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=10000, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_hns=False, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None, preservation_policy=None, prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False), root_directory=gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260420_153552/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260420_153552_02_sft_linen_ckpt/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7ce6f8c75100> I0420 15:45:23.833229 137435586180928 metrics_logger.py:64] WandbBackend skipped: 'wandb' library not installed. I0420 15:45:23.833429 137435586180928 peft_trainer.py:590] Training with mesh: Mesh('diloco': 1, 'data': 4, 'stage': 1, 'fsdp': 8, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1, axis_types=(Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto)) I0420 15:45:24.244656 137435586180928 peft_trainer.py:600] Compiled train_step cache size: 0 Training: 0%| | 0/5 [00:00<?, ?step/s]I0420 15:45:24.246809 137435586180928 metric_logger.py:301] number parameters: 0.000 billion I0420 15:45:24.310365 137281625114368 grain_pool.py:367] Grain pool will use 1 processes. I0420 15:45:24.335663 137281625114368 grain_pool.py:440] Grain pool will start child processes. Per train step: Total TFLOPs: 0.00 split as 54.29% learnable weight flops and 45.71% attention flops I0420 15:45:24.341269 137281625114368 grain_pool.py:448] Grain pool started all child processes. 2026-04-20 15:45:28.216405: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2026-04-20 15:45:28.261053: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2026-04-20 15:45:29.237578: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2026-04-20 15:45:33.471868: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303) I0420 15:45:39.323553 137435586180928 utils.py:86] Train loop finished in: 15.0149 seconds Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/deps/src/maxtext/trainers/post_train/sft/train_sft.py", line 280, in <module> app.run(main) File "/usr/local/lib/python3.12/site-packages/absl/app.py", line 316, in run _run_main(main, args) File "/usr/local/lib/python3.12/site-packages/absl/app.py", line 261, in _run_main sys.exit(main(argv)) ^^^^^^^^^^ File "/deps/src/maxtext/trainers/post_train/sft/train_sft.py", line 276, in main train(mt_config, goodput_recorder) File "/deps/src/maxtext/trainers/post_train/sft/train_sft.py", line 253, in train trainer = train_model(mt_config, trainer, mesh) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/deps/src/maxtext/trainers/post_train/sft/train_sft.py", line 239, in train_model trainer.train(trainer.data_hooks.train_data_iterator, trainer.data_hooks.eval_data_iterator) File "/usr/local/lib/python3.12/site-packages/tunix/sft/peft_trainer.py", line 659, in train train_example = sharding_utils.shard_input( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/tunix/sft/sharding_utils.py", line 58, in shard_input return jax.tree.map( ^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/jax/_src/tree.py", line 155, in map return tree_util.tree_map(f, tree, *rest, is_leaf=is_leaf) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/jax/_src/tree_util.py", line 362, in tree_map return treedef.unflatten(f(*xs) for xs in zip(*all_leaves)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/jax/_src/tree_util.py", line 362, in <genexpr> return treedef.unflatten(f(*xs) for xs in zip(*all_leaves)) ^^^^^^ File "/usr/local/lib/python3.12/site-packages/tunix/sft/sharding_utils.py", line 59, in <lambda> lambda x: jax.make_array_from_process_local_data( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/jax/_src/array.py", line 986, in make_array_from_process_local_data out = [_array_from_process_local_data(data, s, shape) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/jax/_src/array.py", line 1048, in _array_from_process_local_data return make_array_from_callback(global_shape, sharding, cb) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/jax/_src/array.py", line 845, in make_array_from_callback per_device_values = api.device_put(per_device_values, devices) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/jax/_src/api.py", line 2729, in device_put out_flat = dispatch._batched_device_put_impl( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/jax/_src/dispatch.py", line 558, in _batched_device_put_impl y = _device_put_impl(x, device=device, src=src, copy=cp, aval=aval) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/jax/_src/dispatch.py", line 545, in _device_put_impl return _device_put_sharding_impl(x, aval, device, copy) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/jax/_src/dispatch.py", line 487, in _device_put_sharding_impl raise ValueError( ValueError: device_put's first argument must be a fully addressable array, but got value with devices {TpuDevice(id=22, process_index=5, coords=(2,5,0), core_on_chip=0), TpuDevice(id=27, process_index=7, coords=(3,6,0), core_on_chip=0), TpuDevice(id=10, process_index=3, coords=(2,2,0), core_on_chip=0), TpuDevice(id=18, process_index=5, coords=(2,4,0), core_on_chip=0), TpuDevice(id=3, process_index=1, coords=(3,0,0), core_on_chip=0), TpuDevice(id=6, process_index=1, coords=(2,1,0), core_on_chip=0), TpuDevice(id=9, process_index=2, coords=(1,2,0), core_on_chip=0), TpuDevice(id=14, process_index=3, coords=(2,3,0), core_on_chip=0), TpuDevice(id=7, process_index=1, coords=(3,1,0), core_on_chip=0), TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=30, process_index=7, coords=(2,7,0), core_on_chip=0), TpuDevice(id=21, process_index=4, coords=(1,5,0), core_on_chip=0), TpuDevice(id=25, process_index=6, coords=(1,6,0), core_on_chip=0), TpuDevice(id=5, process_index=0, coords=(1,1,0), core_on_chip=0), TpuDevice(id=12, process_index=2, coords=(0,3,0), core_on_chip=0), TpuDevice(id=24, process_index=6, coords=(0,6,0), core_on_chip=0), TpuDevice(id=2, process_index=1, coords=(2,0,0), core_on_chip=0), TpuDevice(id=23, process_index=5, coords=(3,5,0), core_on_chip=0), TpuDevice(id=8, process_index=2, coords=(0,2,0), core_on_chip=0), TpuDevice(id=29, process_index=6, coords=(1,7,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=15, process_index=3, coords=(3,3,0), core_on_chip=0), TpuDevice(id=31, process_index=7, coords=(3,7,0), core_on_chip=0), TpuDevice(id=13, process_index=2, coords=(1,3,0), core_on_chip=0), TpuDevice(id=19, process_index=5, coords=(3,4,0), core_on_chip=0), TpuDevice(id=4, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=16, process_index=4, coords=(0,4,0), core_on_chip=0), TpuDevice(id=11, process_index=3, coords=(3,2,0), core_on_chip=0), TpuDevice(id=20, process_index=4, coords=(0,5,0), core_on_chip=0), TpuDevice(id=28, process_index=6, coords=(0,7,0), core_on_chip=0), TpuDevice(id=17, process_index=4, coords=(1,4,0), core_on_chip=0), TpuDevice(id=26, process_index=7, coords=(2,6,0), core_on_chip=0)} I0420 15:45:39.674383 137281625114368 grain_pool.py:542] Grain pool is exiting. I0420 15:45:39.674488 137281625114368 grain_pool.py:547] Shutting down multiprocessing system. I0420 15:45:45.571768 137281625114368 grain_pool.py:547] Shutting down multiprocessing system. Training: 0%| | 0/5 [00:24<?, ?step/s] Exception ignored in: <function GCSRecordWriter.__del__ at 0x7cfe7b6a96c0> Traceback (most recent call last): File "/usr/local/lib/python3.12/site-packages/tensorboardX/record_writer.py", line 134, in __del__ File "/usr/local/lib/python3.12/site-packages/tensorboardX/record_writer.py", line 158, in close File "/usr/local/lib/python3.12/site-packages/tensorboardX/record_writer.py", line 149, in flush File "/usr/local/lib/python3.12/copy.py", line 87, in copy ImportError: sys.meta_path is None, Python is likely shutting down /usr/local/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 15 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' Exception ignored in: <function GCSRecordWriter.__del__ at 0x7cfe7b6a96c0> Traceback (most recent call last): File "/usr/local/lib/python3.12/site-packages/tensorboardX/record_writer.py", line 134, in __del__ File "/usr/local/lib/python3.12/site-packages/tensorboardX/record_writer.py", line 158, in close File "/usr/local/lib/python3.12/site-packages/tensorboardX/record_writer.py", line 149, in flush File "/usr/local/lib/python3.12/copy.py", line 87, in copy ImportError: sys.meta_path is None, Python is likely shutting down XPK End: Mon Apr 20 15:45:57 UTC 2026 EXIT_CODE=1