2026-04-16 21:13:30.845532: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303) I0416 21:13:31.292215 127603182345344 max_utils.py:238] Skipping jax distributed system due to skip_jax_distributed_system=True flag. I0416 21:14:01.481170 127603182345344 max_utils.py:800] System Information: Jax Version: 0.8.3 I0416 21:14:01.481449 127603182345344 max_utils.py:801] System Information: Jaxlib Version: 0.8.3 I0416 21:14:01.481501 127603182345344 max_utils.py:802] System Information: Jax Backend: PJRT C API TFRT TPU v6 lite Built on Dec 15 2025 14:03:46 (1765836226) cl/844590465 I0416 21:14:01.486060 127603182345344 maxtext_utils.py:1687] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1) I0416 21:14:01.568634 127603182345344 maxtext_utils.py:1687] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1) I0416 21:14:01.651642 127603182345344 maxtext_utils.py:1687] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1) I0416 21:14:02.618201 127603182345344 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None I0416 21:14:02.618680 127603182345344 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x740d4cbfd940>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB) I0416 21:14:02.618773 127603182345344 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.36 W0416 21:14:03.915950 127603182345344 checkpoint.py:202] Metadata file does not exist: gs://wanglance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items/_CHECKPOINT_METADATA I0416 21:14:04.166777 4171481 google_auth_provider.cc:149] Using credentials at ~/.config/gcloud/application_default_credentials.json I0416 21:14:04.166856 4171481 google_auth_provider.cc:156] Using OAuth2 AuthProvider I0416 21:14:04.691958 127603182345344 event_tracking.py:70] [process=0] [sync] Started load checkpoint @ gs://wanglance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items. I0416 21:14:04.821602 127603182345344 checkpointer.py:307] Restoring checkpoint from gs://wanglance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items. I0416 21:14:04.821842 127603182345344 event_tracking.py:125] [process=0] [sync] Finished blocking load in 0.13 seconds. Continuing load @ gs://wanglance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items. W0416 21:14:05.131629 127603182345344 transform_utils.py:230] The transformations API will eventually be replaced by an upgraded design. The current API will not be removed until this point, but it will no longer be actively worked on. I0416 21:14:05.131973 127603182345344 transform_utils.py:288] The following keys are not loaded from the original tree after applying specified transforms: decoder/decoder_norm/bias/value, decoder/decoder_norm/scale/value, decoder/dropout/rngs/aqt/count/value, decoder/dropout/rngs/aqt/key/value, decoder/dropout/rngs/dropout/count/value, decoder/dropout/rngs/dropout/key/value, decoder/dropout/rngs/params/count/value, decoder/dropout/rngs/params/key/value, decoder/layers/dropout/rngs/aqt/count/value, decoder/layers/dropout/rngs/aqt/key/value, decoder/layers/dropout/rngs/dropout/count/value, decoder/layers/dropout/rngs/dropout/key/value, decoder/layers/dropout/rngs/params/count/value, decoder/layers/dropout/rngs/params/key/value, decoder/layers/mlp/dropout/rngs/aqt/count/value, decoder/layers/mlp/dropout/rngs/aqt/key/value, decoder/layers/mlp/dropout/rngs/dropout/count/value, decoder/layers/mlp/dropout/rngs/dropout/key/value, decoder/layers/mlp/dropout/rngs/params/count/value, decoder/layers/mlp/dropout/rngs/params/key/value, decoder/layers/mlp/mlp_layer_norm/bias/value, decoder/layers/mlp/mlp_layer_norm/scale/value, decoder/layers/mlp/wi/bias/value, decoder/layers/mlp/wi/kernel/value, decoder/layers/mlp/wo/bias/value, decoder/layers/mlp/wo/kernel/value, decoder/layers/pre_self_attention_norm/bias/value, decoder/layers/pre_self_attention_norm/scale/value, decoder/layers/rngs/aqt/count/value, decoder/layers/rngs/aqt/key/value, decoder/layers/rngs/dropout/count/value, decoder/layers/rngs/dropout/key/value, decoder/layers/rngs/params/count/value, decoder/layers/rngs/params/key/value, decoder/layers/self_attention/out/bias/value, decoder/layers/self_attention/out/kernel/value, decoder/layers/self_attention/qkv_proj/bias/value, decoder/layers/self_attention/qkv_proj/kernel/value, decoder/position_embedder/embedding/value, decoder/rngs/aqt/count/value, decoder/rngs/aqt/key/value, decoder/rngs/dropout/count/value, decoder/rngs/dropout/key/value, decoder/rngs/params/count/value, decoder/rngs/params/key/value, token_embedder/embedding/value I0416 21:14:05.132554 127603182345344 event_tracking.py:138] [process=0] [sync] Finished load in 0.44 seconds @ gs://wanglance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items I0416 21:14:05.137176 127603182345344 max_utils.py:194] tensorboardX not available; using no-op SummaryWriter. I0416 21:14:05.161016 127603182345344 config.py:112] TensorFlow version 2.20.0 available. I0416 21:14:05.161425 127603182345344 config.py:125] JAX version 0.8.3 available. E0416 21:14:08.360208 127603182345344 packing.py:209] PackAndBatchOperation is deprecated. Please use lazy_dataset.FirstFitPackIterDataset instead. I0416 21:14:08.743884 127603182345344 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None I0416 21:14:08.744020 127603182345344 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x740d4cbfd940>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB) I0416 21:14:08.744076 127603182345344 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None I0416 21:14:08.744103 127603182345344 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x740d4cbfd940>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB) I0416 21:14:08.744149 127603182345344 checkpoint_manager.py:708] [process=0][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'model_params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7406aeb49c40>, 'optimizer_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7406acf17f20>, 'custom_metadata': <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7406acf15fd0>}, handler_registry=None I0416 21:14:08.744389 127603182345344 composite_checkpoint_handler.py:237] Deferred registration for item: "model_params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7406aeb49c40>` for item "model_params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`. I0416 21:14:08.744426 127603182345344 composite_checkpoint_handler.py:237] Deferred registration for item: "optimizer_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7406acf17f20>` for item "optimizer_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`. I0416 21:14:08.744449 127603182345344 composite_checkpoint_handler.py:237] Deferred registration for item: "custom_metadata". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7406acf15fd0>` for item "custom_metadata" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`. I0416 21:14:08.744468 127603182345344 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7406aeb49cd0>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`. I0416 21:14:08.744489 127603182345344 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7406aeb49c40>, ('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7406aeb49c40>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7406acf17f20>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7406acf17f20>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7406acf15fd0>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7406acf15fd0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7406aeb49cd0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7406aeb49cd0>}). I0416 21:14:08.744904 127603182345344 async_checkpointer.py:192] [process=0][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>.<lambda> at 0x7406aeb96980> timeout: 1200 secs and primary_host=0 for async checkpoint writes I0416 21:14:09.551501 127603182345344 checkpoint_manager.py:564] Created directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints I0416 21:14:11.890223 127603182345344 checkpoint_manager.py:1812] Found 0 checkpoint steps in gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints I0416 21:14:11.890483 127603182345344 checkpoint_manager.py:929] [process=0][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=10000, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None, preservation_policy=None, prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False, lightweight_initialize=False), root_directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7406acf172f0> I0416 21:14:13.913620 127603182345344 metrics_logger.py:64] WandbBackend skipped: 'wandb' library not installed. I0416 21:14:13.913903 127603182345344 peft_trainer.py:590] Training with mesh: Mesh('diloco': 1, 'data': 1, 'stage': 1, 'fsdp': 8, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1, axis_types=(Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto)) I0416 21:14:14.367448 127603182345344 peft_trainer.py:600] Compiled train_step cache size: 0 [DECOUPLED NO-OP] gcs_storage: using stubs. [DECOUPLED NO-OP] mldiagnostics: using stub. [DECOUPLED NO-OP] mldiagnostics: using stub. [DECOUPLED NO-OP] mldiagnostics: using stub. [DECOUPLED NO-OP] workload_monitor: using stub. [DECOUPLED NO-OP] vertex_tensorboard: using stub. Training: 0%| | 0/5 [00:00<?, ?step/s]I0416 21:14:14.370298 127603182345344 metric_logger.py:289] number parameters: 0.000 billion Per train step: Total TFLOPs: 0.00 split as 54.29% learnable weight flops and 45.71% attention flops 2026-04-16 21:14:17.562603: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2026-04-16 21:14:17.601146: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2026-04-16 21:14:18.627113: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2026-04-16 21:14:21.040568: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303) I0416 21:14:28.499745 127603182345344 checkpoint_manager.py:2009] [process=0][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning. I0416 21:14:28.499957 127603182345344 checkpoint_manager.py:1512] [process=0] Saving checkpoint at step 1 I0416 21:14:28.500022 127603182345344 event_tracking.py:70] [process=0] [async] Started save checkpoint @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/1. I0416 21:14:28.590445 127603182345344 signaling_client.py:373] Using ThreadSafeKeyValueSignalingClient I0416 21:14:28.614138 127603182345344 jax_array_handlers.py:360] Scheduling D2H of 46 prioritized jax.Array. I0416 21:14:28.614198 127603182345344 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0416 21:14:28.686511 127493092673088 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/1 I0416 21:14:29.381070 127493082187328 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/1/optimizer_state I0416 21:14:29.381768 127493082187328 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/1/model_params I0416 21:14:29.489559 127603182345344 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.876713s I0416 21:14:29.490092 127603182345344 jax_array_handlers.py:360] Scheduling D2H of 52 prioritized jax.Array. I0416 21:14:29.490141 127603182345344 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0416 21:14:29.520458 127603182345344 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.030797s I0416 21:14:29.520849 127603182345344 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 220.803 KiB/s (total gbytes: 205.1 KiB) (time elapsed: 0.9289424419403076 s) (per-host) I0416 21:14:29.520964 127603182345344 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.929072s (batch_requests_ready=0.003378s, total_serialization_initiated=0.925300s, others=0.000394s) I0416 21:14:29.521268 127603182345344 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 670.711 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 0.9166805744171143 s) (per-host) I0416 21:14:29.521381 127603182345344 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.916803s (batch_requests_ready=0.002418s, total_serialization_initiated=0.914023s, others=0.000361s) I0416 21:14:29.521443 127603182345344 composite_checkpoint_handler.py:715] [process=0][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.930414s (all_items=0.000024s, per_item={'model_params': '0.00001979', 'optimizer_state': '0.00000381'}, temp_paths=0.930391) I0416 21:14:29.522078 127603182345344 event_tracking.py:125] [process=0] [async] Finished blocking save in 1.02 seconds. Continuing save @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/1. I0416 21:14:29.522308 127493103158848 async_checkpointer.py:76] [process=0][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-16 21:34:29.522271 I0416 21:14:29.522549 127603182345344 checkpoint_manager.py:1560] [process=0][thread=MainThread][step=1] Starting CheckpointManager Save Finalize thread=save_finalize I0416 21:14:29.522878 127603182345344 standard_logger.py:34] {'step': 1, 'event_type': 'save', 'directory': 'gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776374068.4997227, 'wait_for_prev_duration_secs': 0.00011301040649414062, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776374068.4999802, 'checkpointer_blocking_duration_secs': 1.0224239826202393, 'get_old_steps_start_time': 1776374069.5224185, 'get_old_steps_duration_secs': 9.393692016601562e-05, 'checkpoint_manager_blocking_start_time': 1776374068.4996448, 'checkpoint_manager_blocking_duration_secs': 1.02321195602417} I0416 21:14:29.523034 127603182345344 profiler.py:85] Starting JAX profiler at step 1. I0416 21:14:29.642767 127493046535744 checkpoint.py:188] Wrote Metadata={'item_handlers': None, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776374069299975685, 'commit_timestamp_nsecs': None, 'custom_metadata': {}}, json={"item_handlers": null, "metrics": {}, "performance_metrics": {}, "init_timestamp_nsecs": 1776374069299975685, "commit_timestamp_nsecs": null, "custom_metadata": {}} to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/1/_CHECKPOINT_METADATA I0416 21:14:29.644687 127492983621184 async_checkpointer.py:280] [process=0][thread=save_finalize] Waiting for background save thread=async_save. I0416 21:14:29.822497 127603182345344 peft_trainer.py:485] Train step 1 training loss: 6.004404 - training perplexity: 405.209259 Training: 0%| | 0/5 [00:15<?, ?step/s, _train_loss=6, _train_perplexity=405, _train_steps_per_sec=0.071] Training: 20%|██ | 1/5 [00:15<01:01, 15.46s/step, _train_loss=6, _train_perplexity=405, _train_steps_per_sec=0.071]I0416 21:14:29.823649 127603182345344 max_utils.py:750] Memstats: After params initialized: I0416 21:14:29.823740 127603182345344 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_0(process=0,(0,0,0,0)) I0416 21:14:29.823793 127603182345344 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_1(process=0,(1,0,0,0)) I0416 21:14:29.823839 127603182345344 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_2(process=0,(0,1,0,0)) I0416 21:14:29.823879 127603182345344 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_3(process=0,(1,1,0,0)) I0416 21:14:29.823917 127603182345344 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_4(process=0,(0,2,0,0)) I0416 21:14:29.823953 127603182345344 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_5(process=0,(1,2,0,0)) I0416 21:14:29.823990 127603182345344 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_6(process=0,(0,3,0,0)) I0416 21:14:29.824023 127603182345344 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_7(process=0,(1,3,0,0)) I0416 21:14:30.001374 127603182345344 metric_logger.py:185] completed step: 1, seconds: 15.453, TFLOP/s/device: 0.000, Tokens/s/device: 66.264, total_weights: 6826, loss: 6.004 I0416 21:14:30.015195 127603182345344 peft_trainer.py:485] Train step 2 training loss: 6.137612 - training perplexity: 462.946655 Training: 20%|██ | 1/5 [00:15<01:01, 15.46s/step, _train_loss=6.07, _train_perplexity=433, _train_steps_per_sec=0.413] Training: 40%|████ | 2/5 [00:15<00:19, 6.48s/step, _train_loss=6.07, _train_perplexity=433, _train_steps_per_sec=0.413]I0416 21:14:30.016810 127603182345344 metric_logger.py:185] completed step: 2, seconds: 0.192, TFLOP/s/device: 0.001, Tokens/s/device: 5338.594, total_weights: 4636, loss: 6.138 I0416 21:14:30.039141 127603182345344 peft_trainer.py:485] Train step 3 training loss: 5.638189 - training perplexity: 280.953430 Training: 40%|████ | 2/5 [00:15<00:19, 6.48s/step, _train_loss=5.93, _train_perplexity=375, _train_steps_per_sec=1.98] I0416 21:14:30.040519 127603182345344 metric_logger.py:185] completed step: 3, seconds: 0.024, TFLOP/s/device: 0.009, Tokens/s/device: 43204.800, total_weights: 5886, loss: 5.638 I0416 21:14:30.052953 127603182345344 peft_trainer.py:485] Train step 4 training loss: 5.768358 - training perplexity: 320.011780 Training: 60%|██████ | 3/5 [00:15<00:12, 6.48s/step, _train_loss=5.89, _train_perplexity=360, _train_steps_per_sec=11.9]I0416 21:14:30.054291 127603182345344 metric_logger.py:185] completed step: 4, seconds: 0.014, TFLOP/s/device: 0.016, Tokens/s/device: 72577.227, total_weights: 4990, loss: 5.768 I0416 21:14:30.054584 127603182345344 profiler.py:113] Stopping JAX profiler at step 5. I0416 21:14:30.650598 127493036049984 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/1/model_params/array_metadatas/process_0 I0416 21:14:30.910360 127493015078464 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/1/optimizer_state/array_metadatas/process_0 I0416 21:14:32.218957 127493004592704 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.319877s (commit=1.897046s, array_metadata_write=0.422831s) I0416 21:14:32.220005 127493103158848 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 56.535 KiB/s (total gbytes: 205.1 KiB) (time elapsed: 3.6280887126922607 s) (per-host) I0416 21:14:32.388974 127492994106944 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.506206s (commit=1.950958s, array_metadata_write=0.555248s) I0416 21:14:32.390113 127493103158848 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 162.417 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 3.785482883453369 s) (per-host) I0416 21:14:32.390390 127493103158848 async_checkpointer.py:90] [process=0][thread=async_save] 4 Handler Commit operations completed. Time taken: 2.867794s. I0416 21:14:32.572980 127493103158848 checkpoint.py:228] Read Metadata={'item_handlers': None, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776374069299975685, 'commit_timestamp_nsecs': None, 'custom_metadata': {}} from gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/1/_CHECKPOINT_METADATA I0416 21:14:32.763076 127493103158848 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0. I0416 21:14:32.965797 127493046535744 checkpoint.py:247] Updated Metadata={'item_handlers': {'model_params': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler', 'optimizer_state': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler'}, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776374069299975685, 'commit_timestamp_nsecs': None, 'custom_metadata': {}} to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/1/_CHECKPOINT_METADATA I0416 21:14:33.188506 127493103158848 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.575443s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/1/model_params I0416 21:14:33.189359 127493103158848 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/1/model_params I0416 21:14:33.597511 127493103158848 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0. I0416 21:14:34.038228 127493103158848 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.570066s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/1/optimizer_state I0416 21:14:34.039159 127493103158848 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/1/optimizer_state I0416 21:14:34.308415 127493103158848 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/1 I0416 21:14:35.040098 127493103158848 atomicity.py:847] [process=0][thread=async_save] Finished saving checkpoint (finalized tmp dir) to `gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/1`. I0416 21:14:35.040850 127493103158848 event_tracking.py:138] [process=0] [async] Finished save (blocking + background) in 6.54 seconds @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/1 I0416 21:14:35.040942 127493103158848 async_checkpointer.py:160] [process=0][thread=async_save] Background save thread done. Time taken: 5.518348s. I0416 21:14:35.041152 127492983621184 async_checkpointer.py:288] [process=0][thread=save_finalize] Done with waiting for background save thread=async_save. I0416 21:14:35.041284 127492983621184 async_checkpointer.py:298] [process=0][thread=save_finalize] No errors found in background save thread=async_save. I0416 21:14:35.041374 127492983621184 checkpoint_manager.py:2137] [process=0][thread=save_finalize][step=1] CheckpointManager Save Finalize is syncing with other hosts... I0416 21:14:35.041436 127492983621184 checkpoint_manager.py:2146] [process=0][thread=save_finalize][step=1] CheckpointManager Save Finalize is done on all hosts. I0416 21:14:35.228086 127603182345344 utils.py:86] Train loop finished in: 20.8571 seconds I0416 21:14:35.228813 127603182345344 peft_trainer.py:485] Train step 5 training loss: 5.830140 - training perplexity: 340.406403 Training: 80%|████████ | 4/5 [00:20<00:06, 6.48s/step, _train_loss=5.88, _train_perplexity=356, _train_steps_per_sec=23.9] Training: 100%|██████████| 5/5 [00:20<00:00, 3.08s/step, _train_loss=5.88, _train_perplexity=356, _train_steps_per_sec=23.9]I0416 21:14:35.230207 127603182345344 metric_logger.py:185] completed step: 5, seconds: 5.176, TFLOP/s/device: 0.000, Tokens/s/device: 197.853, total_weights: 4264, loss: 5.830 I0416 21:14:35.233552 127603182345344 checkpoint_manager.py:2009] [process=0][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning. I0416 21:14:35.233731 127603182345344 checkpoint_manager.py:1512] [process=0] Saving checkpoint at step 5 I0416 21:14:35.233798 127603182345344 event_tracking.py:70] [process=0] [async] Started save checkpoint @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/5. I0416 21:14:35.344758 127603182345344 jax_array_handlers.py:360] Scheduling D2H of 46 prioritized jax.Array. I0416 21:14:35.344907 127603182345344 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0416 21:14:35.358796 127603182345344 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.015488s I0416 21:14:35.359142 127603182345344 jax_array_handlers.py:360] Scheduling D2H of 52 prioritized jax.Array. I0416 21:14:35.359182 127603182345344 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0416 21:14:35.390075 127603182345344 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.031216s I0416 21:14:35.390434 127603182345344 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 2.962 MiB/s (total gbytes: 205.1 KiB) (time elapsed: 0.06762099266052246 s) (per-host) I0416 21:14:35.390551 127603182345344 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.067752s (batch_requests_ready=0.002405s, total_serialization_initiated=0.064987s, others=0.000360s) I0416 21:14:35.390912 127603182345344 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 10.606 MiB/s (total gbytes: 614.8 KiB) (time elapsed: 0.05661153793334961 s) (per-host) I0416 21:14:35.390986 127603182345344 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.056693s (batch_requests_ready=0.002507s, total_serialization_initiated=0.053801s, others=0.000385s) I0416 21:14:35.391068 127603182345344 composite_checkpoint_handler.py:715] [process=0][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.068769s (all_items=0.000022s, per_item={'model_params': '0.00001979', 'optimizer_state': '0.00000238'}, temp_paths=0.068747) I0416 21:14:35.391772 127603182345344 event_tracking.py:125] [process=0] [async] Finished blocking save in 0.16 seconds. Continuing save @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/5. I0416 21:14:35.392015 127492941678144 async_checkpointer.py:76] [process=0][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-16 21:34:35.391975 I0416 21:14:35.392315 127603182345344 checkpoint_manager.py:1560] [process=0][thread=MainThread][step=5] Starting CheckpointManager Save Finalize thread=save_finalize I0416 21:14:35.392630 127493103158848 async_checkpointer.py:280] [process=0][thread=save_finalize] Waiting for background save thread=async_save. I0416 21:14:35.392758 127603182345344 standard_logger.py:34] {'step': 5, 'event_type': 'save', 'directory': 'gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776374075.2335236, 'wait_for_prev_duration_secs': 6.365776062011719e-05, 'time_between_consecutive_saves_sec': 0.19207048416137695, 'checkpointer_blocking_start_time': 1776374075.2337549, 'checkpointer_blocking_duration_secs': 0.15837907791137695, 'get_old_steps_start_time': 1776374075.3921597, 'get_old_steps_duration_secs': 0.00011348724365234375, 'checkpoint_manager_blocking_start_time': 1776374075.2334738, 'checkpoint_manager_blocking_duration_secs': 0.15925145149230957} I0416 21:14:35.392923 127603182345344 checkpoint_manager.py:2020] [process=0][thread=MainThread][step=5][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete. I0416 21:14:35.411346 127492983621184 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/5 I0416 21:14:36.111943 127492994106944 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/5/model_params I0416 21:14:36.112873 127492994106944 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/5/optimizer_state I0416 21:14:37.388089 127492973135424 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/5/optimizer_state/array_metadatas/process_0 I0416 21:14:37.427177 127493015078464 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/5/model_params/array_metadatas/process_0 I0416 21:14:38.943069 127492952163904 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.338260s (commit=1.915511s, array_metadata_write=0.422749s) I0416 21:14:38.948512 127492962649664 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.348586s (commit=1.935701s, array_metadata_write=0.412885s) I0416 21:14:38.949511 127492941678144 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 56.557 KiB/s (total gbytes: 205.1 KiB) (time elapsed: 3.6266543865203857 s) (per-host) I0416 21:14:38.949813 127492941678144 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 170.053 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 3.615516424179077 s) (per-host) I0416 21:14:38.949895 127492941678144 async_checkpointer.py:90] [process=0][thread=async_save] 4 Handler Commit operations completed. Time taken: 3.557538s. I0416 21:14:39.284973 127492941678144 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0. I0416 21:14:39.711813 127492941678144 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.562474s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/5/model_params I0416 21:14:39.712685 127492941678144 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/5/model_params I0416 21:14:40.134906 127492941678144 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0. I0416 21:14:40.517791 127492941678144 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.523288s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/5/optimizer_state I0416 21:14:40.518666 127492941678144 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/5/optimizer_state I0416 21:14:40.801910 127492941678144 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/5 I0416 21:14:41.494839 127492941678144 atomicity.py:847] [process=0][thread=async_save] Finished saving checkpoint (finalized tmp dir) to `gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/5`. I0416 21:14:41.495628 127492941678144 event_tracking.py:138] [process=0] [async] Finished save (blocking + background) in 6.26 seconds @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_02_sft_nnx_ckpt/checkpoints/5 I0416 21:14:41.495702 127492941678144 async_checkpointer.py:160] [process=0][thread=async_save] Background save thread done. Time taken: 6.103346s. I0416 21:14:41.495862 127493103158848 async_checkpointer.py:288] [process=0][thread=save_finalize] Done with waiting for background save thread=async_save. I0416 21:14:41.495973 127493103158848 async_checkpointer.py:298] [process=0][thread=save_finalize] No errors found in background save thread=async_save. I0416 21:14:41.496028 127493103158848 checkpoint_manager.py:2137] [process=0][thread=save_finalize][step=5] CheckpointManager Save Finalize is syncing with other hosts... I0416 21:14:41.496079 127493103158848 checkpoint_manager.py:2146] [process=0][thread=save_finalize][step=5] CheckpointManager Save Finalize is done on all hosts. I0416 21:14:41.496230 127603182345344 checkpoint_manager.py:2032] [process=0][thread=MainThread][step=5][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=5. I0416 21:14:41.496460 127603182345344 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=127603182345344 count=1 at 0x7406b27dd8c0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7406acf16cc0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7406acf16f30>, _write_futures=[]) I0416 21:14:41.496953 127603182345344 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=127603182345344 count=1 at 0x7406b27dd8c0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7406acf16cc0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7406acf16f30>, _write_futures=[]) I0416 21:14:41.496987 127603182345344 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=127603182345344 count=1 at 0x7406b27dd8c0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7406acf16cc0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7406acf16f30>, _write_futures=[]) Training: 100%|██████████| 5/5 [00:28<00:00, 5.65s/step, _train_loss=5.88, _train_perplexity=356, _train_steps_per_sec=23.9] [DECOUPLED NO-OP] gcs_storage: using stubs. [DECOUPLED NO-OP] mldiagnostics: using stub. [DECOUPLED NO-OP] mldiagnostics: using stub. [DECOUPLED NO-OP] mldiagnostics: using stub. [DECOUPLED NO-OP] workload_monitor: using stub. [DECOUPLED NO-OP] vertex_tensorboard: using stub. ~/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 15 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '