2026-04-20 21:01:43.299730: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303) I0420 21:01:43.867106 134480345812096 max_utils.py:238] Skipping jax distributed system due to skip_jax_distributed_system=True flag. I0420 21:02:33.152223 134480345812096 max_utils.py:800] System Information: Jax Version: 0.8.3 I0420 21:02:33.152345 134480345812096 max_utils.py:801] System Information: Jaxlib Version: 0.8.3 I0420 21:02:33.152379 134480345812096 max_utils.py:802] System Information: Jax Backend: PJRT C API TFRT TPU v6 lite Built on Dec 15 2025 14:03:46 (1765836226) cl/844590465 I0420 21:02:33.155655 134480345812096 maxtext_utils.py:1718] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1) I0420 21:02:33.237605 134480345812096 maxtext_utils.py:1718] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1) I0420 21:02:33.320575 134480345812096 maxtext_utils.py:1718] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1) I0420 21:02:34.285350 134480345812096 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None I0420 21:02:34.285803 134480345812096 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7a4e839988f0>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB) I0420 21:02:34.285862 134480345812096 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.36 W0420 21:02:35.579232 134480345812096 checkpoint.py:202] Metadata file does not exist: gs://wanglance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items/_CHECKPOINT_METADATA I0420 21:02:35.861202 1303379 google_auth_provider.cc:149] Using credentials at ~/.config/gcloud/application_default_credentials.json I0420 21:02:35.861279 1303379 google_auth_provider.cc:156] Using OAuth2 AuthProvider I0420 21:02:36.313401 134480345812096 event_tracking.py:70] [process=0] [sync] Started load checkpoint @ gs://wanglance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items. I0420 21:02:36.430268 134480345812096 checkpointer.py:307] Restoring checkpoint from gs://wanglance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items. I0420 21:02:36.430491 134480345812096 event_tracking.py:125] [process=0] [sync] Finished blocking load in 0.12 seconds. Continuing load @ gs://wanglance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items. W0420 21:02:36.726357 134480345812096 transform_utils.py:230] The transformations API will eventually be replaced by an upgraded design. The current API will not be removed until this point, but it will no longer be actively worked on. I0420 21:02:36.726686 134480345812096 transform_utils.py:288] The following keys are not loaded from the original tree after applying specified transforms: decoder/decoder_norm/bias/value, decoder/decoder_norm/scale/value, decoder/dropout/rngs/aqt/count/value, decoder/dropout/rngs/aqt/key/value, decoder/dropout/rngs/dropout/count/value, decoder/dropout/rngs/dropout/key/value, decoder/dropout/rngs/params/count/value, decoder/dropout/rngs/params/key/value, decoder/layers/dropout/rngs/aqt/count/value, decoder/layers/dropout/rngs/aqt/key/value, decoder/layers/dropout/rngs/dropout/count/value, decoder/layers/dropout/rngs/dropout/key/value, decoder/layers/dropout/rngs/params/count/value, decoder/layers/dropout/rngs/params/key/value, decoder/layers/mlp/dropout/rngs/aqt/count/value, decoder/layers/mlp/dropout/rngs/aqt/key/value, decoder/layers/mlp/dropout/rngs/dropout/count/value, decoder/layers/mlp/dropout/rngs/dropout/key/value, decoder/layers/mlp/dropout/rngs/params/count/value, decoder/layers/mlp/dropout/rngs/params/key/value, decoder/layers/mlp/mlp_layer_norm/bias/value, decoder/layers/mlp/mlp_layer_norm/scale/value, decoder/layers/mlp/wi/bias/value, decoder/layers/mlp/wi/kernel/value, decoder/layers/mlp/wo/bias/value, decoder/layers/mlp/wo/kernel/value, decoder/layers/pre_self_attention_norm/bias/value, decoder/layers/pre_self_attention_norm/scale/value, decoder/layers/rngs/aqt/count/value, decoder/layers/rngs/aqt/key/value, decoder/layers/rngs/dropout/count/value, decoder/layers/rngs/dropout/key/value, decoder/layers/rngs/params/count/value, decoder/layers/rngs/params/key/value, decoder/layers/self_attention/out/bias/value, decoder/layers/self_attention/out/kernel/value, decoder/layers/self_attention/qkv_proj/bias/value, decoder/layers/self_attention/qkv_proj/kernel/value, decoder/position_embedder/embedding/value, decoder/rngs/aqt/count/value, decoder/rngs/aqt/key/value, decoder/rngs/dropout/count/value, decoder/rngs/dropout/key/value, decoder/rngs/params/count/value, decoder/rngs/params/key/value, token_embedder/embedding/value I0420 21:02:36.727247 134480345812096 event_tracking.py:138] [process=0] [sync] Finished load in 0.41 seconds @ gs://wanglance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items I0420 21:02:36.731760 134480345812096 max_utils.py:194] tensorboardX not available; using no-op SummaryWriter. I0420 21:02:36.753572 134480345812096 config.py:112] TensorFlow version 2.20.0 available. I0420 21:02:36.753963 134480345812096 config.py:125] JAX version 0.8.3 available. E0420 21:02:39.988869 134480345812096 packing.py:209] PackAndBatchOperation is deprecated. Please use lazy_dataset.FirstFitPackIterDataset instead. I0420 21:02:40.364208 134480345812096 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None I0420 21:02:40.364379 134480345812096 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7a4e839988f0>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB) I0420 21:02:40.364428 134480345812096 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None I0420 21:02:40.364456 134480345812096 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7a4e839988f0>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB) I0420 21:02:40.364507 134480345812096 checkpoint_manager.py:708] [process=0][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'model_params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7a480d3acbc0>, 'optimizer_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7a47e7d9aab0>, 'custom_metadata': <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7a47e7d14ef0>}, handler_registry=None I0420 21:02:40.364779 134480345812096 composite_checkpoint_handler.py:237] Deferred registration for item: "model_params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7a480d3acbc0>` for item "model_params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`. I0420 21:02:40.364818 134480345812096 composite_checkpoint_handler.py:237] Deferred registration for item: "optimizer_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7a47e7d9aab0>` for item "optimizer_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`. I0420 21:02:40.364840 134480345812096 composite_checkpoint_handler.py:237] Deferred registration for item: "custom_metadata". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7a47e7d14ef0>` for item "custom_metadata" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`. I0420 21:02:40.364858 134480345812096 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7a47e7d9ae40>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`. I0420 21:02:40.364881 134480345812096 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7a480d3acbc0>, ('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7a480d3acbc0>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7a47e7d9aab0>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7a47e7d9aab0>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7a47e7d14ef0>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7a47e7d14ef0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7a47e7d9ae40>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7a47e7d9ae40>}). I0420 21:02:40.365455 134480345812096 async_checkpointer.py:192] [process=0][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>.<lambda> at 0x7a47e999f4c0> timeout: 1200 secs and primary_host=0 for async checkpoint writes I0420 21:02:41.226655 134480345812096 checkpoint_manager.py:564] Created directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints I0420 21:02:43.584155 134480345812096 checkpoint_manager.py:1812] Found 0 checkpoint steps in gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints I0420 21:02:43.584415 134480345812096 checkpoint_manager.py:929] [process=0][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=10000, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None, preservation_policy=None, prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False, lightweight_initialize=False), root_directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7a47e7d15c10> I0420 21:02:45.628972 134480345812096 metrics_logger.py:64] WandbBackend skipped: 'wandb' library not installed. I0420 21:02:45.629308 134480345812096 peft_trainer.py:590] Training with mesh: Mesh('diloco': 1, 'data': 1, 'stage': 1, 'fsdp': 8, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1, axis_types=(Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto)) I0420 21:02:46.105168 134480345812096 peft_trainer.py:600] Compiled train_step cache size: 0 [DECOUPLED NO-OP] gcs_storage: using stubs. [DECOUPLED NO-OP] mldiagnostics: using stub. [DECOUPLED NO-OP] mldiagnostics: using stub. [DECOUPLED NO-OP] mldiagnostics: using stub. [DECOUPLED NO-OP] workload_monitor: using stub. [DECOUPLED NO-OP] vertex_tensorboard: using stub. Training: 0%| | 0/5 [00:00<?, ?step/s]I0420 21:02:46.108309 134480345812096 metric_logger.py:301] number parameters: 0.000 billion Per train step: Total TFLOPs: 0.00 split as 54.29% learnable weight flops and 45.71% attention flops 2026-04-20 21:02:49.303944: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2026-04-20 21:02:49.342173: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2026-04-20 21:02:50.356441: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2026-04-20 21:02:52.748632: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303) I0420 21:03:01.602701 134480345812096 checkpoint_manager.py:2009] [process=0][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning. I0420 21:03:01.602904 134480345812096 checkpoint_manager.py:1512] [process=0] Saving checkpoint at step 1 I0420 21:03:01.602967 134480345812096 event_tracking.py:70] [process=0] [async] Started save checkpoint @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/1. I0420 21:03:01.690607 134480345812096 signaling_client.py:373] Using ThreadSafeKeyValueSignalingClient I0420 21:03:01.713673 134480345812096 jax_array_handlers.py:360] Scheduling D2H of 46 prioritized jax.Array. I0420 21:03:01.713737 134480345812096 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0420 21:03:01.806691 134370274838080 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/1 I0420 21:03:02.526479 134370264352320 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/1/model_params I0420 21:03:02.548727 134370264352320 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/1/optimizer_state I0420 21:03:02.594137 134480345812096 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.881805s I0420 21:03:02.594619 134480345812096 jax_array_handlers.py:360] Scheduling D2H of 52 prioritized jax.Array. I0420 21:03:02.594668 134480345812096 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0420 21:03:02.625112 134480345812096 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.030870s I0420 21:03:02.625522 134480345812096 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 219.724 KiB/s (total gbytes: 205.1 KiB) (time elapsed: 0.9335026741027832 s) (per-host) I0420 21:03:02.625641 134480345812096 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.933636s (batch_requests_ready=0.002842s, total_serialization_initiated=0.930378s, others=0.000416s) I0420 21:03:02.625976 134480345812096 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 667.108 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 0.9216325283050537 s) (per-host) I0420 21:03:02.626064 134480345812096 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.921715s (batch_requests_ready=0.002276s, total_serialization_initiated=0.919085s, others=0.000355s) I0420 21:03:02.626131 134480345812096 composite_checkpoint_handler.py:715] [process=0][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.934956s (all_items=0.000020s, per_item={'model_params': '0.00001526', 'optimizer_state': '0.00000453'}, temp_paths=0.934936) I0420 21:03:02.626803 134480345812096 event_tracking.py:125] [process=0] [async] Finished blocking save in 1.02 seconds. Continuing save @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/1. I0420 21:03:02.627028 134370165786176 async_checkpointer.py:76] [process=0][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-20 21:23:02.626991 I0420 21:03:02.627285 134480345812096 checkpoint_manager.py:1560] [process=0][thread=MainThread][step=1] Starting CheckpointManager Save Finalize thread=save_finalize I0420 21:03:02.627618 134480345812096 standard_logger.py:34] {'step': 1, 'event_type': 'save', 'directory': 'gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776718981.6026824, 'wait_for_prev_duration_secs': 9.775161743164062e-05, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776718981.6029274, 'checkpointer_blocking_duration_secs': 1.0242114067077637, 'get_old_steps_start_time': 1776718982.6271546, 'get_old_steps_duration_secs': 9.441375732421875e-05, 'checkpoint_manager_blocking_start_time': 1776718981.6026115, 'checkpoint_manager_blocking_duration_secs': 1.0249860286712646} I0420 21:03:02.627760 134480345812096 profiler.py:85] Starting JAX profiler at step 1. I0420 21:03:02.778738 134370228700736 checkpoint.py:188] Wrote Metadata={'item_handlers': None, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776718982438514013, 'commit_timestamp_nsecs': None, 'custom_metadata': {}}, json={"item_handlers": null, "metrics": {}, "performance_metrics": {}, "init_timestamp_nsecs": 1776718982438514013, "commit_timestamp_nsecs": null, "custom_metadata": {}} to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/1/_CHECKPOINT_METADATA I0420 21:03:02.780307 134370285323840 async_checkpointer.py:280] [process=0][thread=save_finalize] Waiting for background save thread=async_save. I0420 21:03:02.960199 134480345812096 peft_trainer.py:485] Train step 1 training loss: 6.004404 - training perplexity: 405.209259 Training: 0%| | 0/5 [00:16<?, ?step/s, _train_loss=6, _train_perplexity=405, _train_steps_per_sec=0.065] Training: 20%|██ | 1/5 [00:16<01:07, 16.86s/step, _train_loss=6, _train_perplexity=405, _train_steps_per_sec=0.065]I0420 21:03:02.961225 134480345812096 max_utils.py:750] Memstats: After params initialized: I0420 21:03:02.961308 134480345812096 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_0(process=0,(0,0,0,0)) I0420 21:03:02.961357 134480345812096 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_1(process=0,(1,0,0,0)) I0420 21:03:02.961400 134480345812096 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_2(process=0,(0,1,0,0)) I0420 21:03:02.961438 134480345812096 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_3(process=0,(1,1,0,0)) I0420 21:03:02.961475 134480345812096 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_4(process=0,(0,2,0,0)) I0420 21:03:02.961511 134480345812096 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_5(process=0,(1,2,0,0)) I0420 21:03:02.961546 134480345812096 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_6(process=0,(0,3,0,0)) I0420 21:03:02.961580 134480345812096 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_7(process=0,(1,3,0,0)) I0420 21:03:03.093493 134480345812096 metric_logger.py:196] completed step: 1, seconds: 16.853, TFLOP/s/device: 0.000, Tokens/s/device: 60.762, total_weights: 6826, loss: 6.004, lm_loss: 0.000, perplexity: 0.000 I0420 21:03:03.106771 134480345812096 peft_trainer.py:485] Train step 2 training loss: 6.137612 - training perplexity: 462.946655 Training: 20%|██ | 1/5 [00:17<01:07, 16.86s/step, _train_loss=6.07, _train_perplexity=433, _train_steps_per_sec=0.4] Training: 40%|████ | 2/5 [00:17<00:21, 7.03s/step, _train_loss=6.07, _train_perplexity=433, _train_steps_per_sec=0.4]I0420 21:03:03.108383 134480345812096 metric_logger.py:196] completed step: 2, seconds: 0.146, TFLOP/s/device: 0.002, Tokens/s/device: 7007.486, total_weights: 4636, loss: 6.138, lm_loss: 0.000, perplexity: 0.000 I0420 21:03:03.130511 134480345812096 peft_trainer.py:485] Train step 3 training loss: 5.638189 - training perplexity: 280.953430 Training: 40%|████ | 2/5 [00:17<00:21, 7.03s/step, _train_loss=5.93, _train_perplexity=375, _train_steps_per_sec=2.54]I0420 21:03:03.131894 134480345812096 metric_logger.py:196] completed step: 3, seconds: 0.024, TFLOP/s/device: 0.009, Tokens/s/device: 43499.150, total_weights: 5886, loss: 5.638, lm_loss: 0.000, perplexity: 0.000 I0420 21:03:03.144440 134480345812096 peft_trainer.py:485] Train step 4 training loss: 5.768358 - training perplexity: 320.011780 Training: 60%|██████ | 3/5 [00:17<00:14, 7.03s/step, _train_loss=5.89, _train_perplexity=360, _train_steps_per_sec=12.4]I0420 21:03:03.146080 134480345812096 metric_logger.py:196] completed step: 4, seconds: 0.014, TFLOP/s/device: 0.016, Tokens/s/device: 73678.081, total_weights: 4990, loss: 5.768, lm_loss: 0.000, perplexity: 0.000 I0420 21:03:03.146410 134480345812096 profiler.py:113] Stopping JAX profiler at step 5. I0420 21:03:03.997662 134370218214976 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/1/model_params/array_metadatas/process_0 I0420 21:03:04.020177 134370197243456 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/1/optimizer_state/array_metadatas/process_0 I0420 21:03:05.558131 134370176271936 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.354125s (commit=1.908642s, array_metadata_write=0.445484s) I0420 21:03:05.585724 134370186757696 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.377752s (commit=1.922729s, array_metadata_write=0.455022s) I0420 21:03:05.586842 134370165786176 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 52.663 KiB/s (total gbytes: 205.1 KiB) (time elapsed: 3.8947975635528564 s) (per-host) I0420 21:03:05.587145 134370165786176 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 158.346 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 3.8828063011169434 s) (per-host) I0420 21:03:05.587220 134370165786176 async_checkpointer.py:90] [process=0][thread=async_save] 4 Handler Commit operations completed. Time taken: 2.959889s. I0420 21:03:05.772107 134370165786176 checkpoint.py:228] Read Metadata={'item_handlers': None, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776718982438514013, 'commit_timestamp_nsecs': None, 'custom_metadata': {}} from gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/1/_CHECKPOINT_METADATA I0420 21:03:05.961358 134370165786176 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0. I0420 21:03:06.197771 134370228700736 checkpoint.py:247] Updated Metadata={'item_handlers': {'model_params': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler', 'optimizer_state': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler'}, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776718982438514013, 'commit_timestamp_nsecs': None, 'custom_metadata': {}} to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/1/_CHECKPOINT_METADATA I0420 21:03:06.403767 134370165786176 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.584596s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/1/model_params I0420 21:03:06.404675 134370165786176 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/1/model_params I0420 21:03:06.847319 134370165786176 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0. I0420 21:03:07.235740 134370165786176 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.556724s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/1/optimizer_state I0420 21:03:07.236656 134370165786176 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/1/optimizer_state I0420 21:03:07.519971 134370165786176 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/1 I0420 21:03:08.038182 134480345812096 utils.py:86] Train loop finished in: 21.9291 seconds I0420 21:03:08.039167 134480345812096 peft_trainer.py:485] Train step 5 training loss: 5.830140 - training perplexity: 340.406403 Training: 80%|████████ | 4/5 [00:21<00:07, 7.03s/step, _train_loss=5.88, _train_perplexity=356, _train_steps_per_sec=24.2] Training: 100%|██████████| 5/5 [00:21<00:00, 3.17s/step, _train_loss=5.88, _train_perplexity=356, _train_steps_per_sec=24.2]I0420 21:03:08.040826 134480345812096 metric_logger.py:196] completed step: 5, seconds: 4.895, TFLOP/s/device: 0.000, Tokens/s/device: 209.213, total_weights: 4264, loss: 5.830, lm_loss: 0.000, perplexity: 0.000 I0420 21:03:08.045383 134480345812096 checkpoint_manager.py:2020] [process=0][thread=MainThread][step=1][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete. I0420 21:03:08.233853 134370165786176 atomicity.py:847] [process=0][thread=async_save] Finished saving checkpoint (finalized tmp dir) to `gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/1`. I0420 21:03:08.234689 134370165786176 event_tracking.py:138] [process=0] [async] Finished save (blocking + background) in 6.63 seconds @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/1 I0420 21:03:08.234799 134370165786176 async_checkpointer.py:160] [process=0][thread=async_save] Background save thread done. Time taken: 5.607466s. I0420 21:03:08.235038 134370285323840 async_checkpointer.py:288] [process=0][thread=save_finalize] Done with waiting for background save thread=async_save. I0420 21:03:08.235170 134370285323840 async_checkpointer.py:298] [process=0][thread=save_finalize] No errors found in background save thread=async_save. I0420 21:03:08.235241 134370285323840 checkpoint_manager.py:2137] [process=0][thread=save_finalize][step=1] CheckpointManager Save Finalize is syncing with other hosts... I0420 21:03:08.235305 134370285323840 checkpoint_manager.py:2146] [process=0][thread=save_finalize][step=1] CheckpointManager Save Finalize is done on all hosts. I0420 21:03:08.235430 134480345812096 checkpoint_manager.py:2032] [process=0][thread=MainThread][step=1][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=1. I0420 21:03:08.235682 134480345812096 checkpoint_manager.py:1512] [process=0] Saving checkpoint at step 5 I0420 21:03:08.235764 134480345812096 event_tracking.py:70] [process=0] [async] Started save checkpoint @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/5. I0420 21:03:08.345758 134480345812096 jax_array_handlers.py:360] Scheduling D2H of 46 prioritized jax.Array. I0420 21:03:08.345885 134480345812096 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0420 21:03:08.360299 134480345812096 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.015994s I0420 21:03:08.360628 134480345812096 jax_array_handlers.py:360] Scheduling D2H of 52 prioritized jax.Array. I0420 21:03:08.360667 134480345812096 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0420 21:03:08.391240 134480345812096 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.030876s I0420 21:03:08.391582 134480345812096 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 2.938 MiB/s (total gbytes: 205.1 KiB) (time elapsed: 0.06818485260009766 s) (per-host) I0420 21:03:08.391701 134480345812096 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.068327s (batch_requests_ready=0.002422s, total_serialization_initiated=0.065556s, others=0.000350s) I0420 21:03:08.392150 134480345812096 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 10.543 MiB/s (total gbytes: 614.8 KiB) (time elapsed: 0.05695176124572754 s) (per-host) I0420 21:03:08.392230 134480345812096 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.057042s (batch_requests_ready=0.002362s, total_serialization_initiated=0.054209s, others=0.000472s) I0420 21:03:08.392297 134480345812096 composite_checkpoint_handler.py:715] [process=0][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.069499s (all_items=0.000026s, per_item={'model_params': '0.00002289', 'optimizer_state': '0.00000310'}, temp_paths=0.069473) I0420 21:03:08.392982 134480345812096 event_tracking.py:125] [process=0] [async] Finished blocking save in 0.16 seconds. Continuing save @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/5. I0420 21:03:08.393232 134370134328896 async_checkpointer.py:76] [process=0][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-20 21:23:08.393192 I0420 21:03:08.393514 134480345812096 checkpoint_manager.py:1560] [process=0][thread=MainThread][step=5] Starting CheckpointManager Save Finalize thread=save_finalize I0420 21:03:08.393824 134370165786176 async_checkpointer.py:280] [process=0][thread=save_finalize] Waiting for background save thread=async_save. I0420 21:03:08.393956 134480345812096 standard_logger.py:34] {'step': 5, 'event_type': 'save', 'directory': 'gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776718988.0453446, 'wait_for_prev_duration_secs': 0.1901252269744873, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776718988.2357073, 'checkpointer_blocking_duration_secs': 0.15762639045715332, 'get_old_steps_start_time': 1776718988.3933575, 'get_old_steps_duration_secs': 0.0001163482666015625, 'checkpoint_manager_blocking_start_time': 1776718988.0451381, 'checkpoint_manager_blocking_duration_secs': 0.34877967834472656} I0420 21:03:08.394136 134480345812096 checkpoint_manager.py:2020] [process=0][thread=MainThread][step=5][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete. I0420 21:03:08.420858 134370285323840 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/5 I0420 21:03:09.102317 134370186757696 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/5/optimizer_state I0420 21:03:09.112533 134370186757696 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/5/model_params I0420 21:03:10.374741 134370176271936 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/5/optimizer_state/array_metadatas/process_0 I0420 21:03:10.402567 134370218214976 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/5/model_params/array_metadatas/process_0 I0420 21:03:11.861532 134370155300416 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.262427s (commit=1.874962s, array_metadata_write=0.387465s) I0420 21:03:11.862675 134370134328896 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 57.954 KiB/s (total gbytes: 205.1 KiB) (time elapsed: 3.5392441749572754 s) (per-host) I0420 21:03:11.936437 134370144814656 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.342244s (commit=1.918696s, array_metadata_write=0.423548s) I0420 21:03:11.937525 134370134328896 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 170.677 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 3.6022801399230957 s) (per-host) I0420 21:03:11.937794 134370134328896 async_checkpointer.py:90] [process=0][thread=async_save] 4 Handler Commit operations completed. Time taken: 3.544236s. I0420 21:03:12.308453 134370134328896 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0. I0420 21:03:12.749393 134370134328896 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.574100s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/5/model_params I0420 21:03:12.750254 134370134328896 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/5/model_params I0420 21:03:13.170418 134370134328896 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0. I0420 21:03:13.623396 134370134328896 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.601852s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/5/optimizer_state I0420 21:03:13.624299 134370134328896 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/5/optimizer_state I0420 21:03:13.885002 134370134328896 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/5 I0420 21:03:14.571004 134370134328896 atomicity.py:847] [process=0][thread=async_save] Finished saving checkpoint (finalized tmp dir) to `gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/5`. I0420 21:03:14.571770 134370134328896 event_tracking.py:138] [process=0] [async] Finished save (blocking + background) in 6.34 seconds @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_02_sft_nnx_ckpt/checkpoints/5 I0420 21:03:14.571840 134370134328896 async_checkpointer.py:160] [process=0][thread=async_save] Background save thread done. Time taken: 6.178284s. I0420 21:03:14.572032 134370165786176 async_checkpointer.py:288] [process=0][thread=save_finalize] Done with waiting for background save thread=async_save. I0420 21:03:14.572170 134370165786176 async_checkpointer.py:298] [process=0][thread=save_finalize] No errors found in background save thread=async_save. I0420 21:03:14.572221 134370165786176 checkpoint_manager.py:2137] [process=0][thread=save_finalize][step=5] CheckpointManager Save Finalize is syncing with other hosts... I0420 21:03:14.572261 134370165786176 checkpoint_manager.py:2146] [process=0][thread=save_finalize][step=5] CheckpointManager Save Finalize is done on all hosts. I0420 21:03:14.572441 134480345812096 checkpoint_manager.py:2032] [process=0][thread=MainThread][step=5][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=5. I0420 21:03:14.572755 134480345812096 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=134480345812096 count=1 at 0x7a47e99c9d00>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7a47e7d17ef0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7a47e7d15bb0>, _write_futures=[]) I0420 21:03:14.573577 134480345812096 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=134480345812096 count=1 at 0x7a47e99c9d00>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7a47e7d17ef0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7a47e7d15bb0>, _write_futures=[]) I0420 21:03:14.573606 134480345812096 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=134480345812096 count=1 at 0x7a47e99c9d00>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7a47e7d17ef0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7a47e7d15bb0>, _write_futures=[]) Training: 100%|██████████| 5/5 [00:29<00:00, 5.92s/step, _train_loss=5.88, _train_perplexity=356, _train_steps_per_sec=24.2] [DECOUPLED NO-OP] gcs_storage: using stubs. [DECOUPLED NO-OP] mldiagnostics: using stub. [DECOUPLED NO-OP] mldiagnostics: using stub. [DECOUPLED NO-OP] mldiagnostics: using stub. [DECOUPLED NO-OP] workload_monitor: using stub. [DECOUPLED NO-OP] vertex_tensorboard: using stub. ~/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 15 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '