MaxView

‹ —Case: 01_sft_smoke07_distill_smoke ›

Metrics: Linen vs NNX  ·  feat/nnx-post-train-fixes

MetricLinen  e27fc1e97NNX  e27fc1e97Diff (NNX − Linen)
Parameters0.000 billion0.000 billion
Final loss5.84505.8300-0.015
TFLOP/s0.0000.0000
Tok/s213.0194.3-18.705
Avg s/step4.7214.861+0.14
Memory %0.030.030
JAX0.8.30.8.3

Diff = NNX value − Linen value. Green = NNX improved. Red = NNX regressed.

Linen  ·  e27fc1e97  ·  feat_nnx_post_train_fixes_20260420_205452  ·  full log
2026-04-20 20:54:59.505566: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0420 20:54:59.970658 131277611596928 max_utils.py:238] Skipping jax distributed system due to skip_jax_distributed_system=True flag.
I0420 20:55:56.455088 131277611596928 max_utils.py:800] System Information: Jax Version: 0.8.3
I0420 20:55:56.455201 131277611596928 max_utils.py:801] System Information: Jaxlib Version: 0.8.3
I0420 20:55:56.455240 131277611596928 max_utils.py:802] System Information: Jax Backend: PJRT C API
TFRT TPU v6 lite
Built on Dec 15 2025 14:03:46 (1765836226) cl/844590465
I0420 20:55:56.458084 131277611596928 maxtext_utils.py:1718] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0420 20:55:56.650534 131277611596928 maxtext_utils.py:1718] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0420 20:55:57.667990 131277611596928 max_utils.py:194] tensorboardX not available; using no-op SummaryWriter.
I0420 20:55:57.690428 131277611596928 config.py:112] TensorFlow version 2.20.0 available.
I0420 20:55:57.690818 131277611596928 config.py:125] JAX version 0.8.3 available.
E0420 20:56:04.092798 131277611596928 packing.py:209] PackAndBatchOperation is deprecated. Please use lazy_dataset.FirstFitPackIterDataset instead.
I0420 20:56:04.436242 131277611596928 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None
I0420 20:56:04.436644 131277611596928 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7764d19f9850>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0420 20:56:04.436690 131277611596928 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None
I0420 20:56:04.436722 131277611596928 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7764d19f9850>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0420 20:56:04.436776 131277611596928 checkpoint_manager.py:708] [process=0][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'model_params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x775e3adebc80>, 'optimizer_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x775e33d24e60>, 'custom_metadata': <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x775e33d27ce0>}, handler_registry=None
I0420 20:56:04.437022 131277611596928 composite_checkpoint_handler.py:237] Deferred registration for item: "model_params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x775e3adebc80>` for item "model_params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0420 20:56:04.437080 131277611596928 composite_checkpoint_handler.py:237] Deferred registration for item: "optimizer_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x775e33d24e60>` for item "optimizer_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0420 20:56:04.437103 131277611596928 composite_checkpoint_handler.py:237] Deferred registration for item: "custom_metadata". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x775e33d27ce0>` for item "custom_metadata" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0420 20:56:04.437121 131277611596928 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x775e33d24500>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0420 20:56:04.437143 131277611596928 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x775e3adebc80>, ('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x775e3adebc80>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x775e33d24e60>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x775e33d24e60>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x775e33d27ce0>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x775e33d27ce0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x775e33d24500>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x775e33d24500>}).
I0420 20:56:04.437305 131277611596928 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.36
I0420 20:56:04.437346 131277611596928 async_checkpointer.py:192] [process=0][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>.<lambda> at 0x775e3597d440> timeout: 1200 secs and primary_host=0 for async checkpoint writes
I0420 20:56:06.869726 131277611596928 checkpoint_manager.py:564] Created directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints
I0420 20:56:09.237763 131277611596928 checkpoint_manager.py:1812] Found 0 checkpoint steps in gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints
I0420 20:56:09.238008 131277611596928 checkpoint_manager.py:929] [process=0][thread=MainThread] CheckpointManager created,  primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=10000, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None, preservation_policy=None, prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False, lightweight_initialize=False), root_directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x775e33d271a0>
I0420 20:56:11.272553 131277611596928 metrics_logger.py:64] WandbBackend skipped: 'wandb' library not installed.
I0420 20:56:11.272873 131277611596928 peft_trainer.py:590] Training with mesh: Mesh('diloco': 1, 'data': 1, 'stage': 1, 'fsdp': 8, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1, axis_types=(Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto))
I0420 20:56:11.680313 131277611596928 peft_trainer.py:600] Compiled train_step cache size: 0
[DECOUPLED NO-OP] gcs_storage: using stubs.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] workload_monitor: using stub.
[DECOUPLED NO-OP] vertex_tensorboard: using stub.

Training:   0%|          | 0/5 [00:00<?, ?step/s]I0420 20:56:11.682250 131277611596928 metric_logger.py:301] number parameters: 0.000 billion
Per train step:
 Total TFLOPs: 0.00 
 split as 54.29% learnable weight flops and 45.71% attention flops
2026-04-20 20:56:14.858519: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-04-20 20:56:14.894343: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-04-20 20:56:15.923371: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-04-20 20:56:18.356196: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0420 20:56:29.026290 131277611596928 checkpoint_manager.py:2009] [process=0][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0420 20:56:29.026478 131277611596928 checkpoint_manager.py:1512] [process=0] Saving checkpoint at step 1
I0420 20:56:29.026543 131277611596928 event_tracking.py:70] [process=0] [async] Started save checkpoint @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1.
I0420 20:56:29.099605 131277611596928 signaling_client.py:373] Using ThreadSafeKeyValueSignalingClient
I0420 20:56:29.117304 131277611596928 jax_array_handlers.py:360] Scheduling D2H of 22 prioritized jax.Array.
I0420 20:56:29.117364 131277611596928 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0420 20:56:29.202693 131168359941696 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1
I0420 20:56:29.934624 131168349455936 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/model_params
I0420 20:56:29.943332 131277611596928 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.826671s
I0420 20:56:29.943779 131277611596928 jax_array_handlers.py:360] Scheduling D2H of 52 prioritized jax.Array.
I0420 20:56:29.943829 131277611596928 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0420 20:56:29.974426 131277611596928 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.030997s
I0420 20:56:29.974759 131277611596928 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 234.480 KiB/s (total gbytes: 205.0 KiB) (time elapsed: 0.8741600513458252 s) (per-host)
I0420 20:56:29.974868 131277611596928 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.874282s (batch_requests_ready=0.001996s, total_serialization_initiated=0.871953s, others=0.000334s)
I0420 20:56:29.975244 131277611596928 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 708.434 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 0.8678691387176514 s) (per-host)
I0420 20:56:29.975318 131277611596928 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.867955s (batch_requests_ready=0.002631s, total_serialization_initiated=0.864923s, others=0.000401s)
I0420 20:56:29.975378 131277611596928 composite_checkpoint_handler.py:715] [process=0][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.875271s (all_items=0.000018s, per_item={'model_params': '0.00001454', 'optimizer_state': '0.00000358'}, temp_paths=0.875253)
I0420 20:56:29.975962 131277611596928 event_tracking.py:125] [process=0] [async] Finished blocking save in 0.95 seconds. Continuing save @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1.
I0420 20:56:29.976199 131168255084096 async_checkpointer.py:76] [process=0][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-20 21:16:29.976165
I0420 20:56:29.976424 131277611596928 checkpoint_manager.py:1560] [process=0][thread=MainThread][step=1] Starting CheckpointManager Save Finalize thread=save_finalize
I0420 20:56:29.976741 131277611596928 standard_logger.py:34] {'step': 1, 'event_type': 'save', 'directory': 'gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776718589.026272, 'wait_for_prev_duration_secs': 8.7738037109375e-05, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776718589.0265017, 'checkpointer_blocking_duration_secs': 0.9497811794281006, 'get_old_steps_start_time': 1776718589.9762976, 'get_old_steps_duration_secs': 8.916854858398438e-05, 'checkpoint_manager_blocking_start_time': 1776718589.026192, 'checkpoint_manager_blocking_duration_secs': 0.9505290985107422}
I0420 20:56:29.976874 131277611596928 profiler.py:85] Starting JAX profiler at step 1.
I0420 20:56:30.127057 131168317998656 checkpoint.py:188] Wrote Metadata={'item_handlers': None, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776718589833610025, 'commit_timestamp_nsecs': None, 'custom_metadata': {}}, json={"item_handlers": null, "metrics": {}, "performance_metrics": {}, "init_timestamp_nsecs": 1776718589833610025, "commit_timestamp_nsecs": null, "custom_metadata": {}} to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/_CHECKPOINT_METADATA
I0420 20:56:30.128260 131168370427456 async_checkpointer.py:280] [process=0][thread=save_finalize] Waiting for background save thread=async_save.
I0420 20:56:30.128818 131168349455936 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/optimizer_state
I0420 20:56:30.313750 131277611596928 peft_trainer.py:485] Train step 1 training loss: 6.124308  - training perplexity: 456.828308

Training:   0%|          | 0/5 [00:18<?, ?step/s, _train_loss=6.12, _train_perplexity=457, _train_steps_per_sec=0.058]
Training:  20%|██        | 1/5 [00:18<01:14, 18.63s/step, _train_loss=6.12, _train_perplexity=457, _train_steps_per_sec=0.058]I0420 20:56:30.314570 131277611596928 max_utils.py:750] 
Memstats: After params initialized:
I0420 20:56:30.314659 131277611596928 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_0(process=0,(0,0,0,0))
I0420 20:56:30.314708 131277611596928 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_1(process=0,(1,0,0,0))
I0420 20:56:30.314751 131277611596928 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_2(process=0,(0,1,0,0))
I0420 20:56:30.314790 131277611596928 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_3(process=0,(1,1,0,0))
I0420 20:56:30.314826 131277611596928 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_4(process=0,(0,2,0,0))
I0420 20:56:30.314862 131277611596928 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_5(process=0,(1,2,0,0))
I0420 20:56:30.314896 131277611596928 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_6(process=0,(0,3,0,0))
I0420 20:56:30.314930 131277611596928 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_7(process=0,(1,3,0,0))
I0420 20:56:30.441781 131277611596928 metric_logger.py:196] completed step: 1, seconds: 18.632, TFLOP/s/device: 0.000, Tokens/s/device: 54.959, total_weights: 6826, loss: 6.124, lm_loss: 0.000, perplexity: 0.000
I0420 20:56:30.453015 131277611596928 peft_trainer.py:485] Train step 2 training loss: 6.129846  - training perplexity: 459.365448

Training:  20%|██        | 1/5 [00:18<01:14, 18.63s/step, _train_loss=6.13, _train_perplexity=458, _train_steps_per_sec=0.418]
Training:  40%|████      | 2/5 [00:18<00:23,  7.75s/step, _train_loss=6.13, _train_perplexity=458, _train_steps_per_sec=0.418]I0420 20:56:30.454539 131277611596928 metric_logger.py:196] completed step: 2, seconds: 0.139, TFLOP/s/device: 0.002, Tokens/s/device: 7367.427, total_weights: 4636, loss: 6.130, lm_loss: 0.000, perplexity: 0.000
I0420 20:56:30.469069 131277611596928 peft_trainer.py:485] Train step 3 training loss: 5.572245  - training perplexity: 263.023956

Training:  40%|████      | 2/5 [00:18<00:23,  7.75s/step, _train_loss=5.94, _train_perplexity=381, _train_steps_per_sec=2.62] I0420 20:56:30.470377 131277611596928 metric_logger.py:196] completed step: 3, seconds: 0.016, TFLOP/s/device: 0.014, Tokens/s/device: 64786.057, total_weights: 5886, loss: 5.572, lm_loss: 0.000, perplexity: 0.000
I0420 20:56:30.480331 131277611596928 peft_trainer.py:485] Train step 4 training loss: 5.824473  - training perplexity: 338.482697

Training:  60%|██████    | 3/5 [00:18<00:15,  7.75s/step, _train_loss=5.91, _train_perplexity=370, _train_steps_per_sec=17.5]I0420 20:56:30.481542 131277611596928 metric_logger.py:196] completed step: 4, seconds: 0.011, TFLOP/s/device: 0.020, Tokens/s/device: 91538.582, total_weights: 4990, loss: 5.824, lm_loss: 0.000, perplexity: 0.000
I0420 20:56:30.481835 131277611596928 profiler.py:113] Stopping JAX profiler at step 5.
I0420 20:56:30.643142 1292846 google_auth_provider.cc:149] Using credentials at ~/.config/gcloud/application_default_credentials.json
I0420 20:56:30.643201 1292846 google_auth_provider.cc:156] Using OAuth2 AuthProvider
I0420 20:56:31.414179 131168307512896 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 22 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/model_params/array_metadatas/process_0
I0420 20:56:31.483818 131168286541376 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/optimizer_state/array_metadatas/process_0
I0420 20:56:33.027559 131168276055616 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.382223s (commit=1.947978s, array_metadata_write=0.434245s)
I0420 20:56:33.028425 131168255084096 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 52.185 KiB/s (total gbytes: 205.0 KiB) (time elapsed: 3.9278156757354736 s) (per-host)
I0420 20:56:33.133764 131168265569856 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.487766s (commit=2.073956s, array_metadata_write=0.413810s)
I0420 20:56:33.134762 131168255084096 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 152.663 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 4.027355194091797 s) (per-host)
I0420 20:56:33.134983 131168255084096 async_checkpointer.py:90] [process=0][thread=async_save] 4 Handler Commit operations completed. Time taken: 3.158514s.
I0420 20:56:33.331231 131168255084096 checkpoint.py:228] Read Metadata={'item_handlers': None, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776718589833610025, 'commit_timestamp_nsecs': None, 'custom_metadata': {}} from gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/_CHECKPOINT_METADATA
I0420 20:56:33.518449 131168255084096 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0.
I0420 20:56:33.719391 131168317998656 checkpoint.py:247] Updated Metadata={'item_handlers': {'model_params': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler', 'optimizer_state': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler'}, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776718589833610025, 'commit_timestamp_nsecs': None, 'custom_metadata': {}} to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/_CHECKPOINT_METADATA
I0420 20:56:33.951169 131168255084096 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.575591s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/model_params
I0420 20:56:33.951873 131168255084096 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/model_params
I0420 20:56:34.373094 131168255084096 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0.
I0420 20:56:34.779686 131168255084096 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.560371s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/optimizer_state
I0420 20:56:34.780461 131168255084096 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/optimizer_state
I0420 20:56:35.039479 131168255084096 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1
I0420 20:56:35.286706 131277611596928 utils.py:86] Train loop finished in: 23.6037 seconds
I0420 20:56:35.287311 131277611596928 peft_trainer.py:485] Train step 5 training loss: 5.844853  - training perplexity: 345.451904

Training:  80%|████████  | 4/5 [00:23<00:07,  7.75s/step, _train_loss=5.9, _train_perplexity=365, _train_steps_per_sec=31.7] 
Training: 100%|██████████| 5/5 [00:23<00:00,  3.36s/step, _train_loss=5.9, _train_perplexity=365, _train_steps_per_sec=31.7]I0420 20:56:35.288574 131277611596928 metric_logger.py:196] completed step: 5, seconds: 4.807, TFLOP/s/device: 0.000, Tokens/s/device: 213.034, total_weights: 4264, loss: 5.845, lm_loss: 0.000, perplexity: 0.000
I0420 20:56:35.290589 131277611596928 checkpoint_manager.py:2020] [process=0][thread=MainThread][step=1][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0420 20:56:35.941421 131168255084096 atomicity.py:847] [process=0][thread=async_save] Finished saving checkpoint (finalized tmp dir) to `gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1`.
I0420 20:56:35.942108 131168255084096 event_tracking.py:138] [process=0] [async] Finished save (blocking + background) in 6.92 seconds @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1
I0420 20:56:35.942180 131168255084096 async_checkpointer.py:160] [process=0][thread=async_save] Background save thread done. Time taken: 5.965714s.
I0420 20:56:35.942322 131168370427456 async_checkpointer.py:288] [process=0][thread=save_finalize] Done with waiting for background save thread=async_save.
I0420 20:56:35.942423 131168370427456 async_checkpointer.py:298] [process=0][thread=save_finalize] No errors found in background save thread=async_save.
I0420 20:56:35.942477 131168370427456 checkpoint_manager.py:2137] [process=0][thread=save_finalize][step=1] CheckpointManager Save Finalize is syncing with other hosts...
I0420 20:56:35.942522 131168370427456 checkpoint_manager.py:2146] [process=0][thread=save_finalize][step=1] CheckpointManager Save Finalize is done on all hosts.
I0420 20:56:35.942666 131277611596928 checkpoint_manager.py:2032] [process=0][thread=MainThread][step=1][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=1.
I0420 20:56:35.942949 131277611596928 checkpoint_manager.py:1512] [process=0] Saving checkpoint at step 5
I0420 20:56:35.943023 131277611596928 event_tracking.py:70] [process=0] [async] Started save checkpoint @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5.
I0420 20:56:36.044782 131277611596928 jax_array_handlers.py:360] Scheduling D2H of 22 prioritized jax.Array.
I0420 20:56:36.044890 131277611596928 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0420 20:56:36.056690 131277611596928 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.012559s
I0420 20:56:36.056979 131277611596928 jax_array_handlers.py:360] Scheduling D2H of 52 prioritized jax.Array.
I0420 20:56:36.057019 131277611596928 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0420 20:56:36.087510 131277611596928 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.030762s
I0420 20:56:36.087819 131277611596928 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 3.419 MiB/s (total gbytes: 205.0 KiB) (time elapsed: 0.05854606628417969 s) (per-host)
I0420 20:56:36.087926 131277611596928 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.058665s (batch_requests_ready=0.001498s, total_serialization_initiated=0.056855s, others=0.000311s)
I0420 20:56:36.088275 131277611596928 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 11.232 MiB/s (total gbytes: 614.8 KiB) (time elapsed: 0.05345630645751953 s) (per-host)
I0420 20:56:36.088350 131277611596928 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.053541s (batch_requests_ready=0.002615s, total_serialization_initiated=0.050549s, others=0.000378s)
I0420 20:56:36.088412 131277611596928 composite_checkpoint_handler.py:715] [process=0][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.059622s (all_items=0.000012s, per_item={'model_params': '0.00000978', 'optimizer_state': '0.00000215'}, temp_paths=0.059610)
I0420 20:56:36.089056 131277611596928 event_tracking.py:125] [process=0] [async] Finished blocking save in 0.15 seconds. Continuing save @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5.
I0420 20:56:36.089279 131167068096064 async_checkpointer.py:76] [process=0][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-20 21:16:36.089242
I0420 20:56:36.089503 131277611596928 checkpoint_manager.py:1560] [process=0][thread=MainThread][step=5] Starting CheckpointManager Save Finalize thread=save_finalize
I0420 20:56:36.089784 131168370427456 async_checkpointer.py:280] [process=0][thread=save_finalize] Waiting for background save thread=async_save.
I0420 20:56:36.089890 131277611596928 standard_logger.py:34] {'step': 5, 'event_type': 'save', 'directory': 'gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776718595.290569, 'wait_for_prev_duration_secs': 0.6521902084350586, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776718595.942974, 'checkpointer_blocking_duration_secs': 0.14638805389404297, 'get_old_steps_start_time': 1776718596.0893788, 'get_old_steps_duration_secs': 8.797645568847656e-05, 'checkpoint_manager_blocking_start_time': 1776718595.2905304, 'checkpoint_manager_blocking_duration_secs': 0.7993361949920654}
I0420 20:56:36.090055 131277611596928 checkpoint_manager.py:2020] [process=0][thread=MainThread][step=5][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0420 20:56:36.125337 131168255084096 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5
I0420 20:56:36.796242 131168265569856 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5/optimizer_state
I0420 20:56:36.796858 131168265569856 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5/model_params
I0420 20:56:38.021816 131167147787840 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5/optimizer_state/array_metadatas/process_0
I0420 20:56:38.061219 131168307512896 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 22 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5/model_params/array_metadatas/process_0
I0420 20:56:39.500903 131167110039104 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.229860s (commit=1.817855s, array_metadata_write=0.412005s)
I0420 20:56:39.501832 131167068096064 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 59.027 KiB/s (total gbytes: 205.0 KiB) (time elapsed: 3.472522258758545 s) (per-host)
I0420 20:56:39.540758 131167099553344 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.269204s (commit=1.858348s, array_metadata_write=0.410857s)
I0420 20:56:39.793588 131167068096064 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 163.573 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 3.7587482929229736 s) (per-host)
I0420 20:56:39.793927 131167068096064 async_checkpointer.py:90] [process=0][thread=async_save] 4 Handler Commit operations completed. Time taken: 3.704381s.
I0420 20:56:40.174992 131167068096064 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0.
I0420 20:56:40.571792 131167068096064 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.553383s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5/model_params
I0420 20:56:40.572502 131167068096064 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5/model_params
I0420 20:56:40.992185 131167068096064 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0.
I0420 20:56:41.393006 131167068096064 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.555303s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5/optimizer_state
I0420 20:56:41.393746 131167068096064 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5/optimizer_state
I0420 20:56:41.653888 131167068096064 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5
I0420 20:56:42.362612 131167068096064 atomicity.py:847] [process=0][thread=async_save] Finished saving checkpoint (finalized tmp dir) to `gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5`.
I0420 20:56:42.363260 131167068096064 event_tracking.py:138] [process=0] [async] Finished save (blocking + background) in 6.42 seconds @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5
I0420 20:56:42.363331 131167068096064 async_checkpointer.py:160] [process=0][thread=async_save] Background save thread done. Time taken: 6.273787s.
I0420 20:56:42.363502 131168370427456 async_checkpointer.py:288] [process=0][thread=save_finalize] Done with waiting for background save thread=async_save.
I0420 20:56:42.363606 131168370427456 async_checkpointer.py:298] [process=0][thread=save_finalize] No errors found in background save thread=async_save.
I0420 20:56:42.363660 131168370427456 checkpoint_manager.py:2137] [process=0][thread=save_finalize][step=5] CheckpointManager Save Finalize is syncing with other hosts...
I0420 20:56:42.363701 131168370427456 checkpoint_manager.py:2146] [process=0][thread=save_finalize][step=5] CheckpointManager Save Finalize is done on all hosts.
I0420 20:56:42.364266 131277611596928 checkpoint_manager.py:2032] [process=0][thread=MainThread][step=5][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=5.
I0420 20:56:42.364467 131277611596928 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=131277611596928 count=1 at 0x775e36185d40>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x775e35989a60>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x775e3598ad80>, _write_futures=[])
I0420 20:56:42.364871 131277611596928 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=131277611596928 count=1 at 0x775e36185d40>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x775e35989a60>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x775e3598ad80>, _write_futures=[])
I0420 20:56:42.364899 131277611596928 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=131277611596928 count=1 at 0x775e36185d40>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x775e35989a60>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x775e3598ad80>, _write_futures=[])

Training: 100%|██████████| 5/5 [00:31<00:00,  6.34s/step, _train_loss=5.9, _train_perplexity=365, _train_steps_per_sec=31.7]
[DECOUPLED NO-OP] gcs_storage: using stubs.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] workload_monitor: using stub.
[DECOUPLED NO-OP] vertex_tensorboard: using stub.
~/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 15 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
NNX  ·  e27fc1e97  ·  feat_nnx_post_train_fixes_20260420_205452  ·  full log
2026-04-20 20:59:41.801818: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0420 20:59:42.342416 128452369886336 max_utils.py:238] Skipping jax distributed system due to skip_jax_distributed_system=True flag.
I0420 21:00:28.414965 128452369886336 max_utils.py:800] System Information: Jax Version: 0.8.3
I0420 21:00:28.415099 128452369886336 max_utils.py:801] System Information: Jaxlib Version: 0.8.3
I0420 21:00:28.415136 128452369886336 max_utils.py:802] System Information: Jax Backend: PJRT C API
TFRT TPU v6 lite
Built on Dec 15 2025 14:03:46 (1765836226) cl/844590465
I0420 21:00:28.418357 128452369886336 maxtext_utils.py:1718] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0420 21:00:28.502035 128452369886336 maxtext_utils.py:1718] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0420 21:00:28.586132 128452369886336 maxtext_utils.py:1718] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0420 21:00:29.584421 128452369886336 max_utils.py:194] tensorboardX not available; using no-op SummaryWriter.
I0420 21:00:29.608268 128452369886336 config.py:112] TensorFlow version 2.20.0 available.
I0420 21:00:29.608665 128452369886336 config.py:125] JAX version 0.8.3 available.
E0420 21:00:32.865133 128452369886336 packing.py:209] PackAndBatchOperation is deprecated. Please use lazy_dataset.FirstFitPackIterDataset instead.
I0420 21:00:33.227858 128452369886336 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None
I0420 21:00:33.228285 128452369886336 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x74d3041fd460>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0420 21:00:33.228327 128452369886336 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None
I0420 21:00:33.228359 128452369886336 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x74d3041fd460>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0420 21:00:33.228405 128452369886336 checkpoint_manager.py:708] [process=0][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'model_params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x74cc6990faa0>, 'optimizer_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x74cc67d70b60>, 'custom_metadata': <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x74cc67d0d4f0>}, handler_registry=None
I0420 21:00:33.228694 128452369886336 composite_checkpoint_handler.py:237] Deferred registration for item: "model_params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x74cc6990faa0>` for item "model_params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0420 21:00:33.228739 128452369886336 composite_checkpoint_handler.py:237] Deferred registration for item: "optimizer_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x74cc67d70b60>` for item "optimizer_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0420 21:00:33.228761 128452369886336 composite_checkpoint_handler.py:237] Deferred registration for item: "custom_metadata". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x74cc67d0d4f0>` for item "custom_metadata" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0420 21:00:33.228780 128452369886336 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x74cc67d714f0>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0420 21:00:33.228802 128452369886336 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x74cc6990faa0>, ('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x74cc6990faa0>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x74cc67d70b60>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x74cc67d70b60>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x74cc67d0d4f0>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x74cc67d0d4f0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x74cc67d714f0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x74cc67d714f0>}).
I0420 21:00:33.228971 128452369886336 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.36
I0420 21:00:33.229011 128452369886336 async_checkpointer.py:192] [process=0][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>.<lambda> at 0x74cc6999b2e0> timeout: 1200 secs and primary_host=0 for async checkpoint writes
I0420 21:00:35.253094 128452369886336 checkpoint_manager.py:564] Created directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints
I0420 21:00:37.617816 128452369886336 checkpoint_manager.py:1812] Found 0 checkpoint steps in gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints
I0420 21:00:37.618103 128452369886336 checkpoint_manager.py:929] [process=0][thread=MainThread] CheckpointManager created,  primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=10000, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None, preservation_policy=None, prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False, lightweight_initialize=False), root_directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x74cc67d0e3c0>
I0420 21:00:39.645554 128452369886336 metrics_logger.py:64] WandbBackend skipped: 'wandb' library not installed.
I0420 21:00:39.645836 128452369886336 peft_trainer.py:590] Training with mesh: Mesh('diloco': 1, 'data': 1, 'stage': 1, 'fsdp': 8, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1, axis_types=(Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto))
I0420 21:00:40.086117 128452369886336 peft_trainer.py:600] Compiled train_step cache size: 0
[DECOUPLED NO-OP] gcs_storage: using stubs.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] workload_monitor: using stub.
[DECOUPLED NO-OP] vertex_tensorboard: using stub.

Training:   0%|          | 0/5 [00:00<?, ?step/s]I0420 21:00:40.089613 128452369886336 metric_logger.py:301] number parameters: 0.000 billion
Per train step:
 Total TFLOPs: 0.00 
 split as 54.29% learnable weight flops and 45.71% attention flops
2026-04-20 21:00:43.281176: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-04-20 21:00:43.319487: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-04-20 21:00:44.343156: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-04-20 21:00:46.777212: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0420 21:00:57.572515 128452369886336 checkpoint_manager.py:2009] [process=0][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0420 21:00:57.572713 128452369886336 checkpoint_manager.py:1512] [process=0] Saving checkpoint at step 1
I0420 21:00:57.572782 128452369886336 event_tracking.py:70] [process=0] [async] Started save checkpoint @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1.
I0420 21:00:57.668204 128452369886336 signaling_client.py:373] Using ThreadSafeKeyValueSignalingClient
I0420 21:00:57.691869 128452369886336 jax_array_handlers.py:360] Scheduling D2H of 46 prioritized jax.Array.
I0420 21:00:57.691930 128452369886336 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0420 21:00:57.763056 128342357444160 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1
I0420 21:00:58.464372 128342346958400 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/optimizer_state
I0420 21:00:58.465951 128342346958400 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/model_params
I0420 21:00:58.542684 128452369886336 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.852064s
I0420 21:00:58.543185 128452369886336 jax_array_handlers.py:360] Scheduling D2H of 52 prioritized jax.Array.
I0420 21:00:58.543232 128452369886336 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0420 21:00:58.574319 128452369886336 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.031527s
I0420 21:00:58.574725 128452369886336 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 226.649 KiB/s (total gbytes: 205.1 KiB) (time elapsed: 0.9049837589263916 s) (per-host)
I0420 21:00:58.574849 128452369886336 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.905123s (batch_requests_ready=0.003015s, total_serialization_initiated=0.901695s, others=0.000413s)
I0420 21:00:58.575226 128452369886336 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 688.550 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 0.8929321765899658 s) (per-host)
I0420 21:00:58.575304 128452369886336 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.893020s (batch_requests_ready=0.002395s, total_serialization_initiated=0.890224s, others=0.000401s)
I0420 21:00:58.575365 128452369886336 composite_checkpoint_handler.py:715] [process=0][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.906585s (all_items=0.000022s, per_item={'model_params': '0.00001812', 'optimizer_state': '0.00000381'}, temp_paths=0.906563)
I0420 21:00:58.576059 128452369886336 event_tracking.py:125] [process=0] [async] Finished blocking save in 1.00 seconds. Continuing save @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1.
I0420 21:00:58.576236 128342248392256 async_checkpointer.py:76] [process=0][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-20 21:20:58.576203
I0420 21:00:58.576477 128452369886336 checkpoint_manager.py:1560] [process=0][thread=MainThread][step=1] Starting CheckpointManager Save Finalize thread=save_finalize
I0420 21:00:58.576820 128452369886336 standard_logger.py:34] {'step': 1, 'event_type': 'save', 'directory': 'gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776718857.572496, 'wait_for_prev_duration_secs': 9.369850158691406e-05, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776718857.572741, 'checkpointer_blocking_duration_secs': 1.003591537475586, 'get_old_steps_start_time': 1776718858.576348, 'get_old_steps_duration_secs': 9.226799011230469e-05, 'checkpoint_manager_blocking_start_time': 1776718857.5724196, 'checkpoint_manager_blocking_duration_secs': 1.0043795108795166}
I0420 21:00:58.576955 128452369886336 profiler.py:85] Starting JAX profiler at step 1.
I0420 21:00:58.697643 128342311306816 checkpoint.py:188] Wrote Metadata={'item_handlers': None, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776718858371740598, 'commit_timestamp_nsecs': None, 'custom_metadata': {}}, json={"item_handlers": null, "metrics": {}, "performance_metrics": {}, "init_timestamp_nsecs": 1776718858371740598, "commit_timestamp_nsecs": null, "custom_metadata": {}} to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/_CHECKPOINT_METADATA
I0420 21:00:58.699167 128342367929920 async_checkpointer.py:280] [process=0][thread=save_finalize] Waiting for background save thread=async_save.
I0420 21:00:58.902091 128452369886336 peft_trainer.py:485] Train step 1 training loss: 6.004404  - training perplexity: 405.209259

Training:   0%|          | 0/5 [00:18<?, ?step/s, _train_loss=6, _train_perplexity=405, _train_steps_per_sec=0.057]
Training:  20%|██        | 1/5 [00:18<01:15, 18.82s/step, _train_loss=6, _train_perplexity=405, _train_steps_per_sec=0.057]I0420 21:00:58.902937 128452369886336 max_utils.py:750] 
Memstats: After params initialized:
I0420 21:00:58.903025 128452369886336 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_0(process=0,(0,0,0,0))
I0420 21:00:58.903096 128452369886336 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_1(process=0,(1,0,0,0))
I0420 21:00:58.903143 128452369886336 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_2(process=0,(0,1,0,0))
I0420 21:00:58.903182 128452369886336 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_3(process=0,(1,1,0,0))
I0420 21:00:58.903220 128452369886336 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_4(process=0,(0,2,0,0))
I0420 21:00:58.903260 128452369886336 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_5(process=0,(1,2,0,0))
I0420 21:00:58.903299 128452369886336 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_6(process=0,(0,3,0,0))
I0420 21:00:58.903340 128452369886336 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_7(process=0,(1,3,0,0))
I0420 21:00:59.001490 1301256 google_auth_provider.cc:149] Using credentials at ~/.config/gcloud/application_default_credentials.json
I0420 21:00:59.001563 1301256 google_auth_provider.cc:156] Using OAuth2 AuthProvider
I0420 21:00:59.049012 128452369886336 metric_logger.py:196] completed step: 1, seconds: 18.813, TFLOP/s/device: 0.000, Tokens/s/device: 54.430, total_weights: 6826, loss: 6.004, lm_loss: 0.000, perplexity: 0.000
I0420 21:00:59.089101 128452369886336 peft_trainer.py:485] Train step 2 training loss: 6.137612  - training perplexity: 462.946655

Training:  20%|██        | 1/5 [00:19<01:15, 18.82s/step, _train_loss=6.07, _train_perplexity=433, _train_steps_per_sec=0.404]
Training:  40%|████      | 2/5 [00:19<00:23,  7.86s/step, _train_loss=6.07, _train_perplexity=433, _train_steps_per_sec=0.404]I0420 21:00:59.090628 128452369886336 metric_logger.py:196] completed step: 2, seconds: 0.187, TFLOP/s/device: 0.001, Tokens/s/device: 5487.121, total_weights: 4636, loss: 6.138, lm_loss: 0.000, perplexity: 0.000
I0420 21:00:59.110405 128452369886336 peft_trainer.py:485] Train step 3 training loss: 5.638189  - training perplexity: 280.953430

Training:  40%|████      | 2/5 [00:19<00:23,  7.86s/step, _train_loss=5.93, _train_perplexity=375, _train_steps_per_sec=2.05] I0420 21:00:59.111706 128452369886336 metric_logger.py:196] completed step: 3, seconds: 0.021, TFLOP/s/device: 0.010, Tokens/s/device: 48466.309, total_weights: 5886, loss: 5.638, lm_loss: 0.000, perplexity: 0.000
I0420 21:00:59.122963 128452369886336 peft_trainer.py:485] Train step 4 training loss: 5.768358  - training perplexity: 320.011780

Training:  60%|██████    | 3/5 [00:19<00:15,  7.86s/step, _train_loss=5.89, _train_perplexity=360, _train_steps_per_sec=13.2]I0420 21:00:59.124320 128452369886336 metric_logger.py:196] completed step: 4, seconds: 0.013, TFLOP/s/device: 0.018, Tokens/s/device: 81542.232, total_weights: 4990, loss: 5.768, lm_loss: 0.000, perplexity: 0.000
I0420 21:00:59.124614 128452369886336 profiler.py:113] Stopping JAX profiler at step 5.
I0420 21:01:00.000919 128342279849536 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/optimizer_state/array_metadatas/process_0
I0420 21:01:00.007847 128342300821056 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/model_params/array_metadatas/process_0
I0420 21:01:01.407443 128342269363776 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.395984s (commit=1.942039s, array_metadata_write=0.453944s)
I0420 21:01:01.408600 128342248392256 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 54.861 KiB/s (total gbytes: 205.1 KiB) (time elapsed: 3.7388148307800293 s) (per-host)
I0420 21:01:01.590229 128342258878016 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.576199s (commit=2.136577s, array_metadata_write=0.439621s)
I0420 21:01:01.591369 128342248392256 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 157.284 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 3.909031391143799 s) (per-host)
I0420 21:01:01.591641 128342248392256 async_checkpointer.py:90] [process=0][thread=async_save] 4 Handler Commit operations completed. Time taken: 3.015121s.
I0420 21:01:01.762398 128342248392256 checkpoint.py:228] Read Metadata={'item_handlers': None, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776718858371740598, 'commit_timestamp_nsecs': None, 'custom_metadata': {}} from gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/_CHECKPOINT_METADATA
I0420 21:01:01.979930 128342248392256 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0.
I0420 21:01:02.150733 128342311306816 checkpoint.py:247] Updated Metadata={'item_handlers': {'model_params': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler', 'optimizer_state': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler'}, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776718858371740598, 'commit_timestamp_nsecs': None, 'custom_metadata': {}} to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/_CHECKPOINT_METADATA
I0420 21:01:02.378451 128342248392256 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.581549s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/model_params
I0420 21:01:02.379293 128342248392256 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/model_params
I0420 21:01:02.809881 128342248392256 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0.
I0420 21:01:03.201909 128342248392256 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.543164s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/optimizer_state
I0420 21:01:03.202932 128342248392256 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1/optimizer_state
I0420 21:01:03.486571 128342248392256 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1
I0420 21:01:04.233850 128342248392256 atomicity.py:847] [process=0][thread=async_save] Finished saving checkpoint (finalized tmp dir) to `gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1`.
I0420 21:01:04.234627 128342248392256 event_tracking.py:138] [process=0] [async] Finished save (blocking + background) in 6.66 seconds @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/1
I0420 21:01:04.234741 128342248392256 async_checkpointer.py:160] [process=0][thread=async_save] Background save thread done. Time taken: 5.658220s.
I0420 21:01:04.234980 128342367929920 async_checkpointer.py:288] [process=0][thread=save_finalize] Done with waiting for background save thread=async_save.
I0420 21:01:04.235130 128342367929920 async_checkpointer.py:298] [process=0][thread=save_finalize] No errors found in background save thread=async_save.
I0420 21:01:04.235217 128342367929920 checkpoint_manager.py:2137] [process=0][thread=save_finalize][step=1] CheckpointManager Save Finalize is syncing with other hosts...
I0420 21:01:04.235282 128342367929920 checkpoint_manager.py:2146] [process=0][thread=save_finalize][step=1] CheckpointManager Save Finalize is done on all hosts.
I0420 21:01:04.391675 128452369886336 utils.py:86] Train loop finished in: 24.3014 seconds
I0420 21:01:04.392585 128452369886336 peft_trainer.py:485] Train step 5 training loss: 5.830140  - training perplexity: 340.406403

Training:  80%|████████  | 4/5 [00:24<00:07,  7.86s/step, _train_loss=5.88, _train_perplexity=356, _train_steps_per_sec=26.4]
Training: 100%|██████████| 5/5 [00:24<00:00,  3.50s/step, _train_loss=5.88, _train_perplexity=356, _train_steps_per_sec=26.4]I0420 21:01:04.394168 128452369886336 metric_logger.py:196] completed step: 5, seconds: 5.269, TFLOP/s/device: 0.000, Tokens/s/device: 194.329, total_weights: 4264, loss: 5.830, lm_loss: 0.000, perplexity: 0.000
I0420 21:01:04.397883 128452369886336 checkpoint_manager.py:2009] [process=0][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0420 21:01:04.398073 128452369886336 checkpoint_manager.py:1512] [process=0] Saving checkpoint at step 5
I0420 21:01:04.398141 128452369886336 event_tracking.py:70] [process=0] [async] Started save checkpoint @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5.
I0420 21:01:04.511922 128452369886336 jax_array_handlers.py:360] Scheduling D2H of 46 prioritized jax.Array.
I0420 21:01:04.512056 128452369886336 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0420 21:01:04.526903 128452369886336 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.016342s
I0420 21:01:04.527261 128452369886336 jax_array_handlers.py:360] Scheduling D2H of 52 prioritized jax.Array.
I0420 21:01:04.527301 128452369886336 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0420 21:01:04.557864 128452369886336 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.030896s
I0420 21:01:04.558271 128452369886336 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 2.926 MiB/s (total gbytes: 205.1 KiB) (time elapsed: 0.0684666633605957 s) (per-host)
I0420 21:01:04.558391 128452369886336 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.068608s (batch_requests_ready=0.002524s, total_serialization_initiated=0.065667s, others=0.000417s)
I0420 21:01:04.558710 128452369886336 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 10.480 MiB/s (total gbytes: 614.8 KiB) (time elapsed: 0.05729341506958008 s) (per-host)
I0420 21:01:04.558784 128452369886336 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.057378s (batch_requests_ready=0.002486s, total_serialization_initiated=0.054547s, others=0.000345s)
I0420 21:01:04.558849 128452369886336 composite_checkpoint_handler.py:715] [process=0][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.069582s (all_items=0.000017s, per_item={'model_params': '0.00001478', 'optimizer_state': '0.00000262'}, temp_paths=0.069565)
I0420 21:01:04.559563 128452369886336 event_tracking.py:125] [process=0] [async] Finished blocking save in 0.16 seconds. Continuing save @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5.
I0420 21:01:04.559805 128342141437504 async_checkpointer.py:76] [process=0][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-20 21:21:04.559767
I0420 21:01:04.560103 128452369886336 checkpoint_manager.py:1560] [process=0][thread=MainThread][step=5] Starting CheckpointManager Save Finalize thread=save_finalize
I0420 21:01:04.560446 128342248392256 async_checkpointer.py:280] [process=0][thread=save_finalize] Waiting for background save thread=async_save.
I0420 21:01:04.560577 128452369886336 standard_logger.py:34] {'step': 5, 'event_type': 'save', 'directory': 'gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776718864.397854, 'wait_for_prev_duration_secs': 5.9604644775390625e-05, 'time_between_consecutive_saves_sec': 0.16255569458007812, 'checkpointer_blocking_start_time': 1776718864.3980978, 'checkpointer_blocking_duration_secs': 0.16180753707885742, 'get_old_steps_start_time': 1776718864.5599275, 'get_old_steps_duration_secs': 0.00013566017150878906, 'checkpoint_manager_blocking_start_time': 1776718864.3978074, 'checkpoint_manager_blocking_duration_secs': 0.1627340316772461}
I0420 21:01:04.560741 128452369886336 checkpoint_manager.py:2020] [process=0][thread=MainThread][step=5][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0420 21:01:04.581285 128342367929920 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5
I0420 21:01:05.272924 128342258878016 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5/optimizer_state
I0420 21:01:05.275128 128342258878016 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5/model_params
I0420 21:01:06.610148 128342279849536 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5/model_params/array_metadatas/process_0
I0420 21:01:06.610483 128342237906496 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5/optimizer_state/array_metadatas/process_0
I0420 21:01:08.058826 128342151923264 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.252741s (commit=1.844080s, array_metadata_write=0.408661s)
I0420 21:01:08.080490 128342227420736 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.269602s (commit=1.849001s, array_metadata_write=0.420601s)
I0420 21:01:08.081575 128342141437504 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 57.107 KiB/s (total gbytes: 205.1 KiB) (time elapsed: 3.5917415618896484 s) (per-host)
I0420 21:01:08.401475 128342141437504 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 157.647 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 3.900029182434082 s) (per-host)
I0420 21:01:08.401835 128342141437504 async_checkpointer.py:90] [process=0][thread=async_save] 4 Handler Commit operations completed. Time taken: 3.841688s.
I0420 21:01:08.762228 128342141437504 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0.
I0420 21:01:09.156631 128342141437504 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.542909s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5/model_params
I0420 21:01:09.157476 128342141437504 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5/model_params
I0420 21:01:09.556336 128342141437504 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0.
I0420 21:01:09.991431 128342141437504 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.575513s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5/optimizer_state
I0420 21:01:09.992280 128342141437504 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5/optimizer_state
I0420 21:01:10.246400 128342141437504 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5
I0420 21:01:10.897143 128342141437504 atomicity.py:847] [process=0][thread=async_save] Finished saving checkpoint (finalized tmp dir) to `gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5`.
I0420 21:01:10.897917 128342141437504 event_tracking.py:138] [process=0] [async] Finished save (blocking + background) in 6.50 seconds @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_01_sft_smoke/checkpoints/5
I0420 21:01:10.897998 128342141437504 async_checkpointer.py:160] [process=0][thread=async_save] Background save thread done. Time taken: 6.337852s.
I0420 21:01:10.898173 128342248392256 async_checkpointer.py:288] [process=0][thread=save_finalize] Done with waiting for background save thread=async_save.
I0420 21:01:10.898281 128342248392256 async_checkpointer.py:298] [process=0][thread=save_finalize] No errors found in background save thread=async_save.
I0420 21:01:10.898333 128342248392256 checkpoint_manager.py:2137] [process=0][thread=save_finalize][step=5] CheckpointManager Save Finalize is syncing with other hosts...
I0420 21:01:10.898371 128342248392256 checkpoint_manager.py:2146] [process=0][thread=save_finalize][step=5] CheckpointManager Save Finalize is done on all hosts.
I0420 21:01:10.898782 128452369886336 checkpoint_manager.py:2032] [process=0][thread=MainThread][step=5][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=5.
I0420 21:01:10.898989 128452369886336 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=128452369886336 count=1 at 0x74cc6996ec40>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x74cc67d0f0e0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x74cc67d0fd40>, _write_futures=[])
I0420 21:01:10.899647 128452369886336 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=128452369886336 count=1 at 0x74cc6996ec40>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x74cc67d0f0e0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x74cc67d0fd40>, _write_futures=[])
I0420 21:01:10.899674 128452369886336 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=128452369886336 count=1 at 0x74cc6996ec40>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x74cc67d0f0e0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x74cc67d0fd40>, _write_futures=[])

Training: 100%|██████████| 5/5 [00:31<00:00,  6.38s/step, _train_loss=5.88, _train_perplexity=356, _train_steps_per_sec=26.4]
[DECOUPLED NO-OP] gcs_storage: using stubs.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] workload_monitor: using stub.
[DECOUPLED NO-OP] vertex_tensorboard: using stub.
~/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 15 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '