MaxView

← Back to run

Log Summary

2026-04-16 21:15:10.758949: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0416 21:15:11.300541 124503439158400 max_utils.py:238] Skipping jax distributed system due to skip_jax_distributed_system=True flag.
I0416 21:15:50.413647 124503439158400 max_utils.py:800] System Information: Jax Version: 0.8.3
I0416 21:15:50.413770 124503439158400 max_utils.py:801] System Information: Jaxlib Version: 0.8.3
I0416 21:15:50.413805 124503439158400 max_utils.py:802] System Information: Jax Backend: PJRT C API
TFRT TPU v6 lite
Built on Dec 15 2025 14:03:46 (1765836226) cl/844590465
I0416 21:15:50.417059 124503439158400 maxtext_utils.py:1687] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0416 21:15:50.498402 124503439158400 maxtext_utils.py:1687] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0416 21:15:50.581346 124503439158400 maxtext_utils.py:1687] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0416 21:15:53.063977 124503439158400 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None
I0416 21:15:53.064406 124503439158400 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x713b957fd520>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0416 21:15:53.064462 124503439158400 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.36
W0416 21:15:54.371870 124503439158400 checkpoint.py:202] Metadata file does not exist: gs://wanglance-maxtext/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items/_CHECKPOINT_METADATA
I0416 21:15:54.564951 4175628 google_auth_provider.cc:149] Using credentials at ~/.config/gcloud/application_default_credentials.json
I0416 21:15:54.565025 4175628 google_auth_provider.cc:156] Using OAuth2 AuthProvider
I0416 21:15:54.951595 124503439158400 event_tracking.py:70] [process=0] [sync] Started load checkpoint @ gs://wanglance-maxtext/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items.
I0416 21:15:55.081683 124503439158400 checkpointer.py:307] Restoring checkpoint from gs://wanglance-maxtext/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items.
I0416 21:15:55.081905 124503439158400 event_tracking.py:125] [process=0] [sync] Finished blocking load in 0.13 seconds. Continuing load @ gs://wanglance-maxtext/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items.
I0416 21:15:55.575335 124503439158400 jax_array_handlers.py:843] [process=0] /jax/orbax/read/worker/io/requested throughput: 1.052 MiB/s (total gbytes: 204.9 KiB) (time elapsed: 0.1901836395263672 s) (per-host)
W0416 21:15:55.576419 124503439158400 transform_utils.py:230] The transformations API will eventually be replaced by an upgraded design. The current API will not be removed until this point, but it will no longer be actively worked on.
I0416 21:15:55.576682 124503439158400 transform_utils.py:288] The following keys are not loaded from the original tree after applying specified transforms: params/params/decoder/dropout/rngs/aqt/count, params/params/decoder/dropout/rngs/aqt/key, params/params/decoder/dropout/rngs/dropout/count, params/params/decoder/dropout/rngs/dropout/key, params/params/decoder/dropout/rngs/params/count, params/params/decoder/dropout/rngs/params/key, params/params/decoder/layers/dropout/rngs/aqt/count, params/params/decoder/layers/dropout/rngs/aqt/key, params/params/decoder/layers/dropout/rngs/dropout/count, params/params/decoder/layers/dropout/rngs/dropout/key, params/params/decoder/layers/dropout/rngs/params/count, params/params/decoder/layers/dropout/rngs/params/key, params/params/decoder/layers/mlp/dropout/rngs/aqt/count, params/params/decoder/layers/mlp/dropout/rngs/aqt/key, params/params/decoder/layers/mlp/dropout/rngs/dropout/count, params/params/decoder/layers/mlp/dropout/rngs/dropout/key, params/params/decoder/layers/mlp/dropout/rngs/params/count, params/params/decoder/layers/mlp/dropout/rngs/params/key, params/params/decoder/layers/rngs/aqt/count, params/params/decoder/layers/rngs/aqt/key, params/params/decoder/layers/rngs/dropout/count, params/params/decoder/layers/rngs/dropout/key, params/params/decoder/layers/rngs/params/count, params/params/decoder/layers/rngs/params/key, params/params/decoder/rngs/aqt/count, params/params/decoder/rngs/aqt/key, params/params/decoder/rngs/dropout/count, params/params/decoder/rngs/dropout/key, params/params/decoder/rngs/params/count, params/params/decoder/rngs/params/key
I0416 21:15:55.577250 124503439158400 event_tracking.py:138] [process=0] [sync] Finished load in 0.63 seconds @ gs://wanglance-maxtext/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items
I0416 21:15:55.582026 124503439158400 max_utils.py:194] tensorboardX not available; using no-op SummaryWriter.
I0416 21:15:55.606182 124503439158400 config.py:112] TensorFlow version 2.20.0 available.
I0416 21:15:55.606583 124503439158400 config.py:125] JAX version 0.8.3 available.
E0416 21:15:57.059201 124503439158400 packing.py:209] PackAndBatchOperation is deprecated. Please use lazy_dataset.FirstFitPackIterDataset instead.
I0416 21:15:57.448797 124503439158400 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None
I0416 21:15:57.448932 124503439158400 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x713b957fd520>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0416 21:15:57.448976 124503439158400 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None
I0416 21:15:57.449003 124503439158400 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x713b957fd520>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0416 21:15:57.449058 124503439158400 checkpoint_manager.py:708] [process=0][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'model_params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7134ff9d5f10>, 'optimizer_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7134fa7bcd40>, 'custom_metadata': <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7134f8b2da00>}, handler_registry=None
I0416 21:15:57.449277 124503439158400 composite_checkpoint_handler.py:237] Deferred registration for item: "model_params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7134ff9d5f10>` for item "model_params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0416 21:15:57.449313 124503439158400 composite_checkpoint_handler.py:237] Deferred registration for item: "optimizer_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7134fa7bcd40>` for item "optimizer_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0416 21:15:57.449333 124503439158400 composite_checkpoint_handler.py:237] Deferred registration for item: "custom_metadata". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7134f8b2da00>` for item "custom_metadata" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0416 21:15:57.449351 124503439158400 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7134fa7be570>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0416 21:15:57.449372 124503439158400 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7134ff9d5f10>, ('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7134ff9d5f10>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7134fa7bcd40>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7134fa7bcd40>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7134f8b2da00>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7134f8b2da00>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7134fa7be570>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7134fa7be570>}).
I0416 21:15:57.449710 124503439158400 async_checkpointer.py:192] [process=0][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>.<lambda> at 0x7134f8b0aac0> timeout: 1200 secs and primary_host=0 for async checkpoint writes
I0416 21:15:58.298808 124503439158400 checkpoint_manager.py:564] Created directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints
I0416 21:16:00.646103 124503439158400 checkpoint_manager.py:1812] Found 0 checkpoint steps in gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints
I0416 21:16:00.646362 124503439158400 checkpoint_manager.py:929] [process=0][thread=MainThread] CheckpointManager created,  primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=10000, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None, preservation_policy=None, prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False, lightweight_initialize=False), root_directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7134f8b2f170>
I0416 21:16:02.683518 124503439158400 metrics_logger.py:64] WandbBackend skipped: 'wandb' library not installed.
I0416 21:16:02.683783 124503439158400 peft_trainer.py:590] Training with mesh: Mesh('diloco': 1, 'data': 1, 'stage': 1, 'fsdp': 8, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1, axis_types=(Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto))
I0416 21:16:03.154180 124503439158400 peft_trainer.py:600] Compiled train_step cache size: 0
[DECOUPLED NO-OP] gcs_storage: using stubs.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] workload_monitor: using stub.
[DECOUPLED NO-OP] vertex_tensorboard: using stub.

Training:   0%|          | 0/5 [00:00<?, ?step/s]I0416 21:16:03.157691 124503439158400 metric_logger.py:289] number parameters: 0.000 billion
Per train step:
 Total TFLOPs: 0.00 
 split as 54.29% learnable weight flops and 45.71% attention flops
2026-04-16 21:16:06.333023: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-04-16 21:16:06.371593: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-04-16 21:16:07.389838: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-04-16 21:16:09.793592: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0416 21:16:16.775801 124503439158400 checkpoint_manager.py:2009] [process=0][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0416 21:16:16.775993 124503439158400 checkpoint_manager.py:1512] [process=0] Saving checkpoint at step 1
I0416 21:16:16.776074 124503439158400 event_tracking.py:70] [process=0] [async] Started save checkpoint @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/1.
I0416 21:16:16.863371 124503439158400 signaling_client.py:373] Using ThreadSafeKeyValueSignalingClient
I0416 21:16:16.887840 124503439158400 jax_array_handlers.py:360] Scheduling D2H of 46 prioritized jax.Array.
I0416 21:16:16.887938 124503439158400 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0416 21:16:16.952542 124393365702208 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/1
I0416 21:16:17.634746 124393355216448 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/1/model_params
I0416 21:16:17.645359 124393355216448 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/1/optimizer_state
I0416 21:16:17.750596 124503439158400 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.864136s
I0416 21:16:17.751082 124503439158400 jax_array_handlers.py:360] Scheduling D2H of 52 prioritized jax.Array.
I0416 21:16:17.751132 124503439158400 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0416 21:16:17.782577 124503439158400 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.031875s
I0416 21:16:17.782929 124503439158400 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 223.435 KiB/s (total gbytes: 205.1 KiB) (time elapsed: 0.9179990291595459 s) (per-host)
I0416 21:16:17.783076 124503439158400 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.918161s (batch_requests_ready=0.002957s, total_serialization_initiated=0.914817s, others=0.000386s)
I0416 21:16:17.783466 124503439158400 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 678.800 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 0.9057579040527344 s) (per-host)
I0416 21:16:17.783544 124503439158400 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.905849s (batch_requests_ready=0.002385s, total_serialization_initiated=0.903054s, others=0.000410s)
I0416 21:16:17.783608 124503439158400 composite_checkpoint_handler.py:715] [process=0][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.919663s (all_items=0.000022s, per_item={'model_params': '0.00001812', 'optimizer_state': '0.00000405'}, temp_paths=0.919641)
I0416 21:16:17.784242 124503439158400 event_tracking.py:125] [process=0] [async] Finished blocking save in 1.01 seconds. Continuing save @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/1.
I0416 21:16:17.784505 124393256650304 async_checkpointer.py:76] [process=0][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-16 21:36:17.784463
I0416 21:16:17.784741 124503439158400 checkpoint_manager.py:1560] [process=0][thread=MainThread][step=1] Starting CheckpointManager Save Finalize thread=save_finalize
I0416 21:16:17.785167 124503439158400 standard_logger.py:34] {'step': 1, 'event_type': 'save', 'directory': 'gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776374176.775781, 'wait_for_prev_duration_secs': 9.608268737792969e-05, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776374176.7760215, 'checkpointer_blocking_duration_secs': 1.0085759162902832, 'get_old_steps_start_time': 1776374177.7846127, 'get_old_steps_duration_secs': 9.250640869140625e-05, 'checkpoint_manager_blocking_start_time': 1776374176.7757118, 'checkpoint_manager_blocking_duration_secs': 1.0094316005706787}
I0416 21:16:17.785306 124503439158400 profiler.py:85] Starting JAX profiler at step 1.
I0416 21:16:17.906919 124393319564864 checkpoint.py:188] Wrote Metadata={'item_handlers': None, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776374177549970043, 'commit_timestamp_nsecs': None, 'custom_metadata': {}}, json={"item_handlers": null, "metrics": {}, "performance_metrics": {}, "init_timestamp_nsecs": 1776374177549970043, "commit_timestamp_nsecs": null, "custom_metadata": {}} to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/1/_CHECKPOINT_METADATA
I0416 21:16:17.907921 124393376187968 async_checkpointer.py:280] [process=0][thread=save_finalize] Waiting for background save thread=async_save.
I0416 21:16:18.080338 124503439158400 peft_trainer.py:485] Train step 1 training loss: 6.011032  - training perplexity: 407.903900

Training:   0%|          | 0/5 [00:14<?, ?step/s, _train_loss=6.01, _train_perplexity=408, _train_steps_per_sec=0.073]
Training:  20%|██        | 1/5 [00:14<00:59, 14.93s/step, _train_loss=6.01, _train_perplexity=408, _train_steps_per_sec=0.073]I0416 21:16:18.081259 124503439158400 max_utils.py:750] 
Memstats: After params initialized:
I0416 21:16:18.081339 124503439158400 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_0(process=0,(0,0,0,0))
I0416 21:16:18.081389 124503439158400 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_1(process=0,(1,0,0,0))
I0416 21:16:18.081431 124503439158400 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_2(process=0,(0,1,0,0))
I0416 21:16:18.081470 124503439158400 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_3(process=0,(1,1,0,0))
I0416 21:16:18.081506 124503439158400 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_4(process=0,(0,2,0,0))
I0416 21:16:18.081544 124503439158400 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_5(process=0,(1,2,0,0))
I0416 21:16:18.081580 124503439158400 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_6(process=0,(0,3,0,0))
I0416 21:16:18.081617 124503439158400 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_7(process=0,(1,3,0,0))
I0416 21:16:18.215662 124503439158400 metric_logger.py:185] completed step: 1, seconds: 14.923, TFLOP/s/device: 0.000, Tokens/s/device: 68.617, total_weights: 6826, loss: 6.011
I0416 21:16:18.228652 124503439158400 peft_trainer.py:485] Train step 2 training loss: 6.241558  - training perplexity: 513.658264

Training:  20%|██        | 1/5 [00:15<00:59, 14.93s/step, _train_loss=6.13, _train_perplexity=458, _train_steps_per_sec=0.42] 
Training:  40%|████      | 2/5 [00:15<00:18,  6.23s/step, _train_loss=6.13, _train_perplexity=458, _train_steps_per_sec=0.42]I0416 21:16:18.230390 124503439158400 metric_logger.py:185] completed step: 2, seconds: 0.148, TFLOP/s/device: 0.001, Tokens/s/device: 6918.613, total_weights: 4636, loss: 6.242
I0416 21:16:18.252366 124503439158400 peft_trainer.py:485] Train step 3 training loss: 5.699822  - training perplexity: 298.814362

Training:  40%|████      | 2/5 [00:15<00:18,  6.23s/step, _train_loss=5.98, _train_perplexity=397, _train_steps_per_sec=2.48]I0416 21:16:18.253703 124503439158400 metric_logger.py:185] completed step: 3, seconds: 0.023, TFLOP/s/device: 0.009, Tokens/s/device: 44136.942, total_weights: 5886, loss: 5.700
I0416 21:16:18.265721 124503439158400 peft_trainer.py:485] Train step 4 training loss: 5.823575  - training perplexity: 338.178894

Training:  60%|██████    | 3/5 [00:15<00:12,  6.23s/step, _train_loss=5.94, _train_perplexity=381, _train_steps_per_sec=12.4]I0416 21:16:18.267157 124503439158400 metric_logger.py:185] completed step: 4, seconds: 0.014, TFLOP/s/device: 0.016, Tokens/s/device: 75691.559, total_weights: 4990, loss: 5.824
I0416 21:16:18.267456 124503439158400 profiler.py:113] Stopping JAX profiler at step 5.
I0416 21:16:19.346155 124393288107584 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/1/optimizer_state/array_metadatas/process_0
I0416 21:16:19.372929 124393309079104 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/1/model_params/array_metadatas/process_0
I0416 21:16:20.921711 124393277621824 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.380286s (commit=1.934036s, array_metadata_write=0.446251s)
I0416 21:16:20.922885 124393256650304 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 50.546 KiB/s (total gbytes: 205.1 KiB) (time elapsed: 4.057915925979614 s) (per-host)
I0416 21:16:20.956583 124393267136064 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.410495s (commit=1.947642s, array_metadata_write=0.462852s)
I0416 21:16:20.957605 124393256650304 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 150.698 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 4.07986307144165 s) (per-host)
I0416 21:16:20.957798 124393256650304 async_checkpointer.py:90] [process=0][thread=async_save] 4 Handler Commit operations completed. Time taken: 3.173012s.
I0416 21:16:21.145099 124393256650304 checkpoint.py:228] Read Metadata={'item_handlers': None, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776374177549970043, 'commit_timestamp_nsecs': None, 'custom_metadata': {}} from gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/1/_CHECKPOINT_METADATA
I0416 21:16:21.335674 124393256650304 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0.
I0416 21:16:21.547162 124393319564864 checkpoint.py:247] Updated Metadata={'item_handlers': {'model_params': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler', 'optimizer_state': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler'}, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776374177549970043, 'commit_timestamp_nsecs': None, 'custom_metadata': {}} to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/1/_CHECKPOINT_METADATA
I0416 21:16:21.756681 124393256650304 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.568246s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/1/model_params
I0416 21:16:21.757524 124393256650304 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/1/model_params
I0416 21:16:22.165493 124393256650304 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0.
I0416 21:16:22.602717 124393256650304 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.570594s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/1/optimizer_state
I0416 21:16:22.603689 124393256650304 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/1/optimizer_state
I0416 21:16:22.885011 124393256650304 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/1
I0416 21:16:23.183993 124503439158400 utils.py:86] Train loop finished in: 20.0256 seconds
I0416 21:16:23.184694 124503439158400 peft_trainer.py:485] Train step 5 training loss: 5.944773  - training perplexity: 381.752594

Training:  80%|████████  | 4/5 [00:20<00:06,  6.23s/step, _train_loss=5.94, _train_perplexity=382, _train_steps_per_sec=24.8]
Training: 100%|██████████| 5/5 [00:20<00:00,  2.95s/step, _train_loss=5.94, _train_perplexity=382, _train_steps_per_sec=24.8]I0416 21:16:23.186122 124503439158400 metric_logger.py:185] completed step: 5, seconds: 4.919, TFLOP/s/device: 0.000, Tokens/s/device: 208.183, total_weights: 4264, loss: 5.945
I0416 21:16:23.189604 124503439158400 checkpoint_manager.py:2020] [process=0][thread=MainThread][step=1][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0416 21:16:23.592179 124393256650304 atomicity.py:847] [process=0][thread=async_save] Finished saving checkpoint (finalized tmp dir) to `gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/1`.
I0416 21:16:23.592942 124393256650304 event_tracking.py:138] [process=0] [async] Finished save (blocking + background) in 6.82 seconds @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/1
I0416 21:16:23.593061 124393256650304 async_checkpointer.py:160] [process=0][thread=async_save] Background save thread done. Time taken: 5.808250s.
I0416 21:16:23.593278 124393376187968 async_checkpointer.py:288] [process=0][thread=save_finalize] Done with waiting for background save thread=async_save.
I0416 21:16:23.593407 124393376187968 async_checkpointer.py:298] [process=0][thread=save_finalize] No errors found in background save thread=async_save.
I0416 21:16:23.593478 124393376187968 checkpoint_manager.py:2137] [process=0][thread=save_finalize][step=1] CheckpointManager Save Finalize is syncing with other hosts...
I0416 21:16:23.593523 124393376187968 checkpoint_manager.py:2146] [process=0][thread=save_finalize][step=1] CheckpointManager Save Finalize is done on all hosts.
I0416 21:16:23.593695 124503439158400 checkpoint_manager.py:2032] [process=0][thread=MainThread][step=1][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=1.
I0416 21:16:23.594026 124503439158400 checkpoint_manager.py:1512] [process=0] Saving checkpoint at step 5
I0416 21:16:23.594117 124503439158400 event_tracking.py:70] [process=0] [async] Started save checkpoint @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/5.
I0416 21:16:23.710592 124503439158400 jax_array_handlers.py:360] Scheduling D2H of 46 prioritized jax.Array.
I0416 21:16:23.710720 124503439158400 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0416 21:16:23.725228 124503439158400 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.015991s
I0416 21:16:23.725596 124503439158400 jax_array_handlers.py:360] Scheduling D2H of 52 prioritized jax.Array.
I0416 21:16:23.725638 124503439158400 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0416 21:16:23.756597 124503439158400 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.031295s
I0416 21:16:23.756950 124503439158400 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 2.909 MiB/s (total gbytes: 205.1 KiB) (time elapsed: 0.06885886192321777 s) (per-host)
I0416 21:16:23.757086 124503439158400 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.069007s (batch_requests_ready=0.002452s, total_serialization_initiated=0.066170s, others=0.000385s)
I0416 21:16:23.757440 124503439158400 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 10.414 MiB/s (total gbytes: 614.8 KiB) (time elapsed: 0.057654619216918945 s) (per-host)
I0416 21:16:23.757514 124503439158400 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.057738s (batch_requests_ready=0.002550s, total_serialization_initiated=0.054812s, others=0.000376s)
I0416 21:16:23.757573 124503439158400 composite_checkpoint_handler.py:715] [process=0][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.070021s (all_items=0.000012s, per_item={'model_params': '0.00000930', 'optimizer_state': '0.00000262'}, temp_paths=0.070009)
I0416 21:16:23.758219 124503439158400 event_tracking.py:125] [process=0] [async] Finished blocking save in 0.16 seconds. Continuing save @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/5.
I0416 21:16:23.758440 124393149695552 async_checkpointer.py:76] [process=0][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-16 21:36:23.758402
I0416 21:16:23.758686 124503439158400 checkpoint_manager.py:1560] [process=0][thread=MainThread][step=5] Starting CheckpointManager Save Finalize thread=save_finalize
I0416 21:16:23.759034 124393139209792 async_checkpointer.py:280] [process=0][thread=save_finalize] Waiting for background save thread=async_save.
I0416 21:16:23.759159 124503439158400 standard_logger.py:34] {'step': 5, 'event_type': 'save', 'directory': 'gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776374183.1895819, 'wait_for_prev_duration_secs': 0.4042243957519531, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776374183.594069, 'checkpointer_blocking_duration_secs': 0.16446971893310547, 'get_old_steps_start_time': 1776374183.7585564, 'get_old_steps_duration_secs': 9.059906005859375e-05, 'checkpoint_manager_blocking_start_time': 1776374183.1895368, 'checkpoint_manager_blocking_duration_secs': 0.5695977210998535}
I0416 21:16:23.759322 124503439158400 checkpoint_manager.py:2020] [process=0][thread=MainThread][step=5][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0416 21:16:23.772718 124393376187968 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/5
I0416 21:16:24.443983 124393267136064 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/5/optimizer_state
I0416 21:16:24.451905 124393267136064 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/5/model_params
I0416 21:16:25.686643 124393309079104 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/5/model_params/array_metadatas/process_0
I0416 21:16:25.693965 124393277621824 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/5/optimizer_state/array_metadatas/process_0
I0416 21:16:27.203212 124393212610112 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.291595s (commit=1.872011s, array_metadata_write=0.419584s)
I0416 21:16:27.214572 124393235678784 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.303403s (commit=1.891137s, array_metadata_write=0.412266s)
I0416 21:16:27.215649 124393149695552 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 58.147 KiB/s (total gbytes: 205.1 KiB) (time elapsed: 3.5275211334228516 s) (per-host)
I0416 21:16:27.215944 124393149695552 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 174.858 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 3.516164541244507 s) (per-host)
I0416 21:16:27.216018 124393149695552 async_checkpointer.py:90] [process=0][thread=async_save] 4 Handler Commit operations completed. Time taken: 3.457286s.
I0416 21:16:27.563478 124393149695552 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0.
I0416 21:16:27.998229 124393149695552 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.575525s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/5/model_params
I0416 21:16:27.999100 124393149695552 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/5/model_params
I0416 21:16:28.392844 124393149695552 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0.
I0416 21:16:28.827482 124393149695552 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.569905s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/5/optimizer_state
I0416 21:16:28.828361 124393149695552 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/5/optimizer_state
I0416 21:16:29.102622 124393149695552 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/5
I0416 21:16:29.806365 124393149695552 atomicity.py:847] [process=0][thread=async_save] Finished saving checkpoint (finalized tmp dir) to `gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/5`.
I0416 21:16:29.807156 124393149695552 event_tracking.py:138] [process=0] [async] Finished save (blocking + background) in 6.21 seconds @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_nnx_feat_nnx_post_train_fixes_20260416_210550_03_sft_linen_ckpt/checkpoints/5
I0416 21:16:29.807230 124393149695552 async_checkpointer.py:160] [process=0][thread=async_save] Background save thread done. Time taken: 6.048503s.
I0416 21:16:29.807398 124393139209792 async_checkpointer.py:288] [process=0][thread=save_finalize] Done with waiting for background save thread=async_save.
I0416 21:16:29.807512 124393139209792 async_checkpointer.py:298] [process=0][thread=save_finalize] No errors found in background save thread=async_save.
I0416 21:16:29.807567 124393139209792 checkpoint_manager.py:2137] [process=0][thread=save_finalize][step=5] CheckpointManager Save Finalize is syncing with other hosts...
I0416 21:16:29.807607 124393139209792 checkpoint_manager.py:2146] [process=0][thread=save_finalize][step=5] CheckpointManager Save Finalize is done on all hosts.
I0416 21:16:29.807759 124503439158400 checkpoint_manager.py:2032] [process=0][thread=MainThread][step=5][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=5.
I0416 21:16:29.808066 124503439158400 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=124503439158400 count=1 at 0x7134fab1b8c0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7134f8b2fce0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7134f8b2f890>, _write_futures=[])
I0416 21:16:29.808525 124503439158400 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=124503439158400 count=1 at 0x7134fab1b8c0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7134f8b2fce0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7134f8b2f890>, _write_futures=[])
I0416 21:16:29.808554 124503439158400 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=124503439158400 count=1 at 0x7134fab1b8c0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7134f8b2fce0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7134f8b2f890>, _write_futures=[])

Training: 100%|██████████| 5/5 [00:27<00:00,  5.56s/step, _train_loss=5.94, _train_perplexity=382, _train_steps_per_sec=24.8]
[DECOUPLED NO-OP] gcs_storage: using stubs.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] workload_monitor: using stub.
[DECOUPLED NO-OP] vertex_tensorboard: using stub.
~/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 15 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '