MaxView

← Back to run

Log Summary

2026-04-20 20:57:16.653092: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0420 20:57:17.111736 124365660339328 max_utils.py:238] Skipping jax distributed system due to skip_jax_distributed_system=True flag.
I0420 20:57:42.812107 124365660339328 max_utils.py:800] System Information: Jax Version: 0.8.3
I0420 20:57:42.812410 124365660339328 max_utils.py:801] System Information: Jaxlib Version: 0.8.3
I0420 20:57:42.812465 124365660339328 max_utils.py:802] System Information: Jax Backend: PJRT C API
TFRT TPU v6 lite
Built on Dec 15 2025 14:03:46 (1765836226) cl/844590465
I0420 20:57:42.817206 124365660339328 maxtext_utils.py:1718] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0420 20:57:43.015409 124365660339328 maxtext_utils.py:1718] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0420 20:57:44.067003 124365660339328 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None
I0420 20:57:44.067532 124365660339328 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x711b813fd970>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0420 20:57:44.067624 124365660339328 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.36
W0420 20:57:45.395480 124365660339328 checkpoint.py:202] Metadata file does not exist: gs://wanglance-maxtext/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items/_CHECKPOINT_METADATA
I0420 20:57:45.700141 1294976 google_auth_provider.cc:149] Using credentials at ~/.config/gcloud/application_default_credentials.json
I0420 20:57:45.700243 1294976 google_auth_provider.cc:156] Using OAuth2 AuthProvider
I0420 20:57:46.105418 124365660339328 event_tracking.py:70] [process=0] [sync] Started load checkpoint @ gs://wanglance-maxtext/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items.
I0420 20:57:46.226369 124365660339328 checkpointer.py:307] Restoring checkpoint from gs://wanglance-maxtext/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items.
I0420 20:57:46.226611 124365660339328 event_tracking.py:125] [process=0] [sync] Finished blocking load in 0.12 seconds. Continuing load @ gs://wanglance-maxtext/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items.
I0420 20:57:46.791587 124365660339328 jax_array_handlers.py:843] [process=0] /jax/orbax/read/worker/io/requested throughput: 737.139 KiB/s (total gbytes: 204.9 KiB) (time elapsed: 0.278017520904541 s) (per-host)
W0420 20:57:46.792751 124365660339328 transform_utils.py:230] The transformations API will eventually be replaced by an upgraded design. The current API will not be removed until this point, but it will no longer be actively worked on.
I0420 20:57:46.792959 124365660339328 transform_utils.py:288] The following keys are not loaded from the original tree after applying specified transforms: params/params/decoder/to_nnx__rngs/aqt/count, params/params/decoder/to_nnx__rngs/aqt/key, params/params/decoder/to_nnx__rngs/dropout/count, params/params/decoder/to_nnx__rngs/dropout/key, params/params/decoder/to_nnx__rngs/params/count, params/params/decoder/to_nnx__rngs/params/key
I0420 20:57:46.793427 124365660339328 event_tracking.py:138] [process=0] [sync] Finished load in 0.69 seconds @ gs://wanglance-maxtext/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items
I0420 20:57:46.795868 124365660339328 max_utils.py:194] tensorboardX not available; using no-op SummaryWriter.
I0420 20:57:46.820671 124365660339328 config.py:112] TensorFlow version 2.20.0 available.
I0420 20:57:46.821082 124365660339328 config.py:125] JAX version 0.8.3 available.
E0420 20:57:50.059655 124365660339328 packing.py:209] PackAndBatchOperation is deprecated. Please use lazy_dataset.FirstFitPackIterDataset instead.
I0420 20:57:50.419497 124365660339328 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None
I0420 20:57:50.419641 124365660339328 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x711b813fd970>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0420 20:57:50.419680 124365660339328 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None
I0420 20:57:50.419708 124365660339328 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x711b813fd970>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0420 20:57:50.419763 124365660339328 checkpoint_manager.py:708] [process=0][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'model_params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7115125dc950>, 'optimizer_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7114e4377860>, 'custom_metadata': <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7114e4377620>}, handler_registry=None
I0420 20:57:50.420059 124365660339328 composite_checkpoint_handler.py:237] Deferred registration for item: "model_params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7115125dc950>` for item "model_params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0420 20:57:50.420100 124365660339328 composite_checkpoint_handler.py:237] Deferred registration for item: "optimizer_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7114e4377860>` for item "optimizer_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0420 20:57:50.420123 124365660339328 composite_checkpoint_handler.py:237] Deferred registration for item: "custom_metadata". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7114e4377620>` for item "custom_metadata" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0420 20:57:50.420142 124365660339328 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7114e43ba180>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0420 20:57:50.420163 124365660339328 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7115125dc950>, ('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7115125dc950>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7114e4377860>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7114e4377860>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7114e4377620>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7114e4377620>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7114e43ba180>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7114e43ba180>}).
I0420 20:57:50.420511 124365660339328 async_checkpointer.py:192] [process=0][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>.<lambda> at 0x7114e4360720> timeout: 1200 secs and primary_host=0 for async checkpoint writes
I0420 20:57:51.281138 124365660339328 checkpoint_manager.py:564] Created directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints
I0420 20:57:53.611415 124365660339328 checkpoint_manager.py:1812] Found 0 checkpoint steps in gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints
I0420 20:57:53.611720 124365660339328 checkpoint_manager.py:929] [process=0][thread=MainThread] CheckpointManager created,  primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=10000, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None, preservation_policy=None, prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False, lightweight_initialize=False), root_directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7114e43777d0>
I0420 20:57:55.640522 124365660339328 metrics_logger.py:64] WandbBackend skipped: 'wandb' library not installed.
I0420 20:57:55.640824 124365660339328 peft_trainer.py:590] Training with mesh: Mesh('diloco': 1, 'data': 1, 'stage': 1, 'fsdp': 8, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1, axis_types=(Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto))
I0420 20:57:56.095821 124365660339328 peft_trainer.py:600] Compiled train_step cache size: 0
[DECOUPLED NO-OP] gcs_storage: using stubs.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] workload_monitor: using stub.
[DECOUPLED NO-OP] vertex_tensorboard: using stub.

Training:   0%|          | 0/5 [00:00<?, ?step/s]I0420 20:57:56.097879 124365660339328 metric_logger.py:301] number parameters: 0.000 billion
Per train step:
 Total TFLOPs: 0.00 
 split as 54.29% learnable weight flops and 45.71% attention flops
2026-04-20 20:57:59.262188: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-04-20 20:57:59.300156: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-04-20 20:58:00.308061: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-04-20 20:58:02.725814: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0420 20:58:11.552586 124365660339328 checkpoint_manager.py:2009] [process=0][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0420 20:58:11.552798 124365660339328 checkpoint_manager.py:1512] [process=0] Saving checkpoint at step 1
I0420 20:58:11.552864 124365660339328 event_tracking.py:70] [process=0] [async] Started save checkpoint @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/1.
I0420 20:58:11.630252 124365660339328 signaling_client.py:373] Using ThreadSafeKeyValueSignalingClient
I0420 20:58:11.649092 124365660339328 jax_array_handlers.py:360] Scheduling D2H of 22 prioritized jax.Array.
I0420 20:58:11.649190 124365660339328 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0420 20:58:11.724645 124255620564544 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/1
I0420 20:58:12.395016 124255610078784 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/1/optimizer_state
I0420 20:58:12.400226 124255610078784 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/1/model_params
I0420 20:58:12.486662 124365660339328 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.838304s
I0420 20:58:12.487158 124365660339328 jax_array_handlers.py:360] Scheduling D2H of 52 prioritized jax.Array.
I0420 20:58:12.487207 124365660339328 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0420 20:58:12.517328 124365660339328 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.030555s
I0420 20:58:12.517753 124365660339328 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 231.262 KiB/s (total gbytes: 205.0 KiB) (time elapsed: 0.8863224983215332 s) (per-host)
I0420 20:58:12.517888 124365660339328 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.886471s (batch_requests_ready=0.001957s, total_serialization_initiated=0.884067s, others=0.000446s)
I0420 20:58:12.518318 124365660339328 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 698.969 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 0.8796217441558838 s) (per-host)
I0420 20:58:12.518396 124365660339328 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.879710s (batch_requests_ready=0.002478s, total_serialization_initiated=0.876822s, others=0.000410s)
I0420 20:58:12.518457 124365660339328 composite_checkpoint_handler.py:715] [process=0][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.887615s (all_items=0.000019s, per_item={'model_params': '0.00001502', 'optimizer_state': '0.00000358'}, temp_paths=0.887596)
I0420 20:58:12.519095 124365660339328 event_tracking.py:125] [process=0] [async] Finished blocking save in 0.97 seconds. Continuing save @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/1.
I0420 20:58:12.519276 124255511512640 async_checkpointer.py:76] [process=0][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-20 21:18:12.519241
I0420 20:58:12.519520 124365660339328 checkpoint_manager.py:1560] [process=0][thread=MainThread][step=1] Starting CheckpointManager Save Finalize thread=save_finalize
I0420 20:58:12.519847 124365660339328 standard_logger.py:34] {'step': 1, 'event_type': 'save', 'directory': 'gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776718691.552567, 'wait_for_prev_duration_secs': 0.00010228157043457031, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776718691.552822, 'checkpointer_blocking_duration_secs': 0.9665517807006836, 'get_old_steps_start_time': 1776718692.5193915, 'get_old_steps_duration_secs': 9.298324584960938e-05, 'checkpoint_manager_blocking_start_time': 1776718691.552487, 'checkpoint_manager_blocking_duration_secs': 0.9673388004302979}
I0420 20:58:12.519970 124365660339328 profiler.py:85] Starting JAX profiler at step 1.
I0420 20:58:12.625614 124255574427200 checkpoint.py:188] Wrote Metadata={'item_handlers': None, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776718692302856210, 'commit_timestamp_nsecs': None, 'custom_metadata': {}}, json={"item_handlers": null, "metrics": {}, "performance_metrics": {}, "init_timestamp_nsecs": 1776718692302856210, "commit_timestamp_nsecs": null, "custom_metadata": {}} to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/1/_CHECKPOINT_METADATA
I0420 20:58:12.626695 124255631050304 async_checkpointer.py:280] [process=0][thread=save_finalize] Waiting for background save thread=async_save.
I0420 20:58:12.793934 124365660339328 peft_trainer.py:485] Train step 1 training loss: 6.011032  - training perplexity: 407.903900

Training:   0%|          | 0/5 [00:16<?, ?step/s, _train_loss=6.01, _train_perplexity=408, _train_steps_per_sec=0.065]
Training:  20%|██        | 1/5 [00:16<01:06, 16.70s/step, _train_loss=6.01, _train_perplexity=408, _train_steps_per_sec=0.065]I0420 20:58:12.794969 124365660339328 max_utils.py:750] 
Memstats: After params initialized:
I0420 20:58:12.795065 124365660339328 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_0(process=0,(0,0,0,0))
I0420 20:58:12.795116 124365660339328 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_1(process=0,(1,0,0,0))
I0420 20:58:12.795157 124365660339328 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_2(process=0,(0,1,0,0))
I0420 20:58:12.795195 124365660339328 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_3(process=0,(1,1,0,0))
I0420 20:58:12.795233 124365660339328 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_4(process=0,(0,2,0,0))
I0420 20:58:12.795271 124365660339328 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_5(process=0,(1,2,0,0))
I0420 20:58:12.795306 124365660339328 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_6(process=0,(0,3,0,0))
I0420 20:58:12.795341 124365660339328 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_7(process=0,(1,3,0,0))
I0420 20:58:12.928371 124365660339328 metric_logger.py:196] completed step: 1, seconds: 16.697, TFLOP/s/device: 0.000, Tokens/s/device: 61.329, total_weights: 6826, loss: 6.011, lm_loss: 0.000, perplexity: 0.000
I0420 20:58:12.940074 124365660339328 peft_trainer.py:485] Train step 2 training loss: 6.241558  - training perplexity: 513.658264

Training:  20%|██        | 1/5 [00:16<01:06, 16.70s/step, _train_loss=6.13, _train_perplexity=458, _train_steps_per_sec=0.436]
Training:  40%|████      | 2/5 [00:16<00:20,  6.96s/step, _train_loss=6.13, _train_perplexity=458, _train_steps_per_sec=0.436]I0420 20:58:12.941626 124365660339328 metric_logger.py:196] completed step: 2, seconds: 0.146, TFLOP/s/device: 0.002, Tokens/s/device: 7030.368, total_weights: 4636, loss: 6.242, lm_loss: 0.000, perplexity: 0.000
I0420 20:58:12.957458 124365660339328 peft_trainer.py:485] Train step 3 training loss: 5.699822  - training perplexity: 298.814362

Training:  40%|████      | 2/5 [00:16<00:20,  6.96s/step, _train_loss=5.98, _train_perplexity=397, _train_steps_per_sec=2.52] I0420 20:58:12.958688 124365660339328 metric_logger.py:196] completed step: 3, seconds: 0.017, TFLOP/s/device: 0.013, Tokens/s/device: 59886.823, total_weights: 5886, loss: 5.700, lm_loss: 0.000, perplexity: 0.000
I0420 20:58:12.969920 124365660339328 peft_trainer.py:485] Train step 4 training loss: 5.823575  - training perplexity: 338.178894

Training:  60%|██████    | 3/5 [00:16<00:13,  6.96s/step, _train_loss=5.94, _train_perplexity=381, _train_steps_per_sec=16.2]I0420 20:58:12.971336 124365660339328 metric_logger.py:196] completed step: 4, seconds: 0.013, TFLOP/s/device: 0.018, Tokens/s/device: 81417.297, total_weights: 4990, loss: 5.824, lm_loss: 0.000, perplexity: 0.000
I0420 20:58:12.971647 124365660339328 profiler.py:113] Stopping JAX profiler at step 5.
I0420 20:58:13.852616 124255563941440 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 22 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/1/model_params/array_metadatas/process_0
I0420 20:58:13.857678 124255542969920 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/1/optimizer_state/array_metadatas/process_0
I0420 20:58:15.289704 124255532484160 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.211747s (commit=1.789488s, array_metadata_write=0.422259s)
I0420 20:58:15.290928 124255511512640 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 56.012 KiB/s (total gbytes: 205.0 KiB) (time elapsed: 3.659451484680176 s) (per-host)
I0420 20:58:15.307704 124255521998400 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.229424s (commit=1.807405s, array_metadata_write=0.422020s)
I0420 20:58:15.308484 124255511512640 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 167.539 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 3.669772148132324 s) (per-host)
I0420 20:58:15.308643 124255511512640 async_checkpointer.py:90] [process=0][thread=async_save] 4 Handler Commit operations completed. Time taken: 2.789079s.
I0420 20:58:15.504720 124255511512640 checkpoint.py:228] Read Metadata={'item_handlers': None, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776718692302856210, 'commit_timestamp_nsecs': None, 'custom_metadata': {}} from gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/1/_CHECKPOINT_METADATA
I0420 20:58:15.688384 124255511512640 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0.
I0420 20:58:15.886350 124255574427200 checkpoint.py:247] Updated Metadata={'item_handlers': {'model_params': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler', 'optimizer_state': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler'}, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776718692302856210, 'commit_timestamp_nsecs': None, 'custom_metadata': {}} to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/1/_CHECKPOINT_METADATA
I0420 20:58:16.088394 124255511512640 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.543653s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/1/model_params
I0420 20:58:16.089267 124255511512640 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/1/model_params
I0420 20:58:16.484429 124255511512640 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0.
I0420 20:58:16.878884 124255511512640 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.537851s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/1/optimizer_state
I0420 20:58:16.879859 124255511512640 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/1/optimizer_state
I0420 20:58:17.143618 124255511512640 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/1
I0420 20:58:17.647283 124365660339328 utils.py:86] Train loop finished in: 21.5487 seconds
I0420 20:58:17.647976 124365660339328 peft_trainer.py:485] Train step 5 training loss: 5.944773  - training perplexity: 381.752594

Training:  80%|████████  | 4/5 [00:21<00:06,  6.96s/step, _train_loss=5.94, _train_perplexity=382, _train_steps_per_sec=29]  
Training: 100%|██████████| 5/5 [00:21<00:00,  3.10s/step, _train_loss=5.94, _train_perplexity=382, _train_steps_per_sec=29]I0420 20:58:17.649384 124365660339328 metric_logger.py:196] completed step: 5, seconds: 4.678, TFLOP/s/device: 0.000, Tokens/s/device: 218.911, total_weights: 4264, loss: 5.945, lm_loss: 0.000, perplexity: 0.000
I0420 20:58:17.651589 124365660339328 checkpoint_manager.py:2020] [process=0][thread=MainThread][step=1][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0420 20:58:17.868098 124255511512640 atomicity.py:847] [process=0][thread=async_save] Finished saving checkpoint (finalized tmp dir) to `gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/1`.
I0420 20:58:17.868926 124255511512640 event_tracking.py:138] [process=0] [async] Finished save (blocking + background) in 6.32 seconds @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/1
I0420 20:58:17.869023 124255511512640 async_checkpointer.py:160] [process=0][thread=async_save] Background save thread done. Time taken: 5.349460s.
I0420 20:58:17.869273 124255631050304 async_checkpointer.py:288] [process=0][thread=save_finalize] Done with waiting for background save thread=async_save.
I0420 20:58:17.869406 124255631050304 async_checkpointer.py:298] [process=0][thread=save_finalize] No errors found in background save thread=async_save.
I0420 20:58:17.869469 124255631050304 checkpoint_manager.py:2137] [process=0][thread=save_finalize][step=1] CheckpointManager Save Finalize is syncing with other hosts...
I0420 20:58:17.869514 124255631050304 checkpoint_manager.py:2146] [process=0][thread=save_finalize][step=1] CheckpointManager Save Finalize is done on all hosts.
I0420 20:58:17.869707 124365660339328 checkpoint_manager.py:2032] [process=0][thread=MainThread][step=1][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=1.
I0420 20:58:17.870058 124365660339328 checkpoint_manager.py:1512] [process=0] Saving checkpoint at step 5
I0420 20:58:17.870133 124365660339328 event_tracking.py:70] [process=0] [async] Started save checkpoint @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/5.
I0420 20:58:17.974214 124365660339328 jax_array_handlers.py:360] Scheduling D2H of 22 prioritized jax.Array.
I0420 20:58:17.974408 124365660339328 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0420 20:58:17.986112 124365660339328 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.012670s
I0420 20:58:17.986437 124365660339328 jax_array_handlers.py:360] Scheduling D2H of 52 prioritized jax.Array.
I0420 20:58:17.986477 124365660339328 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0420 20:58:18.015385 124365660339328 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.029210s
I0420 20:58:18.015737 124365660339328 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 3.497 MiB/s (total gbytes: 205.0 KiB) (time elapsed: 0.057241201400756836 s) (per-host)
I0420 20:58:18.015853 124365660339328 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.057369s (batch_requests_ready=0.001556s, total_serialization_initiated=0.055459s, others=0.000354s)
I0420 20:58:18.016223 124365660339328 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 11.615 MiB/s (total gbytes: 614.8 KiB) (time elapsed: 0.05169510841369629 s) (per-host)
I0420 20:58:18.016306 124365660339328 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.051790s (batch_requests_ready=0.002491s, total_serialization_initiated=0.048894s, others=0.000404s)
I0420 20:58:18.016369 124365660339328 composite_checkpoint_handler.py:715] [process=0][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.058455s (all_items=0.000017s, per_item={'model_params': '0.00001478', 'optimizer_state': '0.00000191'}, temp_paths=0.058438)
I0420 20:58:18.017146 124365660339328 event_tracking.py:125] [process=0] [async] Finished blocking save in 0.15 seconds. Continuing save @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/5.
I0420 20:58:18.017371 124255415043648 async_checkpointer.py:76] [process=0][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-20 21:18:18.017333
I0420 20:58:18.017639 124365660339328 checkpoint_manager.py:1560] [process=0][thread=MainThread][step=5] Starting CheckpointManager Save Finalize thread=save_finalize
I0420 20:58:18.017984 124255511512640 async_checkpointer.py:280] [process=0][thread=save_finalize] Waiting for background save thread=async_save.
I0420 20:58:18.018108 124365660339328 standard_logger.py:34] {'step': 5, 'event_type': 'save', 'directory': 'gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776718697.6515658, 'wait_for_prev_duration_secs': 0.21825647354125977, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776718697.8700838, 'checkpointer_blocking_duration_secs': 0.14739537239074707, 'get_old_steps_start_time': 1776718698.0174963, 'get_old_steps_duration_secs': 0.00010609626770019531, 'checkpoint_manager_blocking_start_time': 1776718697.6515229, 'checkpoint_manager_blocking_duration_secs': 0.3665597438812256}
I0420 20:58:18.018259 124365660339328 checkpoint_manager.py:2020] [process=0][thread=MainThread][step=5][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0420 20:58:18.038488 124255631050304 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/5
I0420 20:58:18.691740 124255501026880 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/5/optimizer_state
I0420 20:58:18.696786 124255501026880 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/5/model_params
I0420 20:58:19.908465 124255563941440 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 22 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/5/model_params/array_metadatas/process_0
I0420 20:58:19.912632 124255532484160 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/5/optimizer_state/array_metadatas/process_0
I0420 20:58:21.369640 124255490541120 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.185486s (commit=1.804176s, array_metadata_write=0.381310s)
I0420 20:58:21.370378 124255456986688 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.193252s (commit=1.780271s, array_metadata_write=0.412981s)
I0420 20:58:21.372007 124255415043648 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 60.048 KiB/s (total gbytes: 205.0 KiB) (time elapsed: 3.4134654998779297 s) (per-host)
I0420 20:58:21.372424 124255415043648 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 180.413 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 3.4079012870788574 s) (per-host)
I0420 20:58:21.372502 124255415043648 async_checkpointer.py:90] [process=0][thread=async_save] 4 Handler Commit operations completed. Time taken: 3.354818s.
I0420 20:58:21.725642 124255415043648 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0.
I0420 20:58:22.121615 124255415043648 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.535907s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/5/model_params
I0420 20:58:22.122534 124255415043648 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/5/model_params
I0420 20:58:22.564457 124255415043648 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0.
I0420 20:58:22.941363 124255415043648 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.524868s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/5/optimizer_state
I0420 20:58:22.942295 124255415043648 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/5/optimizer_state
I0420 20:58:23.205485 124255415043648 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/5
I0420 20:58:23.878858 124255415043648 atomicity.py:847] [process=0][thread=async_save] Finished saving checkpoint (finalized tmp dir) to `gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/5`.
I0420 20:58:23.879642 124255415043648 event_tracking.py:138] [process=0] [async] Finished save (blocking + background) in 6.01 seconds @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_linen_feat_nnx_post_train_fixes_20260420_205452_02_sft_linen_ckpt/checkpoints/5
I0420 20:58:23.879721 124255415043648 async_checkpointer.py:160] [process=0][thread=async_save] Background save thread done. Time taken: 5.862038s.
I0420 20:58:23.879944 124255511512640 async_checkpointer.py:288] [process=0][thread=save_finalize] Done with waiting for background save thread=async_save.
I0420 20:58:23.880088 124255511512640 async_checkpointer.py:298] [process=0][thread=save_finalize] No errors found in background save thread=async_save.
I0420 20:58:23.880144 124255511512640 checkpoint_manager.py:2137] [process=0][thread=save_finalize][step=5] CheckpointManager Save Finalize is syncing with other hosts...
I0420 20:58:23.880185 124255511512640 checkpoint_manager.py:2146] [process=0][thread=save_finalize][step=5] CheckpointManager Save Finalize is done on all hosts.
I0420 20:58:23.880374 124365660339328 checkpoint_manager.py:2032] [process=0][thread=MainThread][step=5][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=5.
I0420 20:58:23.880617 124365660339328 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=124365660339328 count=1 at 0x7114e4bc9e00>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7114e7b361e0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7114e4377230>, _write_futures=[])
I0420 20:58:23.881337 124365660339328 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=124365660339328 count=1 at 0x7114e4bc9e00>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7114e7b361e0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7114e4377230>, _write_futures=[])
I0420 20:58:23.881366 124365660339328 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=124365660339328 count=1 at 0x7114e4bc9e00>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7114e7b361e0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7114e4377230>, _write_futures=[])

Training: 100%|██████████| 5/5 [00:28<00:00,  5.78s/step, _train_loss=5.94, _train_perplexity=382, _train_steps_per_sec=29]
[DECOUPLED NO-OP] gcs_storage: using stubs.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] workload_monitor: using stub.
[DECOUPLED NO-OP] vertex_tensorboard: using stub.
~/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 15 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '