MaxView

← Back to run

Log Summary

2026-04-16 21:07:39.512687: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0416 21:07:39.955577 127608205884544 max_utils.py:238] Skipping jax distributed system due to skip_jax_distributed_system=True flag.
I0416 21:08:33.603207 127608205884544 max_utils.py:800] System Information: Jax Version: 0.8.3
I0416 21:08:33.603489 127608205884544 max_utils.py:801] System Information: Jaxlib Version: 0.8.3
I0416 21:08:33.603541 127608205884544 max_utils.py:802] System Information: Jax Backend: PJRT C API
TFRT TPU v6 lite
Built on Dec 15 2025 14:03:46 (1765836226) cl/844590465
I0416 21:08:33.608148 127608205884544 maxtext_utils.py:1687] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0416 21:08:33.797785 127608205884544 maxtext_utils.py:1687] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0416 21:08:34.839942 127608205884544 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None
I0416 21:08:34.840439 127608205884544 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x740e77ff9580>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0416 21:08:34.840527 127608205884544 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.36
W0416 21:08:36.123785 127608205884544 checkpoint.py:202] Metadata file does not exist: gs://wanglance-maxtext/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items/_CHECKPOINT_METADATA
I0416 21:08:36.423229 4160929 google_auth_provider.cc:149] Using credentials at ~/.config/gcloud/application_default_credentials.json
I0416 21:08:36.423308 4160929 google_auth_provider.cc:156] Using OAuth2 AuthProvider
I0416 21:08:36.841008 127608205884544 event_tracking.py:70] [process=0] [sync] Started load checkpoint @ gs://wanglance-maxtext/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items.
I0416 21:08:36.969334 127608205884544 checkpointer.py:307] Restoring checkpoint from gs://wanglance-maxtext/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items.
I0416 21:08:36.969554 127608205884544 event_tracking.py:125] [process=0] [sync] Finished blocking load in 0.13 seconds. Continuing load @ gs://wanglance-maxtext/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items.
I0416 21:08:37.575159 127608205884544 jax_array_handlers.py:843] [process=0] /jax/orbax/read/worker/io/requested throughput: 737.363 KiB/s (total gbytes: 204.9 KiB) (time elapsed: 0.27793288230895996 s) (per-host)
W0416 21:08:37.576175 127608205884544 transform_utils.py:230] The transformations API will eventually be replaced by an upgraded design. The current API will not be removed until this point, but it will no longer be actively worked on.
I0416 21:08:37.576381 127608205884544 transform_utils.py:288] The following keys are not loaded from the original tree after applying specified transforms: params/params/decoder/to_nnx__rngs/aqt/count, params/params/decoder/to_nnx__rngs/aqt/key, params/params/decoder/to_nnx__rngs/dropout/count, params/params/decoder/to_nnx__rngs/dropout/key, params/params/decoder/to_nnx__rngs/params/count, params/params/decoder/to_nnx__rngs/params/key
I0416 21:08:37.576624 127608205884544 event_tracking.py:138] [process=0] [sync] Finished load in 0.74 seconds @ gs://wanglance-maxtext/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items
I0416 21:08:37.579165 127608205884544 max_utils.py:194] tensorboardX not available; using no-op SummaryWriter.
I0416 21:08:37.603823 127608205884544 config.py:112] TensorFlow version 2.20.0 available.
I0416 21:08:37.604247 127608205884544 config.py:125] JAX version 0.8.3 available.
E0416 21:08:40.556811 127608205884544 packing.py:209] PackAndBatchOperation is deprecated. Please use lazy_dataset.FirstFitPackIterDataset instead.
I0416 21:08:40.921984 127608205884544 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None
I0416 21:08:40.922126 127608205884544 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x740e77ff9580>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0416 21:08:40.922167 127608205884544 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None
I0416 21:08:40.922194 127608205884544 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x740e77ff9580>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0416 21:08:40.922241 127608205884544 checkpoint_manager.py:708] [process=0][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'model_params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7407e1d5d6a0>, 'optimizer_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7407daf2e180>, 'custom_metadata': <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7407dcb6be90>}, handler_registry=None
I0416 21:08:40.922478 127608205884544 composite_checkpoint_handler.py:237] Deferred registration for item: "model_params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7407e1d5d6a0>` for item "model_params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0416 21:08:40.922514 127608205884544 composite_checkpoint_handler.py:237] Deferred registration for item: "optimizer_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7407daf2e180>` for item "optimizer_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0416 21:08:40.922535 127608205884544 composite_checkpoint_handler.py:237] Deferred registration for item: "custom_metadata". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7407dcb6be90>` for item "custom_metadata" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0416 21:08:40.922553 127608205884544 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7407daf2dbb0>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0416 21:08:40.922573 127608205884544 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7407e1d5d6a0>, ('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7407e1d5d6a0>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7407daf2e180>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7407daf2e180>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7407dcb6be90>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7407dcb6be90>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7407daf2dbb0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7407daf2dbb0>}).
I0416 21:08:40.922904 127608205884544 async_checkpointer.py:192] [process=0][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>.<lambda> at 0x7407dcbc7ba0> timeout: 1200 secs and primary_host=0 for async checkpoint writes
I0416 21:08:41.803540 127608205884544 checkpoint_manager.py:564] Created directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints
I0416 21:08:44.154294 127608205884544 checkpoint_manager.py:1812] Found 0 checkpoint steps in gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints
I0416 21:08:44.154560 127608205884544 checkpoint_manager.py:929] [process=0][thread=MainThread] CheckpointManager created,  primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=10000, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None, preservation_policy=None, prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False, lightweight_initialize=False), root_directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7407dcb6bf50>
I0416 21:08:46.184395 127608205884544 metrics_logger.py:64] WandbBackend skipped: 'wandb' library not installed.
I0416 21:08:46.184666 127608205884544 peft_trainer.py:590] Training with mesh: Mesh('diloco': 1, 'data': 1, 'stage': 1, 'fsdp': 8, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1, axis_types=(Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto))
I0416 21:08:46.664179 127608205884544 peft_trainer.py:600] Compiled train_step cache size: 0
[DECOUPLED NO-OP] gcs_storage: using stubs.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] workload_monitor: using stub.
[DECOUPLED NO-OP] vertex_tensorboard: using stub.

Training:   0%|          | 0/5 [00:00<?, ?step/s]I0416 21:08:46.666158 127608205884544 metric_logger.py:289] number parameters: 0.000 billion
Per train step:
 Total TFLOPs: 0.00 
 split as 54.29% learnable weight flops and 45.71% attention flops
2026-04-16 21:08:49.848359: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-04-16 21:08:49.887068: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-04-16 21:08:50.908573: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-04-16 21:08:53.301696: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0416 21:09:02.353843 127608205884544 checkpoint_manager.py:2009] [process=0][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0416 21:09:02.354060 127608205884544 checkpoint_manager.py:1512] [process=0] Saving checkpoint at step 1
I0416 21:09:02.354128 127608205884544 event_tracking.py:70] [process=0] [async] Started save checkpoint @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/1.
I0416 21:09:02.437186 127608205884544 signaling_client.py:373] Using ThreadSafeKeyValueSignalingClient
I0416 21:09:02.457288 127608205884544 jax_array_handlers.py:360] Scheduling D2H of 22 prioritized jax.Array.
I0416 21:09:02.457382 127608205884544 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0416 21:09:02.532084 127498132129344 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/1
I0416 21:09:03.238056 127498121643584 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/1/model_params
I0416 21:09:03.245420 127498121643584 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/1/optimizer_state
I0416 21:09:03.294971 127608205884544 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.838415s
I0416 21:09:03.295512 127608205884544 jax_array_handlers.py:360] Scheduling D2H of 52 prioritized jax.Array.
I0416 21:09:03.295559 127608205884544 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0416 21:09:03.326086 127608205884544 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.030986s
I0416 21:09:03.326435 127608205884544 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 230.878 KiB/s (total gbytes: 205.0 KiB) (time elapsed: 0.8877980709075928 s) (per-host)
I0416 21:09:03.326561 127608205884544 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.887937s (batch_requests_ready=0.001986s, total_serialization_initiated=0.885591s, others=0.000360s)
I0416 21:09:03.327255 127608205884544 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 698.423 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 0.8803095817565918 s) (per-host)
I0416 21:09:03.327337 127608205884544 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.880405s (batch_requests_ready=0.002496s, total_serialization_initiated=0.877187s, others=0.000722s)
I0416 21:09:03.327400 127608205884544 composite_checkpoint_handler.py:715] [process=0][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.889618s (all_items=0.000020s, per_item={'model_params': '0.00001645', 'optimizer_state': '0.00000381'}, temp_paths=0.889598)
I0416 21:09:03.328113 127608205884544 event_tracking.py:125] [process=0] [async] Finished blocking save in 0.97 seconds. Continuing save @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/1.
I0416 21:09:03.328395 127498023077440 async_checkpointer.py:76] [process=0][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-16 21:29:03.328351
I0416 21:09:03.328642 127608205884544 checkpoint_manager.py:1560] [process=0][thread=MainThread][step=1] Starting CheckpointManager Save Finalize thread=save_finalize
I0416 21:09:03.329000 127608205884544 standard_logger.py:34] {'step': 1, 'event_type': 'save', 'directory': 'gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776373742.3538237, 'wait_for_prev_duration_secs': 9.870529174804688e-05, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776373742.354085, 'checkpointer_blocking_duration_secs': 0.9744117259979248, 'get_old_steps_start_time': 1776373743.3285122, 'get_old_steps_duration_secs': 9.34600830078125e-05, 'checkpoint_manager_blocking_start_time': 1776373742.3537557, 'checkpoint_manager_blocking_duration_secs': 0.9752223491668701}
I0416 21:09:03.329139 127608205884544 profiler.py:85] Starting JAX profiler at step 1.
I0416 21:09:03.448317 127498085992000 checkpoint.py:188] Wrote Metadata={'item_handlers': None, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776373743156570192, 'commit_timestamp_nsecs': None, 'custom_metadata': {}}, json={"item_handlers": null, "metrics": {}, "performance_metrics": {}, "init_timestamp_nsecs": 1776373743156570192, "commit_timestamp_nsecs": null, "custom_metadata": {}} to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/1/_CHECKPOINT_METADATA
I0416 21:09:03.449562 127498142615104 async_checkpointer.py:280] [process=0][thread=save_finalize] Waiting for background save thread=async_save.
I0416 21:09:03.616842 127608205884544 peft_trainer.py:485] Train step 1 training loss: 6.011032  - training perplexity: 407.903900

Training:   0%|          | 0/5 [00:16<?, ?step/s, _train_loss=6.01, _train_perplexity=408, _train_steps_per_sec=0.064]
Training:  20%|██        | 1/5 [00:16<01:07, 16.95s/step, _train_loss=6.01, _train_perplexity=408, _train_steps_per_sec=0.064]I0416 21:09:03.617772 127608205884544 max_utils.py:750] 
Memstats: After params initialized:
I0416 21:09:03.617856 127608205884544 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_0(process=0,(0,0,0,0))
I0416 21:09:03.617907 127608205884544 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_1(process=0,(1,0,0,0))
I0416 21:09:03.617949 127608205884544 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_2(process=0,(0,1,0,0))
I0416 21:09:03.617986 127608205884544 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_3(process=0,(1,1,0,0))
I0416 21:09:03.618023 127608205884544 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_4(process=0,(0,2,0,0))
I0416 21:09:03.618081 127608205884544 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_5(process=0,(1,2,0,0))
I0416 21:09:03.618119 127608205884544 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_6(process=0,(0,3,0,0))
I0416 21:09:03.618158 127608205884544 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_7(process=0,(1,3,0,0))
I0416 21:09:03.748554 127608205884544 metric_logger.py:185] completed step: 1, seconds: 16.952, TFLOP/s/device: 0.000, Tokens/s/device: 60.408, total_weights: 6826, loss: 6.011
I0416 21:09:03.760497 127608205884544 peft_trainer.py:485] Train step 2 training loss: 6.241558  - training perplexity: 513.658264

Training:  20%|██        | 1/5 [00:17<01:07, 16.95s/step, _train_loss=6.13, _train_perplexity=458, _train_steps_per_sec=0.428]
Training:  40%|████      | 2/5 [00:17<00:21,  7.06s/step, _train_loss=6.13, _train_perplexity=458, _train_steps_per_sec=0.428]I0416 21:09:03.762164 127608205884544 metric_logger.py:185] completed step: 2, seconds: 0.143, TFLOP/s/device: 0.002, Tokens/s/device: 7164.230, total_weights: 4636, loss: 6.242
I0416 21:09:03.778423 127608205884544 peft_trainer.py:485] Train step 3 training loss: 5.699822  - training perplexity: 298.814362

Training:  40%|████      | 2/5 [00:17<00:21,  7.06s/step, _train_loss=5.98, _train_perplexity=397, _train_steps_per_sec=2.55] I0416 21:09:03.779865 127608205884544 metric_logger.py:185] completed step: 3, seconds: 0.018, TFLOP/s/device: 0.012, Tokens/s/device: 56513.888, total_weights: 5886, loss: 5.700
I0416 21:09:03.790905 127608205884544 peft_trainer.py:485] Train step 4 training loss: 5.823575  - training perplexity: 338.178894

Training:  60%|██████    | 3/5 [00:17<00:14,  7.06s/step, _train_loss=5.94, _train_perplexity=381, _train_steps_per_sec=15.8]I0416 21:09:03.792300 127608205884544 metric_logger.py:185] completed step: 4, seconds: 0.012, TFLOP/s/device: 0.018, Tokens/s/device: 83251.700, total_weights: 4990, loss: 5.824
I0416 21:09:03.792600 127608205884544 profiler.py:113] Stopping JAX profiler at step 5.
I0416 21:09:04.680009 127498075506240 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 22 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/1/model_params/array_metadatas/process_0
I0416 21:09:04.697410 127498054534720 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/1/optimizer_state/array_metadatas/process_0
I0416 21:09:06.190823 127498033563200 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.331827s (commit=1.897069s, array_metadata_write=0.434758s)
I0416 21:09:06.227377 127498044048960 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.372943s (commit=1.930304s, array_metadata_write=0.442639s)
I0416 21:09:06.228282 127498023077440 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 54.088 KiB/s (total gbytes: 205.0 KiB) (time elapsed: 3.789638042449951 s) (per-host)
I0416 21:09:06.228529 127498023077440 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 162.584 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 3.7815933227539062 s) (per-host)
I0416 21:09:06.228601 127498023077440 async_checkpointer.py:90] [process=0][thread=async_save] 4 Handler Commit operations completed. Time taken: 2.899917s.
I0416 21:09:06.412028 127498023077440 checkpoint.py:228] Read Metadata={'item_handlers': None, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776373743156570192, 'commit_timestamp_nsecs': None, 'custom_metadata': {}} from gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/1/_CHECKPOINT_METADATA
I0416 21:09:06.594783 127498023077440 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0.
I0416 21:09:06.782159 127498085992000 checkpoint.py:247] Updated Metadata={'item_handlers': {'model_params': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler', 'optimizer_state': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler'}, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776373743156570192, 'commit_timestamp_nsecs': None, 'custom_metadata': {}} to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/1/_CHECKPOINT_METADATA
I0416 21:09:07.010770 127498023077440 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.556287s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/1/model_params
I0416 21:09:07.011615 127498023077440 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/1/model_params
I0416 21:09:07.422727 127498023077440 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0.
I0416 21:09:07.849965 127498023077440 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.567127s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/1/optimizer_state
I0416 21:09:07.850825 127498023077440 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/1/optimizer_state
I0416 21:09:08.121306 127498023077440 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/1
I0416 21:09:08.539182 127608205884544 utils.py:86] Train loop finished in: 21.8723 seconds
I0416 21:09:08.539885 127608205884544 peft_trainer.py:485] Train step 5 training loss: 5.944773  - training perplexity: 381.752594

Training:  80%|████████  | 4/5 [00:21<00:07,  7.06s/step, _train_loss=5.94, _train_perplexity=382, _train_steps_per_sec=28.6]
Training: 100%|██████████| 5/5 [00:21<00:00,  3.15s/step, _train_loss=5.94, _train_perplexity=382, _train_steps_per_sec=28.6]I0416 21:09:08.541347 127608205884544 metric_logger.py:185] completed step: 5, seconds: 4.749, TFLOP/s/device: 0.000, Tokens/s/device: 215.626, total_weights: 4264, loss: 5.945
I0416 21:09:08.543557 127608205884544 checkpoint_manager.py:2020] [process=0][thread=MainThread][step=1][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0416 21:09:08.826288 127498023077440 atomicity.py:847] [process=0][thread=async_save] Finished saving checkpoint (finalized tmp dir) to `gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/1`.
I0416 21:09:08.827091 127498023077440 event_tracking.py:138] [process=0] [async] Finished save (blocking + background) in 6.47 seconds @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/1
I0416 21:09:08.827173 127498023077440 async_checkpointer.py:160] [process=0][thread=async_save] Background save thread done. Time taken: 5.498488s.
I0416 21:09:08.827389 127498142615104 async_checkpointer.py:288] [process=0][thread=save_finalize] Done with waiting for background save thread=async_save.
I0416 21:09:08.827517 127498142615104 async_checkpointer.py:298] [process=0][thread=save_finalize] No errors found in background save thread=async_save.
I0416 21:09:08.827574 127498142615104 checkpoint_manager.py:2137] [process=0][thread=save_finalize][step=1] CheckpointManager Save Finalize is syncing with other hosts...
I0416 21:09:08.827624 127498142615104 checkpoint_manager.py:2146] [process=0][thread=save_finalize][step=1] CheckpointManager Save Finalize is done on all hosts.
I0416 21:09:08.827796 127608205884544 checkpoint_manager.py:2032] [process=0][thread=MainThread][step=1][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=1.
I0416 21:09:08.828136 127608205884544 checkpoint_manager.py:1512] [process=0] Saving checkpoint at step 5
I0416 21:09:08.828207 127608205884544 event_tracking.py:70] [process=0] [async] Started save checkpoint @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/5.
I0416 21:09:08.935629 127608205884544 jax_array_handlers.py:360] Scheduling D2H of 22 prioritized jax.Array.
I0416 21:09:08.935754 127608205884544 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0416 21:09:08.947572 127608205884544 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.012598s
I0416 21:09:08.947873 127608205884544 jax_array_handlers.py:360] Scheduling D2H of 52 prioritized jax.Array.
I0416 21:09:08.947912 127608205884544 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0416 21:09:08.976626 127608205884544 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.028993s
I0416 21:09:08.976939 127608205884544 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 3.496 MiB/s (total gbytes: 205.0 KiB) (time elapsed: 0.057257652282714844 s) (per-host)
I0416 21:09:08.977074 127608205884544 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.057405s (batch_requests_ready=0.001536s, total_serialization_initiated=0.055530s, others=0.000339s)
I0416 21:09:08.977361 127608205884544 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 11.587 MiB/s (total gbytes: 614.8 KiB) (time elapsed: 0.05182003974914551 s) (per-host)
I0416 21:09:08.977436 127608205884544 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.051904s (batch_requests_ready=0.002493s, total_serialization_initiated=0.049098s, others=0.000312s)
I0416 21:09:08.977495 127608205884544 composite_checkpoint_handler.py:715] [process=0][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.058342s (all_items=0.000013s, per_item={'model_params': '0.00001049', 'optimizer_state': '0.00000238'}, temp_paths=0.058329)
I0416 21:09:08.978153 127608205884544 event_tracking.py:125] [process=0] [async] Finished blocking save in 0.15 seconds. Continuing save @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/5.
I0416 21:09:08.978409 127497991620160 async_checkpointer.py:76] [process=0][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-16 21:29:08.978368
I0416 21:09:08.978648 127608205884544 checkpoint_manager.py:1560] [process=0][thread=MainThread][step=5] Starting CheckpointManager Save Finalize thread=save_finalize
I0416 21:09:08.978959 127498023077440 async_checkpointer.py:280] [process=0][thread=save_finalize] Waiting for background save thread=async_save.
I0416 21:09:08.979096 127608205884544 standard_logger.py:34] {'step': 5, 'event_type': 'save', 'directory': 'gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776373748.5435326, 'wait_for_prev_duration_secs': 0.28437066078186035, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776373748.828162, 'checkpointer_blocking_duration_secs': 0.15034127235412598, 'get_old_steps_start_time': 1776373748.9785194, 'get_old_steps_duration_secs': 9.107589721679688e-05, 'checkpoint_manager_blocking_start_time': 1776373748.5434911, 'checkpoint_manager_blocking_duration_secs': 0.4355807304382324}
I0416 21:09:08.979254 127608205884544 checkpoint_manager.py:2020] [process=0][thread=MainThread][step=5][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0416 21:09:09.005186 127498142615104 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/5
I0416 21:09:09.670554 127498044048960 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/5/model_params
I0416 21:09:09.683208 127498044048960 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/5/optimizer_state
I0416 21:09:11.267300 127498100672064 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 22 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/5/model_params/array_metadatas/process_0
I0416 21:09:11.267516 127498054534720 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/5/optimizer_state/array_metadatas/process_0
I0416 21:09:12.428680 127498033563200 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.289626s (commit=1.890224s, array_metadata_write=0.399401s)
I0416 21:09:12.429738 127497991620160 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 58.396 KiB/s (total gbytes: 205.0 KiB) (time elapsed: 3.5100529193878174 s) (per-host)
I0416 21:09:12.437009 127498002105920 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.290381s (commit=1.896379s, array_metadata_write=0.394002s)
I0416 21:09:12.437801 127497991620160 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 175.052 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 3.5122580528259277 s) (per-host)
I0416 21:09:12.437942 127497991620160 async_checkpointer.py:90] [process=0][thread=async_save] 4 Handler Commit operations completed. Time taken: 3.459249s.
I0416 21:09:12.786947 127497991620160 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0.
I0416 21:09:13.178365 127497991620160 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.524791s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/5/model_params
I0416 21:09:13.179279 127497991620160 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/5/model_params
I0416 21:09:13.589695 127497991620160 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0.
I0416 21:09:13.984010 127497991620160 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.533510s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/5/optimizer_state
I0416 21:09:13.984931 127497991620160 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/5/optimizer_state
I0416 21:09:14.243039 127497991620160 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/5
I0416 21:09:14.938582 127497991620160 atomicity.py:847] [process=0][thread=async_save] Finished saving checkpoint (finalized tmp dir) to `gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/5`.
I0416 21:09:14.939331 127497991620160 event_tracking.py:138] [process=0] [async] Finished save (blocking + background) in 6.11 seconds @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260416_210550/pt_sft_linen_feat_nnx_post_train_fixes_20260416_210550_02_sft_linen_ckpt/checkpoints/5
I0416 21:09:14.939400 127497991620160 async_checkpointer.py:160] [process=0][thread=async_save] Background save thread done. Time taken: 5.960709s.
I0416 21:09:14.939507 127498023077440 async_checkpointer.py:288] [process=0][thread=save_finalize] Done with waiting for background save thread=async_save.
I0416 21:09:14.939556 127498023077440 async_checkpointer.py:298] [process=0][thread=save_finalize] No errors found in background save thread=async_save.
I0416 21:09:14.939601 127498023077440 checkpoint_manager.py:2137] [process=0][thread=save_finalize][step=5] CheckpointManager Save Finalize is syncing with other hosts...
I0416 21:09:14.939639 127498023077440 checkpoint_manager.py:2146] [process=0][thread=save_finalize][step=5] CheckpointManager Save Finalize is done on all hosts.
I0416 21:09:14.939797 127608205884544 checkpoint_manager.py:2032] [process=0][thread=MainThread][step=5][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=5.
I0416 21:09:14.940087 127608205884544 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=127608205884544 count=1 at 0x7407dd73d800>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7407dcb6a090>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7407dcb69fa0>, _write_futures=[])
I0416 21:09:14.940523 127608205884544 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=127608205884544 count=1 at 0x7407dd73d800>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7407dcb6a090>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7407dcb69fa0>, _write_futures=[])
I0416 21:09:14.940550 127608205884544 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=127608205884544 count=1 at 0x7407dd73d800>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7407dcb6a090>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7407dcb69fa0>, _write_futures=[])

Training: 100%|██████████| 5/5 [00:29<00:00,  5.87s/step, _train_loss=5.94, _train_perplexity=382, _train_steps_per_sec=28.6]
[DECOUPLED NO-OP] gcs_storage: using stubs.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] mldiagnostics: using stub.
[DECOUPLED NO-OP] workload_monitor: using stub.
[DECOUPLED NO-OP] vertex_tensorboard: using stub.
~/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 15 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '