2026-04-20 21:03:47.848443: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303) I0420 21:03:48.296000 131873516874880 max_utils.py:238] Skipping jax distributed system due to skip_jax_distributed_system=True flag. I0420 21:04:16.783522 131873516874880 max_utils.py:800] System Information: Jax Version: 0.8.3 I0420 21:04:16.783797 131873516874880 max_utils.py:801] System Information: Jaxlib Version: 0.8.3 I0420 21:04:16.783848 131873516874880 max_utils.py:802] System Information: Jax Backend: PJRT C API TFRT TPU v6 lite Built on Dec 15 2025 14:03:46 (1765836226) cl/844590465 I0420 21:04:16.788512 131873516874880 maxtext_utils.py:1718] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1) I0420 21:04:16.872640 131873516874880 maxtext_utils.py:1718] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1) I0420 21:04:16.956612 131873516874880 maxtext_utils.py:1718] Num_devices: 8, shape (1, 1, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1) I0420 21:04:17.916663 131873516874880 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None I0420 21:04:17.917165 131873516874880 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x77ef8fbf5490>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB) I0420 21:04:17.917256 131873516874880 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.36 W0420 21:04:19.190178 131873516874880 checkpoint.py:202] Metadata file does not exist: gs://wanglance-maxtext/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items/_CHECKPOINT_METADATA I0420 21:04:19.476494 1306525 google_auth_provider.cc:149] Using credentials at ~/.config/gcloud/application_default_credentials.json I0420 21:04:19.476571 1306525 google_auth_provider.cc:156] Using OAuth2 AuthProvider I0420 21:04:19.925404 131873516874880 event_tracking.py:70] [process=0] [sync] Started load checkpoint @ gs://wanglance-maxtext/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items. I0420 21:04:20.055178 131873516874880 checkpointer.py:307] Restoring checkpoint from gs://wanglance-maxtext/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items. I0420 21:04:20.055392 131873516874880 event_tracking.py:125] [process=0] [sync] Finished blocking load in 0.13 seconds. Continuing load @ gs://wanglance-maxtext/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items. I0420 21:04:20.546633 131873516874880 jax_array_handlers.py:843] [process=0] /jax/orbax/read/worker/io/requested throughput: 1.060 MiB/s (total gbytes: 204.9 KiB) (time elapsed: 0.18884563446044922 s) (per-host) W0420 21:04:20.547648 131873516874880 transform_utils.py:230] The transformations API will eventually be replaced by an upgraded design. The current API will not be removed until this point, but it will no longer be actively worked on. I0420 21:04:20.547920 131873516874880 transform_utils.py:288] The following keys are not loaded from the original tree after applying specified transforms: params/params/decoder/dropout/rngs/aqt/count, params/params/decoder/dropout/rngs/aqt/key, params/params/decoder/dropout/rngs/dropout/count, params/params/decoder/dropout/rngs/dropout/key, params/params/decoder/dropout/rngs/params/count, params/params/decoder/dropout/rngs/params/key, params/params/decoder/layers/dropout/rngs/aqt/count, params/params/decoder/layers/dropout/rngs/aqt/key, params/params/decoder/layers/dropout/rngs/dropout/count, params/params/decoder/layers/dropout/rngs/dropout/key, params/params/decoder/layers/dropout/rngs/params/count, params/params/decoder/layers/dropout/rngs/params/key, params/params/decoder/layers/mlp/dropout/rngs/aqt/count, params/params/decoder/layers/mlp/dropout/rngs/aqt/key, params/params/decoder/layers/mlp/dropout/rngs/dropout/count, params/params/decoder/layers/mlp/dropout/rngs/dropout/key, params/params/decoder/layers/mlp/dropout/rngs/params/count, params/params/decoder/layers/mlp/dropout/rngs/params/key, params/params/decoder/layers/rngs/aqt/count, params/params/decoder/layers/rngs/aqt/key, params/params/decoder/layers/rngs/dropout/count, params/params/decoder/layers/rngs/dropout/key, params/params/decoder/layers/rngs/params/count, params/params/decoder/layers/rngs/params/key, params/params/decoder/rngs/aqt/count, params/params/decoder/rngs/aqt/key, params/params/decoder/rngs/dropout/count, params/params/decoder/rngs/dropout/key, params/params/decoder/rngs/params/count, params/params/decoder/rngs/params/key I0420 21:04:20.548696 131873516874880 event_tracking.py:138] [process=0] [sync] Finished load in 0.62 seconds @ gs://wanglance-maxtext/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items I0420 21:04:20.553404 131873516874880 max_utils.py:194] tensorboardX not available; using no-op SummaryWriter. I0420 21:04:20.580897 131873516874880 config.py:112] TensorFlow version 2.20.0 available. I0420 21:04:20.581341 131873516874880 config.py:125] JAX version 0.8.3 available. E0420 21:04:23.499619 131873516874880 packing.py:209] PackAndBatchOperation is deprecated. Please use lazy_dataset.FirstFitPackIterDataset instead. I0420 21:04:23.871343 131873516874880 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None I0420 21:04:23.871483 131873516874880 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x77ef8fbf5490>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB) I0420 21:04:23.871523 131873516874880 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None I0420 21:04:23.871550 131873516874880 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x77ef8fbf5490>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB) I0420 21:04:23.871600 131873516874880 checkpoint_manager.py:708] [process=0][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'model_params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x77e8f3389a00>, 'optimizer_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x77f031bfab40>, 'custom_metadata': <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x77e8f173edb0>}, handler_registry=None I0420 21:04:23.871852 131873516874880 composite_checkpoint_handler.py:237] Deferred registration for item: "model_params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x77e8f3389a00>` for item "model_params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`. I0420 21:04:23.871890 131873516874880 composite_checkpoint_handler.py:237] Deferred registration for item: "optimizer_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x77f031bfab40>` for item "optimizer_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`. I0420 21:04:23.871911 131873516874880 composite_checkpoint_handler.py:237] Deferred registration for item: "custom_metadata". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x77e8f173edb0>` for item "custom_metadata" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`. I0420 21:04:23.871929 131873516874880 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x77e8f173c740>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`. I0420 21:04:23.871951 131873516874880 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x77e8f3389a00>, ('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x77e8f3389a00>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x77f031bfab40>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x77f031bfab40>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x77e8f173edb0>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x77e8f173edb0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x77e8f173c740>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x77e8f173c740>}). I0420 21:04:23.872320 131873516874880 async_checkpointer.py:192] [process=0][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>.<lambda> at 0x77e8f170f600> timeout: 1200 secs and primary_host=0 for async checkpoint writes I0420 21:04:24.722367 131873516874880 checkpoint_manager.py:564] Created directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints I0420 21:04:27.086986 131873516874880 checkpoint_manager.py:1812] Found 0 checkpoint steps in gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints I0420 21:04:27.087255 131873516874880 checkpoint_manager.py:929] [process=0][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=10000, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None, preservation_policy=None, prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False, lightweight_initialize=False), root_directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x77e8f173ec60> I0420 21:04:29.130269 131873516874880 metrics_logger.py:64] WandbBackend skipped: 'wandb' library not installed. I0420 21:04:29.130562 131873516874880 peft_trainer.py:590] Training with mesh: Mesh('diloco': 1, 'data': 1, 'stage': 1, 'fsdp': 8, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1, axis_types=(Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto)) I0420 21:04:29.581172 131873516874880 peft_trainer.py:600] Compiled train_step cache size: 0 [DECOUPLED NO-OP] gcs_storage: using stubs. [DECOUPLED NO-OP] mldiagnostics: using stub. [DECOUPLED NO-OP] mldiagnostics: using stub. [DECOUPLED NO-OP] mldiagnostics: using stub. [DECOUPLED NO-OP] workload_monitor: using stub. [DECOUPLED NO-OP] vertex_tensorboard: using stub. Training: 0%| | 0/5 [00:00<?, ?step/s]I0420 21:04:29.584689 131873516874880 metric_logger.py:301] number parameters: 0.000 billion Per train step: Total TFLOPs: 0.00 split as 54.29% learnable weight flops and 45.71% attention flops 2026-04-20 21:04:32.756694: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2026-04-20 21:04:32.794978: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2026-04-20 21:04:33.808351: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2026-04-20 21:04:36.250076: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303) I0420 21:04:45.128105 131873516874880 checkpoint_manager.py:2009] [process=0][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning. I0420 21:04:45.128302 131873516874880 checkpoint_manager.py:1512] [process=0] Saving checkpoint at step 1 I0420 21:04:45.128365 131873516874880 event_tracking.py:70] [process=0] [async] Started save checkpoint @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/1. I0420 21:04:45.213103 131873516874880 signaling_client.py:373] Using ThreadSafeKeyValueSignalingClient I0420 21:04:45.237936 131873516874880 jax_array_handlers.py:360] Scheduling D2H of 46 prioritized jax.Array. I0420 21:04:45.238058 131873516874880 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0420 21:04:45.300506 131763454084672 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/1 I0420 21:04:46.006393 131763443598912 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/1/optimizer_state I0420 21:04:46.006622 131763443598912 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/1/model_params I0420 21:04:46.094156 131873516874880 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.857539s I0420 21:04:46.094642 131873516874880 jax_array_handlers.py:360] Scheduling D2H of 52 prioritized jax.Array. I0420 21:04:46.094691 131873516874880 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0420 21:04:46.125881 131873516874880 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.031616s I0420 21:04:46.126270 131873516874880 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 224.971 KiB/s (total gbytes: 205.1 KiB) (time elapsed: 0.9117329120635986 s) (per-host) I0420 21:04:46.126398 131873516874880 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.911876s (batch_requests_ready=0.002888s, total_serialization_initiated=0.908589s, others=0.000398s) I0420 21:04:46.126832 131873516874880 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 683.704 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 0.8992607593536377 s) (per-host) I0420 21:04:46.126910 131873516874880 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.899349s (batch_requests_ready=0.002306s, total_serialization_initiated=0.896586s, others=0.000456s) I0420 21:04:46.126976 131873516874880 composite_checkpoint_handler.py:715] [process=0][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.913297s (all_items=0.000019s, per_item={'model_params': '0.00001526', 'optimizer_state': '0.00000358'}, temp_paths=0.913278) I0420 21:04:46.127654 131873516874880 event_tracking.py:125] [process=0] [async] Finished blocking save in 1.00 seconds. Continuing save @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/1. I0420 21:04:46.127839 131763345032768 async_checkpointer.py:76] [process=0][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-20 21:24:46.127809 I0420 21:04:46.128099 131873516874880 checkpoint_manager.py:1560] [process=0][thread=MainThread][step=1] Starting CheckpointManager Save Finalize thread=save_finalize I0420 21:04:46.128518 131873516874880 standard_logger.py:34] {'step': 1, 'event_type': 'save', 'directory': 'gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776719085.1280856, 'wait_for_prev_duration_secs': 9.775161743164062e-05, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776719085.1283252, 'checkpointer_blocking_duration_secs': 0.9996049404144287, 'get_old_steps_start_time': 1776719086.127945, 'get_old_steps_duration_secs': 9.512901306152344e-05, 'checkpoint_manager_blocking_start_time': 1776719085.1279964, 'checkpoint_manager_blocking_duration_secs': 1.0004997253417969} I0420 21:04:46.128663 131873516874880 profiler.py:85] Starting JAX profiler at step 1. I0420 21:04:46.237052 131763407947328 checkpoint.py:188] Wrote Metadata={'item_handlers': None, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776719085914058654, 'commit_timestamp_nsecs': None, 'custom_metadata': {}}, json={"item_handlers": null, "metrics": {}, "performance_metrics": {}, "init_timestamp_nsecs": 1776719085914058654, "commit_timestamp_nsecs": null, "custom_metadata": {}} to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/1/_CHECKPOINT_METADATA I0420 21:04:46.238389 131763464570432 async_checkpointer.py:280] [process=0][thread=save_finalize] Waiting for background save thread=async_save. I0420 21:04:46.408756 131873516874880 peft_trainer.py:485] Train step 1 training loss: 6.011032 - training perplexity: 407.903900 Training: 0%| | 0/5 [00:16<?, ?step/s, _train_loss=6.01, _train_perplexity=408, _train_steps_per_sec=0.064] Training: 20%|██ | 1/5 [00:16<01:07, 16.83s/step, _train_loss=6.01, _train_perplexity=408, _train_steps_per_sec=0.064]I0420 21:04:46.409672 131873516874880 max_utils.py:750] Memstats: After params initialized: I0420 21:04:46.409748 131873516874880 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_0(process=0,(0,0,0,0)) I0420 21:04:46.409797 131873516874880 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_1(process=0,(1,0,0,0)) I0420 21:04:46.409839 131873516874880 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_2(process=0,(0,1,0,0)) I0420 21:04:46.409878 131873516874880 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_3(process=0,(1,1,0,0)) I0420 21:04:46.409917 131873516874880 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_4(process=0,(0,2,0,0)) I0420 21:04:46.409953 131873516874880 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_5(process=0,(1,2,0,0)) I0420 21:04:46.409987 131873516874880 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_6(process=0,(0,3,0,0)) I0420 21:04:46.410022 131873516874880 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_7(process=0,(1,3,0,0)) I0420 21:04:46.543381 131873516874880 metric_logger.py:196] completed step: 1, seconds: 16.825, TFLOP/s/device: 0.000, Tokens/s/device: 60.862, total_weights: 6826, loss: 6.011, lm_loss: 0.000, perplexity: 0.000 I0420 21:04:46.556790 131873516874880 peft_trainer.py:485] Train step 2 training loss: 6.241558 - training perplexity: 513.658264 Training: 20%|██ | 1/5 [00:16<01:07, 16.83s/step, _train_loss=6.13, _train_perplexity=458, _train_steps_per_sec=0.422] Training: 40%|████ | 2/5 [00:16<00:21, 7.02s/step, _train_loss=6.13, _train_perplexity=458, _train_steps_per_sec=0.422]I0420 21:04:46.558586 131873516874880 metric_logger.py:196] completed step: 2, seconds: 0.148, TFLOP/s/device: 0.001, Tokens/s/device: 6931.021, total_weights: 4636, loss: 6.242, lm_loss: 0.000, perplexity: 0.000 I0420 21:04:46.580828 131873516874880 peft_trainer.py:485] Train step 3 training loss: 5.699822 - training perplexity: 298.814362 Training: 40%|████ | 2/5 [00:16<00:21, 7.02s/step, _train_loss=5.98, _train_perplexity=397, _train_steps_per_sec=2.53] I0420 21:04:46.582228 131873516874880 metric_logger.py:196] completed step: 3, seconds: 0.023, TFLOP/s/device: 0.009, Tokens/s/device: 43703.104, total_weights: 5886, loss: 5.700, lm_loss: 0.000, perplexity: 0.000 I0420 21:04:46.595162 131873516874880 peft_trainer.py:485] Train step 4 training loss: 5.823575 - training perplexity: 338.178894 Training: 60%|██████ | 3/5 [00:17<00:14, 7.02s/step, _train_loss=5.94, _train_perplexity=381, _train_steps_per_sec=12.3]I0420 21:04:46.596443 131873516874880 metric_logger.py:196] completed step: 4, seconds: 0.014, TFLOP/s/device: 0.015, Tokens/s/device: 72004.067, total_weights: 4990, loss: 5.824, lm_loss: 0.000, perplexity: 0.000 I0420 21:04:46.596736 131873516874880 profiler.py:113] Stopping JAX profiler at step 5. I0420 21:04:47.525168 131763376490048 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/1/optimizer_state/array_metadatas/process_0 I0420 21:04:47.560147 131763397461568 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/1/model_params/array_metadatas/process_0 I0420 21:04:48.973123 131763366004288 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.241225s (commit=1.832331s, array_metadata_write=0.408895s) I0420 21:04:48.974226 131763345032768 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 54.556 KiB/s (total gbytes: 205.1 KiB) (time elapsed: 3.7596495151519775 s) (per-host) I0420 21:04:49.056294 131763355518528 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.324271s (commit=1.909623s, array_metadata_write=0.414648s) I0420 21:04:49.057367 131763345032768 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 160.540 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 3.8297529220581055 s) (per-host) I0420 21:04:49.057667 131763345032768 async_checkpointer.py:90] [process=0][thread=async_save] 4 Handler Commit operations completed. Time taken: 2.929524s. I0420 21:04:49.240959 131763345032768 checkpoint.py:228] Read Metadata={'item_handlers': None, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776719085914058654, 'commit_timestamp_nsecs': None, 'custom_metadata': {}} from gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/1/_CHECKPOINT_METADATA I0420 21:04:49.440397 131763345032768 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0. I0420 21:04:49.642983 131763407947328 checkpoint.py:247] Updated Metadata={'item_handlers': {'model_params': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler', 'optimizer_state': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler'}, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776719085914058654, 'commit_timestamp_nsecs': None, 'custom_metadata': {}} to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/1/_CHECKPOINT_METADATA I0420 21:04:49.819579 131763345032768 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.535805s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/1/model_params I0420 21:04:49.820410 131763345032768 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/1/model_params I0420 21:04:50.243130 131763345032768 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0. I0420 21:04:50.655386 131763345032768 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.577794s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/1/optimizer_state I0420 21:04:50.656301 131763345032768 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/1/optimizer_state I0420 21:04:50.943783 131763345032768 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/1 I0420 21:04:51.456123 131873516874880 utils.py:86] Train loop finished in: 21.8707 seconds I0420 21:04:51.456758 131873516874880 peft_trainer.py:485] Train step 5 training loss: 5.944773 - training perplexity: 381.752594 Training: 80%|████████ | 4/5 [00:21<00:07, 7.02s/step, _train_loss=5.94, _train_perplexity=382, _train_steps_per_sec=23.8] Training: 100%|██████████| 5/5 [00:21<00:00, 3.16s/step, _train_loss=5.94, _train_perplexity=382, _train_steps_per_sec=23.8]I0420 21:04:51.458138 131873516874880 metric_logger.py:196] completed step: 5, seconds: 4.862, TFLOP/s/device: 0.000, Tokens/s/device: 210.630, total_weights: 4264, loss: 5.945, lm_loss: 0.000, perplexity: 0.000 I0420 21:04:51.461281 131873516874880 checkpoint_manager.py:2020] [process=0][thread=MainThread][step=1][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete. I0420 21:04:51.676808 131763345032768 atomicity.py:847] [process=0][thread=async_save] Finished saving checkpoint (finalized tmp dir) to `gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/1`. I0420 21:04:51.677610 131763345032768 event_tracking.py:138] [process=0] [async] Finished save (blocking + background) in 6.55 seconds @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/1 I0420 21:04:51.677688 131763345032768 async_checkpointer.py:160] [process=0][thread=async_save] Background save thread done. Time taken: 5.549546s. I0420 21:04:51.677888 131763464570432 async_checkpointer.py:288] [process=0][thread=save_finalize] Done with waiting for background save thread=async_save. I0420 21:04:51.678011 131763464570432 async_checkpointer.py:298] [process=0][thread=save_finalize] No errors found in background save thread=async_save. I0420 21:04:51.678081 131763464570432 checkpoint_manager.py:2137] [process=0][thread=save_finalize][step=1] CheckpointManager Save Finalize is syncing with other hosts... I0420 21:04:51.678125 131763464570432 checkpoint_manager.py:2146] [process=0][thread=save_finalize][step=1] CheckpointManager Save Finalize is done on all hosts. I0420 21:04:51.678231 131873516874880 checkpoint_manager.py:2032] [process=0][thread=MainThread][step=1][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=1. I0420 21:04:51.678451 131873516874880 checkpoint_manager.py:1512] [process=0] Saving checkpoint at step 5 I0420 21:04:51.678518 131873516874880 event_tracking.py:70] [process=0] [async] Started save checkpoint @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/5. I0420 21:04:51.783766 131873516874880 jax_array_handlers.py:360] Scheduling D2H of 46 prioritized jax.Array. I0420 21:04:51.783893 131873516874880 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0420 21:04:51.798094 131873516874880 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.015690s I0420 21:04:51.798437 131873516874880 jax_array_handlers.py:360] Scheduling D2H of 52 prioritized jax.Array. I0420 21:04:51.798478 131873516874880 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0420 21:04:51.829219 131873516874880 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.031059s I0420 21:04:51.829579 131873516874880 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 2.971 MiB/s (total gbytes: 205.1 KiB) (time elapsed: 0.0674278736114502 s) (per-host) I0420 21:04:51.829699 131873516874880 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.067561s (batch_requests_ready=0.002261s, total_serialization_initiated=0.064925s, others=0.000375s) I0420 21:04:51.830076 131873516874880 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 10.626 MiB/s (total gbytes: 614.8 KiB) (time elapsed: 0.05650496482849121 s) (per-host) I0420 21:04:51.830153 131873516874880 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.056593s (batch_requests_ready=0.002362s, total_serialization_initiated=0.053828s, others=0.000403s) I0420 21:04:51.830213 131873516874880 composite_checkpoint_handler.py:715] [process=0][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.068579s (all_items=0.000012s, per_item={'model_params': '0.00000978', 'optimizer_state': '0.00000262'}, temp_paths=0.068567) I0420 21:04:51.830872 131873516874880 event_tracking.py:125] [process=0] [async] Finished blocking save in 0.15 seconds. Continuing save @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/5. I0420 21:04:51.831089 131763324061248 async_checkpointer.py:76] [process=0][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-20 21:24:51.831051 I0420 21:04:51.831340 131873516874880 checkpoint_manager.py:1560] [process=0][thread=MainThread][step=5] Starting CheckpointManager Save Finalize thread=save_finalize I0420 21:04:51.831647 131763345032768 async_checkpointer.py:280] [process=0][thread=save_finalize] Waiting for background save thread=async_save. I0420 21:04:51.831766 131873516874880 standard_logger.py:34] {'step': 5, 'event_type': 'save', 'directory': 'gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776719091.4612584, 'wait_for_prev_duration_secs': 0.21701765060424805, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776719091.6784754, 'checkpointer_blocking_duration_secs': 0.15271663665771484, 'get_old_steps_start_time': 1776719091.8312087, 'get_old_steps_duration_secs': 9.34600830078125e-05, 'checkpoint_manager_blocking_start_time': 1776719091.4612193, 'checkpoint_manager_blocking_duration_secs': 0.37052273750305176} I0420 21:04:51.831926 131873516874880 checkpoint_manager.py:2020] [process=0][thread=MainThread][step=5][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete. I0420 21:04:51.851367 131763464570432 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/5 I0420 21:04:52.498360 131763355518528 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/5/model_params I0420 21:04:52.503310 131763355518528 atomicity.py:140] Creating tmp directory gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/5/optimizer_state I0420 21:04:53.760020 131763376490048 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/5/optimizer_state/array_metadatas/process_0 I0420 21:04:53.799901 131763422627392 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/5/model_params/array_metadatas/process_0 I0420 21:04:55.283547 131763366004288 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.302789s (commit=1.841069s, array_metadata_write=0.461720s) I0420 21:04:55.284662 131763324061248 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 58.230 KiB/s (total gbytes: 205.1 KiB) (time elapsed: 3.522468090057373 s) (per-host) I0420 21:04:55.285392 131763334547008 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 2.305214s (commit=1.868789s, array_metadata_write=0.436425s) I0420 21:04:55.286468 131763324061248 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 175.020 KiB/s (total gbytes: 614.8 KiB) (time elapsed: 3.5128984451293945 s) (per-host) I0420 21:04:55.286567 131763324061248 async_checkpointer.py:90] [process=0][thread=async_save] 4 Handler Commit operations completed. Time taken: 3.455184s. I0420 21:04:55.671319 131763324061248 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0. I0420 21:04:56.083130 131763324061248 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.568144s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/5/model_params I0420 21:04:56.083960 131763324061248 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/5/model_params I0420 21:04:56.508147 131763324061248 array_metadata_store.py:367] [process=0][thread=async_save] Skipped cross-host ArrayMetadata validation because only one process is found: process_index=0. I0420 21:04:56.900245 131763324061248 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 0.547238s. use_zarr3=False, enable_post_merge_validation=True, directory=gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/5/optimizer_state I0420 21:04:56.901096 131763324061248 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/5/optimizer_state I0420 21:04:57.190911 131763324061248 atomicity.py:666] Finalizing gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/5 I0420 21:04:57.888430 131763324061248 atomicity.py:847] [process=0][thread=async_save] Finished saving checkpoint (finalized tmp dir) to `gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/5`. I0420 21:04:57.889200 131763324061248 event_tracking.py:138] [process=0] [async] Finished save (blocking + background) in 6.21 seconds @ gs://wanglance-maxtext/pt_ckpt_feat_nnx_post_train_fixes_20260420_205452/pt_sft_nnx_feat_nnx_post_train_fixes_20260420_205452_03_sft_linen_ckpt/checkpoints/5 I0420 21:04:57.889271 131763324061248 async_checkpointer.py:160] [process=0][thread=async_save] Background save thread done. Time taken: 6.057889s. I0420 21:04:57.889441 131763345032768 async_checkpointer.py:288] [process=0][thread=save_finalize] Done with waiting for background save thread=async_save. I0420 21:04:57.889553 131763345032768 async_checkpointer.py:298] [process=0][thread=save_finalize] No errors found in background save thread=async_save. I0420 21:04:57.889608 131763345032768 checkpoint_manager.py:2137] [process=0][thread=save_finalize][step=5] CheckpointManager Save Finalize is syncing with other hosts... I0420 21:04:57.889647 131763345032768 checkpoint_manager.py:2146] [process=0][thread=save_finalize][step=5] CheckpointManager Save Finalize is done on all hosts. I0420 21:04:57.889809 131873516874880 checkpoint_manager.py:2032] [process=0][thread=MainThread][step=5][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=5. I0420 21:04:57.890040 131873516874880 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=131873516874880 count=1 at 0x77e8f6b72ec0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x77e8f173e810>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x77e8f475f5f0>, _write_futures=[]) I0420 21:04:57.890451 131873516874880 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=131873516874880 count=1 at 0x77e8f6b72ec0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x77e8f173e810>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x77e8f475f5f0>, _write_futures=[]) I0420 21:04:57.890478 131873516874880 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=131873516874880 count=1 at 0x77e8f6b72ec0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x77e8f173e810>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x77e8f475f5f0>, _write_futures=[]) Training: 100%|██████████| 5/5 [00:29<00:00, 5.89s/step, _train_loss=5.94, _train_perplexity=382, _train_steps_per_sec=23.8] [DECOUPLED NO-OP] gcs_storage: using stubs. [DECOUPLED NO-OP] mldiagnostics: using stub. [DECOUPLED NO-OP] mldiagnostics: using stub. [DECOUPLED NO-OP] mldiagnostics: using stub. [DECOUPLED NO-OP] workload_monitor: using stub. [DECOUPLED NO-OP] vertex_tensorboard: using stub. ~/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 15 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '