XPK Start: Fri Apr 24 12:23:28 UTC 2026
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
2026-04-24 12:23:57.280360: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0424 12:23:57.529889 132675596343104 max_utils.py:273] Attempting to initialize the jax distributed system...
INFO:2026-04-24 12:24:06,572:jax._src.distributed:149: Starting JAX distributed service on [::]:8482
I0424 12:24:06.572643 132675596343104 distributed.py:149] Starting JAX distributed service on [::]:8482
INFO:2026-04-24 12:24:06,575:jax._src.distributed:166: Connecting to JAX distributed service on mt-02-sft-nnx-ckpt-bnyu1-slice-job-0-0.mt-02-sft-nnx-ckpt-bnyu1:8482
I0424 12:24:06.575020 132675596343104 distributed.py:166] Connecting to JAX distributed service on mt-02-sft-nnx-ckpt-bnyu1-slice-job-0-0.mt-02-sft-nnx-ckpt-bnyu1:8482
I0424 12:24:07.956301 132675596343104 max_utils.py:284] Jax distributed system initialized!
I0424 12:24:13.960306 132675596343104 max_utils.py:800] System Information: Jax Version: 0.8.3
I0424 12:24:13.960413 132675596343104 max_utils.py:801] System Information: Jaxlib Version: 0.8.3
I0424 12:24:13.960452 132675596343104 max_utils.py:802] System Information: Jax Backend: PJRT C API
TFRT TPU v6 lite
Built on Dec 15 2025 14:03:46 (1765836226) cl/844590465
I0424 12:24:13.963897 132675596343104 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0424 12:24:14.060910 132675596343104 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0424 12:24:14.164172 132675596343104 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0424 12:24:15.221263 132675596343104 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0424 12:24:15.221702 132675596343104 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x78aa3bf97d10>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0424 12:24:15.221766 132675596343104 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.28
W0424 12:24:15.726036 132675596343104 checkpoint.py:202] Metadata file does not exist: gs://lance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items/_CHECKPOINT_METADATA
I0424 12:24:16.370584 1942 google_auth_provider.cc:181] Running on GCE, using service account 562977990677-compute@developer.gserviceaccount.com
I0424 12:24:17.444863 132675596343104 checkpointer.py:304] Restoring checkpoint from gs://lance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items.
W0424 12:24:18.338762 132675596343104 transform_utils.py:230] The transformations API will eventually be replaced by an upgraded design. The current API will not be removed until this point, but it will no longer be actively worked on.
I0424 12:24:18.339182 132675596343104 transform_utils.py:288] The following keys are not loaded from the original tree after applying specified transforms: decoder/decoder_norm/bias/value, decoder/decoder_norm/scale/value, decoder/dropout/rngs/aqt/count/value, decoder/dropout/rngs/aqt/key/value, decoder/dropout/rngs/dropout/count/value, decoder/dropout/rngs/dropout/key/value, decoder/dropout/rngs/params/count/value, decoder/dropout/rngs/params/key/value, decoder/layers/dropout/rngs/aqt/count/value, decoder/layers/dropout/rngs/aqt/key/value, decoder/layers/dropout/rngs/dropout/count/value, decoder/layers/dropout/rngs/dropout/key/value, decoder/layers/dropout/rngs/params/count/value, decoder/layers/dropout/rngs/params/key/value, decoder/layers/mlp/dropout/rngs/aqt/count/value, decoder/layers/mlp/dropout/rngs/aqt/key/value, decoder/layers/mlp/dropout/rngs/dropout/count/value, decoder/layers/mlp/dropout/rngs/dropout/key/value, decoder/layers/mlp/dropout/rngs/params/count/value, decoder/layers/mlp/dropout/rngs/params/key/value, decoder/layers/mlp/mlp_layer_norm/bias/value, decoder/layers/mlp/mlp_layer_norm/scale/value, decoder/layers/mlp/wi/bias/value, decoder/layers/mlp/wi/kernel/value, decoder/layers/mlp/wo/bias/value, decoder/layers/mlp/wo/kernel/value, decoder/layers/pre_self_attention_norm/bias/value, decoder/layers/pre_self_attention_norm/scale/value, decoder/layers/rngs/aqt/count/value, decoder/layers/rngs/aqt/key/value, decoder/layers/rngs/dropout/count/value, decoder/layers/rngs/dropout/key/value, decoder/layers/rngs/params/count/value, decoder/layers/rngs/params/key/value, decoder/layers/self_attention/out/bias/value, decoder/layers/self_attention/out/kernel/value, decoder/layers/self_attention/qkv_proj/bias/value, decoder/layers/self_attention/qkv_proj/kernel/value, decoder/position_embedder/embedding/value, decoder/rngs/aqt/count/value, decoder/rngs/aqt/key/value, decoder/rngs/dropout/count/value, decoder/rngs/dropout/key/value, decoder/rngs/params/count/value, decoder/rngs/params/key/value, token_embedder/embedding/value
I0424 12:24:18.616578 132675596343104 checkpointer.py:318] Finished restoring checkpoint in 1.55 seconds from gs://lance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items.
I0424 12:24:18.685434 132675596343104 config.py:112] TensorFlow version 2.20.0 available.
I0424 12:24:18.685977 132675596343104 config.py:125] JAX version 0.8.3 available.
/deps/src/maxtext/input_pipeline/input_pipeline_utils.py:467: UserWarning: WARNING: Inefficient dataloading. Your train or eval dataset contains 3 shards, smaller than number of host loading data. This is known to lead to inefficient dataloading. Seegithub.com/google/maxtext/blob/main/getting_started/Data_Input_Pipeline.md#multihost-dataloading-best-practice
warnings.warn(
E0424 12:24:23.959339 132675596343104 packing.py:209] PackAndBatchOperation is deprecated. Please use lazy_dataset.FirstFitPackIterDataset instead.
I0424 12:24:23.959660 132675596343104 data_loader.py:408] Adding CopyNumPyArrayToSharedMemory MapTransform.
I0424 12:24:24.344260 132675596343104 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0424 12:24:24.344407 132675596343104 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x78aa3bf97d10>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0424 12:24:24.344454 132675596343104 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0424 12:24:24.344489 132675596343104 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x78aa3bf97d10>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0424 12:24:24.344533 132675596343104 checkpoint_manager.py:702] [process=4][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'model_params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7892f57a06b0>, 'optimizer_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7891c6dc4560>, 'custom_metadata': <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7891c6dc5490>}, handler_registry=None
I0424 12:24:24.344745 132675596343104 composite_checkpoint_handler.py:237] Deferred registration for item: "model_params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7892f57a06b0>` for item "model_params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0424 12:24:24.344788 132675596343104 composite_checkpoint_handler.py:237] Deferred registration for item: "optimizer_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7891c6dc4560>` for item "optimizer_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0424 12:24:24.344818 132675596343104 composite_checkpoint_handler.py:237] Deferred registration for item: "custom_metadata". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7891c6dc5490>` for item "custom_metadata" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0424 12:24:24.344842 132675596343104 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7891c6dbd760>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0424 12:24:24.344882 132675596343104 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7892f57a06b0>, ('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7892f57a06b0>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7891c6dc4560>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7891c6dc4560>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7891c6dc5490>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7891c6dc5490>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7891c6dbd760>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7891c6dbd760>}).
I0424 12:24:24.345343 132675596343104 async_checkpointer.py:177] [process=4][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>._fn at 0x7891c76bfe20> timeout: 600 secs and primary_host=0 for async checkpoint writes
I0424 12:24:26.654997 132675596343104 checkpoint_manager.py:1788] Found 0 checkpoint steps in gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_02_sft_nnx_ckpt/checkpoints
I0424 12:24:26.673562 132675596343104 checkpoint_manager.py:921] [process=4][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=10000, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_hns=False, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None, preservation_policy=None, prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False), root_directory=gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_02_sft_nnx_ckpt/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7891c6dc66c0>
I0424 12:24:26.673831 132675596343104 peft_trainer.py:584] Training with mesh: Mesh('diloco': 1, 'data': 4, 'stage': 1, 'fsdp': 8, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1, axis_types=(Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto))
I0424 12:24:27.092058 132675596343104 peft_trainer.py:594] Compiled train_step cache size: 0
I0424 12:24:27.096236 132675596343104 metric_logger.py:301] number parameters: 0.000 billion
I0424 12:24:27.098466 132521861826304 grain_pool.py:367] Grain pool will use 1 processes.
I0424 12:24:27.124796 132521861826304 grain_pool.py:440] Grain pool will start child processes.
Per train step:
Total TFLOPs: 0.00
split as 54.29% learnable weight flops and 45.71% attention flops
I0424 12:24:27.130064 132521861826304 grain_pool.py:448] Grain pool started all child processes.
2026-04-24 12:24:31.110774: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-04-24 12:24:31.155568: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-04-24 12:24:32.326212: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
2026-04-24 12:24:36.516700: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0424 12:24:44.931921 132675596343104 checkpoint_manager.py:1983] [process=4][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0424 12:24:44.933806 132675596343104 checkpoint_manager.py:1501] [process=4] Saving checkpoint at step 1
I0424 12:24:44.936892 132675596343104 async_checkpointer.py:452] [process=4] Started async saving checkpoint to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_02_sft_nnx_ckpt/checkpoints/1.
I0424 12:24:45.523601 132675596343104 signaling_client.py:364] Using JaxDistributedSignalingClient
I0424 12:24:45.524628 132675596343104 jax_array_handlers.py:347] Scheduling D2H of 46 prioritized jax.Array.
I0424 12:24:45.524687 132675596343104 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0424 12:24:46.220772 132675596343104 base_pytree_checkpoint_handler.py:153] [process=4][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.698264s
I0424 12:24:46.222906 132675596343104 base_pytree_checkpoint_handler.py:128] [process=4] /jax/checkpoint/write/blocking_gbytes_per_sec: 26.522 KiB/s (total gbytes: 25.6 KiB) (time elapsed: 963 milliseconds) (per-host)
I0424 12:24:46.222979 132675596343104 base_pytree_checkpoint_handler.py:732] [process=4][thread=MainThread] Initiated Pytree async_save. Time taken: 0.963964s (batch_requests_ready=0.257650s, total_serialization_initiated=0.706177s, others=0.000137s)
I0424 12:24:46.223821 132675596343104 jax_array_handlers.py:347] Scheduling D2H of 52 prioritized jax.Array.
I0424 12:24:46.223903 132675596343104 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0424 12:24:46.236112 132675596343104 base_pytree_checkpoint_handler.py:153] [process=4][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.013007s
I0424 12:24:46.236238 132675596343104 base_pytree_checkpoint_handler.py:128] [process=4] /jax/checkpoint/write/blocking_gbytes_per_sec: 78.736 KiB/s (total gbytes: 76.7 KiB) (time elapsed: 973 milliseconds) (per-host)
I0424 12:24:46.236294 132675596343104 base_pytree_checkpoint_handler.py:732] [process=4][thread=MainThread] Initiated Pytree async_save. Time taken: 0.974066s (batch_requests_ready=0.958735s, total_serialization_initiated=0.015246s, others=0.000085s)
I0424 12:24:46.236424 132675596343104 composite_checkpoint_handler.py:715] [process=4][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.981599s (all_items=0.000023s, per_item={'model_params': '0.00001907', 'optimizer_state': '0.00000429'}, temp_paths=0.981576)
I0424 12:24:46.237369 132519194257152 async_checkpointer.py:79] [process=4][thread=async_save] Background save thread started.
I0424 12:24:46.237492 132675596343104 async_checkpointer.py:561] Finished blocking save. Time taken: 1.303620s. Continuing background save to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_02_sft_nnx_ckpt/checkpoints/1.
I0424 12:24:46.248348 132675596343104 checkpoint_manager.py:1549] [process=4][thread=MainThread][step=1] Starting CheckpointManager Save Finalize thread=save_finalize
I0424 12:24:46.248609 132519722735360 async_checkpointer.py:265] [process=4][thread=save_finalize] Waiting for background save thread=async_save.
I0424 12:24:46.248756 132675596343104 standard_logger.py:34] {'step': 1, 'event_type': 'save', 'directory': 'gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_02_sft_nnx_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1777033484.9318917, 'wait_for_prev_duration_secs': 0.00012564659118652344, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1777033484.9338455, 'checkpointer_blocking_duration_secs': 1.3038203716278076, 'get_old_steps_start_time': 1777033486.237685, 'get_old_steps_duration_secs': 8.273124694824219e-05, 'checkpoint_manager_blocking_start_time': 1777033484.877951, 'checkpoint_manager_blocking_duration_secs': 1.370772123336792}
I0424 12:24:46.390444 132675596343104 peft_trainer.py:474] Train step 1 training loss: 5.871947 - training perplexity: 354.939331
I0424 12:24:46.390706 132675596343104 max_utils.py:750]
Memstats: After params initialized:
I0424 12:24:46.390766 132675596343104 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_16(process=4,(0,4,0,0))
I0424 12:24:46.390804 132675596343104 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_17(process=4,(1,4,0,0))
I0424 12:24:46.390833 132675596343104 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_20(process=4,(0,5,0,0))
I0424 12:24:46.390859 132675596343104 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_21(process=4,(1,5,0,0))
I0424 12:24:46.520955 132675596343104 metric_logger.py:196] completed step: 1, seconds: 19.295, TFLOP/s/device: 0.000, Tokens/s/device: 53.072, total_weights: 21054, loss: 5.872, lm_loss: 0.000, perplexity: 0.000
I0424 12:24:46.529908 132675596343104 peft_trainer.py:474] Train step 2 training loss: 5.539500 - training perplexity: 254.550751
I0424 12:24:46.530623 132675596343104 metric_logger.py:196] completed step: 2, seconds: 0.139, TFLOP/s/device: 0.002, Tokens/s/device: 7356.635, total_weights: 21455, loss: 5.540, lm_loss: 0.000, perplexity: 0.000
I0424 12:24:46.611675 132675596343104 peft_trainer.py:474] Train step 3 training loss: 5.440901 - training perplexity: 230.649963
I0424 12:24:46.612773 132675596343104 metric_logger.py:196] completed step: 3, seconds: 0.082, TFLOP/s/device: 0.003, Tokens/s/device: 12511.956, total_weights: 22025, loss: 5.441, lm_loss: 0.000, perplexity: 0.000
I0424 12:24:46.636216 132675596343104 peft_trainer.py:474] Train step 4 training loss: 5.596125 - training perplexity: 269.380554
I0424 12:24:46.636936 132675596343104 metric_logger.py:196] completed step: 4, seconds: 0.024, TFLOP/s/device: 0.009, Tokens/s/device: 41862.062, total_weights: 23787, loss: 5.596, lm_loss: 0.000, perplexity: 0.000
I0424 12:24:46.652180 132675596343104 peft_trainer.py:733] Train loop finished in: 19.5548 seconds
I0424 12:24:46.653108 132675596343104 peft_trainer.py:474] Train step 5 training loss: 5.670155 - training perplexity: 290.079498
I0424 12:24:46.653800 132675596343104 metric_logger.py:196] completed step: 5, seconds: 0.017, TFLOP/s/device: 0.013, Tokens/s/device: 60737.090, total_weights: 20141, loss: 5.670, lm_loss: 0.000, perplexity: 0.000
I0424 12:24:50.880285 132519697557248 array_metadata_store.py:203] [process=4][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_02_sft_nnx_ckpt/checkpoints/1/model_params/array_metadatas/process_4
I0424 12:24:50.881574 132519194257152 base_pytree_checkpoint_handler.py:128] [process=4] /jax/checkpoint/write/gbytes_per_sec: 4.546 KiB/s (total gbytes: 25.6 KiB) (time elapsed: 5 seconds) (per-host)
I0424 12:24:50.885418 132519202649856 array_metadata_store.py:203] [process=4][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_02_sft_nnx_ckpt/checkpoints/1/optimizer_state/array_metadatas/process_4
I0424 12:24:50.886358 132519194257152 base_pytree_checkpoint_handler.py:128] [process=4] /jax/checkpoint/write/gbytes_per_sec: 13.636 KiB/s (total gbytes: 76.7 KiB) (time elapsed: 5 seconds) (per-host)
I0424 12:24:50.886421 132519194257152 async_checkpointer.py:90] [process=4][thread=async_save] 4 Handler Commit operations completed. Time taken: 4.648848s.
I0424 12:24:55.428988 132675596343104 checkpoint_manager.py:1994] [process=4][thread=MainThread][step=1][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0424 12:25:01.712959 132519194257152 async_checkpointer.py:144] [process=4][thread=async_save] Background save thread done. Time taken: 15.475373s.
I0424 12:25:01.713266 132519722735360 async_checkpointer.py:273] [process=4][thread=save_finalize] Done with waiting for background save thread=async_save.
I0424 12:25:01.713390 132519722735360 async_checkpointer.py:283] [process=4][thread=save_finalize] No errors found in background save thread=async_save.
I0424 12:25:01.713443 132519722735360 checkpoint_manager.py:2103] [process=4][thread=save_finalize][step=1] CheckpointManager Save Finalize is syncing with other hosts...
I0424 12:25:01.714682 132519722735360 checkpoint_manager.py:2112] [process=4][thread=save_finalize][step=1] CheckpointManager Save Finalize is done on all hosts.
I0424 12:25:01.714856 132675596343104 checkpoint_manager.py:2006] [process=4][thread=MainThread][step=1][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=1.
W0424 12:25:01.715013 132675596343104 checkpoint_manager.py:1441] Waiting for previous save to complete took 6.286043 seconds. If this number is high, consider checkpointing less frequently.
I0424 12:25:01.716577 132675596343104 checkpoint_manager.py:1501] [process=4] Saving checkpoint at step 5
I0424 12:25:01.720058 132675596343104 async_checkpointer.py:452] [process=4] Started async saving checkpoint to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_02_sft_nnx_ckpt/checkpoints/5.
I0424 12:25:02.480861 132675596343104 jax_array_handlers.py:347] Scheduling D2H of 52 prioritized jax.Array.
I0424 12:25:02.480966 132675596343104 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0424 12:25:02.495921 132675596343104 base_pytree_checkpoint_handler.py:153] [process=4][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.016232s
I0424 12:25:02.499624 132675596343104 base_pytree_checkpoint_handler.py:128] [process=4] /jax/checkpoint/write/blocking_gbytes_per_sec: 285.786 KiB/s (total gbytes: 76.7 KiB) (time elapsed: 268 milliseconds) (per-host)
I0424 12:25:02.499685 132675596343104 base_pytree_checkpoint_handler.py:732] [process=4][thread=MainThread] Initiated Pytree async_save. Time taken: 0.268435s (batch_requests_ready=0.246407s, total_serialization_initiated=0.021929s, others=0.000100s)
I0424 12:25:02.501470 132675596343104 jax_array_handlers.py:347] Scheduling D2H of 46 prioritized jax.Array.
I0424 12:25:02.501526 132675596343104 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0424 12:25:02.507006 132675596343104 base_pytree_checkpoint_handler.py:153] [process=4][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.007238s
I0424 12:25:02.507108 132675596343104 base_pytree_checkpoint_handler.py:128] [process=4] /jax/checkpoint/write/blocking_gbytes_per_sec: 91.792 KiB/s (total gbytes: 25.6 KiB) (time elapsed: 278 milliseconds) (per-host)
I0424 12:25:02.507150 132675596343104 base_pytree_checkpoint_handler.py:732] [process=4][thread=MainThread] Initiated Pytree async_save. Time taken: 0.278543s (batch_requests_ready=0.267433s, total_serialization_initiated=0.011046s, others=0.000064s)
I0424 12:25:02.507238 132675596343104 composite_checkpoint_handler.py:715] [process=4][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.282600s (all_items=0.000014s, per_item={'model_params': '0.00001097', 'optimizer_state': '0.00000286'}, temp_paths=0.282586)
I0424 12:25:02.508320 132519722735360 async_checkpointer.py:79] [process=4][thread=async_save] Background save thread started.
I0424 12:25:02.508451 132675596343104 async_checkpointer.py:561] Finished blocking save. Time taken: 0.791806s. Continuing background save to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_02_sft_nnx_ckpt/checkpoints/5.
I0424 12:25:02.880069 132675596343104 checkpoint_manager.py:1549] [process=4][thread=MainThread][step=5] Starting CheckpointManager Save Finalize thread=save_finalize
I0424 12:25:02.880442 132519194257152 async_checkpointer.py:265] [process=4][thread=save_finalize] Waiting for background save thread=async_save.
I0424 12:25:02.880630 132675596343104 standard_logger.py:34] {'step': 5, 'event_type': 'save', 'directory': 'gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_02_sft_nnx_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1777033495.4289477, 'wait_for_prev_duration_secs': 6.2860426902771, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1777033501.7166176, 'checkpointer_blocking_duration_secs': 0.7920107841491699, 'get_old_steps_start_time': 1777033502.50865, 'get_old_steps_duration_secs': 8.225440979003906e-05, 'checkpoint_manager_blocking_start_time': 1777033486.658235, 'checkpoint_manager_blocking_duration_secs': 16.222357511520386}
I0424 12:25:02.880831 132675596343104 checkpoint_manager.py:1994] [process=4][thread=MainThread][step=5][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0424 12:25:07.641768 132519202649856 array_metadata_store.py:203] [process=4][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_02_sft_nnx_ckpt/checkpoints/5/model_params/array_metadatas/process_4
I0424 12:25:07.642987 132519722735360 base_pytree_checkpoint_handler.py:128] [process=4] /jax/checkpoint/write/gbytes_per_sec: 4.721 KiB/s (total gbytes: 25.6 KiB) (time elapsed: 5 seconds) (per-host)
I0424 12:25:07.678002 132519697557248 array_metadata_store.py:203] [process=4][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_02_sft_nnx_ckpt/checkpoints/5/optimizer_state/array_metadatas/process_4
I0424 12:25:07.679237 132519722735360 base_pytree_checkpoint_handler.py:128] [process=4] /jax/checkpoint/write/gbytes_per_sec: 14.076 KiB/s (total gbytes: 76.7 KiB) (time elapsed: 5 seconds) (per-host)
I0424 12:25:07.679342 132519722735360 async_checkpointer.py:90] [process=4][thread=async_save] 4 Handler Commit operations completed. Time taken: 5.170805s.
I0424 12:25:16.860482 132519722735360 async_checkpointer.py:144] [process=4][thread=async_save] Background save thread done. Time taken: 14.351932s.
I0424 12:25:16.860782 132519194257152 async_checkpointer.py:273] [process=4][thread=save_finalize] Done with waiting for background save thread=async_save.
I0424 12:25:16.860968 132519194257152 async_checkpointer.py:283] [process=4][thread=save_finalize] No errors found in background save thread=async_save.
I0424 12:25:16.861021 132519194257152 checkpoint_manager.py:2103] [process=4][thread=save_finalize][step=5] CheckpointManager Save Finalize is syncing with other hosts...
I0424 12:25:16.862541 132519194257152 checkpoint_manager.py:2112] [process=4][thread=save_finalize][step=5] CheckpointManager Save Finalize is done on all hosts.
I0424 12:25:16.862709 132675596343104 checkpoint_manager.py:2006] [process=4][thread=MainThread][step=5][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=5.
I0424 12:25:16.862882 132675596343104 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=132675596343104 count=1 at 0x789f76067980>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7891c6dc6720>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7891c6dc7d40>, _write_futures=[])
I0424 12:25:16.863337 132675596343104 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=132675596343104 count=1 at 0x789f76067980>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7891c6dc6720>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7891c6dc7d40>, _write_futures=[])
I0424 12:25:16.863369 132675596343104 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=132675596343104 count=1 at 0x789f76067980>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7891c6dc6720>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7891c6dc7d40>, _write_futures=[])
I0424 12:25:17.146060 132521861826304 grain_pool.py:542] Grain pool is exiting.
I0424 12:25:17.146159 132521861826304 grain_pool.py:547] Shutting down multiprocessing system.
I0424 12:25:19.212762 132521861826304 grain_pool.py:547] Shutting down multiprocessing system.
/usr/local/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 15 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
XPK End: Fri Apr 24 12:25:29 UTC 2026
EXIT_CODE=0