XPK Start: Sat Apr 25 12:30:09 UTC 2026
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
2026-04-25 12:30:38.315764: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0425 12:30:38.563244 137721040312128 max_utils.py:273] Attempting to initialize the jax distributed system...
INFO:2026-04-25 12:30:47,604:jax._src.distributed:149: Starting JAX distributed service on [::]:8482
I0425 12:30:47.604779 137721040312128 distributed.py:149] Starting JAX distributed service on [::]:8482
INFO:2026-04-25 12:30:47,607:jax._src.distributed:166: Connecting to JAX distributed service on mt-02-sft-nnx-ckpt-1gyo0-slice-job-0-0.mt-02-sft-nnx-ckpt-1gyo0:8482
I0425 12:30:47.607276 137721040312128 distributed.py:166] Connecting to JAX distributed service on mt-02-sft-nnx-ckpt-1gyo0-slice-job-0-0.mt-02-sft-nnx-ckpt-1gyo0:8482
I0425 12:30:48.964869 137721040312128 max_utils.py:284] Jax distributed system initialized!
I0425 12:30:53.980385 137721040312128 max_utils.py:800] System Information: Jax Version: 0.8.3
I0425 12:30:53.980487 137721040312128 max_utils.py:801] System Information: Jaxlib Version: 0.8.3
I0425 12:30:53.980527 137721040312128 max_utils.py:802] System Information: Jax Backend: PJRT C API
TFRT TPU v6 lite
Built on Dec 15 2025 14:03:46 (1765836226) cl/844590465
I0425 12:30:53.983950 137721040312128 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0425 12:30:54.079760 137721040312128 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0425 12:30:54.181114 137721040312128 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0425 12:30:55.234821 137721040312128 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0425 12:30:55.235283 137721040312128 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7d40f7dfc140>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0425 12:30:55.235349 137721040312128 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.28
W0425 12:30:55.796433 137721040312128 checkpoint.py:202] Metadata file does not exist: gs://lance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items/_CHECKPOINT_METADATA
I0425 12:30:56.328379 1892 google_auth_provider.cc:181] Running on GCE, using service account 562977990677-compute@developer.gserviceaccount.com
I0425 12:30:57.583790 137721040312128 checkpointer.py:304] Restoring checkpoint from gs://lance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items.
W0425 12:30:58.499098 137721040312128 transform_utils.py:230] The transformations API will eventually be replaced by an upgraded design. The current API will not be removed until this point, but it will no longer be actively worked on.
I0425 12:30:58.499504 137721040312128 transform_utils.py:288] The following keys are not loaded from the original tree after applying specified transforms: decoder/decoder_norm/bias/value, decoder/decoder_norm/scale/value, decoder/dropout/rngs/aqt/count/value, decoder/dropout/rngs/aqt/key/value, decoder/dropout/rngs/dropout/count/value, decoder/dropout/rngs/dropout/key/value, decoder/dropout/rngs/params/count/value, decoder/dropout/rngs/params/key/value, decoder/layers/dropout/rngs/aqt/count/value, decoder/layers/dropout/rngs/aqt/key/value, decoder/layers/dropout/rngs/dropout/count/value, decoder/layers/dropout/rngs/dropout/key/value, decoder/layers/dropout/rngs/params/count/value, decoder/layers/dropout/rngs/params/key/value, decoder/layers/mlp/dropout/rngs/aqt/count/value, decoder/layers/mlp/dropout/rngs/aqt/key/value, decoder/layers/mlp/dropout/rngs/dropout/count/value, decoder/layers/mlp/dropout/rngs/dropout/key/value, decoder/layers/mlp/dropout/rngs/params/count/value, decoder/layers/mlp/dropout/rngs/params/key/value, decoder/layers/mlp/mlp_layer_norm/bias/value, decoder/layers/mlp/mlp_layer_norm/scale/value, decoder/layers/mlp/wi/bias/value, decoder/layers/mlp/wi/kernel/value, decoder/layers/mlp/wo/bias/value, decoder/layers/mlp/wo/kernel/value, decoder/layers/pre_self_attention_norm/bias/value, decoder/layers/pre_self_attention_norm/scale/value, decoder/layers/rngs/aqt/count/value, decoder/layers/rngs/aqt/key/value, decoder/layers/rngs/dropout/count/value, decoder/layers/rngs/dropout/key/value, decoder/layers/rngs/params/count/value, decoder/layers/rngs/params/key/value, decoder/layers/self_attention/out/bias/value, decoder/layers/self_attention/out/kernel/value, decoder/layers/self_attention/qkv_proj/bias/value, decoder/layers/self_attention/qkv_proj/kernel/value, decoder/position_embedder/embedding/value, decoder/rngs/aqt/count/value, decoder/rngs/aqt/key/value, decoder/rngs/dropout/count/value, decoder/rngs/dropout/key/value, decoder/rngs/params/count/value, decoder/rngs/params/key/value, token_embedder/embedding/value
I0425 12:30:59.288969 137721040312128 checkpointer.py:318] Finished restoring checkpoint in 2.08 seconds from gs://lance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items.
I0425 12:30:59.357434 137721040312128 config.py:112] TensorFlow version 2.20.0 available.
I0425 12:30:59.357958 137721040312128 config.py:125] JAX version 0.8.3 available.
/deps/src/maxtext/input_pipeline/input_pipeline_utils.py:467: UserWarning: WARNING: Inefficient dataloading. Your train or eval dataset contains 3 shards, smaller than number of host loading data. This is known to lead to inefficient dataloading. Seegithub.com/google/maxtext/blob/main/getting_started/Data_Input_Pipeline.md#multihost-dataloading-best-practice
warnings.warn(
E0425 12:31:04.575771 137721040312128 packing.py:209] PackAndBatchOperation is deprecated. Please use lazy_dataset.FirstFitPackIterDataset instead.
I0425 12:31:04.576097 137721040312128 data_loader.py:408] Adding CopyNumPyArrayToSharedMemory MapTransform.
I0425 12:31:04.967498 137721040312128 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0425 12:31:04.967644 137721040312128 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7d40f7dfc140>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0425 12:31:04.967708 137721040312128 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0425 12:31:04.967744 137721040312128 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7d40f7dfc140>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0425 12:31:04.967787 137721040312128 checkpoint_manager.py:702] [process=5][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'model_params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7d29a9a1e780>, 'optimizer_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7d29a9a2d8b0>, 'custom_metadata': <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7d29a9a2e450>}, handler_registry=None
I0425 12:31:04.968002 137721040312128 composite_checkpoint_handler.py:237] Deferred registration for item: "model_params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7d29a9a1e780>` for item "model_params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0425 12:31:04.968044 137721040312128 composite_checkpoint_handler.py:237] Deferred registration for item: "optimizer_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7d29a9a2d8b0>` for item "optimizer_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0425 12:31:04.968073 137721040312128 composite_checkpoint_handler.py:237] Deferred registration for item: "custom_metadata". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7d29a9a2e450>` for item "custom_metadata" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0425 12:31:04.968100 137721040312128 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7d29a9a1e990>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0425 12:31:04.968127 137721040312128 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7d29a9a1e780>, ('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7d29a9a1e780>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7d29a9a2d8b0>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7d29a9a2d8b0>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7d29a9a2e450>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7d29a9a2e450>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7d29a9a1e990>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7d29a9a1e990>}).
I0425 12:31:04.968585 137721040312128 async_checkpointer.py:177] [process=5][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>._fn at 0x7d29a9a83e20> timeout: 600 secs and primary_host=0 for async checkpoint writes
I0425 12:31:07.369135 137721040312128 checkpoint_manager.py:1788] Found 0 checkpoint steps in gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_nnx_ckpt/checkpoints
I0425 12:31:07.797802 137721040312128 checkpoint_manager.py:921] [process=5][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=10000, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_hns=False, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None, preservation_policy=None, prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False), root_directory=gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_nnx_ckpt/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7d29a9a2c800>
I0425 12:31:07.798166 137721040312128 peft_trainer.py:584] Training with mesh: Mesh('diloco': 1, 'data': 4, 'stage': 1, 'fsdp': 8, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1, axis_types=(Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto))
I0425 12:31:08.217335 137721040312128 peft_trainer.py:594] Compiled train_step cache size: 0
I0425 12:31:08.221392 137721040312128 metric_logger.py:301] number parameters: 0.000 billion
I0425 12:31:08.223742 137571074037504 grain_pool.py:367] Grain pool will use 1 processes.
I0425 12:31:08.254468 137571074037504 grain_pool.py:440] Grain pool will start child processes.
Per train step:
Total TFLOPs: 0.00
split as 54.29% learnable weight flops and 45.71% attention flops
I0425 12:31:08.259462 137571074037504 grain_pool.py:448] Grain pool started all child processes.
2026-04-25 12:31:12.253010: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-04-25 12:31:12.298020: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-04-25 12:31:13.460745: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
2026-04-25 12:31:17.678476: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0425 12:31:26.068405 137721040312128 checkpoint_manager.py:1983] [process=5][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0425 12:31:26.070258 137721040312128 checkpoint_manager.py:1501] [process=5] Saving checkpoint at step 1
I0425 12:31:26.073368 137721040312128 async_checkpointer.py:452] [process=5] Started async saving checkpoint to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_nnx_ckpt/checkpoints/1.
I0425 12:31:26.658532 137721040312128 signaling_client.py:364] Using JaxDistributedSignalingClient
I0425 12:31:26.659600 137721040312128 jax_array_handlers.py:347] Scheduling D2H of 46 prioritized jax.Array.
I0425 12:31:26.659676 137721040312128 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0425 12:31:27.335042 137721040312128 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.677598s
I0425 12:31:27.337116 137721040312128 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 27.188 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 939 milliseconds) (per-host)
I0425 12:31:27.337187 137721040312128 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.939168s (batch_requests_ready=0.253523s, total_serialization_initiated=0.685529s, others=0.000117s)
I0425 12:31:27.338021 137721040312128 jax_array_handlers.py:347] Scheduling D2H of 52 prioritized jax.Array.
I0425 12:31:27.338081 137721040312128 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0425 12:31:27.349412 137721040312128 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.012095s
I0425 12:31:27.349560 137721040312128 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 80.769 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 948 milliseconds) (per-host)
I0425 12:31:27.349619 137721040312128 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.948400s (batch_requests_ready=0.934012s, total_serialization_initiated=0.014290s, others=0.000097s)
I0425 12:31:27.349794 137721040312128 composite_checkpoint_handler.py:715] [process=5][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.956071s (all_items=0.000023s, per_item={'model_params': '0.00001907', 'optimizer_state': '0.00000405'}, temp_paths=0.956048)
I0425 12:31:27.351283 137568892983040 async_checkpointer.py:79] [process=5][thread=async_save] Background save thread started.
I0425 12:31:27.351410 137721040312128 async_checkpointer.py:561] Finished blocking save. Time taken: 1.281094s. Continuing background save to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_nnx_ckpt/checkpoints/1.
I0425 12:31:27.377780 137721040312128 checkpoint_manager.py:1549] [process=5][thread=MainThread][step=1] Starting CheckpointManager Save Finalize thread=save_finalize
I0425 12:31:27.378117 137568901375744 async_checkpointer.py:265] [process=5][thread=save_finalize] Waiting for background save thread=async_save.
I0425 12:31:27.379422 137721040312128 standard_logger.py:34] {'step': 1, 'event_type': 'save', 'directory': 'gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_nnx_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1777120286.0683715, 'wait_for_prev_duration_secs': 0.0001308917999267578, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1777120286.0702977, 'checkpointer_blocking_duration_secs': 1.2811775207519531, 'get_old_steps_start_time': 1777120287.3514962, 'get_old_steps_duration_secs': 7.963180541992188e-05, 'checkpoint_manager_blocking_start_time': 1777120285.9975955, 'checkpoint_manager_blocking_duration_secs': 1.3817920684814453}
I0425 12:31:27.523934 137721040312128 peft_trainer.py:474] Train step 1 training loss: 5.871947 - training perplexity: 354.939331
I0425 12:31:27.524199 137721040312128 max_utils.py:750]
Memstats: After params initialized:
I0425 12:31:27.524266 137721040312128 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_18(process=5,(2,4,0,0))
I0425 12:31:27.524302 137721040312128 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_19(process=5,(3,4,0,0))
I0425 12:31:27.524330 137721040312128 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_22(process=5,(2,5,0,0))
I0425 12:31:27.524353 137721040312128 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_23(process=5,(3,5,0,0))
I0425 12:31:27.654816 137721040312128 metric_logger.py:196] completed step: 1, seconds: 19.303, TFLOP/s/device: 0.000, Tokens/s/device: 53.049, total_weights: 21054, loss: 5.872, lm_loss: 0.000, perplexity: 0.000
I0425 12:31:27.664032 137721040312128 peft_trainer.py:474] Train step 2 training loss: 5.539500 - training perplexity: 254.550751
I0425 12:31:27.664710 137721040312128 metric_logger.py:196] completed step: 2, seconds: 0.140, TFLOP/s/device: 0.002, Tokens/s/device: 7322.598, total_weights: 21455, loss: 5.540, lm_loss: 0.000, perplexity: 0.000
I0425 12:31:27.739207 137721040312128 peft_trainer.py:474] Train step 3 training loss: 5.440901 - training perplexity: 230.649963
I0425 12:31:27.740149 137721040312128 metric_logger.py:196] completed step: 3, seconds: 0.075, TFLOP/s/device: 0.003, Tokens/s/device: 13613.856, total_weights: 22025, loss: 5.441, lm_loss: 0.000, perplexity: 0.000
I0425 12:31:27.763957 137721040312128 peft_trainer.py:474] Train step 4 training loss: 5.596125 - training perplexity: 269.380554
I0425 12:31:27.764742 137721040312128 metric_logger.py:196] completed step: 4, seconds: 0.025, TFLOP/s/device: 0.009, Tokens/s/device: 41463.150, total_weights: 23787, loss: 5.596, lm_loss: 0.000, perplexity: 0.000
I0425 12:31:27.780138 137721040312128 peft_trainer.py:733] Train loop finished in: 19.5577 seconds
I0425 12:31:27.781143 137721040312128 peft_trainer.py:474] Train step 5 training loss: 5.670155 - training perplexity: 290.079498
I0425 12:31:27.781814 137721040312128 metric_logger.py:196] completed step: 5, seconds: 0.017, TFLOP/s/device: 0.013, Tokens/s/device: 59674.835, total_weights: 20141, loss: 5.670, lm_loss: 0.000, perplexity: 0.000
I0425 12:31:32.161107 137567861204736 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_nnx_ckpt/checkpoints/1/model_params/array_metadatas/process_5
I0425 12:31:32.162317 137568892983040 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 4.429 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 5 seconds) (per-host)
I0425 12:31:32.180125 137567844419328 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_nnx_ckpt/checkpoints/1/optimizer_state/array_metadatas/process_5
I0425 12:31:32.181078 137568892983040 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 13.252 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 5 seconds) (per-host)
I0425 12:31:32.181135 137568892983040 async_checkpointer.py:90] [process=5][thread=async_save] 4 Handler Commit operations completed. Time taken: 4.829761s.
I0425 12:31:36.709986 137721040312128 checkpoint_manager.py:1994] [process=5][thread=MainThread][step=1][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0425 12:31:43.671213 137568892983040 async_checkpointer.py:144] [process=5][thread=async_save] Background save thread done. Time taken: 16.319822s.
I0425 12:31:43.671505 137568901375744 async_checkpointer.py:273] [process=5][thread=save_finalize] Done with waiting for background save thread=async_save.
I0425 12:31:43.671625 137568901375744 async_checkpointer.py:283] [process=5][thread=save_finalize] No errors found in background save thread=async_save.
I0425 12:31:43.671744 137568901375744 checkpoint_manager.py:2103] [process=5][thread=save_finalize][step=1] CheckpointManager Save Finalize is syncing with other hosts...
I0425 12:31:43.673094 137568901375744 checkpoint_manager.py:2112] [process=5][thread=save_finalize][step=1] CheckpointManager Save Finalize is done on all hosts.
I0425 12:31:43.673240 137721040312128 checkpoint_manager.py:2006] [process=5][thread=MainThread][step=1][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=1.
W0425 12:31:43.673326 137721040312128 checkpoint_manager.py:1441] Waiting for previous save to complete took 6.963352 seconds. If this number is high, consider checkpointing less frequently.
I0425 12:31:43.674959 137721040312128 checkpoint_manager.py:1501] [process=5] Saving checkpoint at step 5
I0425 12:31:43.678565 137721040312128 async_checkpointer.py:452] [process=5] Started async saving checkpoint to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_nnx_ckpt/checkpoints/5.
I0425 12:31:44.217561 137721040312128 jax_array_handlers.py:347] Scheduling D2H of 52 prioritized jax.Array.
I0425 12:31:44.217673 137721040312128 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0425 12:31:44.229588 137721040312128 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.012979s
I0425 12:31:44.232914 137721040312128 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 284.628 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 269 milliseconds) (per-host)
I0425 12:31:44.232975 137721040312128 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.269195s (batch_requests_ready=0.250702s, total_serialization_initiated=0.018395s, others=0.000098s)
I0425 12:31:44.234755 137721040312128 jax_array_handlers.py:347] Scheduling D2H of 46 prioritized jax.Array.
I0425 12:31:44.234809 137721040312128 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0425 12:31:44.239823 137721040312128 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.006745s
I0425 12:31:44.239924 137721040312128 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 91.587 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 278 milliseconds) (per-host)
I0425 12:31:44.239965 137721040312128 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.278822s (batch_requests_ready=0.268578s, total_serialization_initiated=0.010182s, others=0.000062s)
I0425 12:31:44.240078 137721040312128 composite_checkpoint_handler.py:715] [process=5][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.282825s (all_items=0.000014s, per_item={'model_params': '0.00001097', 'optimizer_state': '0.00000310'}, temp_paths=0.282811)
I0425 12:31:44.241064 137568901375744 async_checkpointer.py:79] [process=5][thread=async_save] Background save thread started.
I0425 12:31:44.241217 137721040312128 async_checkpointer.py:561] Finished blocking save. Time taken: 0.566188s. Continuing background save to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_nnx_ckpt/checkpoints/5.
I0425 12:31:44.603729 137721040312128 checkpoint_manager.py:1549] [process=5][thread=MainThread][step=5] Starting CheckpointManager Save Finalize thread=save_finalize
I0425 12:31:44.604118 137568892983040 async_checkpointer.py:265] [process=5][thread=save_finalize] Waiting for background save thread=async_save.
I0425 12:31:44.604295 137721040312128 standard_logger.py:34] {'step': 5, 'event_type': 'save', 'directory': 'gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_nnx_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1777120296.709948, 'wait_for_prev_duration_secs': 6.96335244178772, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1777120303.675, 'checkpointer_blocking_duration_secs': 0.566321611404419, 'get_old_steps_start_time': 1777120304.2413425, 'get_old_steps_duration_secs': 9.083747863769531e-05, 'checkpoint_manager_blocking_start_time': 1777120287.7862802, 'checkpoint_manager_blocking_duration_secs': 16.817979335784912}
I0425 12:31:44.604500 137721040312128 checkpoint_manager.py:1994] [process=5][thread=MainThread][step=5][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0425 12:31:49.284825 137567861204736 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_nnx_ckpt/checkpoints/5/optimizer_state/array_metadatas/process_5
I0425 12:31:49.812424 137567836026624 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_nnx_ckpt/checkpoints/5/model_params/array_metadatas/process_5
I0425 12:31:49.813562 137568901375744 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 4.363 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 5 seconds) (per-host)
I0425 12:31:49.813674 137568901375744 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 13.093 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 5 seconds) (per-host)
I0425 12:31:49.813712 137568901375744 async_checkpointer.py:90] [process=5][thread=async_save] 4 Handler Commit operations completed. Time taken: 5.572540s.
I0425 12:31:59.988075 137568901375744 async_checkpointer.py:144] [process=5][thread=async_save] Background save thread done. Time taken: 15.746887s.
I0425 12:31:59.988394 137568892983040 async_checkpointer.py:273] [process=5][thread=save_finalize] Done with waiting for background save thread=async_save.
I0425 12:31:59.988511 137568892983040 async_checkpointer.py:283] [process=5][thread=save_finalize] No errors found in background save thread=async_save.
I0425 12:31:59.988557 137568892983040 checkpoint_manager.py:2103] [process=5][thread=save_finalize][step=5] CheckpointManager Save Finalize is syncing with other hosts...
I0425 12:31:59.990241 137568892983040 checkpoint_manager.py:2112] [process=5][thread=save_finalize][step=5] CheckpointManager Save Finalize is done on all hosts.
I0425 12:31:59.990422 137721040312128 checkpoint_manager.py:2006] [process=5][thread=MainThread][step=5][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=5.
I0425 12:31:59.990575 137721040312128 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=137721040312128 count=1 at 0x7d28303629c0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7d2830553740>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7d2830553710>, _write_futures=[])
I0425 12:31:59.991041 137721040312128 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=137721040312128 count=1 at 0x7d28303629c0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7d2830553740>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7d2830553710>, _write_futures=[])
I0425 12:31:59.991074 137721040312128 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=137721040312128 count=1 at 0x7d28303629c0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7d2830553740>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7d2830553710>, _write_futures=[])
I0425 12:32:00.274681 137571074037504 grain_pool.py:542] Grain pool is exiting.
I0425 12:32:00.274775 137571074037504 grain_pool.py:547] Shutting down multiprocessing system.
I0425 12:32:02.383213 137571074037504 grain_pool.py:547] Shutting down multiprocessing system.
/usr/local/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 15 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
XPK End: Sat Apr 25 12:32:15 UTC 2026
EXIT_CODE=0