XPK Start: Wed Apr 22 12:59:42 UTC 2026
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
2026-04-22 13:00:11.551352: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0422 13:00:11.794559 133038495418176 max_utils.py:273] Attempting to initialize the jax distributed system...
INFO:2026-04-22 13:00:20,837:jax._src.distributed:149: Starting JAX distributed service on [::]:8482
I0422 13:00:20.837136 133038495418176 distributed.py:149] Starting JAX distributed service on [::]:8482
INFO:2026-04-22 13:00:20,839:jax._src.distributed:166: Connecting to JAX distributed service on mt-02-sft-nnx-ckpt-93ive-slice-job-0-0.mt-02-sft-nnx-ckpt-93ive:8482
I0422 13:00:20.839601 133038495418176 distributed.py:166] Connecting to JAX distributed service on mt-02-sft-nnx-ckpt-93ive-slice-job-0-0.mt-02-sft-nnx-ckpt-93ive:8482
I0422 13:00:22.278229 133038495418176 max_utils.py:284] Jax distributed system initialized!
I0422 13:00:28.535176 133038495418176 max_utils.py:800] System Information: Jax Version: 0.8.3
I0422 13:00:28.535288 133038495418176 max_utils.py:801] System Information: Jaxlib Version: 0.8.3
I0422 13:00:28.535328 133038495418176 max_utils.py:802] System Information: Jax Backend: PJRT C API
TFRT TPU v6 lite
Built on Dec 15 2025 14:03:46 (1765836226) cl/844590465
I0422 13:00:28.538643 133038495418176 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0422 13:00:28.634830 133038495418176 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0422 13:00:28.737160 133038495418176 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0422 13:00:29.780133 133038495418176 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0422 13:00:29.780568 133038495418176 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x78febd790fb0>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0422 13:00:29.780627 133038495418176 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.28
W0422 13:00:30.300580 133038495418176 checkpoint.py:202] Metadata file does not exist: gs://lance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items/_CHECKPOINT_METADATA
I0422 13:00:30.846841 1951 google_auth_provider.cc:181] Running on GCE, using service account 562977990677-compute@developer.gserviceaccount.com
I0422 13:00:32.119863 133038495418176 checkpointer.py:304] Restoring checkpoint from gs://lance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items.
W0422 13:00:33.056910 133038495418176 transform_utils.py:230] The transformations API will eventually be replaced by an upgraded design. The current API will not be removed until this point, but it will no longer be actively worked on.
I0422 13:00:33.057310 133038495418176 transform_utils.py:288] The following keys are not loaded from the original tree after applying specified transforms: decoder/decoder_norm/bias/value, decoder/decoder_norm/scale/value, decoder/dropout/rngs/aqt/count/value, decoder/dropout/rngs/aqt/key/value, decoder/dropout/rngs/dropout/count/value, decoder/dropout/rngs/dropout/key/value, decoder/dropout/rngs/params/count/value, decoder/dropout/rngs/params/key/value, decoder/layers/dropout/rngs/aqt/count/value, decoder/layers/dropout/rngs/aqt/key/value, decoder/layers/dropout/rngs/dropout/count/value, decoder/layers/dropout/rngs/dropout/key/value, decoder/layers/dropout/rngs/params/count/value, decoder/layers/dropout/rngs/params/key/value, decoder/layers/mlp/dropout/rngs/aqt/count/value, decoder/layers/mlp/dropout/rngs/aqt/key/value, decoder/layers/mlp/dropout/rngs/dropout/count/value, decoder/layers/mlp/dropout/rngs/dropout/key/value, decoder/layers/mlp/dropout/rngs/params/count/value, decoder/layers/mlp/dropout/rngs/params/key/value, decoder/layers/mlp/mlp_layer_norm/bias/value, decoder/layers/mlp/mlp_layer_norm/scale/value, decoder/layers/mlp/wi/bias/value, decoder/layers/mlp/wi/kernel/value, decoder/layers/mlp/wo/bias/value, decoder/layers/mlp/wo/kernel/value, decoder/layers/pre_self_attention_norm/bias/value, decoder/layers/pre_self_attention_norm/scale/value, decoder/layers/rngs/aqt/count/value, decoder/layers/rngs/aqt/key/value, decoder/layers/rngs/dropout/count/value, decoder/layers/rngs/dropout/key/value, decoder/layers/rngs/params/count/value, decoder/layers/rngs/params/key/value, decoder/layers/self_attention/out/bias/value, decoder/layers/self_attention/out/kernel/value, decoder/layers/self_attention/qkv_proj/bias/value, decoder/layers/self_attention/qkv_proj/kernel/value, decoder/position_embedder/embedding/value, decoder/rngs/aqt/count/value, decoder/rngs/aqt/key/value, decoder/rngs/dropout/count/value, decoder/rngs/dropout/key/value, decoder/rngs/params/count/value, decoder/rngs/params/key/value, token_embedder/embedding/value
I0422 13:00:33.934225 133038495418176 checkpointer.py:318] Finished restoring checkpoint in 2.21 seconds from gs://lance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items.
I0422 13:00:34.003111 133038495418176 config.py:112] TensorFlow version 2.20.0 available.
I0422 13:00:34.003601 133038495418176 config.py:125] JAX version 0.8.3 available.
/deps/src/maxtext/input_pipeline/input_pipeline_utils.py:467: UserWarning: WARNING: Inefficient dataloading. Your train or eval dataset contains 3 shards, smaller than number of host loading data. This is known to lead to inefficient dataloading. Seegithub.com/google/maxtext/blob/main/getting_started/Data_Input_Pipeline.md#multihost-dataloading-best-practice
warnings.warn(
E0422 13:00:39.338555 133038495418176 packing.py:209] PackAndBatchOperation is deprecated. Please use lazy_dataset.FirstFitPackIterDataset instead.
I0422 13:00:39.338905 133038495418176 data_loader.py:408] Adding CopyNumPyArrayToSharedMemory MapTransform.
I0422 13:00:39.727427 133038495418176 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0422 13:00:39.727566 133038495418176 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x78febd790fb0>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0422 13:00:39.727612 133038495418176 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0422 13:00:39.727647 133038495418176 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x78febd790fb0>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0422 13:00:39.727719 133038495418176 checkpoint_manager.py:702] [process=5][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'model_params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x78feb4a03a40>, 'optimizer_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x78e6854394f0>, 'custom_metadata': <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x78e68543b0b0>}, handler_registry=None
I0422 13:00:39.727968 133038495418176 composite_checkpoint_handler.py:237] Deferred registration for item: "model_params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x78feb4a03a40>` for item "model_params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0422 13:00:39.728015 133038495418176 composite_checkpoint_handler.py:237] Deferred registration for item: "optimizer_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x78e6854394f0>` for item "optimizer_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0422 13:00:39.728044 133038495418176 composite_checkpoint_handler.py:237] Deferred registration for item: "custom_metadata". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x78e68543b0b0>` for item "custom_metadata" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0422 13:00:39.728070 133038495418176 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x78e6858b6750>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0422 13:00:39.728096 133038495418176 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x78feb4a03a40>, ('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x78feb4a03a40>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x78e6854394f0>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x78e6854394f0>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x78e68543b0b0>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x78e68543b0b0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x78e6858b6750>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x78e6858b6750>}).
I0422 13:00:39.728562 133038495418176 async_checkpointer.py:177] [process=5][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>._fn at 0x78e685803e20> timeout: 600 secs and primary_host=0 for async checkpoint writes
I0422 13:00:42.562472 133038495418176 checkpoint_manager.py:1788] Found 0 checkpoint steps in gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123915/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260422_123915_02_sft_nnx_ckpt/checkpoints
I0422 13:00:42.945302 133038495418176 checkpoint_manager.py:921] [process=5][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=10000, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_hns=False, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None, preservation_policy=None, prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False), root_directory=gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123915/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260422_123915_02_sft_nnx_ckpt/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x78e68543bc50>
I0422 13:00:42.945680 133038495418176 peft_trainer.py:584] Training with mesh: Mesh('diloco': 1, 'data': 4, 'stage': 1, 'fsdp': 8, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1, axis_types=(Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto))
I0422 13:00:43.368405 133038495418176 peft_trainer.py:594] Compiled train_step cache size: 0
I0422 13:00:43.372548 133038495418176 metric_logger.py:301] number parameters: 0.000 billion
I0422 13:00:43.374942 132884006434560 grain_pool.py:367] Grain pool will use 1 processes.
I0422 13:00:43.400787 132884006434560 grain_pool.py:440] Grain pool will start child processes.
Per train step:
Total TFLOPs: 0.00
split as 54.29% learnable weight flops and 45.71% attention flops
I0422 13:00:43.405350 132884006434560 grain_pool.py:448] Grain pool started all child processes.
2026-04-22 13:00:47.382670: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-04-22 13:00:47.426949: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-04-22 13:00:48.582610: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
2026-04-22 13:00:52.778406: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0422 13:01:02.084625 133038495418176 checkpoint_manager.py:1983] [process=5][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0422 13:01:02.087452 133038495418176 checkpoint_manager.py:1501] [process=5] Saving checkpoint at step 1
I0422 13:01:02.091103 133038495418176 async_checkpointer.py:452] [process=5] Started async saving checkpoint to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123915/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260422_123915_02_sft_nnx_ckpt/checkpoints/1.
I0422 13:01:02.657137 133038495418176 signaling_client.py:364] Using JaxDistributedSignalingClient
I0422 13:01:02.658152 133038495418176 jax_array_handlers.py:347] Scheduling D2H of 52 prioritized jax.Array.
I0422 13:01:02.658216 133038495418176 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0422 13:01:03.316769 133038495418176 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.659725s
I0422 13:01:03.320426 133038495418176 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 82.957 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 923 milliseconds) (per-host)
I0422 13:01:03.320497 133038495418176 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.923402s (batch_requests_ready=0.255707s, total_serialization_initiated=0.667578s, others=0.000116s)
I0422 13:01:03.322631 133038495418176 jax_array_handlers.py:347] Scheduling D2H of 46 prioritized jax.Array.
I0422 13:01:03.322708 133038495418176 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0422 13:01:03.327596 133038495418176 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.006978s
I0422 13:01:03.327731 133038495418176 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 27.340 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 933 milliseconds) (per-host)
I0422 13:01:03.327789 133038495418176 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.933911s (batch_requests_ready=0.923072s, total_serialization_initiated=0.010750s, others=0.000089s)
I0422 13:01:03.327937 133038495418176 composite_checkpoint_handler.py:715] [process=5][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.937804s (all_items=0.000023s, per_item={'model_params': '0.00001836', 'optimizer_state': '0.00000429'}, temp_paths=0.937781)
I0422 13:01:03.328971 132881858950912 async_checkpointer.py:79] [process=5][thread=async_save] Background save thread started.
I0422 13:01:03.329144 133038495418176 async_checkpointer.py:561] Finished blocking save. Time taken: 1.241615s. Continuing background save to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123915/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260422_123915_02_sft_nnx_ckpt/checkpoints/1.
I0422 13:01:03.769786 133038495418176 checkpoint_manager.py:1549] [process=5][thread=MainThread][step=1] Starting CheckpointManager Save Finalize thread=save_finalize
I0422 13:01:03.770144 132882370643712 async_checkpointer.py:265] [process=5][thread=save_finalize] Waiting for background save thread=async_save.
I0422 13:01:03.770300 133038495418176 standard_logger.py:34] {'step': 1, 'event_type': 'save', 'directory': 'gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123915/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260422_123915_02_sft_nnx_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776862862.0845919, 'wait_for_prev_duration_secs': 0.0001461505889892578, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776862862.0874946, 'checkpointer_blocking_duration_secs': 1.2417566776275635, 'get_old_steps_start_time': 1776862863.3292751, 'get_old_steps_duration_secs': 8.225440979003906e-05, 'checkpoint_manager_blocking_start_time': 1776862861.0605001, 'checkpoint_manager_blocking_duration_secs': 2.709761381149292}
I0422 13:01:03.912183 133038495418176 peft_trainer.py:474] Train step 1 training loss: 5.871947 - training perplexity: 354.939331
I0422 13:01:03.912446 133038495418176 max_utils.py:750]
Memstats: After params initialized:
I0422 13:01:03.912510 133038495418176 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_18(process=5,(2,4,0,0))
I0422 13:01:03.912547 133038495418176 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_19(process=5,(3,4,0,0))
I0422 13:01:03.912575 133038495418176 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_22(process=5,(2,5,0,0))
I0422 13:01:03.912609 133038495418176 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_23(process=5,(3,5,0,0))
I0422 13:01:04.043869 133038495418176 metric_logger.py:196] completed step: 1, seconds: 20.540, TFLOP/s/device: 0.000, Tokens/s/device: 49.854, total_weights: 21054, loss: 5.872, lm_loss: 0.000, perplexity: 0.000
I0422 13:01:04.052904 133038495418176 peft_trainer.py:474] Train step 2 training loss: 5.539500 - training perplexity: 254.550751
I0422 13:01:04.053633 133038495418176 metric_logger.py:196] completed step: 2, seconds: 0.140, TFLOP/s/device: 0.002, Tokens/s/device: 7290.637, total_weights: 21455, loss: 5.540, lm_loss: 0.000, perplexity: 0.000
I0422 13:01:04.150990 133038495418176 peft_trainer.py:474] Train step 3 training loss: 5.440901 - training perplexity: 230.649963
I0422 13:01:04.152022 133038495418176 metric_logger.py:196] completed step: 3, seconds: 0.098, TFLOP/s/device: 0.002, Tokens/s/device: 10433.529, total_weights: 22025, loss: 5.441, lm_loss: 0.000, perplexity: 0.000
I0422 13:01:04.176780 133038495418176 peft_trainer.py:474] Train step 4 training loss: 5.596125 - training perplexity: 269.380554
I0422 13:01:04.177545 133038495418176 metric_logger.py:196] completed step: 4, seconds: 0.026, TFLOP/s/device: 0.009, Tokens/s/device: 39828.445, total_weights: 23787, loss: 5.596, lm_loss: 0.000, perplexity: 0.000
I0422 13:01:04.193574 133038495418176 peft_trainer.py:733] Train loop finished in: 20.8199 seconds
I0422 13:01:04.194503 133038495418176 peft_trainer.py:474] Train step 5 training loss: 5.670155 - training perplexity: 290.079498
I0422 13:01:04.195142 133038495418176 metric_logger.py:196] completed step: 5, seconds: 0.018, TFLOP/s/device: 0.012, Tokens/s/device: 57865.164, total_weights: 20141, loss: 5.670, lm_loss: 0.000, perplexity: 0.000
I0422 13:01:08.143780 132882345465600 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123915/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260422_123915_02_sft_nnx_ckpt/checkpoints/1/model_params/array_metadatas/process_5
I0422 13:01:08.144964 132881858950912 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 4.439 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 5 seconds) (per-host)
I0422 13:01:08.162787 132864515471104 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123915/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260422_123915_02_sft_nnx_ckpt/checkpoints/1/optimizer_state/array_metadatas/process_5
I0422 13:01:08.163979 132881858950912 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 13.282 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 5 seconds) (per-host)
I0422 13:01:08.164059 132881858950912 async_checkpointer.py:90] [process=5][thread=async_save] 4 Handler Commit operations completed. Time taken: 4.834972s.
I0422 13:01:13.503256 133038495418176 checkpoint_manager.py:1994] [process=5][thread=MainThread][step=1][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0422 13:01:20.030346 132881858950912 async_checkpointer.py:144] [process=5][thread=async_save] Background save thread done. Time taken: 16.701244s.
I0422 13:01:20.030576 132882370643712 async_checkpointer.py:273] [process=5][thread=save_finalize] Done with waiting for background save thread=async_save.
I0422 13:01:20.030634 132882370643712 async_checkpointer.py:283] [process=5][thread=save_finalize] No errors found in background save thread=async_save.
I0422 13:01:20.030700 132882370643712 checkpoint_manager.py:2103] [process=5][thread=save_finalize][step=1] CheckpointManager Save Finalize is syncing with other hosts...
I0422 13:01:20.032581 132882370643712 checkpoint_manager.py:2112] [process=5][thread=save_finalize][step=1] CheckpointManager Save Finalize is done on all hosts.
I0422 13:01:20.032747 133038495418176 checkpoint_manager.py:2006] [process=5][thread=MainThread][step=1][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=1.
W0422 13:01:20.032877 133038495418176 checkpoint_manager.py:1441] Waiting for previous save to complete took 6.529642 seconds. If this number is high, consider checkpointing less frequently.
I0422 13:01:20.034513 133038495418176 checkpoint_manager.py:1501] [process=5] Saving checkpoint at step 5
I0422 13:01:20.037888 133038495418176 async_checkpointer.py:452] [process=5] Started async saving checkpoint to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123915/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260422_123915_02_sft_nnx_ckpt/checkpoints/5.
I0422 13:01:20.986932 133038495418176 jax_array_handlers.py:347] Scheduling D2H of 52 prioritized jax.Array.
I0422 13:01:20.987034 133038495418176 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0422 13:01:20.999386 133038495418176 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.013571s
I0422 13:01:21.002694 133038495418176 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 290.193 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 263 milliseconds) (per-host)
I0422 13:01:21.002757 133038495418176 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.264053s (batch_requests_ready=0.245099s, total_serialization_initiated=0.018839s, others=0.000115s)
I0422 13:01:21.004592 133038495418176 jax_array_handlers.py:347] Scheduling D2H of 46 prioritized jax.Array.
I0422 13:01:21.004646 133038495418176 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0422 13:01:21.009467 133038495418176 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.006607s
I0422 13:01:21.009573 133038495418176 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 93.361 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 273 milliseconds) (per-host)
I0422 13:01:21.009615 133038495418176 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.273527s (batch_requests_ready=0.263419s, total_serialization_initiated=0.010043s, others=0.000064s)
I0422 13:01:21.009745 133038495418176 composite_checkpoint_handler.py:715] [process=5][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.277868s (all_items=0.000014s, per_item={'model_params': '0.00001121', 'optimizer_state': '0.00000286'}, temp_paths=0.277854)
I0422 13:01:21.010637 132882370643712 async_checkpointer.py:79] [process=5][thread=async_save] Background save thread started.
I0422 13:01:21.010800 133038495418176 async_checkpointer.py:561] Finished blocking save. Time taken: 0.976210s. Continuing background save to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123915/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260422_123915_02_sft_nnx_ckpt/checkpoints/5.
I0422 13:01:21.388980 133038495418176 checkpoint_manager.py:1549] [process=5][thread=MainThread][step=5] Starting CheckpointManager Save Finalize thread=save_finalize
I0422 13:01:21.389331 132881858950912 async_checkpointer.py:265] [process=5][thread=save_finalize] Waiting for background save thread=async_save.
I0422 13:01:21.389513 133038495418176 standard_logger.py:34] {'step': 5, 'event_type': 'save', 'directory': 'gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123915/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260422_123915_02_sft_nnx_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776862873.5032136, 'wait_for_prev_duration_secs': 6.529641628265381, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776862880.0345595, 'checkpointer_blocking_duration_secs': 0.9763450622558594, 'get_old_steps_start_time': 1776862881.010927, 'get_old_steps_duration_secs': 7.987022399902344e-05, 'checkpoint_manager_blocking_start_time': 1776862864.199604, 'checkpoint_manager_blocking_duration_secs': 17.189870595932007}
I0422 13:01:21.389722 133038495418176 checkpoint_manager.py:1994] [process=5][thread=MainThread][step=5][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0422 13:01:26.534847 132882345465600 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123915/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260422_123915_02_sft_nnx_ckpt/checkpoints/5/optimizer_state/array_metadatas/process_5
I0422 13:01:26.563265 132882379036416 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123915/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260422_123915_02_sft_nnx_ckpt/checkpoints/5/model_params/array_metadatas/process_5
I0422 13:01:26.564480 132882370643712 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 4.381 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 5 seconds) (per-host)
I0422 13:01:26.564625 132882370643712 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 13.147 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 5 seconds) (per-host)
I0422 13:01:26.564673 132882370643712 async_checkpointer.py:90] [process=5][thread=async_save] 4 Handler Commit operations completed. Time taken: 5.553916s.
I0422 13:01:36.822934 132882370643712 async_checkpointer.py:144] [process=5][thread=async_save] Background save thread done. Time taken: 15.812162s.
I0422 13:01:36.823228 132881858950912 async_checkpointer.py:273] [process=5][thread=save_finalize] Done with waiting for background save thread=async_save.
I0422 13:01:36.823352 132881858950912 async_checkpointer.py:283] [process=5][thread=save_finalize] No errors found in background save thread=async_save.
I0422 13:01:36.823400 132881858950912 checkpoint_manager.py:2103] [process=5][thread=save_finalize][step=5] CheckpointManager Save Finalize is syncing with other hosts...
I0422 13:01:36.825092 132881858950912 checkpoint_manager.py:2112] [process=5][thread=save_finalize][step=5] CheckpointManager Save Finalize is done on all hosts.
I0422 13:01:36.825284 133038495418176 checkpoint_manager.py:2006] [process=5][thread=MainThread][step=5][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=5.
I0422 13:01:36.825439 133038495418176 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=133038495418176 count=1 at 0x78e686bb3e40>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x78e685439940>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x78e685439610>, _write_futures=[])
I0422 13:01:36.825894 133038495418176 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=133038495418176 count=1 at 0x78e686bb3e40>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x78e685439940>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x78e685439610>, _write_futures=[])
I0422 13:01:36.825930 133038495418176 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=133038495418176 count=1 at 0x78e686bb3e40>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x78e685439940>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x78e685439610>, _write_futures=[])
I0422 13:01:37.189992 132884006434560 grain_pool.py:542] Grain pool is exiting.
I0422 13:01:37.190103 132884006434560 grain_pool.py:547] Shutting down multiprocessing system.
I0422 13:01:39.287593 132884006434560 grain_pool.py:547] Shutting down multiprocessing system.
/usr/local/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 15 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
XPK End: Wed Apr 22 13:01:49 UTC 2026
EXIT_CODE=0