XPK Start: Thu Apr 23 13:07:26 UTC 2026
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
2026-04-23 13:07:55.310949: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0423 13:07:55.556886 134955252619072 max_utils.py:273] Attempting to initialize the jax distributed system...
INFO:2026-04-23 13:08:04,598:jax._src.distributed:149: Starting JAX distributed service on [::]:8482
I0423 13:08:04.598017 134955252619072 distributed.py:149] Starting JAX distributed service on [::]:8482
INFO:2026-04-23 13:08:04,600:jax._src.distributed:166: Connecting to JAX distributed service on mt-02-sft-nnx-ckpt-50b92-slice-job-0-0.mt-02-sft-nnx-ckpt-50b92:8482
I0423 13:08:04.600408 134955252619072 distributed.py:166] Connecting to JAX distributed service on mt-02-sft-nnx-ckpt-50b92-slice-job-0-0.mt-02-sft-nnx-ckpt-50b92:8482
I0423 13:08:05.820332 134955252619072 max_utils.py:284] Jax distributed system initialized!
I0423 13:08:11.827588 134955252619072 max_utils.py:800] System Information: Jax Version: 0.8.3
I0423 13:08:11.827695 134955252619072 max_utils.py:801] System Information: Jaxlib Version: 0.8.3
I0423 13:08:11.827735 134955252619072 max_utils.py:802] System Information: Jax Backend: PJRT C API
TFRT TPU v6 lite
Built on Dec 15 2025 14:03:46 (1765836226) cl/844590465
I0423 13:08:11.831160 134955252619072 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0423 13:08:11.928582 134955252619072 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0423 13:08:12.029806 134955252619072 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0423 13:08:13.076778 134955252619072 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0423 13:08:13.077231 134955252619072 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7abd02150350>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0423 13:08:13.077297 134955252619072 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.28
W0423 13:08:13.626614 134955252619072 checkpoint.py:202] Metadata file does not exist: gs://lance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items/_CHECKPOINT_METADATA
I0423 13:08:14.203839 1943 google_auth_provider.cc:181] Running on GCE, using service account 562977990677-compute@developer.gserviceaccount.com
I0423 13:08:15.430728 134955252619072 checkpointer.py:304] Restoring checkpoint from gs://lance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items.
W0423 13:08:16.341614 134955252619072 transform_utils.py:230] The transformations API will eventually be replaced by an upgraded design. The current API will not be removed until this point, but it will no longer be actively worked on.
I0423 13:08:16.342019 134955252619072 transform_utils.py:288] The following keys are not loaded from the original tree after applying specified transforms: decoder/decoder_norm/bias/value, decoder/decoder_norm/scale/value, decoder/dropout/rngs/aqt/count/value, decoder/dropout/rngs/aqt/key/value, decoder/dropout/rngs/dropout/count/value, decoder/dropout/rngs/dropout/key/value, decoder/dropout/rngs/params/count/value, decoder/dropout/rngs/params/key/value, decoder/layers/dropout/rngs/aqt/count/value, decoder/layers/dropout/rngs/aqt/key/value, decoder/layers/dropout/rngs/dropout/count/value, decoder/layers/dropout/rngs/dropout/key/value, decoder/layers/dropout/rngs/params/count/value, decoder/layers/dropout/rngs/params/key/value, decoder/layers/mlp/dropout/rngs/aqt/count/value, decoder/layers/mlp/dropout/rngs/aqt/key/value, decoder/layers/mlp/dropout/rngs/dropout/count/value, decoder/layers/mlp/dropout/rngs/dropout/key/value, decoder/layers/mlp/dropout/rngs/params/count/value, decoder/layers/mlp/dropout/rngs/params/key/value, decoder/layers/mlp/mlp_layer_norm/bias/value, decoder/layers/mlp/mlp_layer_norm/scale/value, decoder/layers/mlp/wi/bias/value, decoder/layers/mlp/wi/kernel/value, decoder/layers/mlp/wo/bias/value, decoder/layers/mlp/wo/kernel/value, decoder/layers/pre_self_attention_norm/bias/value, decoder/layers/pre_self_attention_norm/scale/value, decoder/layers/rngs/aqt/count/value, decoder/layers/rngs/aqt/key/value, decoder/layers/rngs/dropout/count/value, decoder/layers/rngs/dropout/key/value, decoder/layers/rngs/params/count/value, decoder/layers/rngs/params/key/value, decoder/layers/self_attention/out/bias/value, decoder/layers/self_attention/out/kernel/value, decoder/layers/self_attention/qkv_proj/bias/value, decoder/layers/self_attention/qkv_proj/kernel/value, decoder/position_embedder/embedding/value, decoder/rngs/aqt/count/value, decoder/rngs/aqt/key/value, decoder/rngs/dropout/count/value, decoder/rngs/dropout/key/value, decoder/rngs/params/count/value, decoder/rngs/params/key/value, token_embedder/embedding/value
I0423 13:08:16.813759 134955252619072 checkpointer.py:318] Finished restoring checkpoint in 1.76 seconds from gs://lance-maxtext/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_ckpt_feat_nnx_trainstate_and_training_loop_20260411_044231/nnx_feat_nnx_trainstate_and_training_loop_20260411_044231_08_checkpoint_async_true/checkpoints/9/items.
I0423 13:08:16.882718 134955252619072 config.py:112] TensorFlow version 2.20.0 available.
I0423 13:08:16.883208 134955252619072 config.py:125] JAX version 0.8.3 available.
/deps/src/maxtext/input_pipeline/input_pipeline_utils.py:467: UserWarning: WARNING: Inefficient dataloading. Your train or eval dataset contains 3 shards, smaller than number of host loading data. This is known to lead to inefficient dataloading. Seegithub.com/google/maxtext/blob/main/getting_started/Data_Input_Pipeline.md#multihost-dataloading-best-practice
warnings.warn(
E0423 13:08:22.281252 134955252619072 packing.py:209] PackAndBatchOperation is deprecated. Please use lazy_dataset.FirstFitPackIterDataset instead.
I0423 13:08:22.281587 134955252619072 data_loader.py:408] Adding CopyNumPyArrayToSharedMemory MapTransform.
I0423 13:08:22.667058 134955252619072 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0423 13:08:22.667204 134955252619072 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7abd02150350>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0423 13:08:22.667250 134955252619072 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0423 13:08:22.667289 134955252619072 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7abd02150350>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0423 13:08:22.667333 134955252619072 checkpoint_manager.py:702] [process=3][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'model_params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7ab25ca62cf0>, 'optimizer_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7aa4e94bd790>, 'custom_metadata': <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7aa4e94c3c20>}, handler_registry=None
I0423 13:08:22.667552 134955252619072 composite_checkpoint_handler.py:237] Deferred registration for item: "model_params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7ab25ca62cf0>` for item "model_params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0423 13:08:22.667598 134955252619072 composite_checkpoint_handler.py:237] Deferred registration for item: "optimizer_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7aa4e94bd790>` for item "optimizer_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0423 13:08:22.667628 134955252619072 composite_checkpoint_handler.py:237] Deferred registration for item: "custom_metadata". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7aa4e94c3c20>` for item "custom_metadata" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0423 13:08:22.667653 134955252619072 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7aa4e94bc920>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0423 13:08:22.667681 134955252619072 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7ab25ca62cf0>, ('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7ab25ca62cf0>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7aa4e94bd790>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7aa4e94bd790>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7aa4e94c3c20>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7aa4e94c3c20>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7aa4e94bc920>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7aa4e94bc920>}).
I0423 13:08:22.668148 134955252619072 async_checkpointer.py:177] [process=3][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>._fn at 0x7aa4e9a13e20> timeout: 600 secs and primary_host=0 for async checkpoint writes
I0423 13:08:24.825660 134955252619072 checkpoint_manager.py:1788] Found 0 checkpoint steps in gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_nnx_ckpt/checkpoints
I0423 13:08:24.846487 134955252619072 checkpoint_manager.py:921] [process=3][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=10000, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_hns=False, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None, preservation_policy=None, prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False), root_directory=gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_nnx_ckpt/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7aa4e94c0590>
I0423 13:08:24.846777 134955252619072 peft_trainer.py:584] Training with mesh: Mesh('diloco': 1, 'data': 4, 'stage': 1, 'fsdp': 8, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1, axis_types=(Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto))
I0423 13:08:25.260251 134955252619072 peft_trainer.py:594] Compiled train_step cache size: 0
I0423 13:08:25.264273 134955252619072 metric_logger.py:301] number parameters: 0.000 billion
I0423 13:08:25.266629 134801595098880 grain_pool.py:367] Grain pool will use 1 processes.
I0423 13:08:25.293027 134801595098880 grain_pool.py:440] Grain pool will start child processes.
Per train step:
Total TFLOPs: 0.00
split as 54.29% learnable weight flops and 45.71% attention flops
I0423 13:08:25.298132 134801595098880 grain_pool.py:448] Grain pool started all child processes.
2026-04-23 13:08:29.303966: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-04-23 13:08:29.349242: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-04-23 13:08:30.528807: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
2026-04-23 13:08:34.710101: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0423 13:08:43.262109 134955252619072 checkpoint_manager.py:1983] [process=3][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0423 13:08:43.265223 134955252619072 checkpoint_manager.py:1501] [process=3] Saving checkpoint at step 1
I0423 13:08:43.268623 134955252619072 async_checkpointer.py:452] [process=3] Started async saving checkpoint to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_nnx_ckpt/checkpoints/1.
I0423 13:08:43.839519 134955252619072 signaling_client.py:364] Using JaxDistributedSignalingClient
I0423 13:08:43.840767 134955252619072 jax_array_handlers.py:347] Scheduling D2H of 46 prioritized jax.Array.
I0423 13:08:43.840828 134955252619072 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0423 13:08:44.495863 134955252619072 base_pytree_checkpoint_handler.py:153] [process=3][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.657433s
I0423 13:08:44.497904 134955252619072 base_pytree_checkpoint_handler.py:128] [process=3] /jax/checkpoint/write/blocking_gbytes_per_sec: 27.925 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 914 milliseconds) (per-host)
I0423 13:08:44.497971 134955252619072 base_pytree_checkpoint_handler.py:732] [process=3][thread=MainThread] Initiated Pytree async_save. Time taken: 0.914386s (batch_requests_ready=0.249076s, total_serialization_initiated=0.665193s, others=0.000116s)
I0423 13:08:44.498962 134955252619072 jax_array_handlers.py:347] Scheduling D2H of 52 prioritized jax.Array.
I0423 13:08:44.499023 134955252619072 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0423 13:08:44.511044 134955252619072 base_pytree_checkpoint_handler.py:153] [process=3][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.012954s
I0423 13:08:44.511161 134955252619072 base_pytree_checkpoint_handler.py:128] [process=3] /jax/checkpoint/write/blocking_gbytes_per_sec: 82.876 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 924 milliseconds) (per-host)
I0423 13:08:44.511208 134955252619072 base_pytree_checkpoint_handler.py:732] [process=3][thread=MainThread] Initiated Pytree async_save. Time taken: 0.924269s (batch_requests_ready=0.909109s, total_serialization_initiated=0.015088s, others=0.000072s)
I0423 13:08:44.511330 134955252619072 composite_checkpoint_handler.py:715] [process=3][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.931881s (all_items=0.000025s, per_item={'model_params': '0.00002074', 'optimizer_state': '0.00000405'}, temp_paths=0.931856)
I0423 13:08:44.512389 134799486347008 async_checkpointer.py:79] [process=3][thread=async_save] Background save thread started.
I0423 13:08:44.512561 134955252619072 async_checkpointer.py:561] Finished blocking save. Time taken: 1.247263s. Continuing background save to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_nnx_ckpt/checkpoints/1.
I0423 13:08:44.569146 134955252619072 checkpoint_manager.py:1549] [process=3][thread=MainThread][step=1] Starting CheckpointManager Save Finalize thread=save_finalize
I0423 13:08:44.569508 134800006432512 async_checkpointer.py:265] [process=3][thread=save_finalize] Waiting for background save thread=async_save.
I0423 13:08:44.569673 134955252619072 standard_logger.py:34] {'step': 1, 'event_type': 'save', 'directory': 'gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_nnx_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776949723.2620754, 'wait_for_prev_duration_secs': 0.00014328956604003906, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776949723.265266, 'checkpointer_blocking_duration_secs': 1.2474048137664795, 'get_old_steps_start_time': 1776949724.5126932, 'get_old_steps_duration_secs': 7.939338684082031e-05, 'checkpoint_manager_blocking_start_time': 1776949723.0629992, 'checkpoint_manager_blocking_duration_secs': 1.5066375732421875}
I0423 13:08:44.712443 134955252619072 peft_trainer.py:474] Train step 1 training loss: 5.871947 - training perplexity: 354.939331
I0423 13:08:44.712718 134955252619072 max_utils.py:750]
Memstats: After params initialized:
I0423 13:08:44.712782 134955252619072 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_10(process=3,(2,2,0,0))
I0423 13:08:44.712822 134955252619072 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_11(process=3,(3,2,0,0))
I0423 13:08:44.712849 134955252619072 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_14(process=3,(2,3,0,0))
I0423 13:08:44.712877 134955252619072 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_15(process=3,(3,3,0,0))
I0423 13:08:44.845169 134955252619072 metric_logger.py:196] completed step: 1, seconds: 19.449, TFLOP/s/device: 0.000, Tokens/s/device: 52.652, total_weights: 21054, loss: 5.872, lm_loss: 0.000, perplexity: 0.000
I0423 13:08:44.855197 134955252619072 peft_trainer.py:474] Train step 2 training loss: 5.539500 - training perplexity: 254.550751
I0423 13:08:44.856035 134955252619072 metric_logger.py:196] completed step: 2, seconds: 0.143, TFLOP/s/device: 0.002, Tokens/s/device: 7183.760, total_weights: 21455, loss: 5.540, lm_loss: 0.000, perplexity: 0.000
I0423 13:08:44.942371 134955252619072 peft_trainer.py:474] Train step 3 training loss: 5.440901 - training perplexity: 230.649963
I0423 13:08:44.943309 134955252619072 metric_logger.py:196] completed step: 3, seconds: 0.087, TFLOP/s/device: 0.003, Tokens/s/device: 11742.941, total_weights: 22025, loss: 5.441, lm_loss: 0.000, perplexity: 0.000
I0423 13:08:44.967655 134955252619072 peft_trainer.py:474] Train step 4 training loss: 5.596125 - training perplexity: 269.380554
I0423 13:08:44.968360 134955252619072 metric_logger.py:196] completed step: 4, seconds: 0.025, TFLOP/s/device: 0.009, Tokens/s/device: 40654.801, total_weights: 23787, loss: 5.596, lm_loss: 0.000, perplexity: 0.000
I0423 13:08:44.983994 134955252619072 peft_trainer.py:733] Train loop finished in: 19.7186 seconds
I0423 13:08:44.984918 134955252619072 peft_trainer.py:474] Train step 5 training loss: 5.670155 - training perplexity: 290.079498
I0423 13:08:44.985591 134955252619072 metric_logger.py:196] completed step: 5, seconds: 0.017, TFLOP/s/device: 0.013, Tokens/s/device: 59402.296, total_weights: 20141, loss: 5.670, lm_loss: 0.000, perplexity: 0.000
I0423 13:08:49.070760 134799989647104 array_metadata_store.py:203] [process=3][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_nnx_ckpt/checkpoints/1/model_params/array_metadatas/process_3
I0423 13:08:49.071955 134799486347008 base_pytree_checkpoint_handler.py:128] [process=3] /jax/checkpoint/write/gbytes_per_sec: 4.652 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 5 seconds) (per-host)
I0423 13:08:49.526509 134799494739712 array_metadata_store.py:203] [process=3][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_nnx_ckpt/checkpoints/1/optimizer_state/array_metadatas/process_3
I0423 13:08:49.527657 134799486347008 base_pytree_checkpoint_handler.py:128] [process=3] /jax/checkpoint/write/gbytes_per_sec: 12.893 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 5 seconds) (per-host)
I0423 13:08:49.527762 134799486347008 async_checkpointer.py:90] [process=3][thread=async_save] 4 Handler Commit operations completed. Time taken: 5.015254s.
I0423 13:08:53.812693 134955252619072 checkpoint_manager.py:1994] [process=3][thread=MainThread][step=1][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0423 13:09:00.246749 134799486347008 async_checkpointer.py:144] [process=3][thread=async_save] Background save thread done. Time taken: 15.734227s.
I0423 13:09:00.247071 134800006432512 async_checkpointer.py:273] [process=3][thread=save_finalize] Done with waiting for background save thread=async_save.
I0423 13:09:00.247189 134800006432512 async_checkpointer.py:283] [process=3][thread=save_finalize] No errors found in background save thread=async_save.
I0423 13:09:00.247238 134800006432512 checkpoint_manager.py:2103] [process=3][thread=save_finalize][step=1] CheckpointManager Save Finalize is syncing with other hosts...
I0423 13:09:00.248691 134800006432512 checkpoint_manager.py:2112] [process=3][thread=save_finalize][step=1] CheckpointManager Save Finalize is done on all hosts.
I0423 13:09:00.248866 134955252619072 checkpoint_manager.py:2006] [process=3][thread=MainThread][step=1][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=1.
W0423 13:09:00.249000 134955252619072 checkpoint_manager.py:1441] Waiting for previous save to complete took 6.436326 seconds. If this number is high, consider checkpointing less frequently.
I0423 13:09:00.250657 134955252619072 checkpoint_manager.py:1501] [process=3] Saving checkpoint at step 5
I0423 13:09:00.254475 134955252619072 async_checkpointer.py:452] [process=3] Started async saving checkpoint to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_nnx_ckpt/checkpoints/5.
I0423 13:09:00.818815 134955252619072 jax_array_handlers.py:347] Scheduling D2H of 52 prioritized jax.Array.
I0423 13:09:00.818915 134955252619072 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0423 13:09:00.831692 134955252619072 base_pytree_checkpoint_handler.py:153] [process=3][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.013743s
I0423 13:09:00.835181 134955252619072 base_pytree_checkpoint_handler.py:128] [process=3] /jax/checkpoint/write/blocking_gbytes_per_sec: 275.679 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 277 milliseconds) (per-host)
I0423 13:09:00.835244 134955252619072 base_pytree_checkpoint_handler.py:732] [process=3][thread=MainThread] Initiated Pytree async_save. Time taken: 0.277937s (batch_requests_ready=0.258338s, total_serialization_initiated=0.019495s, others=0.000103s)
I0423 13:09:00.837083 134955252619072 jax_array_handlers.py:347] Scheduling D2H of 46 prioritized jax.Array.
I0423 13:09:00.837138 134955252619072 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0423 13:09:00.842110 134955252619072 base_pytree_checkpoint_handler.py:153] [process=3][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.006763s
I0423 13:09:00.842211 134955252619072 base_pytree_checkpoint_handler.py:128] [process=3] /jax/checkpoint/write/blocking_gbytes_per_sec: 88.796 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 287 milliseconds) (per-host)
I0423 13:09:00.842254 134955252619072 base_pytree_checkpoint_handler.py:732] [process=3][thread=MainThread] Initiated Pytree async_save. Time taken: 0.287586s (batch_requests_ready=0.277141s, total_serialization_initiated=0.010381s, others=0.000064s)
I0423 13:09:00.842367 134955252619072 composite_checkpoint_handler.py:715] [process=3][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.291587s (all_items=0.000013s, per_item={'model_params': '0.00001001', 'optimizer_state': '0.00000262'}, temp_paths=0.291574)
I0423 13:09:00.843444 134800006432512 async_checkpointer.py:79] [process=3][thread=async_save] Background save thread started.
I0423 13:09:00.843601 134955252619072 async_checkpointer.py:561] Finished blocking save. Time taken: 0.592874s. Continuing background save to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_nnx_ckpt/checkpoints/5.
I0423 13:09:01.225340 134955252619072 checkpoint_manager.py:1549] [process=3][thread=MainThread][step=5] Starting CheckpointManager Save Finalize thread=save_finalize
I0423 13:09:01.225759 134799486347008 async_checkpointer.py:265] [process=3][thread=save_finalize] Waiting for background save thread=async_save.
I0423 13:09:01.225935 134955252619072 standard_logger.py:34] {'step': 5, 'event_type': 'save', 'directory': 'gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_nnx_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776949733.812652, 'wait_for_prev_duration_secs': 6.436326026916504, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776949740.250698, 'checkpointer_blocking_duration_secs': 0.5930197238922119, 'get_old_steps_start_time': 1776949740.8437412, 'get_old_steps_duration_secs': 8.058547973632812e-05, 'checkpoint_manager_blocking_start_time': 1776949724.9900837, 'checkpoint_manager_blocking_duration_secs': 16.23581576347351}
I0423 13:09:01.226135 134955252619072 checkpoint_manager.py:1994] [process=3][thread=MainThread][step=5][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0423 13:09:05.559466 134799989647104 array_metadata_store.py:203] [process=3][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_nnx_ckpt/checkpoints/5/optimizer_state/array_metadatas/process_3
I0423 13:09:05.570657 134799981254400 array_metadata_store.py:203] [process=3][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_nnx_ckpt/checkpoints/5/model_params/array_metadatas/process_3
I0423 13:09:05.571907 134800006432512 base_pytree_checkpoint_handler.py:128] [process=3] /jax/checkpoint/write/gbytes_per_sec: 5.089 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 5 seconds) (per-host)
I0423 13:09:05.572052 134800006432512 base_pytree_checkpoint_handler.py:128] [process=3] /jax/checkpoint/write/gbytes_per_sec: 15.274 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 5 seconds) (per-host)
I0423 13:09:05.572090 134800006432512 async_checkpointer.py:90] [process=3][thread=async_save] 4 Handler Commit operations completed. Time taken: 4.728535s.
I0423 13:09:16.298198 134800006432512 async_checkpointer.py:144] [process=3][thread=async_save] Background save thread done. Time taken: 15.454628s.
I0423 13:09:16.298518 134799486347008 async_checkpointer.py:273] [process=3][thread=save_finalize] Done with waiting for background save thread=async_save.
I0423 13:09:16.298638 134799486347008 async_checkpointer.py:283] [process=3][thread=save_finalize] No errors found in background save thread=async_save.
I0423 13:09:16.298687 134799486347008 checkpoint_manager.py:2103] [process=3][thread=save_finalize][step=5] CheckpointManager Save Finalize is syncing with other hosts...
I0423 13:09:16.300201 134799486347008 checkpoint_manager.py:2112] [process=3][thread=save_finalize][step=5] CheckpointManager Save Finalize is done on all hosts.
I0423 13:09:16.300373 134955252619072 checkpoint_manager.py:2006] [process=3][thread=MainThread][step=5][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=5.
I0423 13:09:16.300557 134955252619072 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=134955252619072 count=1 at 0x7ab25cc3e3c0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7aa4e94c12b0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7aa4e94c1550>, _write_futures=[])
I0423 13:09:16.301013 134955252619072 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=134955252619072 count=1 at 0x7ab25cc3e3c0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7aa4e94c12b0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7aa4e94c1550>, _write_futures=[])
I0423 13:09:16.301045 134955252619072 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=134955252619072 count=1 at 0x7ab25cc3e3c0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7aa4e94c12b0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7aa4e94c1550>, _write_futures=[])
I0423 13:09:16.477388 134801595098880 grain_pool.py:542] Grain pool is exiting.
I0423 13:09:16.477519 134801595098880 grain_pool.py:547] Shutting down multiprocessing system.
I0423 13:09:18.606884 134801595098880 grain_pool.py:547] Shutting down multiprocessing system.
/usr/local/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 15 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
XPK End: Thu Apr 23 13:09:32 UTC 2026
EXIT_CODE=0