XPK Start: Sat Apr 25 12:17:36 UTC 2026
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
2026-04-25 12:18:05.584852: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0425 12:18:05.830179 139769434462016 max_utils.py:273] Attempting to initialize the jax distributed system...
INFO:2026-04-25 12:18:14,870:jax._src.distributed:149: Starting JAX distributed service on [::]:8482
I0425 12:18:14.870754 139769434462016 distributed.py:149] Starting JAX distributed service on [::]:8482
INFO:2026-04-25 12:18:14,873:jax._src.distributed:166: Connecting to JAX distributed service on mt-02-sft-linen-ckpt-vl73s-slice-job-0-0.mt-02-sft-linen-ckpt-vl73s:8482
I0425 12:18:14.873006 139769434462016 distributed.py:166] Connecting to JAX distributed service on mt-02-sft-linen-ckpt-vl73s-slice-job-0-0.mt-02-sft-linen-ckpt-vl73s:8482
I0425 12:18:16.022411 139769434462016 max_utils.py:284] Jax distributed system initialized!
I0425 12:18:22.105628 139769434462016 max_utils.py:800] System Information: Jax Version: 0.8.3
I0425 12:18:22.105753 139769434462016 max_utils.py:801] System Information: Jaxlib Version: 0.8.3
I0425 12:18:22.105796 139769434462016 max_utils.py:802] System Information: Jax Backend: PJRT C API
TFRT TPU v6 lite
Built on Dec 15 2025 14:03:46 (1765836226) cl/844590465
I0425 12:18:22.110053 139769434462016 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0425 12:18:22.303284 139769434462016 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0425 12:18:23.392350 139769434462016 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0425 12:18:23.392816 139769434462016 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7f1de5b10320>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0425 12:18:23.392874 139769434462016 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.28
W0425 12:18:23.907688 139769434462016 checkpoint.py:202] Metadata file does not exist: gs://lance-maxtext/pt_seed_ckpts/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items/_CHECKPOINT_METADATA
I0425 12:18:24.500037 1939 google_auth_provider.cc:181] Running on GCE, using service account 562977990677-compute@developer.gserviceaccount.com
I0425 12:18:25.628355 139769434462016 checkpointer.py:304] Restoring checkpoint from gs://lance-maxtext/pt_seed_ckpts/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items.
W0425 12:18:27.449693 139769434462016 transform_utils.py:230] The transformations API will eventually be replaced by an upgraded design. The current API will not be removed until this point, but it will no longer be actively worked on.
I0425 12:18:27.450071 139769434462016 transform_utils.py:288] The following keys are not loaded from the original tree after applying specified transforms: params/params/decoder/to_nnx__rngs/aqt/count, params/params/decoder/to_nnx__rngs/aqt/key, params/params/decoder/to_nnx__rngs/dropout/count, params/params/decoder/to_nnx__rngs/dropout/key, params/params/decoder/to_nnx__rngs/params/count, params/params/decoder/to_nnx__rngs/params/key
I0425 12:18:28.097709 139769434462016 checkpointer.py:318] Finished restoring checkpoint in 2.84 seconds from gs://lance-maxtext/pt_seed_ckpts/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items.
I0425 12:18:28.162408 139769434462016 config.py:112] TensorFlow version 2.20.0 available.
I0425 12:18:28.162910 139769434462016 config.py:125] JAX version 0.8.3 available.
/deps/src/maxtext/input_pipeline/input_pipeline_utils.py:467: UserWarning: WARNING: Inefficient dataloading. Your train or eval dataset contains 3 shards, smaller than number of host loading data. This is known to lead to inefficient dataloading. Seegithub.com/google/maxtext/blob/main/getting_started/Data_Input_Pipeline.md#multihost-dataloading-best-practice
warnings.warn(
E0425 12:18:33.447805 139769434462016 packing.py:209] PackAndBatchOperation is deprecated. Please use lazy_dataset.FirstFitPackIterDataset instead.
I0425 12:18:33.448134 139769434462016 data_loader.py:408] Adding CopyNumPyArrayToSharedMemory MapTransform.
I0425 12:18:33.833352 139769434462016 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0425 12:18:33.833494 139769434462016 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7f1de5b10320>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0425 12:18:33.833540 139769434462016 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0425 12:18:33.833575 139769434462016 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7f1de5b10320>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0425 12:18:33.833617 139769434462016 checkpoint_manager.py:702] [process=5][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'model_params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7f05dce895b0>, 'optimizer_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7f05dedb2fc0>, 'custom_metadata': <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7f05dce8d490>}, handler_registry=None
I0425 12:18:33.833873 139769434462016 composite_checkpoint_handler.py:237] Deferred registration for item: "model_params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7f05dce895b0>` for item "model_params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0425 12:18:33.833925 139769434462016 composite_checkpoint_handler.py:237] Deferred registration for item: "optimizer_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7f05dedb2fc0>` for item "optimizer_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0425 12:18:33.833959 139769434462016 composite_checkpoint_handler.py:237] Deferred registration for item: "custom_metadata". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7f05dce8d490>` for item "custom_metadata" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0425 12:18:33.833985 139769434462016 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7f05dce8b0e0>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0425 12:18:33.834013 139769434462016 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7f05dce895b0>, ('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7f05dce895b0>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7f05dedb2fc0>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7f05dedb2fc0>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7f05dce8d490>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7f05dce8d490>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7f05dce8b0e0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7f05dce8b0e0>}).
I0425 12:18:33.834420 139769434462016 async_checkpointer.py:177] [process=5][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>._fn at 0x7f05dcf2fd80> timeout: 600 secs and primary_host=0 for async checkpoint writes
I0425 12:18:36.225932 139769434462016 checkpoint_manager.py:1788] Found 0 checkpoint steps in gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_linen_ckpt/checkpoints
I0425 12:18:36.255024 139769434462016 checkpoint_manager.py:921] [process=5][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=10000, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_hns=False, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None, preservation_policy=None, prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False), root_directory=gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_linen_ckpt/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7f05dce8f2c0>
I0425 12:18:36.255306 139769434462016 peft_trainer.py:584] Training with mesh: Mesh('diloco': 1, 'data': 4, 'stage': 1, 'fsdp': 8, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1, axis_types=(Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto))
I0425 12:18:36.663978 139769434462016 peft_trainer.py:594] Compiled train_step cache size: 0
I0425 12:18:36.665943 139769434462016 metric_logger.py:301] number parameters: 0.000 billion
I0425 12:18:36.668756 139616459925248 grain_pool.py:367] Grain pool will use 1 processes.
I0425 12:18:36.694976 139616459925248 grain_pool.py:440] Grain pool will start child processes.
Per train step:
Total TFLOPs: 0.00
split as 54.29% learnable weight flops and 45.71% attention flops
I0425 12:18:36.700595 139616459925248 grain_pool.py:448] Grain pool started all child processes.
2026-04-25 12:18:40.753964: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-04-25 12:18:40.798838: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-04-25 12:18:41.970702: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
2026-04-25 12:18:46.260373: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0425 12:18:54.613207 139769434462016 checkpoint_manager.py:1983] [process=5][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0425 12:18:54.615351 139769434462016 checkpoint_manager.py:1501] [process=5] Saving checkpoint at step 1
I0425 12:18:54.618593 139769434462016 async_checkpointer.py:452] [process=5] Started async saving checkpoint to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_linen_ckpt/checkpoints/1.
I0425 12:18:55.156753 139769434462016 signaling_client.py:364] Using JaxDistributedSignalingClient
I0425 12:18:55.157744 139769434462016 jax_array_handlers.py:347] Scheduling D2H of 52 prioritized jax.Array.
I0425 12:18:55.157804 139769434462016 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0425 12:18:55.853554 139769434462016 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.696912s
I0425 12:18:55.855179 139769434462016 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 80.818 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 947 milliseconds) (per-host)
I0425 12:18:55.855257 139769434462016 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.947856s (batch_requests_ready=0.244694s, total_serialization_initiated=0.703034s, others=0.000128s)
I0425 12:18:55.856377 139769434462016 jax_array_handlers.py:347] Scheduling D2H of 22 prioritized jax.Array.
I0425 12:18:55.856436 139769434462016 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0425 12:18:55.861139 139769434462016 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.005749s
I0425 12:18:55.861254 139769434462016 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 26.706 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 956 milliseconds) (per-host)
I0425 12:18:55.861302 139769434462016 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.956072s (batch_requests_ready=0.948519s, total_serialization_initiated=0.007478s, others=0.000076s)
I0425 12:18:55.861418 139769434462016 composite_checkpoint_handler.py:715] [process=5][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.960459s (all_items=0.000023s, per_item={'model_params': '0.00001812', 'optimizer_state': '0.00000453'}, temp_paths=0.960436)
I0425 12:18:55.862297 139613817534208 async_checkpointer.py:79] [process=5][thread=async_save] Background save thread started.
I0425 12:18:55.862394 139769434462016 async_checkpointer.py:561] Finished blocking save. Time taken: 1.246978s. Continuing background save to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_linen_ckpt/checkpoints/1.
I0425 12:18:55.905444 139769434462016 checkpoint_manager.py:1549] [process=5][thread=MainThread][step=1] Starting CheckpointManager Save Finalize thread=save_finalize
I0425 12:18:55.905696 139614346012416 async_checkpointer.py:265] [process=5][thread=save_finalize] Waiting for background save thread=async_save.
I0425 12:18:55.905788 139769434462016 standard_logger.py:34] {'step': 1, 'event_type': 'save', 'directory': 'gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_linen_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1777119534.6131828, 'wait_for_prev_duration_secs': 0.00011229515075683594, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1777119534.6153963, 'checkpointer_blocking_duration_secs': 1.247114658355713, 'get_old_steps_start_time': 1777119535.8625305, 'get_old_steps_duration_secs': 8.130073547363281e-05, 'checkpoint_manager_blocking_start_time': 1777119534.611785, 'checkpoint_manager_blocking_duration_secs': 1.2939724922180176}
I0425 12:18:56.048702 139769434462016 peft_trainer.py:474] Train step 1 training loss: 5.926076 - training perplexity: 374.681519
I0425 12:18:56.048969 139769434462016 max_utils.py:750]
Memstats: After params initialized:
I0425 12:18:56.049033 139769434462016 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_18(process=5,(2,4,0,0))
I0425 12:18:56.049070 139769434462016 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_19(process=5,(3,4,0,0))
I0425 12:18:56.049098 139769434462016 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_22(process=5,(2,5,0,0))
I0425 12:18:56.049124 139769434462016 max_utils.py:756] Using (GB) 0.01 / 31.25 (0.032000%) on TPU_23(process=5,(3,5,0,0))
I0425 12:18:56.179115 139769434462016 metric_logger.py:196] completed step: 1, seconds: 19.383, TFLOP/s/device: 0.000, Tokens/s/device: 52.829, total_weights: 21054, loss: 5.926, lm_loss: 0.000, perplexity: 0.000
I0425 12:18:56.187133 139769434462016 peft_trainer.py:474] Train step 2 training loss: 5.643813 - training perplexity: 282.537903
I0425 12:18:56.187939 139769434462016 metric_logger.py:196] completed step: 2, seconds: 0.138, TFLOP/s/device: 0.002, Tokens/s/device: 7411.249, total_weights: 21455, loss: 5.644, lm_loss: 0.000, perplexity: 0.000
I0425 12:18:56.261382 139769434462016 peft_trainer.py:474] Train step 3 training loss: 5.503054 - training perplexity: 245.440414
I0425 12:18:56.262305 139769434462016 metric_logger.py:196] completed step: 3, seconds: 0.074, TFLOP/s/device: 0.003, Tokens/s/device: 13783.153, total_weights: 22025, loss: 5.503, lm_loss: 0.000, perplexity: 0.000
I0425 12:18:56.281050 139769434462016 peft_trainer.py:474] Train step 4 training loss: 5.624704 - training perplexity: 277.190186
I0425 12:18:56.281896 139769434462016 metric_logger.py:196] completed step: 4, seconds: 0.020, TFLOP/s/device: 0.011, Tokens/s/device: 52232.477, total_weights: 23787, loss: 5.625, lm_loss: 0.000, perplexity: 0.000
I0425 12:18:56.295188 139769434462016 peft_trainer.py:733] Train loop finished in: 19.6278 seconds
I0425 12:18:56.296157 139769434462016 peft_trainer.py:474] Train step 5 training loss: 5.765533 - training perplexity: 319.109253
I0425 12:18:56.296886 139769434462016 metric_logger.py:196] completed step: 5, seconds: 0.015, TFLOP/s/device: 0.015, Tokens/s/device: 67824.393, total_weights: 20141, loss: 5.766, lm_loss: 0.000, perplexity: 0.000
I0425 12:18:59.938611 139614304048896 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 22 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_linen_ckpt/checkpoints/1/model_params/array_metadatas/process_5
I0425 12:18:59.939819 139613817534208 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 5.071 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 5 seconds) (per-host)
I0425 12:18:59.943275 139614320834304 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_linen_ckpt/checkpoints/1/optimizer_state/array_metadatas/process_5
I0425 12:18:59.944218 139613817534208 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 15.207 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 5 seconds) (per-host)
I0425 12:18:59.944284 139613817534208 async_checkpointer.py:90] [process=5][thread=async_save] 4 Handler Commit operations completed. Time taken: 4.081836s.
I0425 12:19:05.894576 139769434462016 checkpoint_manager.py:1994] [process=5][thread=MainThread][step=1][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0425 12:19:08.368883 139613817534208 async_checkpointer.py:144] [process=5][thread=async_save] Background save thread done. Time taken: 12.506420s.
I0425 12:19:08.369139 139614346012416 async_checkpointer.py:273] [process=5][thread=save_finalize] Done with waiting for background save thread=async_save.
I0425 12:19:08.369195 139614346012416 async_checkpointer.py:283] [process=5][thread=save_finalize] No errors found in background save thread=async_save.
I0425 12:19:08.369243 139614346012416 checkpoint_manager.py:2103] [process=5][thread=save_finalize][step=1] CheckpointManager Save Finalize is syncing with other hosts...
I0425 12:19:08.370776 139614346012416 checkpoint_manager.py:2112] [process=5][thread=save_finalize][step=1] CheckpointManager Save Finalize is done on all hosts.
I0425 12:19:08.370877 139769434462016 checkpoint_manager.py:2006] [process=5][thread=MainThread][step=1][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=1.
W0425 12:19:08.370945 139769434462016 checkpoint_manager.py:1441] Waiting for previous save to complete took 2.476385 seconds. If this number is high, consider checkpointing less frequently.
I0425 12:19:08.372597 139769434462016 checkpoint_manager.py:1501] [process=5] Saving checkpoint at step 5
I0425 12:19:08.376286 139769434462016 async_checkpointer.py:452] [process=5] Started async saving checkpoint to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_linen_ckpt/checkpoints/5.
I0425 12:19:08.948278 139769434462016 jax_array_handlers.py:347] Scheduling D2H of 22 prioritized jax.Array.
I0425 12:19:08.948373 139769434462016 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0425 12:19:08.953382 139769434462016 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.006279s
I0425 12:19:08.953504 139769434462016 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 98.388 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 259 milliseconds) (per-host)
I0425 12:19:08.953553 139769434462016 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.259575s (batch_requests_ready=0.251612s, total_serialization_initiated=0.007879s, others=0.000084s)
I0425 12:19:08.967609 139769434462016 jax_array_handlers.py:347] Scheduling D2H of 52 prioritized jax.Array.
I0425 12:19:08.967722 139769434462016 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0425 12:19:08.979769 139769434462016 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.013069s
I0425 12:19:08.979897 139769434462016 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 269.277 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 284 milliseconds) (per-host)
I0425 12:19:08.979944 139769434462016 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.284520s (batch_requests_ready=0.269254s, total_serialization_initiated=0.015183s, others=0.000083s)
I0425 12:19:08.980091 139769434462016 composite_checkpoint_handler.py:715] [process=5][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.289893s (all_items=0.000014s, per_item={'model_params': '0.00001168', 'optimizer_state': '0.00000262'}, temp_paths=0.289879)
I0425 12:19:08.981052 139613817534208 async_checkpointer.py:79] [process=5][thread=async_save] Background save thread started.
I0425 12:19:08.981207 139769434462016 async_checkpointer.py:561] Finished blocking save. Time taken: 0.608542s. Continuing background save to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_linen_ckpt/checkpoints/5.
I0425 12:19:09.399024 139769434462016 checkpoint_manager.py:1549] [process=5][thread=MainThread][step=5] Starting CheckpointManager Save Finalize thread=save_finalize
I0425 12:19:09.399408 139614346012416 async_checkpointer.py:265] [process=5][thread=save_finalize] Waiting for background save thread=async_save.
I0425 12:19:09.399587 139769434462016 standard_logger.py:34] {'step': 5, 'event_type': 'save', 'directory': 'gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_linen_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1777119545.8945355, 'wait_for_prev_duration_secs': 2.4763853549957275, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1777119548.3726351, 'checkpointer_blocking_duration_secs': 0.6086733341217041, 'get_old_steps_start_time': 1777119548.9813287, 'get_old_steps_duration_secs': 7.987022399902344e-05, 'checkpoint_manager_blocking_start_time': 1777119536.2992046, 'checkpoint_manager_blocking_duration_secs': 13.100345373153687}
I0425 12:19:09.399780 139769434462016 checkpoint_manager.py:1994] [process=5][thread=MainThread][step=5][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0425 12:19:13.710893 139613809141504 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_linen_ckpt/checkpoints/5/optimizer_state/array_metadatas/process_5
I0425 12:19:14.002370 139614304048896 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 22 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260425_121405_02_sft_linen_ckpt/checkpoints/5/model_params/array_metadatas/process_5
I0425 12:19:14.003470 139613817534208 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 4.809 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 5 seconds) (per-host)
I0425 12:19:14.003562 139613817534208 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 14.430 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 5 seconds) (per-host)
I0425 12:19:14.003599 139613817534208 async_checkpointer.py:90] [process=5][thread=async_save] 4 Handler Commit operations completed. Time taken: 5.022441s.
I0425 12:19:21.467305 139613817534208 async_checkpointer.py:144] [process=5][thread=async_save] Background save thread done. Time taken: 12.486125s.
I0425 12:19:21.467682 139614346012416 async_checkpointer.py:273] [process=5][thread=save_finalize] Done with waiting for background save thread=async_save.
I0425 12:19:21.467812 139614346012416 async_checkpointer.py:283] [process=5][thread=save_finalize] No errors found in background save thread=async_save.
I0425 12:19:21.467862 139614346012416 checkpoint_manager.py:2103] [process=5][thread=save_finalize][step=5] CheckpointManager Save Finalize is syncing with other hosts...
I0425 12:19:21.469372 139614346012416 checkpoint_manager.py:2112] [process=5][thread=save_finalize][step=5] CheckpointManager Save Finalize is done on all hosts.
I0425 12:19:21.469512 139769434462016 checkpoint_manager.py:2006] [process=5][thread=MainThread][step=5][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=5.
I0425 12:19:21.469615 139769434462016 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=139769434462016 count=1 at 0x7f092c21fe40>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7f05dce8e360>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7f05dce8f980>, _write_futures=[])
I0425 12:19:21.470059 139769434462016 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=139769434462016 count=1 at 0x7f092c21fe40>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7f05dce8e360>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7f05dce8f980>, _write_futures=[])
I0425 12:19:21.470092 139769434462016 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=139769434462016 count=1 at 0x7f092c21fe40>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7f05dce8e360>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7f05dce8f980>, _write_futures=[])
I0425 12:19:21.790943 139616459925248 grain_pool.py:542] Grain pool is exiting.
I0425 12:19:21.791041 139616459925248 grain_pool.py:547] Shutting down multiprocessing system.
I0425 12:19:23.882455 139616459925248 grain_pool.py:547] Shutting down multiprocessing system.
/usr/local/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 15 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
XPK End: Sat Apr 25 12:19:37 UTC 2026
EXIT_CODE=0