MaxView

← Back to run

Log Summary

XPK Start: Thu Apr 23 12:55:12 UTC 2026
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
2026-04-23 12:55:41.857005: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0423 12:55:42.104268 135903583459136 max_utils.py:273] Attempting to initialize the jax distributed system...
INFO:2026-04-23 12:55:51,145:jax._src.distributed:149: Starting JAX distributed service on [::]:8482
I0423 12:55:51.145292 135903583459136 distributed.py:149] Starting JAX distributed service on [::]:8482
INFO:2026-04-23 12:55:51,147:jax._src.distributed:166: Connecting to JAX distributed service on mt-02-sft-linen-ckpt-8izk2-slice-job-0-0.mt-02-sft-linen-ckpt-8izk2:8482
I0423 12:55:51.147675 135903583459136 distributed.py:166] Connecting to JAX distributed service on mt-02-sft-linen-ckpt-8izk2-slice-job-0-0.mt-02-sft-linen-ckpt-8izk2:8482
I0423 12:55:52.332746 135903583459136 max_utils.py:284] Jax distributed system initialized!
I0423 12:55:57.559135 135903583459136 max_utils.py:800] System Information: Jax Version: 0.8.3
I0423 12:55:57.559240 135903583459136 max_utils.py:801] System Information: Jaxlib Version: 0.8.3
I0423 12:55:57.559279 135903583459136 max_utils.py:802] System Information: Jax Backend: PJRT C API
TFRT TPU v6 lite
Built on Dec 15 2025 14:03:46 (1765836226) cl/844590465
I0423 12:55:57.562673 135903583459136 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0423 12:55:57.750952 135903583459136 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0423 12:55:58.836117 135903583459136 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0423 12:55:58.836592 135903583459136 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7b99cf0332f0>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0423 12:55:58.836666 135903583459136 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.28
W0423 12:55:59.384341 135903583459136 checkpoint.py:202] Metadata file does not exist: gs://lance-maxtext/pt_seed_ckpts/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items/_CHECKPOINT_METADATA
I0423 12:56:00.769885    1887 google_auth_provider.cc:181] Running on GCE, using service account 562977990677-compute@developer.gserviceaccount.com
I0423 12:56:01.911513 135903583459136 checkpointer.py:304] Restoring checkpoint from gs://lance-maxtext/pt_seed_ckpts/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items.
W0423 12:56:04.184477 135903583459136 transform_utils.py:230] The transformations API will eventually be replaced by an upgraded design. The current API will not be removed until this point, but it will no longer be actively worked on.
I0423 12:56:04.184868 135903583459136 transform_utils.py:288] The following keys are not loaded from the original tree after applying specified transforms: params/params/decoder/to_nnx__rngs/aqt/count, params/params/decoder/to_nnx__rngs/aqt/key, params/params/decoder/to_nnx__rngs/dropout/count, params/params/decoder/to_nnx__rngs/dropout/key, params/params/decoder/to_nnx__rngs/params/count, params/params/decoder/to_nnx__rngs/params/key
I0423 12:56:04.237839 135903583459136 checkpointer.py:318] Finished restoring checkpoint in 2.70 seconds from gs://lance-maxtext/pt_seed_ckpts/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items.
I0423 12:56:04.303481 135903583459136 config.py:112] TensorFlow version 2.20.0 available.
I0423 12:56:04.304032 135903583459136 config.py:125] JAX version 0.8.3 available.
/deps/src/maxtext/input_pipeline/input_pipeline_utils.py:467: UserWarning: WARNING: Inefficient dataloading. Your train or eval dataset contains 3 shards, smaller than number of host loading data. This is known to lead to inefficient dataloading. Seegithub.com/google/maxtext/blob/main/getting_started/Data_Input_Pipeline.md#multihost-dataloading-best-practice
  warnings.warn(
E0423 12:56:09.624505 135903583459136 packing.py:209] PackAndBatchOperation is deprecated. Please use lazy_dataset.FirstFitPackIterDataset instead.
I0423 12:56:09.624843 135903583459136 data_loader.py:408] Adding CopyNumPyArrayToSharedMemory MapTransform.
I0423 12:56:10.009964 135903583459136 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0423 12:56:10.010113 135903583459136 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7b99cf0332f0>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0423 12:56:10.010160 135903583459136 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0423 12:56:10.010195 135903583459136 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7b99cf0332f0>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0423 12:56:10.010238 135903583459136 checkpoint_manager.py:702] [process=5][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'model_params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b8187bd79e0>, 'optimizer_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b8187bd5e50>, 'custom_metadata': <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7b8187bd57f0>}, handler_registry=None
I0423 12:56:10.010451 135903583459136 composite_checkpoint_handler.py:237] Deferred registration for item: "model_params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b8187bd79e0>` for item "model_params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0423 12:56:10.010497 135903583459136 composite_checkpoint_handler.py:237] Deferred registration for item: "optimizer_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b8187bd5e50>` for item "optimizer_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0423 12:56:10.010525 135903583459136 composite_checkpoint_handler.py:237] Deferred registration for item: "custom_metadata". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7b8187bd57f0>` for item "custom_metadata" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0423 12:56:10.010551 135903583459136 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7b8187bd4c50>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0423 12:56:10.010579 135903583459136 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b8187bd79e0>, ('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b8187bd79e0>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b8187bd5e50>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b8187bd5e50>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7b8187bd57f0>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7b8187bd57f0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7b8187bd4c50>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7b8187bd4c50>}).
I0423 12:56:10.011001 135903583459136 async_checkpointer.py:177] [process=5][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>._fn at 0x7b8187c63d80> timeout: 600 secs and primary_host=0 for async checkpoint writes
I0423 12:56:12.333786 135903583459136 checkpoint_manager.py:1788] Found 0 checkpoint steps in gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_linen_ckpt/checkpoints
I0423 12:56:12.635106 135903583459136 checkpoint_manager.py:921] [process=5][thread=MainThread] CheckpointManager created,  primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=10000, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_hns=False, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None, preservation_policy=None, prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False), root_directory=gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_linen_ckpt/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7b8187bd5a90>
I0423 12:56:12.635457 135903583459136 peft_trainer.py:584] Training with mesh: Mesh('diloco': 1, 'data': 4, 'stage': 1, 'fsdp': 8, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1, axis_types=(Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto))
I0423 12:56:13.045555 135903583459136 peft_trainer.py:594] Compiled train_step cache size: 0
I0423 12:56:13.047518 135903583459136 metric_logger.py:301] number parameters: 0.000 billion
I0423 12:56:13.049959 135753891825408 grain_pool.py:367] Grain pool will use 1 processes.
I0423 12:56:13.075951 135753891825408 grain_pool.py:440] Grain pool will start child processes.
Per train step:
 Total TFLOPs: 0.00 
 split as 54.29% learnable weight flops and 45.71% attention flops
I0423 12:56:13.081192 135753891825408 grain_pool.py:448] Grain pool started all child processes.
2026-04-23 12:56:17.116479: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-04-23 12:56:17.160918: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-04-23 12:56:18.333896: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
2026-04-23 12:56:22.611823: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0423 12:56:30.890399 135903583459136 checkpoint_manager.py:1983] [process=5][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0423 12:56:30.892240 135903583459136 checkpoint_manager.py:1501] [process=5] Saving checkpoint at step 1
I0423 12:56:30.895316 135903583459136 async_checkpointer.py:452] [process=5] Started async saving checkpoint to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_linen_ckpt/checkpoints/1.
I0423 12:56:31.511891 135903583459136 signaling_client.py:364] Using JaxDistributedSignalingClient
I0423 12:56:31.512871 135903583459136 jax_array_handlers.py:347] Scheduling D2H of 22 prioritized jax.Array.
I0423 12:56:31.512930 135903583459136 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0423 12:56:32.193057 135903583459136 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.681499s
I0423 12:56:32.195153 135903583459136 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 26.632 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 958 milliseconds) (per-host)
I0423 12:56:32.195230 135903583459136 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.958775s (batch_requests_ready=0.271569s, total_serialization_initiated=0.687082s, others=0.000124s)
I0423 12:56:32.196119 135903583459136 jax_array_handlers.py:347] Scheduling D2H of 52 prioritized jax.Array.
I0423 12:56:32.196177 135903583459136 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0423 12:56:32.207704 135903583459136 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.012381s
I0423 12:56:32.207825 135903583459136 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 79.029 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 969 milliseconds) (per-host)
I0423 12:56:32.207874 135903583459136 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.969262s (batch_requests_ready=0.954632s, total_serialization_initiated=0.014554s, others=0.000076s)
I0423 12:56:32.208020 135903583459136 composite_checkpoint_handler.py:715] [process=5][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.975690s (all_items=0.000023s, per_item={'model_params': '0.00001907', 'optimizer_state': '0.00000405'}, temp_paths=0.975667)
I0423 12:56:32.208954 135751224256256 async_checkpointer.py:79] [process=5][thread=async_save] Background save thread started.
I0423 12:56:32.209120 135903583459136 async_checkpointer.py:561] Finished blocking save. Time taken: 1.316809s. Continuing background save to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_linen_ckpt/checkpoints/1.
I0423 12:56:32.252032 135903583459136 checkpoint_manager.py:1549] [process=5][thread=MainThread][step=1] Starting CheckpointManager Save Finalize thread=save_finalize
I0423 12:56:32.252336 135751744341760 async_checkpointer.py:265] [process=5][thread=save_finalize] Waiting for background save thread=async_save.
I0423 12:56:32.252473 135903583459136 standard_logger.py:34] {'step': 1, 'event_type': 'save', 'directory': 'gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_linen_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776948990.8903651, 'wait_for_prev_duration_secs': 0.00013947486877441406, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776948990.8922815, 'checkpointer_blocking_duration_secs': 1.3169474601745605, 'get_old_steps_start_time': 1776948992.2092519, 'get_old_steps_duration_secs': 8.344650268554688e-05, 'checkpoint_manager_blocking_start_time': 1776948990.8095233, 'checkpoint_manager_blocking_duration_secs': 1.4429125785827637}
I0423 12:56:32.392016 135903583459136 peft_trainer.py:474] Train step 1 training loss: 5.926076  - training perplexity: 374.681519
I0423 12:56:32.392288 135903583459136 max_utils.py:750] 
Memstats: After params initialized:
I0423 12:56:32.392351 135903583459136 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_18(process=5,(2,4,0,0))
I0423 12:56:32.392389 135903583459136 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_19(process=5,(3,4,0,0))
I0423 12:56:32.392419 135903583459136 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_22(process=5,(2,5,0,0))
I0423 12:56:32.392445 135903583459136 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_23(process=5,(3,5,0,0))
I0423 12:56:32.524714 135903583459136 metric_logger.py:196] completed step: 1, seconds: 19.345, TFLOP/s/device: 0.000, Tokens/s/device: 52.934, total_weights: 21054, loss: 5.926, lm_loss: 0.000, perplexity: 0.000
I0423 12:56:32.533430 135903583459136 peft_trainer.py:474] Train step 2 training loss: 5.643813  - training perplexity: 282.537903
I0423 12:56:32.534193 135903583459136 metric_logger.py:196] completed step: 2, seconds: 0.141, TFLOP/s/device: 0.002, Tokens/s/device: 7254.885, total_weights: 21455, loss: 5.644, lm_loss: 0.000, perplexity: 0.000
I0423 12:56:32.598898 135903583459136 peft_trainer.py:474] Train step 3 training loss: 5.503054  - training perplexity: 245.440414
I0423 12:56:32.599704 135903583459136 metric_logger.py:196] completed step: 3, seconds: 0.066, TFLOP/s/device: 0.003, Tokens/s/device: 15633.223, total_weights: 22025, loss: 5.503, lm_loss: 0.000, perplexity: 0.000
I0423 12:56:32.617624 135903583459136 peft_trainer.py:474] Train step 4 training loss: 5.624704  - training perplexity: 277.190186
I0423 12:56:32.618371 135903583459136 metric_logger.py:196] completed step: 4, seconds: 0.019, TFLOP/s/device: 0.012, Tokens/s/device: 54792.990, total_weights: 23787, loss: 5.625, lm_loss: 0.000, perplexity: 0.000
I0423 12:56:32.632104 135903583459136 peft_trainer.py:733] Train loop finished in: 19.5834 seconds
I0423 12:56:32.633059 135903583459136 peft_trainer.py:474] Train step 5 training loss: 5.765533  - training perplexity: 319.109253
I0423 12:56:32.633754 135903583459136 metric_logger.py:196] completed step: 5, seconds: 0.015, TFLOP/s/device: 0.014, Tokens/s/device: 66495.743, total_weights: 20141, loss: 5.766, lm_loss: 0.000, perplexity: 0.000
I0423 12:56:36.719671 135751232648960 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_linen_ckpt/checkpoints/1/optimizer_state/array_metadatas/process_5
I0423 12:56:37.082970 135751727556352 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 22 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_linen_ckpt/checkpoints/1/model_params/array_metadatas/process_5
I0423 12:56:37.083972 135751224256256 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 4.366 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 5 seconds) (per-host)
I0423 12:56:37.084114 135751224256256 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 13.103 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 5 seconds) (per-host)
I0423 12:56:37.084153 135751224256256 async_checkpointer.py:90] [process=5][thread=async_save] 4 Handler Commit operations completed. Time taken: 4.875091s.
I0423 12:56:41.351712 135903583459136 checkpoint_manager.py:1994] [process=5][thread=MainThread][step=1][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0423 12:56:48.389304 135751224256256 async_checkpointer.py:144] [process=5][thread=async_save] Background save thread done. Time taken: 16.180225s.
I0423 12:56:48.389528 135751744341760 async_checkpointer.py:273] [process=5][thread=save_finalize] Done with waiting for background save thread=async_save.
I0423 12:56:48.389583 135751744341760 async_checkpointer.py:283] [process=5][thread=save_finalize] No errors found in background save thread=async_save.
I0423 12:56:48.389629 135751744341760 checkpoint_manager.py:2103] [process=5][thread=save_finalize][step=1] CheckpointManager Save Finalize is syncing with other hosts...
I0423 12:56:48.391534 135751744341760 checkpoint_manager.py:2112] [process=5][thread=save_finalize][step=1] CheckpointManager Save Finalize is done on all hosts.
I0423 12:56:48.391699 135903583459136 checkpoint_manager.py:2006] [process=5][thread=MainThread][step=1][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=1.
W0423 12:56:48.391826 135903583459136 checkpoint_manager.py:1441] Waiting for previous save to complete took 7.040169 seconds. If this number is high, consider checkpointing less frequently.
I0423 12:56:48.393448 135903583459136 checkpoint_manager.py:1501] [process=5] Saving checkpoint at step 5
I0423 12:56:48.397028 135903583459136 async_checkpointer.py:452] [process=5] Started async saving checkpoint to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_linen_ckpt/checkpoints/5.
I0423 12:56:49.355170 135903583459136 jax_array_handlers.py:347] Scheduling D2H of 22 prioritized jax.Array.
I0423 12:56:49.355265 135903583459136 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0423 12:56:49.360384 135903583459136 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.006392s
I0423 12:56:49.362153 135903583459136 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 97.553 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 261 milliseconds) (per-host)
I0423 12:56:49.362211 135903583459136 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.261810s (batch_requests_ready=0.252060s, total_serialization_initiated=0.009654s, others=0.000096s)
I0423 12:56:49.363061 135903583459136 jax_array_handlers.py:347] Scheduling D2H of 52 prioritized jax.Array.
I0423 12:56:49.363114 135903583459136 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0423 12:56:49.374779 135903583459136 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.012480s
I0423 12:56:49.374887 135903583459136 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 280.535 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 273 milliseconds) (per-host)
I0423 12:56:49.374943 135903583459136 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.273101s (batch_requests_ready=0.258651s, total_serialization_initiated=0.014373s, others=0.000077s)
I0423 12:56:49.375067 135903583459136 composite_checkpoint_handler.py:715] [process=5][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.278583s (all_items=0.000013s, per_item={'model_params': '0.00001049', 'optimizer_state': '0.00000238'}, temp_paths=0.278570)
I0423 12:56:49.375909 135751224256256 async_checkpointer.py:79] [process=5][thread=async_save] Background save thread started.
I0423 12:56:49.376086 135903583459136 async_checkpointer.py:561] Finished blocking save. Time taken: 0.982568s. Continuing background save to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_linen_ckpt/checkpoints/5.
I0423 12:56:49.821777 135903583459136 checkpoint_manager.py:1549] [process=5][thread=MainThread][step=5] Starting CheckpointManager Save Finalize thread=save_finalize
I0423 12:56:49.822171 135751744341760 async_checkpointer.py:265] [process=5][thread=save_finalize] Waiting for background save thread=async_save.
I0423 12:56:49.822358 135903583459136 standard_logger.py:34] {'step': 5, 'event_type': 'save', 'directory': 'gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_linen_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776949001.351634, 'wait_for_prev_duration_secs': 7.0401694774627686, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776949008.3934894, 'checkpointer_blocking_duration_secs': 0.982698917388916, 'get_old_steps_start_time': 1776949009.3762083, 'get_old_steps_duration_secs': 7.963180541992188e-05, 'checkpoint_manager_blocking_start_time': 1776948992.6361942, 'checkpoint_manager_blocking_duration_secs': 17.18612790107727}
I0423 12:56:49.822553 135903583459136 checkpoint_manager.py:1994] [process=5][thread=MainThread][step=5][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0423 12:56:54.917907 135751719163648 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_linen_ckpt/checkpoints/5/optimizer_state/array_metadatas/process_5
I0423 12:56:54.986263 135751232648960 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 22 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_linen_xpk_feat_nnx_post_train_fixes_20260423_124550_02_sft_linen_ckpt/checkpoints/5/model_params/array_metadatas/process_5
I0423 12:56:54.987387 135751224256256 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 4.337 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 5 seconds) (per-host)
I0423 12:56:54.987561 135751224256256 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 13.014 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 5 seconds) (per-host)
I0423 12:56:54.987613 135751224256256 async_checkpointer.py:90] [process=5][thread=async_save] 4 Handler Commit operations completed. Time taken: 5.611578s.
I0423 12:57:04.134320 135751224256256 async_checkpointer.py:144] [process=5][thread=async_save] Background save thread done. Time taken: 14.758272s.
I0423 12:57:04.134606 135751744341760 async_checkpointer.py:273] [process=5][thread=save_finalize] Done with waiting for background save thread=async_save.
I0423 12:57:04.134748 135751744341760 async_checkpointer.py:283] [process=5][thread=save_finalize] No errors found in background save thread=async_save.
I0423 12:57:04.134851 135751744341760 checkpoint_manager.py:2103] [process=5][thread=save_finalize][step=5] CheckpointManager Save Finalize is syncing with other hosts...
I0423 12:57:04.136623 135751744341760 checkpoint_manager.py:2112] [process=5][thread=save_finalize][step=5] CheckpointManager Save Finalize is done on all hosts.
I0423 12:57:04.136830 135903583459136 checkpoint_manager.py:2006] [process=5][thread=MainThread][step=5][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=5.
I0423 12:57:04.136983 135903583459136 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=135903583459136 count=1 at 0x7b972f1f7480>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7b8187bd5e20>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7b8187bd48f0>, _write_futures=[])
I0423 12:57:04.137419 135903583459136 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=135903583459136 count=1 at 0x7b972f1f7480>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7b8187bd5e20>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7b8187bd48f0>, _write_futures=[])
I0423 12:57:04.137453 135903583459136 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=135903583459136 count=1 at 0x7b972f1f7480>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7b8187bd5e20>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7b8187bd48f0>, _write_futures=[])
I0423 12:57:04.627272 135753891825408 grain_pool.py:542] Grain pool is exiting.
I0423 12:57:04.627384 135753891825408 grain_pool.py:547] Shutting down multiprocessing system.
I0423 12:57:06.724222 135753891825408 grain_pool.py:547] Shutting down multiprocessing system.
/usr/local/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 15 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
XPK End: Thu Apr 23 12:57:18 UTC 2026
EXIT_CODE=0