MaxView

← Back to run

Log Summary

XPK Start: Thu Apr 23 13:10:35 UTC 2026
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
2026-04-23 13:11:05.017014: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0423 13:11:05.265665 140641550411584 max_utils.py:273] Attempting to initialize the jax distributed system...
INFO:2026-04-23 13:11:14,308:jax._src.distributed:149: Starting JAX distributed service on [::]:8482
I0423 13:11:14.308335 140641550411584 distributed.py:149] Starting JAX distributed service on [::]:8482
INFO:2026-04-23 13:11:14,310:jax._src.distributed:166: Connecting to JAX distributed service on mt-03-sft-linen-ckpt-dkv66-slice-job-0-0.mt-03-sft-linen-ckpt-dkv66:8482
I0423 13:11:14.310850 140641550411584 distributed.py:166] Connecting to JAX distributed service on mt-03-sft-linen-ckpt-dkv66-slice-job-0-0.mt-03-sft-linen-ckpt-dkv66:8482
I0423 13:11:16.019471 140641550411584 max_utils.py:284] Jax distributed system initialized!
I0423 13:11:21.202701 140641550411584 max_utils.py:800] System Information: Jax Version: 0.8.3
I0423 13:11:21.202809 140641550411584 max_utils.py:801] System Information: Jaxlib Version: 0.8.3
I0423 13:11:21.202850 140641550411584 max_utils.py:802] System Information: Jax Backend: PJRT C API
TFRT TPU v6 lite
Built on Dec 15 2025 14:03:46 (1765836226) cl/844590465
I0423 13:11:21.206285 140641550411584 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0423 13:11:21.303313 140641550411584 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0423 13:11:21.406452 140641550411584 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0423 13:11:22.451458 140641550411584 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0423 13:11:22.451913 140641550411584 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7fe8f3d44290>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0423 13:11:22.451978 140641550411584 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.28
W0423 13:11:22.993803 140641550411584 checkpoint.py:202] Metadata file does not exist: gs://lance-maxtext/pt_seed_ckpts/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items/_CHECKPOINT_METADATA
I0423 13:11:23.530677    1899 google_auth_provider.cc:181] Running on GCE, using service account 562977990677-compute@developer.gserviceaccount.com
I0423 13:11:24.688009 140641550411584 checkpointer.py:304] Restoring checkpoint from gs://lance-maxtext/pt_seed_ckpts/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items.
W0423 13:11:26.565143 140641550411584 transform_utils.py:230] The transformations API will eventually be replaced by an upgraded design. The current API will not be removed until this point, but it will no longer be actively worked on.
I0423 13:11:26.565584 140641550411584 transform_utils.py:288] The following keys are not loaded from the original tree after applying specified transforms: params/params/decoder/dropout/rngs/aqt/count, params/params/decoder/dropout/rngs/aqt/key, params/params/decoder/dropout/rngs/dropout/count, params/params/decoder/dropout/rngs/dropout/key, params/params/decoder/dropout/rngs/params/count, params/params/decoder/dropout/rngs/params/key, params/params/decoder/layers/dropout/rngs/aqt/count, params/params/decoder/layers/dropout/rngs/aqt/key, params/params/decoder/layers/dropout/rngs/dropout/count, params/params/decoder/layers/dropout/rngs/dropout/key, params/params/decoder/layers/dropout/rngs/params/count, params/params/decoder/layers/dropout/rngs/params/key, params/params/decoder/layers/mlp/dropout/rngs/aqt/count, params/params/decoder/layers/mlp/dropout/rngs/aqt/key, params/params/decoder/layers/mlp/dropout/rngs/dropout/count, params/params/decoder/layers/mlp/dropout/rngs/dropout/key, params/params/decoder/layers/mlp/dropout/rngs/params/count, params/params/decoder/layers/mlp/dropout/rngs/params/key, params/params/decoder/layers/rngs/aqt/count, params/params/decoder/layers/rngs/aqt/key, params/params/decoder/layers/rngs/dropout/count, params/params/decoder/layers/rngs/dropout/key, params/params/decoder/layers/rngs/params/count, params/params/decoder/layers/rngs/params/key, params/params/decoder/rngs/aqt/count, params/params/decoder/rngs/aqt/key, params/params/decoder/rngs/dropout/count, params/params/decoder/rngs/dropout/key, params/params/decoder/rngs/params/count, params/params/decoder/rngs/params/key
I0423 13:11:27.221196 140641550411584 checkpointer.py:318] Finished restoring checkpoint in 2.93 seconds from gs://lance-maxtext/pt_seed_ckpts/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items.
I0423 13:11:27.290212 140641550411584 config.py:112] TensorFlow version 2.20.0 available.
I0423 13:11:27.290718 140641550411584 config.py:125] JAX version 0.8.3 available.
/deps/src/maxtext/input_pipeline/input_pipeline_utils.py:467: UserWarning: WARNING: Inefficient dataloading. Your train or eval dataset contains 3 shards, smaller than number of host loading data. This is known to lead to inefficient dataloading. Seegithub.com/google/maxtext/blob/main/getting_started/Data_Input_Pipeline.md#multihost-dataloading-best-practice
  warnings.warn(
E0423 13:11:32.667534 140641550411584 packing.py:209] PackAndBatchOperation is deprecated. Please use lazy_dataset.FirstFitPackIterDataset instead.
I0423 13:11:32.667858 140641550411584 data_loader.py:408] Adding CopyNumPyArrayToSharedMemory MapTransform.
I0423 13:11:33.057585 140641550411584 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0423 13:11:33.057733 140641550411584 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7fe8f3d44290>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0423 13:11:33.057780 140641550411584 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0423 13:11:33.057814 140641550411584 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7fe8f3d44290>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0423 13:11:33.057857 140641550411584 checkpoint_manager.py:702] [process=6][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'model_params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7fd1b15fc2f0>, 'optimizer_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7fd1b1fa3bf0>, 'custom_metadata': <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7fd1b0887c20>}, handler_registry=None
I0423 13:11:33.058066 140641550411584 composite_checkpoint_handler.py:237] Deferred registration for item: "model_params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7fd1b15fc2f0>` for item "model_params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0423 13:11:33.058120 140641550411584 composite_checkpoint_handler.py:237] Deferred registration for item: "optimizer_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7fd1b1fa3bf0>` for item "optimizer_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0423 13:11:33.058150 140641550411584 composite_checkpoint_handler.py:237] Deferred registration for item: "custom_metadata". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7fd1b0887c20>` for item "custom_metadata" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0423 13:11:33.058175 140641550411584 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7fd1b0885490>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0423 13:11:33.058203 140641550411584 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7fd1b15fc2f0>, ('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7fd1b15fc2f0>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7fd1b1fa3bf0>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7fd1b1fa3bf0>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7fd1b0887c20>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7fd1b0887c20>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7fd1b0885490>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7fd1b0885490>}).
I0423 13:11:33.058603 140641550411584 async_checkpointer.py:177] [process=6][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>._fn at 0x7fd1b08d3ec0> timeout: 600 secs and primary_host=0 for async checkpoint writes
I0423 13:11:35.496193 140641550411584 checkpoint_manager.py:1788] Found 0 checkpoint steps in gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_03_sft_linen_ckpt/checkpoints
I0423 13:11:35.953405 140641550411584 checkpoint_manager.py:921] [process=6][thread=MainThread] CheckpointManager created,  primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=10000, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_hns=False, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None, preservation_policy=None, prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False), root_directory=gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_03_sft_linen_ckpt/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7fd1b088fcb0>
I0423 13:11:35.953773 140641550411584 peft_trainer.py:584] Training with mesh: Mesh('diloco': 1, 'data': 4, 'stage': 1, 'fsdp': 8, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1, axis_types=(Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto))
I0423 13:11:36.372918 140641550411584 peft_trainer.py:594] Compiled train_step cache size: 0
I0423 13:11:36.376967 140641550411584 metric_logger.py:301] number parameters: 0.000 billion
I0423 13:11:36.379363 140490519340800 grain_pool.py:367] Grain pool will use 1 processes.
I0423 13:11:36.405623 140490519340800 grain_pool.py:440] Grain pool will start child processes.
Per train step:
 Total TFLOPs: 0.00 
 split as 54.29% learnable weight flops and 45.71% attention flops
I0423 13:11:36.410967 140490519340800 grain_pool.py:448] Grain pool started all child processes.
2026-04-23 13:11:40.459070: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-04-23 13:11:40.504168: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-04-23 13:11:41.664553: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
2026-04-23 13:11:45.861248: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0423 13:11:54.250796 140641550411584 checkpoint_manager.py:1983] [process=6][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0423 13:11:54.252628 140641550411584 checkpoint_manager.py:1501] [process=6] Saving checkpoint at step 1
I0423 13:11:54.255739 140641550411584 async_checkpointer.py:452] [process=6] Started async saving checkpoint to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_03_sft_linen_ckpt/checkpoints/1.
I0423 13:11:54.803625 140641550411584 signaling_client.py:364] Using JaxDistributedSignalingClient
I0423 13:11:54.804674 140641550411584 jax_array_handlers.py:347] Scheduling D2H of 46 prioritized jax.Array.
I0423 13:11:54.804737 140641550411584 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0423 13:11:55.483221 140641550411584 base_pytree_checkpoint_handler.py:153] [process=6][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.680707s
I0423 13:11:55.485344 140641550411584 base_pytree_checkpoint_handler.py:128] [process=6] /jax/checkpoint/write/blocking_gbytes_per_sec: 27.182 KiB/s (total gbytes: 25.6 KiB) (time elapsed: 940 milliseconds) (per-host)
I0423 13:11:55.485412 140641550411584 base_pytree_checkpoint_handler.py:732] [process=6][thread=MainThread] Initiated Pytree async_save. Time taken: 0.940540s (batch_requests_ready=0.251788s, total_serialization_initiated=0.688635s, others=0.000117s)
I0423 13:11:55.486345 140641550411584 jax_array_handlers.py:347] Scheduling D2H of 52 prioritized jax.Array.
I0423 13:11:55.486407 140641550411584 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0423 13:11:55.498747 140641550411584 base_pytree_checkpoint_handler.py:153] [process=6][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.013209s
I0423 13:11:55.498876 140641550411584 base_pytree_checkpoint_handler.py:128] [process=6] /jax/checkpoint/write/blocking_gbytes_per_sec: 80.662 KiB/s (total gbytes: 76.7 KiB) (time elapsed: 950 milliseconds) (per-host)
I0423 13:11:55.498925 140641550411584 base_pytree_checkpoint_handler.py:732] [process=6][thread=MainThread] Initiated Pytree async_save. Time taken: 0.950804s (batch_requests_ready=0.935284s, total_serialization_initiated=0.015439s, others=0.000081s)
I0423 13:11:55.499058 140641550411584 composite_checkpoint_handler.py:715] [process=6][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.958207s (all_items=0.000023s, per_item={'model_params': '0.00001907', 'optimizer_state': '0.00000429'}, temp_paths=0.958183)
I0423 13:11:55.499912 140487818200832 async_checkpointer.py:79] [process=6][thread=async_save] Background save thread started.
I0423 13:11:55.500024 140641550411584 async_checkpointer.py:561] Finished blocking save. Time taken: 1.247335s. Continuing background save to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_03_sft_linen_ckpt/checkpoints/1.
I0423 13:11:55.588564 140641550411584 checkpoint_manager.py:1549] [process=6][thread=MainThread][step=1] Starting CheckpointManager Save Finalize thread=save_finalize
I0423 13:11:55.588954 140488875157248 async_checkpointer.py:265] [process=6][thread=save_finalize] Waiting for background save thread=async_save.
I0423 13:11:55.589135 140641550411584 standard_logger.py:34] {'step': 1, 'event_type': 'save', 'directory': 'gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_03_sft_linen_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776949914.2507691, 'wait_for_prev_duration_secs': 0.00012111663818359375, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776949914.252669, 'checkpointer_blocking_duration_secs': 1.2474925518035889, 'get_old_steps_start_time': 1776949915.5001848, 'get_old_steps_duration_secs': 7.915496826171875e-05, 'checkpoint_manager_blocking_start_time': 1776949914.2185733, 'checkpoint_manager_blocking_duration_secs': 1.370504379272461}
I0423 13:11:55.736466 140641550411584 peft_trainer.py:474] Train step 1 training loss: 5.926076  - training perplexity: 374.681519
I0423 13:11:55.736734 140641550411584 max_utils.py:750] 
Memstats: After params initialized:
I0423 13:11:55.736798 140641550411584 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_24(process=6,(0,6,0,0))
I0423 13:11:55.736835 140641550411584 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_25(process=6,(1,6,0,0))
I0423 13:11:55.736865 140641550411584 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_28(process=6,(0,7,0,0))
I0423 13:11:55.736892 140641550411584 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_29(process=6,(1,7,0,0))
I0423 13:11:55.867836 140641550411584 metric_logger.py:196] completed step: 1, seconds: 19.360, TFLOP/s/device: 0.000, Tokens/s/device: 52.893, total_weights: 21054, loss: 5.926, lm_loss: 0.000, perplexity: 0.000
I0423 13:11:55.876813 140641550411584 peft_trainer.py:474] Train step 2 training loss: 5.643813  - training perplexity: 282.537903
I0423 13:11:55.877580 140641550411584 metric_logger.py:196] completed step: 2, seconds: 0.140, TFLOP/s/device: 0.002, Tokens/s/device: 7310.038, total_weights: 21455, loss: 5.644, lm_loss: 0.000, perplexity: 0.000
I0423 13:11:55.971868 140641550411584 peft_trainer.py:474] Train step 3 training loss: 5.503054  - training perplexity: 245.440414
I0423 13:11:55.972823 140641550411584 metric_logger.py:196] completed step: 3, seconds: 0.095, TFLOP/s/device: 0.002, Tokens/s/device: 10763.617, total_weights: 22025, loss: 5.503, lm_loss: 0.000, perplexity: 0.000
I0423 13:11:55.997295 140641550411584 peft_trainer.py:474] Train step 4 training loss: 5.624704  - training perplexity: 277.190186
I0423 13:11:55.998137 140641550411584 metric_logger.py:196] completed step: 4, seconds: 0.025, TFLOP/s/device: 0.009, Tokens/s/device: 40405.906, total_weights: 23787, loss: 5.625, lm_loss: 0.000, perplexity: 0.000
I0423 13:11:56.013625 140641550411584 peft_trainer.py:733] Train loop finished in: 19.6355 seconds
I0423 13:11:56.014583 140641550411584 peft_trainer.py:474] Train step 5 training loss: 5.765533  - training perplexity: 319.109253
I0423 13:11:56.015265 140641550411584 metric_logger.py:196] completed step: 5, seconds: 0.017, TFLOP/s/device: 0.013, Tokens/s/device: 59347.804, total_weights: 20141, loss: 5.766, lm_loss: 0.000, perplexity: 0.000
I0423 13:12:00.266500 140487826593536 array_metadata_store.py:203] [process=6][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_03_sft_linen_ckpt/checkpoints/1/optimizer_state/array_metadatas/process_6
I0423 13:12:00.268362 140487843378944 array_metadata_store.py:203] [process=6][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_03_sft_linen_ckpt/checkpoints/1/model_params/array_metadatas/process_6
I0423 13:12:00.269484 140487818200832 base_pytree_checkpoint_handler.py:128] [process=6] /jax/checkpoint/write/gbytes_per_sec: 4.465 KiB/s (total gbytes: 25.6 KiB) (time elapsed: 5 seconds) (per-host)
I0423 13:12:00.269636 140487818200832 base_pytree_checkpoint_handler.py:128] [process=6] /jax/checkpoint/write/gbytes_per_sec: 13.403 KiB/s (total gbytes: 76.7 KiB) (time elapsed: 5 seconds) (per-host)
I0423 13:12:00.269675 140487818200832 async_checkpointer.py:90] [process=6][thread=async_save] 4 Handler Commit operations completed. Time taken: 4.769593s.
I0423 13:12:05.656271 140641550411584 checkpoint_manager.py:1994] [process=6][thread=MainThread][step=1][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0423 13:12:11.034541 140487818200832 async_checkpointer.py:144] [process=6][thread=async_save] Background save thread done. Time taken: 15.534442s.
I0423 13:12:11.034847 140488875157248 async_checkpointer.py:273] [process=6][thread=save_finalize] Done with waiting for background save thread=async_save.
I0423 13:12:11.034969 140488875157248 async_checkpointer.py:283] [process=6][thread=save_finalize] No errors found in background save thread=async_save.
I0423 13:12:11.035020 140488875157248 checkpoint_manager.py:2103] [process=6][thread=save_finalize][step=1] CheckpointManager Save Finalize is syncing with other hosts...
I0423 13:12:11.036684 140488875157248 checkpoint_manager.py:2112] [process=6][thread=save_finalize][step=1] CheckpointManager Save Finalize is done on all hosts.
I0423 13:12:11.036825 140641550411584 checkpoint_manager.py:2006] [process=6][thread=MainThread][step=1][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=1.
W0423 13:12:11.036899 140641550411584 checkpoint_manager.py:1441] Waiting for previous save to complete took 5.380648 seconds. If this number is high, consider checkpointing less frequently.
I0423 13:12:11.038457 140641550411584 checkpoint_manager.py:1501] [process=6] Saving checkpoint at step 5
I0423 13:12:11.041944 140641550411584 async_checkpointer.py:452] [process=6] Started async saving checkpoint to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_03_sft_linen_ckpt/checkpoints/5.
I0423 13:12:11.572984 140641550411584 jax_array_handlers.py:347] Scheduling D2H of 52 prioritized jax.Array.
I0423 13:12:11.573086 140641550411584 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0423 13:12:11.586603 140641550411584 base_pytree_checkpoint_handler.py:153] [process=6][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.014606s
I0423 13:12:11.590186 140641550411584 base_pytree_checkpoint_handler.py:128] [process=6] /jax/checkpoint/write/blocking_gbytes_per_sec: 284.904 KiB/s (total gbytes: 76.7 KiB) (time elapsed: 269 milliseconds) (per-host)
I0423 13:12:11.590251 140641550411584 base_pytree_checkpoint_handler.py:732] [process=6][thread=MainThread] Initiated Pytree async_save. Time taken: 0.269274s (batch_requests_ready=0.248946s, total_serialization_initiated=0.020220s, others=0.000108s)
I0423 13:12:11.592324 140641550411584 jax_array_handlers.py:347] Scheduling D2H of 46 prioritized jax.Array.
I0423 13:12:11.592386 140641550411584 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0423 13:12:11.597414 140641550411584 base_pytree_checkpoint_handler.py:153] [process=6][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.007008s
I0423 13:12:11.597523 140641550411584 base_pytree_checkpoint_handler.py:128] [process=6] /jax/checkpoint/write/blocking_gbytes_per_sec: 91.558 KiB/s (total gbytes: 25.6 KiB) (time elapsed: 279 milliseconds) (per-host)
I0423 13:12:11.597574 140641550411584 base_pytree_checkpoint_handler.py:732] [process=6][thread=MainThread] Initiated Pytree async_save. Time taken: 0.279261s (batch_requests_ready=0.268490s, total_serialization_initiated=0.010701s, others=0.000070s)
I0423 13:12:11.597733 140641550411584 composite_checkpoint_handler.py:715] [process=6][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.283224s (all_items=0.000013s, per_item={'model_params': '0.00001073', 'optimizer_state': '0.00000238'}, temp_paths=0.283211)
I0423 13:12:11.598688 140488875157248 async_checkpointer.py:79] [process=6][thread=async_save] Background save thread started.
I0423 13:12:11.598813 140641550411584 async_checkpointer.py:561] Finished blocking save. Time taken: 0.560285s. Continuing background save to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_03_sft_linen_ckpt/checkpoints/5.
I0423 13:12:11.962810 140641550411584 checkpoint_manager.py:1549] [process=6][thread=MainThread][step=5] Starting CheckpointManager Save Finalize thread=save_finalize
I0423 13:12:11.963227 140487818200832 async_checkpointer.py:265] [process=6][thread=save_finalize] Waiting for background save thread=async_save.
I0423 13:12:11.963405 140641550411584 standard_logger.py:34] {'step': 5, 'event_type': 'save', 'directory': 'gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_03_sft_linen_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776949925.6562304, 'wait_for_prev_duration_secs': 5.380648136138916, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776949931.038496, 'checkpointer_blocking_duration_secs': 0.5604267120361328, 'get_old_steps_start_time': 1776949931.5989494, 'get_old_steps_duration_secs': 8.20159912109375e-05, 'checkpoint_manager_blocking_start_time': 1776949916.019793, 'checkpoint_manager_blocking_duration_secs': 15.943576097488403}
I0423 13:12:11.963615 140641550411584 checkpoint_manager.py:1994] [process=6][thread=MainThread][step=5][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0423 13:12:16.700416 140487826593536 array_metadata_store.py:203] [process=6][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_03_sft_linen_ckpt/checkpoints/5/optimizer_state/array_metadatas/process_6
I0423 13:12:16.740413 140488866764544 array_metadata_store.py:203] [process=6][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124550/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260423_124550_03_sft_linen_ckpt/checkpoints/5/model_params/array_metadatas/process_6
I0423 13:12:16.741426 140488875157248 base_pytree_checkpoint_handler.py:128] [process=6] /jax/checkpoint/write/gbytes_per_sec: 4.714 KiB/s (total gbytes: 25.6 KiB) (time elapsed: 5 seconds) (per-host)
I0423 13:12:16.741531 140488875157248 base_pytree_checkpoint_handler.py:128] [process=6] /jax/checkpoint/write/gbytes_per_sec: 14.148 KiB/s (total gbytes: 76.7 KiB) (time elapsed: 5 seconds) (per-host)
I0423 13:12:16.741571 140488875157248 async_checkpointer.py:90] [process=6][thread=async_save] 4 Handler Commit operations completed. Time taken: 5.142803s.
I0423 13:12:26.212680 140488875157248 async_checkpointer.py:144] [process=6][thread=async_save] Background save thread done. Time taken: 14.613897s.
I0423 13:12:26.213005 140487818200832 async_checkpointer.py:273] [process=6][thread=save_finalize] Done with waiting for background save thread=async_save.
I0423 13:12:26.213146 140487818200832 async_checkpointer.py:283] [process=6][thread=save_finalize] No errors found in background save thread=async_save.
I0423 13:12:26.213194 140487818200832 checkpoint_manager.py:2103] [process=6][thread=save_finalize][step=5] CheckpointManager Save Finalize is syncing with other hosts...
I0423 13:12:26.214863 140487818200832 checkpoint_manager.py:2112] [process=6][thread=save_finalize][step=5] CheckpointManager Save Finalize is done on all hosts.
I0423 13:12:26.215034 140641550411584 checkpoint_manager.py:2006] [process=6][thread=MainThread][step=5][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=5.
I0423 13:12:26.215198 140641550411584 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=140641550411584 count=1 at 0x7fd3a41714c0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7fd1b088d4c0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7fd1b088c8f0>, _write_futures=[])
I0423 13:12:26.215652 140641550411584 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=140641550411584 count=1 at 0x7fd3a41714c0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7fd1b088d4c0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7fd1b088c8f0>, _write_futures=[])
I0423 13:12:26.215685 140641550411584 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=140641550411584 count=1 at 0x7fd3a41714c0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7fd1b088d4c0>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7fd1b088c8f0>, _write_futures=[])
I0423 13:12:26.507674 140490519340800 grain_pool.py:542] Grain pool is exiting.
I0423 13:12:26.507772 140490519340800 grain_pool.py:547] Shutting down multiprocessing system.
I0423 13:12:28.630166 140490519340800 grain_pool.py:547] Shutting down multiprocessing system.
/usr/local/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 15 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
XPK End: Thu Apr 23 13:12:41 UTC 2026
EXIT_CODE=0