MaxView

← Back to run

Log Summary

XPK Start: Fri Apr 24 12:26:27 UTC 2026
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
2026-04-24 12:26:56.188336: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0424 12:26:56.435254 136590955468608 max_utils.py:273] Attempting to initialize the jax distributed system...
INFO:2026-04-24 12:27:05,476:jax._src.distributed:149: Starting JAX distributed service on [::]:8482
I0424 12:27:05.476542 136590955468608 distributed.py:149] Starting JAX distributed service on [::]:8482
INFO:2026-04-24 12:27:05,478:jax._src.distributed:166: Connecting to JAX distributed service on mt-03-sft-linen-ckpt-w9280-slice-job-0-0.mt-03-sft-linen-ckpt-w9280:8482
I0424 12:27:05.478980 136590955468608 distributed.py:166] Connecting to JAX distributed service on mt-03-sft-linen-ckpt-w9280-slice-job-0-0.mt-03-sft-linen-ckpt-w9280:8482
I0424 12:27:06.503857 136590955468608 max_utils.py:284] Jax distributed system initialized!
I0424 12:27:11.606677 136590955468608 max_utils.py:800] System Information: Jax Version: 0.8.3
I0424 12:27:11.606784 136590955468608 max_utils.py:801] System Information: Jaxlib Version: 0.8.3
I0424 12:27:11.606824 136590955468608 max_utils.py:802] System Information: Jax Backend: PJRT C API
TFRT TPU v6 lite
Built on Dec 15 2025 14:03:46 (1765836226) cl/844590465
I0424 12:27:11.610430 136590955468608 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0424 12:27:11.705820 136590955468608 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0424 12:27:11.807459 136590955468608 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0424 12:27:12.855010 136590955468608 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0424 12:27:12.855462 136590955468608 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7c39d991e600>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0424 12:27:12.855521 136590955468608 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.28
W0424 12:27:13.368437 136590955468608 checkpoint.py:202] Metadata file does not exist: gs://lance-maxtext/pt_seed_ckpts/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items/_CHECKPOINT_METADATA
I0424 12:27:13.896970    1892 google_auth_provider.cc:181] Running on GCE, using service account 562977990677-compute@developer.gserviceaccount.com
I0424 12:27:15.118502 136590955468608 checkpointer.py:304] Restoring checkpoint from gs://lance-maxtext/pt_seed_ckpts/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items.
W0424 12:27:16.966301 136590955468608 transform_utils.py:230] The transformations API will eventually be replaced by an upgraded design. The current API will not be removed until this point, but it will no longer be actively worked on.
I0424 12:27:16.966746 136590955468608 transform_utils.py:288] The following keys are not loaded from the original tree after applying specified transforms: params/params/decoder/dropout/rngs/aqt/count, params/params/decoder/dropout/rngs/aqt/key, params/params/decoder/dropout/rngs/dropout/count, params/params/decoder/dropout/rngs/dropout/key, params/params/decoder/dropout/rngs/params/count, params/params/decoder/dropout/rngs/params/key, params/params/decoder/layers/dropout/rngs/aqt/count, params/params/decoder/layers/dropout/rngs/aqt/key, params/params/decoder/layers/dropout/rngs/dropout/count, params/params/decoder/layers/dropout/rngs/dropout/key, params/params/decoder/layers/dropout/rngs/params/count, params/params/decoder/layers/dropout/rngs/params/key, params/params/decoder/layers/mlp/dropout/rngs/aqt/count, params/params/decoder/layers/mlp/dropout/rngs/aqt/key, params/params/decoder/layers/mlp/dropout/rngs/dropout/count, params/params/decoder/layers/mlp/dropout/rngs/dropout/key, params/params/decoder/layers/mlp/dropout/rngs/params/count, params/params/decoder/layers/mlp/dropout/rngs/params/key, params/params/decoder/layers/rngs/aqt/count, params/params/decoder/layers/rngs/aqt/key, params/params/decoder/layers/rngs/dropout/count, params/params/decoder/layers/rngs/dropout/key, params/params/decoder/layers/rngs/params/count, params/params/decoder/layers/rngs/params/key, params/params/decoder/rngs/aqt/count, params/params/decoder/rngs/aqt/key, params/params/decoder/rngs/dropout/count, params/params/decoder/rngs/dropout/key, params/params/decoder/rngs/params/count, params/params/decoder/rngs/params/key
I0424 12:27:17.251979 136590955468608 checkpointer.py:318] Finished restoring checkpoint in 2.56 seconds from gs://lance-maxtext/pt_seed_ckpts/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items.
I0424 12:27:17.320402 136590955468608 config.py:112] TensorFlow version 2.20.0 available.
I0424 12:27:17.320900 136590955468608 config.py:125] JAX version 0.8.3 available.
/deps/src/maxtext/input_pipeline/input_pipeline_utils.py:467: UserWarning: WARNING: Inefficient dataloading. Your train or eval dataset contains 3 shards, smaller than number of host loading data. This is known to lead to inefficient dataloading. Seegithub.com/google/maxtext/blob/main/getting_started/Data_Input_Pipeline.md#multihost-dataloading-best-practice
  warnings.warn(
E0424 12:27:22.488858 136590955468608 packing.py:209] PackAndBatchOperation is deprecated. Please use lazy_dataset.FirstFitPackIterDataset instead.
I0424 12:27:22.489177 136590955468608 data_loader.py:408] Adding CopyNumPyArrayToSharedMemory MapTransform.
I0424 12:27:22.881239 136590955468608 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0424 12:27:22.881378 136590955468608 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7c39d991e600>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0424 12:27:22.881425 136590955468608 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0424 12:27:22.881463 136590955468608 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7c39d991e600>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0424 12:27:22.881505 136590955468608 checkpoint_manager.py:702] [process=5][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'model_params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7c2360929e80>, 'optimizer_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7c22c31dd280>, 'custom_metadata': <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7c22c28bc140>}, handler_registry=None
I0424 12:27:22.881724 136590955468608 composite_checkpoint_handler.py:237] Deferred registration for item: "model_params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7c2360929e80>` for item "model_params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0424 12:27:22.881770 136590955468608 composite_checkpoint_handler.py:237] Deferred registration for item: "optimizer_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7c22c31dd280>` for item "optimizer_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0424 12:27:22.881799 136590955468608 composite_checkpoint_handler.py:237] Deferred registration for item: "custom_metadata". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7c22c28bc140>` for item "custom_metadata" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0424 12:27:22.881824 136590955468608 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7c22c28b7ef0>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0424 12:27:22.881852 136590955468608 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7c2360929e80>, ('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7c2360929e80>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7c22c31dd280>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7c22c31dd280>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7c22c28bc140>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7c22c28bc140>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7c22c28b7ef0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7c22c28b7ef0>}).
I0424 12:27:22.882240 136590955468608 async_checkpointer.py:177] [process=5][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>._fn at 0x7c22c2913ec0> timeout: 600 secs and primary_host=0 for async checkpoint writes
I0424 12:27:25.296669 136590955468608 checkpoint_manager.py:1788] Found 0 checkpoint steps in gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_03_sft_linen_ckpt/checkpoints
I0424 12:27:25.298867 136590955468608 checkpoint_manager.py:921] [process=5][thread=MainThread] CheckpointManager created,  primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=10000, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_hns=False, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None, preservation_policy=None, prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False), root_directory=gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_03_sft_linen_ckpt/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7c22c28bcb90>
I0424 12:27:25.299125 136590955468608 peft_trainer.py:584] Training with mesh: Mesh('diloco': 1, 'data': 4, 'stage': 1, 'fsdp': 8, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1, axis_types=(Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto))
I0424 12:27:25.718780 136590955468608 peft_trainer.py:594] Compiled train_step cache size: 0
I0424 12:27:25.722997 136590955468608 metric_logger.py:301] number parameters: 0.000 billion
I0424 12:27:25.725665 136440924317440 grain_pool.py:367] Grain pool will use 1 processes.
I0424 12:27:25.751952 136440924317440 grain_pool.py:440] Grain pool will start child processes.
Per train step:
 Total TFLOPs: 0.00 
 split as 54.29% learnable weight flops and 45.71% attention flops
I0424 12:27:25.757046 136440924317440 grain_pool.py:448] Grain pool started all child processes.
2026-04-24 12:27:29.741741: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-04-24 12:27:29.786172: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-04-24 12:27:30.957985: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
2026-04-24 12:27:35.177876: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0424 12:27:43.526279 136590955468608 checkpoint_manager.py:1983] [process=5][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0424 12:27:43.527974 136590955468608 checkpoint_manager.py:1501] [process=5] Saving checkpoint at step 1
I0424 12:27:43.531137 136590955468608 async_checkpointer.py:452] [process=5] Started async saving checkpoint to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_03_sft_linen_ckpt/checkpoints/1.
I0424 12:27:44.126005 136590955468608 signaling_client.py:364] Using JaxDistributedSignalingClient
I0424 12:27:44.127046 136590955468608 jax_array_handlers.py:347] Scheduling D2H of 46 prioritized jax.Array.
I0424 12:27:44.127115 136590955468608 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0424 12:27:44.838782 136590955468608 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.713913s
I0424 12:27:44.839620 136590955468608 jax_array_handlers.py:347] Scheduling D2H of 52 prioritized jax.Array.
I0424 12:27:44.839703 136590955468608 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0424 12:27:44.851091 136590955468608 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.012199s
I0424 12:27:44.851253 136590955468608 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 25.551 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 999 milliseconds) (per-host)
I0424 12:27:44.851307 136590955468608 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.999332s (batch_requests_ready=0.264521s, total_serialization_initiated=0.734706s, others=0.000104s)
I0424 12:27:44.851439 136590955468608 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 76.882 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 996 milliseconds) (per-host)
I0424 12:27:44.851475 136590955468608 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.996307s (batch_requests_ready=0.267933s, total_serialization_initiated=0.728320s, others=0.000054s)
I0424 12:27:44.851594 136590955468608 composite_checkpoint_handler.py:715] [process=5][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 1.003643s (all_items=0.000022s, per_item={'model_params': '0.00001764', 'optimizer_state': '0.00000453'}, temp_paths=1.003621)
I0424 12:27:44.852594 136438788105984 async_checkpointer.py:79] [process=5][thread=async_save] Background save thread started.
I0424 12:27:44.852728 136590955468608 async_checkpointer.py:561] Finished blocking save. Time taken: 1.324679s. Continuing background save to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_03_sft_linen_ckpt/checkpoints/1.
I0424 12:27:44.854674 136590955468608 checkpoint_manager.py:1549] [process=5][thread=MainThread][step=1] Starting CheckpointManager Save Finalize thread=save_finalize
I0424 12:27:44.854891 136438830069504 async_checkpointer.py:265] [process=5][thread=save_finalize] Waiting for background save thread=async_save.
I0424 12:27:44.855025 136590955468608 standard_logger.py:34] {'step': 1, 'event_type': 'save', 'directory': 'gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_03_sft_linen_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1777033663.5262442, 'wait_for_prev_duration_secs': 0.00013971328735351562, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1777033663.528016, 'checkpointer_blocking_duration_secs': 1.3248798847198486, 'get_old_steps_start_time': 1777033664.8529193, 'get_old_steps_duration_secs': 8.20159912109375e-05, 'checkpoint_manager_blocking_start_time': 1777033663.4335577, 'checkpoint_manager_blocking_duration_secs': 1.4214341640472412}
I0424 12:27:44.999363 136590955468608 peft_trainer.py:474] Train step 1 training loss: 5.926076  - training perplexity: 374.681519
I0424 12:27:44.999629 136590955468608 max_utils.py:750] 
Memstats: After params initialized:
I0424 12:27:44.999709 136590955468608 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_18(process=5,(2,4,0,0))
I0424 12:27:44.999747 136590955468608 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_19(process=5,(3,4,0,0))
I0424 12:27:44.999778 136590955468608 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_22(process=5,(2,5,0,0))
I0424 12:27:44.999803 136590955468608 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_23(process=5,(3,5,0,0))
I0424 12:27:45.129494 136590955468608 metric_logger.py:196] completed step: 1, seconds: 19.277, TFLOP/s/device: 0.000, Tokens/s/device: 53.121, total_weights: 21054, loss: 5.926, lm_loss: 0.000, perplexity: 0.000
I0424 12:27:45.138618 136590955468608 peft_trainer.py:474] Train step 2 training loss: 5.643813  - training perplexity: 282.537903
I0424 12:27:45.139401 136590955468608 metric_logger.py:196] completed step: 2, seconds: 0.139, TFLOP/s/device: 0.002, Tokens/s/device: 7367.444, total_weights: 21455, loss: 5.644, lm_loss: 0.000, perplexity: 0.000
I0424 12:27:45.237837 136590955468608 peft_trainer.py:474] Train step 3 training loss: 5.503054  - training perplexity: 245.440414
I0424 12:27:45.238982 136590955468608 metric_logger.py:196] completed step: 3, seconds: 0.099, TFLOP/s/device: 0.002, Tokens/s/device: 10315.056, total_weights: 22025, loss: 5.503, lm_loss: 0.000, perplexity: 0.000
I0424 12:27:45.263424 136590955468608 peft_trainer.py:474] Train step 4 training loss: 5.624704  - training perplexity: 277.190186
I0424 12:27:45.264200 136590955468608 metric_logger.py:196] completed step: 4, seconds: 0.026, TFLOP/s/device: 0.009, Tokens/s/device: 40148.033, total_weights: 23787, loss: 5.625, lm_loss: 0.000, perplexity: 0.000
I0424 12:27:45.279541 136590955468608 peft_trainer.py:733] Train loop finished in: 19.5552 seconds
I0424 12:27:45.280506 136590955468608 peft_trainer.py:474] Train step 5 training loss: 5.765533  - training perplexity: 319.109253
I0424 12:27:45.281220 136590955468608 metric_logger.py:196] completed step: 5, seconds: 0.017, TFLOP/s/device: 0.013, Tokens/s/device: 59969.411, total_weights: 20141, loss: 5.766, lm_loss: 0.000, perplexity: 0.000
I0424 12:27:49.273944 136438796498688 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_03_sft_linen_ckpt/checkpoints/1/optimizer_state/array_metadatas/process_5
I0424 12:27:49.340873 136438813284096 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_03_sft_linen_ckpt/checkpoints/1/model_params/array_metadatas/process_5
I0424 12:27:49.342025 136438788105984 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 4.651 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 5 seconds) (per-host)
I0424 12:27:49.342163 136438788105984 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 13.959 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 5 seconds) (per-host)
I0424 12:27:49.342200 136438788105984 async_checkpointer.py:90] [process=5][thread=async_save] 4 Handler Commit operations completed. Time taken: 4.489398s.
I0424 12:27:54.285919 136590955468608 checkpoint_manager.py:1994] [process=5][thread=MainThread][step=1][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0424 12:28:01.307132 136438788105984 async_checkpointer.py:144] [process=5][thread=async_save] Background save thread done. Time taken: 16.454312s.
I0424 12:28:01.307445 136438830069504 async_checkpointer.py:273] [process=5][thread=save_finalize] Done with waiting for background save thread=async_save.
I0424 12:28:01.307563 136438830069504 async_checkpointer.py:283] [process=5][thread=save_finalize] No errors found in background save thread=async_save.
I0424 12:28:01.307614 136438830069504 checkpoint_manager.py:2103] [process=5][thread=save_finalize][step=1] CheckpointManager Save Finalize is syncing with other hosts...
I0424 12:28:01.309198 136438830069504 checkpoint_manager.py:2112] [process=5][thread=save_finalize][step=1] CheckpointManager Save Finalize is done on all hosts.
I0424 12:28:01.309383 136590955468608 checkpoint_manager.py:2006] [process=5][thread=MainThread][step=1][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=1.
W0424 12:28:01.309510 136590955468608 checkpoint_manager.py:1441] Waiting for previous save to complete took 7.023609 seconds. If this number is high, consider checkpointing less frequently.
I0424 12:28:01.311133 136590955468608 checkpoint_manager.py:1501] [process=5] Saving checkpoint at step 5
I0424 12:28:01.314544 136590955468608 async_checkpointer.py:452] [process=5] Started async saving checkpoint to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_03_sft_linen_ckpt/checkpoints/5.
I0424 12:28:01.944785 136590955468608 jax_array_handlers.py:347] Scheduling D2H of 52 prioritized jax.Array.
I0424 12:28:01.944881 136590955468608 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0424 12:28:01.956896 136590955468608 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.013300s
I0424 12:28:01.960194 136590955468608 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 287.442 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 266 milliseconds) (per-host)
I0424 12:28:01.960261 136590955468608 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.266568s (batch_requests_ready=0.247829s, total_serialization_initiated=0.018635s, others=0.000104s)
I0424 12:28:01.962092 136590955468608 jax_array_handlers.py:347] Scheduling D2H of 46 prioritized jax.Array.
I0424 12:28:01.962147 136590955468608 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0424 12:28:01.967167 136590955468608 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.006803s
I0424 12:28:01.967273 136590955468608 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 92.438 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 276 milliseconds) (per-host)
I0424 12:28:01.967318 136590955468608 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.276261s (batch_requests_ready=0.265951s, total_serialization_initiated=0.010244s, others=0.000066s)
I0424 12:28:01.967433 136590955468608 composite_checkpoint_handler.py:715] [process=5][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.280154s (all_items=0.000014s, per_item={'model_params': '0.00001097', 'optimizer_state': '0.00000286'}, temp_paths=0.280140)
I0424 12:28:01.968325 136438830069504 async_checkpointer.py:79] [process=5][thread=async_save] Background save thread started.
I0424 12:28:01.968484 136590955468608 async_checkpointer.py:561] Finished blocking save. Time taken: 0.657277s. Continuing background save to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_03_sft_linen_ckpt/checkpoints/5.
I0424 12:28:02.394473 136590955468608 checkpoint_manager.py:1549] [process=5][thread=MainThread][step=5] Starting CheckpointManager Save Finalize thread=save_finalize
I0424 12:28:02.394829 136438788105984 async_checkpointer.py:265] [process=5][thread=save_finalize] Waiting for background save thread=async_save.
I0424 12:28:02.395047 136590955468608 standard_logger.py:34] {'step': 5, 'event_type': 'save', 'directory': 'gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_03_sft_linen_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1777033674.2858784, 'wait_for_prev_duration_secs': 7.023609161376953, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1777033681.311173, 'checkpointer_blocking_duration_secs': 0.6574175357818604, 'get_old_steps_start_time': 1777033681.9686146, 'get_old_steps_duration_secs': 9.417533874511719e-05, 'checkpoint_manager_blocking_start_time': 1777033665.2858217, 'checkpoint_manager_blocking_duration_secs': 17.109188318252563}
I0424 12:28:02.395247 136590955468608 checkpoint_manager.py:1994] [process=5][thread=MainThread][step=5][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0424 12:28:07.144883 136438821676800 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_03_sft_linen_ckpt/checkpoints/5/model_params/array_metadatas/process_5
I0424 12:28:07.146089 136438830069504 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 4.680 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 5 seconds) (per-host)
I0424 12:28:07.198767 136438796498688 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260424_120707/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260424_120707_03_sft_linen_ckpt/checkpoints/5/optimizer_state/array_metadatas/process_5
I0424 12:28:07.199903 136438830069504 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 13.911 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 5 seconds) (per-host)
I0424 12:28:07.199986 136438830069504 async_checkpointer.py:90] [process=5][thread=async_save] 4 Handler Commit operations completed. Time taken: 5.231550s.
I0424 12:28:16.685146 136438830069504 async_checkpointer.py:144] [process=5][thread=async_save] Background save thread done. Time taken: 14.716696s.
I0424 12:28:16.685462 136438788105984 async_checkpointer.py:273] [process=5][thread=save_finalize] Done with waiting for background save thread=async_save.
I0424 12:28:16.685580 136438788105984 async_checkpointer.py:283] [process=5][thread=save_finalize] No errors found in background save thread=async_save.
I0424 12:28:16.685625 136438788105984 checkpoint_manager.py:2103] [process=5][thread=save_finalize][step=5] CheckpointManager Save Finalize is syncing with other hosts...
I0424 12:28:16.688576 136438788105984 checkpoint_manager.py:2112] [process=5][thread=save_finalize][step=5] CheckpointManager Save Finalize is done on all hosts.
I0424 12:28:16.688764 136590955468608 checkpoint_manager.py:2006] [process=5][thread=MainThread][step=5][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=5.
I0424 12:28:16.688919 136590955468608 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=136590955468608 count=1 at 0x7c2678245d00>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7c22c28bd940>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7c22c28bda00>, _write_futures=[])
I0424 12:28:16.689359 136590955468608 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=136590955468608 count=1 at 0x7c2678245d00>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7c22c28bd940>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7c22c28bda00>, _write_futures=[])
I0424 12:28:16.689392 136590955468608 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=136590955468608 count=1 at 0x7c2678245d00>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7c22c28bd940>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7c22c28bda00>, _write_futures=[])
I0424 12:28:16.773425 136440924317440 grain_pool.py:542] Grain pool is exiting.
I0424 12:28:16.773521 136440924317440 grain_pool.py:547] Shutting down multiprocessing system.
I0424 12:28:18.838199 136440924317440 grain_pool.py:547] Shutting down multiprocessing system.
/usr/local/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 15 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
XPK End: Fri Apr 24 12:28:32 UTC 2026
EXIT_CODE=0