MaxView

← Back to run

Log Summary

XPK Start: Sat Apr 25 12:27:03 UTC 2026
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
2026-04-25 12:27:32.197075: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0425 12:27:32.441744 135998268753728 max_utils.py:273] Attempting to initialize the jax distributed system...
INFO:2026-04-25 12:27:41,481:jax._src.distributed:149: Starting JAX distributed service on [::]:8482
I0425 12:27:41.481559 135998268753728 distributed.py:149] Starting JAX distributed service on [::]:8482
INFO:2026-04-25 12:27:41,484:jax._src.distributed:166: Connecting to JAX distributed service on mt-01-sft-smoke-b630o-slice-job-0-0.mt-01-sft-smoke-b630o:8482
I0425 12:27:41.484089 135998268753728 distributed.py:166] Connecting to JAX distributed service on mt-01-sft-smoke-b630o-slice-job-0-0.mt-01-sft-smoke-b630o:8482
I0425 12:27:42.826454 135998268753728 max_utils.py:284] Jax distributed system initialized!
I0425 12:27:48.874725 135998268753728 max_utils.py:800] System Information: Jax Version: 0.8.3
I0425 12:27:48.874832 135998268753728 max_utils.py:801] System Information: Jaxlib Version: 0.8.3
I0425 12:27:48.874871 135998268753728 max_utils.py:802] System Information: Jax Backend: PJRT C API
TFRT TPU v6 lite
Built on Dec 15 2025 14:03:46 (1765836226) cl/844590465
I0425 12:27:48.878229 135998268753728 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0425 12:27:48.974111 135998268753728 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0425 12:27:49.076429 135998268753728 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0425 12:27:50.189987 135998268753728 config.py:112] TensorFlow version 2.20.0 available.
I0425 12:27:50.190494 135998268753728 config.py:125] JAX version 0.8.3 available.
/deps/src/maxtext/input_pipeline/input_pipeline_utils.py:467: UserWarning: WARNING: Inefficient dataloading. Your train or eval dataset contains 3 shards, smaller than number of host loading data. This is known to lead to inefficient dataloading. Seegithub.com/google/maxtext/blob/main/getting_started/Data_Input_Pipeline.md#multihost-dataloading-best-practice
  warnings.warn(
E0425 12:27:55.482349 135998268753728 packing.py:209] PackAndBatchOperation is deprecated. Please use lazy_dataset.FirstFitPackIterDataset instead.
I0425 12:27:55.482689 135998268753728 data_loader.py:408] Adding CopyNumPyArrayToSharedMemory MapTransform.
I0425 12:27:55.879122 135998268753728 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0425 12:27:55.879639 135998268753728 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7bafdac1c1d0>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0425 12:27:55.879713 135998268753728 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0425 12:27:55.879768 135998268753728 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7bafdac1c1d0>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0425 12:27:55.879834 135998268753728 checkpoint_manager.py:702] [process=5][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'model_params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b97a60b7a70>, 'optimizer_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b97a6a17b90>, 'custom_metadata': <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7b97a4171a90>}, handler_registry=None
I0425 12:27:55.880072 135998268753728 composite_checkpoint_handler.py:237] Deferred registration for item: "model_params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b97a60b7a70>` for item "model_params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0425 12:27:55.880120 135998268753728 composite_checkpoint_handler.py:237] Deferred registration for item: "optimizer_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b97a6a17b90>` for item "optimizer_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0425 12:27:55.880150 135998268753728 composite_checkpoint_handler.py:237] Deferred registration for item: "custom_metadata". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7b97a4171a90>` for item "custom_metadata" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0425 12:27:55.880183 135998268753728 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7bafd4e4a0c0>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0425 12:27:55.880211 135998268753728 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b97a60b7a70>, ('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b97a60b7a70>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b97a6a17b90>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b97a6a17b90>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7b97a4171a90>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7b97a4171a90>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7bafd4e4a0c0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7bafd4e4a0c0>}).
I0425 12:27:55.880425 135998268753728 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.28
I0425 12:27:55.880476 135998268753728 async_checkpointer.py:177] [process=5][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>._fn at 0x7b9313737420> timeout: 600 secs and primary_host=0 for async checkpoint writes
I0425 12:27:58.303019 135998268753728 checkpoint_manager.py:1788] Found 0 checkpoint steps in gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_01_sft_smoke/checkpoints
I0425 12:27:58.305217 135998268753728 checkpoint_manager.py:921] [process=5][thread=MainThread] CheckpointManager created,  primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=10000, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_hns=False, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None, preservation_policy=None, prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False), root_directory=gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_01_sft_smoke/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7b97a4172870>
I0425 12:27:58.305477 135998268753728 peft_trainer.py:584] Training with mesh: Mesh('diloco': 1, 'data': 4, 'stage': 1, 'fsdp': 8, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1, axis_types=(Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto))
I0425 12:27:58.731130 135998268753728 peft_trainer.py:594] Compiled train_step cache size: 0
I0425 12:27:58.735312 135998268753728 metric_logger.py:301] number parameters: 0.000 billion
I0425 12:27:58.737664 135845059364608 grain_pool.py:367] Grain pool will use 1 processes.
I0425 12:27:58.763820 135845059364608 grain_pool.py:440] Grain pool will start child processes.
Per train step:
 Total TFLOPs: 0.00 
 split as 54.29% learnable weight flops and 45.71% attention flops
I0425 12:27:58.769814 135845059364608 grain_pool.py:448] Grain pool started all child processes.
2026-04-25 12:28:02.770731: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-04-25 12:28:02.815126: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-04-25 12:28:03.978769: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
2026-04-25 12:28:08.195098: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0425 12:28:16.636291 135998268753728 checkpoint_manager.py:1983] [process=5][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0425 12:28:16.638491 135998268753728 checkpoint_manager.py:1501] [process=5] Saving checkpoint at step 1
I0425 12:28:16.641695 135998268753728 async_checkpointer.py:452] [process=5] Started async saving checkpoint to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_01_sft_smoke/checkpoints/1.
I0425 12:28:17.178191 135998268753728 signaling_client.py:364] Using JaxDistributedSignalingClient
I0425 12:28:17.179117 135998268753728 jax_array_handlers.py:347] Scheduling D2H of 46 prioritized jax.Array.
I0425 12:28:17.179177 135998268753728 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0425 12:28:17.851454 135998268753728 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.674390s
I0425 12:28:17.853544 135998268753728 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 27.280 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 935 milliseconds) (per-host)
I0425 12:28:17.853617 135998268753728 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.936000s (batch_requests_ready=0.253554s, total_serialization_initiated=0.682325s, others=0.000121s)
I0425 12:28:17.854393 135998268753728 jax_array_handlers.py:347] Scheduling D2H of 52 prioritized jax.Array.
I0425 12:28:17.854454 135998268753728 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0425 12:28:17.866173 135998268753728 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.012419s
I0425 12:28:17.866305 135998268753728 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 81.020 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 945 milliseconds) (per-host)
I0425 12:28:17.866354 135998268753728 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.945448s (batch_requests_ready=0.930740s, total_serialization_initiated=0.014628s, others=0.000080s)
I0425 12:28:17.866508 135998268753728 composite_checkpoint_handler.py:715] [process=5][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.952790s (all_items=0.000023s, per_item={'model_params': '0.00001812', 'optimizer_state': '0.00000477'}, temp_paths=0.952768)
I0425 12:28:17.867467 135842416809728 async_checkpointer.py:79] [process=5][thread=async_save] Background save thread started.
I0425 12:28:17.867583 135998268753728 async_checkpointer.py:561] Finished blocking save. Time taken: 1.229027s. Continuing background save to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_01_sft_smoke/checkpoints/1.
I0425 12:28:17.935294 135998268753728 checkpoint_manager.py:1549] [process=5][thread=MainThread][step=1] Starting CheckpointManager Save Finalize thread=save_finalize
I0425 12:28:17.935625 135842408417024 async_checkpointer.py:265] [process=5][thread=save_finalize] Waiting for background save thread=async_save.
I0425 12:28:17.935801 135998268753728 standard_logger.py:34] {'step': 1, 'event_type': 'save', 'directory': 'gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_01_sft_smoke/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1777120096.636256, 'wait_for_prev_duration_secs': 0.00014090538024902344, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1777120096.638534, 'checkpointer_blocking_duration_secs': 1.2292020320892334, 'get_old_steps_start_time': 1777120097.867751, 'get_old_steps_duration_secs': 6.270408630371094e-05, 'checkpoint_manager_blocking_start_time': 1777120096.5404134, 'checkpoint_manager_blocking_duration_secs': 1.3953511714935303}
I0425 12:28:18.082525 135998268753728 peft_trainer.py:474] Train step 1 training loss: 5.871947  - training perplexity: 354.939331
I0425 12:28:18.082810 135998268753728 max_utils.py:750] 
Memstats: After params initialized:
I0425 12:28:18.082877 135998268753728 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_18(process=5,(2,4,0,0))
I0425 12:28:18.082931 135998268753728 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_19(process=5,(3,4,0,0))
I0425 12:28:18.082963 135998268753728 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_22(process=5,(2,5,0,0))
I0425 12:28:18.082989 135998268753728 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_23(process=5,(3,5,0,0))
I0425 12:28:18.213337 135998268753728 metric_logger.py:196] completed step: 1, seconds: 19.348, TFLOP/s/device: 0.000, Tokens/s/device: 52.926, total_weights: 21054, loss: 5.872, lm_loss: 0.000, perplexity: 0.000
I0425 12:28:18.222801 135998268753728 peft_trainer.py:474] Train step 2 training loss: 5.539500  - training perplexity: 254.550751
I0425 12:28:18.223585 135998268753728 metric_logger.py:196] completed step: 2, seconds: 0.140, TFLOP/s/device: 0.002, Tokens/s/device: 7315.501, total_weights: 21455, loss: 5.540, lm_loss: 0.000, perplexity: 0.000
I0425 12:28:18.302320 135998268753728 peft_trainer.py:474] Train step 3 training loss: 5.440901  - training perplexity: 230.649963
I0425 12:28:18.303347 135998268753728 metric_logger.py:196] completed step: 3, seconds: 0.080, TFLOP/s/device: 0.003, Tokens/s/device: 12869.743, total_weights: 22025, loss: 5.441, lm_loss: 0.000, perplexity: 0.000
I0425 12:28:18.327070 135998268753728 peft_trainer.py:474] Train step 4 training loss: 5.596125  - training perplexity: 269.380554
I0425 12:28:18.327916 135998268753728 metric_logger.py:196] completed step: 4, seconds: 0.025, TFLOP/s/device: 0.009, Tokens/s/device: 41479.491, total_weights: 23787, loss: 5.596, lm_loss: 0.000, perplexity: 0.000
I0425 12:28:18.343186 135998268753728 peft_trainer.py:733] Train loop finished in: 19.6067 seconds
I0425 12:28:18.344542 135998268753728 peft_trainer.py:474] Train step 5 training loss: 5.670155  - training perplexity: 290.079498
I0425 12:28:18.345238 135998268753728 metric_logger.py:196] completed step: 5, seconds: 0.017, TFLOP/s/device: 0.013, Tokens/s/device: 58643.230, total_weights: 20141, loss: 5.670, lm_loss: 0.000, perplexity: 0.000
I0425 12:28:20.414726    3034 google_auth_provider.cc:181] Running on GCE, using service account 562977990677-compute@developer.gserviceaccount.com
I0425 12:28:22.329795 135842391631616 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_01_sft_smoke/checkpoints/1/model_params/array_metadatas/process_5
I0425 12:28:22.331203 135842416809728 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 4.716 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 5 seconds) (per-host)
I0425 12:28:22.372895 135841896724224 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_01_sft_smoke/checkpoints/1/optimizer_state/array_metadatas/process_5
I0425 12:28:22.374100 135842416809728 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 14.046 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 5 seconds) (per-host)
I0425 12:28:22.374196 135842416809728 async_checkpointer.py:90] [process=5][thread=async_save] 4 Handler Commit operations completed. Time taken: 4.506542s.
I0425 12:28:27.097889 135998268753728 checkpoint_manager.py:1994] [process=5][thread=MainThread][step=1][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0425 12:28:33.671194 135842416809728 async_checkpointer.py:144] [process=5][thread=async_save] Background save thread done. Time taken: 15.803529s.
I0425 12:28:33.671523 135842408417024 async_checkpointer.py:273] [process=5][thread=save_finalize] Done with waiting for background save thread=async_save.
I0425 12:28:33.671643 135842408417024 async_checkpointer.py:283] [process=5][thread=save_finalize] No errors found in background save thread=async_save.
I0425 12:28:33.671710 135842408417024 checkpoint_manager.py:2103] [process=5][thread=save_finalize][step=1] CheckpointManager Save Finalize is syncing with other hosts...
I0425 12:28:33.672874 135842408417024 checkpoint_manager.py:2112] [process=5][thread=save_finalize][step=1] CheckpointManager Save Finalize is done on all hosts.
I0425 12:28:33.673032 135998268753728 checkpoint_manager.py:2006] [process=5][thread=MainThread][step=1][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=1.
W0425 12:28:33.673115 135998268753728 checkpoint_manager.py:1441] Waiting for previous save to complete took 6.575246 seconds. If this number is high, consider checkpointing less frequently.
I0425 12:28:33.674823 135998268753728 checkpoint_manager.py:1501] [process=5] Saving checkpoint at step 5
I0425 12:28:33.678349 135998268753728 async_checkpointer.py:452] [process=5] Started async saving checkpoint to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_01_sft_smoke/checkpoints/5.
I0425 12:28:34.206106 135998268753728 jax_array_handlers.py:347] Scheduling D2H of 46 prioritized jax.Array.
I0425 12:28:34.206205 135998268753728 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0425 12:28:34.211559 135998268753728 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.007397s
I0425 12:28:34.211695 135998268753728 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 97.842 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 260 milliseconds) (per-host)
I0425 12:28:34.211744 135998268753728 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.261037s (batch_requests_ready=0.249746s, total_serialization_initiated=0.011194s, others=0.000097s)
I0425 12:28:34.319498 135998268753728 jax_array_handlers.py:347] Scheduling D2H of 52 prioritized jax.Array.
I0425 12:28:34.319600 135998268753728 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0425 12:28:34.331629 135998268753728 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.013183s
I0425 12:28:34.331770 135998268753728 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 202.401 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 378 milliseconds) (per-host)
I0425 12:28:34.331817 135998268753728 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.378503s (batch_requests_ready=0.363192s, total_serialization_initiated=0.015229s, others=0.000083s)
I0425 12:28:34.331944 135998268753728 composite_checkpoint_handler.py:715] [process=5][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.385009s (all_items=0.000014s, per_item={'model_params': '0.00001121', 'optimizer_state': '0.00000286'}, temp_paths=0.384995)
I0425 12:28:34.332883 135842416809728 async_checkpointer.py:79] [process=5][thread=async_save] Background save thread started.
I0425 12:28:34.332982 135998268753728 async_checkpointer.py:561] Finished blocking save. Time taken: 0.658089s. Continuing background save to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_01_sft_smoke/checkpoints/5.
I0425 12:28:34.577162 135998268753728 checkpoint_manager.py:1549] [process=5][thread=MainThread][step=5] Starting CheckpointManager Save Finalize thread=save_finalize
I0425 12:28:34.577541 135842408417024 async_checkpointer.py:265] [process=5][thread=save_finalize] Waiting for background save thread=async_save.
I0425 12:28:34.577748 135998268753728 standard_logger.py:34] {'step': 5, 'event_type': 'save', 'directory': 'gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_01_sft_smoke/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1777120107.0978477, 'wait_for_prev_duration_secs': 6.575246334075928, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1777120113.6748617, 'checkpointer_blocking_duration_secs': 0.6582307815551758, 'get_old_steps_start_time': 1777120114.333116, 'get_old_steps_duration_secs': 8.058547973632812e-05, 'checkpoint_manager_blocking_start_time': 1777120098.3495228, 'checkpoint_manager_blocking_duration_secs': 16.228188514709473}
I0425 12:28:34.577941 135998268753728 checkpoint_manager.py:1994] [process=5][thread=MainThread][step=5][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0425 12:28:39.034674 135842391631616 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_01_sft_smoke/checkpoints/5/model_params/array_metadatas/process_5
I0425 12:28:39.035699 135841854760704 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_01_sft_smoke/checkpoints/5/optimizer_state/array_metadatas/process_5
I0425 12:28:39.037090 135842416809728 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 5.020 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 5 seconds) (per-host)
I0425 12:28:39.037627 135842416809728 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 15.065 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 5 seconds) (per-host)
I0425 12:28:39.037705 135842416809728 async_checkpointer.py:90] [process=5][thread=async_save] 4 Handler Commit operations completed. Time taken: 4.704759s.
I0425 12:28:47.560019 135842416809728 async_checkpointer.py:144] [process=5][thread=async_save] Background save thread done. Time taken: 13.227063s.
I0425 12:28:47.560353 135842408417024 async_checkpointer.py:273] [process=5][thread=save_finalize] Done with waiting for background save thread=async_save.
I0425 12:28:47.560473 135842408417024 async_checkpointer.py:283] [process=5][thread=save_finalize] No errors found in background save thread=async_save.
I0425 12:28:47.560522 135842408417024 checkpoint_manager.py:2103] [process=5][thread=save_finalize][step=5] CheckpointManager Save Finalize is syncing with other hosts...
I0425 12:28:47.562061 135842408417024 checkpoint_manager.py:2112] [process=5][thread=save_finalize][step=5] CheckpointManager Save Finalize is done on all hosts.
I0425 12:28:47.562262 135998268753728 checkpoint_manager.py:2006] [process=5][thread=MainThread][step=5][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=5.
I0425 12:28:47.562421 135998268753728 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=135998268753728 count=1 at 0x7b97a40f41c0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7b97a417e150>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7b97a417e510>, _write_futures=[])
I0425 12:28:47.562889 135998268753728 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=135998268753728 count=1 at 0x7b97a40f41c0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7b97a417e150>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7b97a417e510>, _write_futures=[])
I0425 12:28:47.562922 135998268753728 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=135998268753728 count=1 at 0x7b97a40f41c0>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7b97a417e150>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7b97a417e510>, _write_futures=[])
I0425 12:28:47.837933 135845059364608 grain_pool.py:542] Grain pool is exiting.
I0425 12:28:47.838032 135845059364608 grain_pool.py:547] Shutting down multiprocessing system.
I0425 12:28:49.931824 135845059364608 grain_pool.py:547] Shutting down multiprocessing system.
/usr/local/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 15 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
XPK End: Sat Apr 25 12:29:04 UTC 2026
EXIT_CODE=0