MaxView

← Back to run

Log Summary

XPK Start: Sat Apr 25 12:33:14 UTC 2026
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
2026-04-25 12:33:43.743553: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0425 12:33:43.989053 134338315638592 max_utils.py:273] Attempting to initialize the jax distributed system...
INFO:2026-04-25 12:33:53,029:jax._src.distributed:149: Starting JAX distributed service on [::]:8482
I0425 12:33:53.029986 134338315638592 distributed.py:149] Starting JAX distributed service on [::]:8482
INFO:2026-04-25 12:33:53,032:jax._src.distributed:166: Connecting to JAX distributed service on mt-03-sft-linen-ckpt-aj3lh-slice-job-0-0.mt-03-sft-linen-ckpt-aj3lh:8482
I0425 12:33:53.032344 134338315638592 distributed.py:166] Connecting to JAX distributed service on mt-03-sft-linen-ckpt-aj3lh-slice-job-0-0.mt-03-sft-linen-ckpt-aj3lh:8482
I0425 12:33:54.206395 134338315638592 max_utils.py:284] Jax distributed system initialized!
I0425 12:34:00.153825 134338315638592 max_utils.py:800] System Information: Jax Version: 0.8.3
I0425 12:34:00.153934 134338315638592 max_utils.py:801] System Information: Jaxlib Version: 0.8.3
I0425 12:34:00.153983 134338315638592 max_utils.py:802] System Information: Jax Backend: PJRT C API
TFRT TPU v6 lite
Built on Dec 15 2025 14:03:46 (1765836226) cl/844590465
I0425 12:34:00.157410 134338315638592 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0425 12:34:00.254342 134338315638592 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0425 12:34:00.357298 134338315638592 maxtext_utils.py:1771] Num_devices: 32, shape (1, 4, 1, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0425 12:34:01.407374 134338315638592 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0425 12:34:01.407833 134338315638592 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7a2d5dc5c230>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0425 12:34:01.407896 134338315638592 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.28
W0425 12:34:01.904027 134338315638592 checkpoint.py:202] Metadata file does not exist: gs://lance-maxtext/pt_seed_ckpts/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items/_CHECKPOINT_METADATA
I0425 12:34:02.438083    1947 google_auth_provider.cc:181] Running on GCE, using service account 562977990677-compute@developer.gserviceaccount.com
I0425 12:34:03.605929 134338315638592 checkpointer.py:304] Restoring checkpoint from gs://lance-maxtext/pt_seed_ckpts/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items.
W0425 12:34:05.750140 134338315638592 transform_utils.py:230] The transformations API will eventually be replaced by an upgraded design. The current API will not be removed until this point, but it will no longer be actively worked on.
I0425 12:34:05.750559 134338315638592 transform_utils.py:288] The following keys are not loaded from the original tree after applying specified transforms: params/params/decoder/dropout/rngs/aqt/count, params/params/decoder/dropout/rngs/aqt/key, params/params/decoder/dropout/rngs/dropout/count, params/params/decoder/dropout/rngs/dropout/key, params/params/decoder/dropout/rngs/params/count, params/params/decoder/dropout/rngs/params/key, params/params/decoder/layers/dropout/rngs/aqt/count, params/params/decoder/layers/dropout/rngs/aqt/key, params/params/decoder/layers/dropout/rngs/dropout/count, params/params/decoder/layers/dropout/rngs/dropout/key, params/params/decoder/layers/dropout/rngs/params/count, params/params/decoder/layers/dropout/rngs/params/key, params/params/decoder/layers/mlp/dropout/rngs/aqt/count, params/params/decoder/layers/mlp/dropout/rngs/aqt/key, params/params/decoder/layers/mlp/dropout/rngs/dropout/count, params/params/decoder/layers/mlp/dropout/rngs/dropout/key, params/params/decoder/layers/mlp/dropout/rngs/params/count, params/params/decoder/layers/mlp/dropout/rngs/params/key, params/params/decoder/layers/rngs/aqt/count, params/params/decoder/layers/rngs/aqt/key, params/params/decoder/layers/rngs/dropout/count, params/params/decoder/layers/rngs/dropout/key, params/params/decoder/layers/rngs/params/count, params/params/decoder/layers/rngs/params/key, params/params/decoder/rngs/aqt/count, params/params/decoder/rngs/aqt/key, params/params/decoder/rngs/dropout/count, params/params/decoder/rngs/dropout/key, params/params/decoder/rngs/params/count, params/params/decoder/rngs/params/key
I0425 12:34:05.802088 134338315638592 checkpointer.py:318] Finished restoring checkpoint in 2.58 seconds from gs://lance-maxtext/pt_seed_ckpts/pt_seed_ckpts/pt_seed_ckpt_gpt352k_linen/checkpoints/9/items.
I0425 12:34:05.870365 134338315638592 config.py:112] TensorFlow version 2.20.0 available.
I0425 12:34:05.870892 134338315638592 config.py:125] JAX version 0.8.3 available.
/deps/src/maxtext/input_pipeline/input_pipeline_utils.py:467: UserWarning: WARNING: Inefficient dataloading. Your train or eval dataset contains 3 shards, smaller than number of host loading data. This is known to lead to inefficient dataloading. Seegithub.com/google/maxtext/blob/main/getting_started/Data_Input_Pipeline.md#multihost-dataloading-best-practice
  warnings.warn(
E0425 12:34:10.967048 134338315638592 packing.py:209] PackAndBatchOperation is deprecated. Please use lazy_dataset.FirstFitPackIterDataset instead.
I0425 12:34:10.967362 134338315638592 data_loader.py:408] Adding CopyNumPyArrayToSharedMemory MapTransform.
I0425 12:34:11.359445 134338315638592 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0425 12:34:11.359587 134338315638592 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7a2d5dc5c230>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0425 12:34:11.359633 134338315638592 pytree_checkpoint_handler.py:577] save_device_host_concurrent_bytes=None
I0425 12:34:11.359684 134338315638592 base_pytree_checkpoint_handler.py:411] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7a2d5dc5c230>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0425 12:34:11.359730 134338315638592 checkpoint_manager.py:702] [process=5][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'model_params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7a1539e7ae70>, 'optimizer_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7a1538b93f50>, 'custom_metadata': <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7a1538b99730>}, handler_registry=None
I0425 12:34:11.359937 134338315638592 composite_checkpoint_handler.py:237] Deferred registration for item: "model_params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7a1539e7ae70>` for item "model_params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0425 12:34:11.359983 134338315638592 composite_checkpoint_handler.py:237] Deferred registration for item: "optimizer_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7a1538b93f50>` for item "optimizer_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0425 12:34:11.360017 134338315638592 composite_checkpoint_handler.py:237] Deferred registration for item: "custom_metadata". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7a1538b99730>` for item "custom_metadata" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0425 12:34:11.360045 134338315638592 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7a1538ba04d0>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0425 12:34:11.360074 134338315638592 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7a1539e7ae70>, ('model_params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7a1539e7ae70>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7a1538b93f50>, ('optimizer_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7a1538b93f50>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7a1538b99730>, ('custom_metadata', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7a1538b99730>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7a1538ba04d0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7a1538ba04d0>}).
I0425 12:34:11.360462 134338315638592 async_checkpointer.py:177] [process=5][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>._fn at 0x7a153938fec0> timeout: 600 secs and primary_host=0 for async checkpoint writes
I0425 12:34:13.876579 134338315638592 checkpoint_manager.py:1788] Found 0 checkpoint steps in gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_03_sft_linen_ckpt/checkpoints
I0425 12:34:13.888211 134338315638592 checkpoint_manager.py:921] [process=5][thread=MainThread] CheckpointManager created,  primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=10000, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_hns=False, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None, preservation_policy=None, prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False), root_directory=gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_03_sft_linen_ckpt/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7a1538b99dc0>
I0425 12:34:13.888492 134338315638592 peft_trainer.py:584] Training with mesh: Mesh('diloco': 1, 'data': 4, 'stage': 1, 'fsdp': 8, 'fsdp_transpose': 1, 'sequence': 1, 'context': 1, 'context_autoregressive': 1, 'tensor': 1, 'tensor_transpose': 1, 'tensor_sequence': 1, 'expert': 1, 'autoregressive': 1, axis_types=(Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto))
I0425 12:34:14.309596 134338315638592 peft_trainer.py:594] Compiled train_step cache size: 0
I0425 12:34:14.313778 134338315638592 metric_logger.py:301] number parameters: 0.000 billion
I0425 12:34:14.316217 134184735581952 grain_pool.py:367] Grain pool will use 1 processes.
I0425 12:34:14.342571 134184735581952 grain_pool.py:440] Grain pool will start child processes.
Per train step:
 Total TFLOPs: 0.00 
 split as 54.29% learnable weight flops and 45.71% attention flops
I0425 12:34:14.347743 134184735581952 grain_pool.py:448] Grain pool started all child processes.
2026-04-25 12:34:18.359777: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-04-25 12:34:18.404448: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-04-25 12:34:19.566097: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
`rope_scaling`'s factor field must be a float >= 1, got 40
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'rope_theta'}
2026-04-25 12:34:23.785622: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0425 12:34:32.270708 134338315638592 checkpoint_manager.py:1983] [process=5][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0425 12:34:32.272511 134338315638592 checkpoint_manager.py:1501] [process=5] Saving checkpoint at step 1
I0425 12:34:32.275578 134338315638592 async_checkpointer.py:452] [process=5] Started async saving checkpoint to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_03_sft_linen_ckpt/checkpoints/1.
I0425 12:34:32.849740 134338315638592 signaling_client.py:364] Using JaxDistributedSignalingClient
I0425 12:34:32.850780 134338315638592 jax_array_handlers.py:347] Scheduling D2H of 46 prioritized jax.Array.
I0425 12:34:32.850841 134338315638592 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0425 12:34:33.510119 134338315638592 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.661470s
I0425 12:34:33.512147 134338315638592 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 27.644 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 923 milliseconds) (per-host)
I0425 12:34:33.512217 134338315638592 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.923677s (batch_requests_ready=0.254299s, total_serialization_initiated=0.669265s, others=0.000113s)
I0425 12:34:33.513137 134338315638592 jax_array_handlers.py:347] Scheduling D2H of 52 prioritized jax.Array.
I0425 12:34:33.513197 134338315638592 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0425 12:34:33.525048 134338315638592 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.012718s
I0425 12:34:33.525169 134338315638592 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 82.057 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 933 milliseconds) (per-host)
I0425 12:34:33.525214 134338315638592 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.933494s (batch_requests_ready=0.918576s, total_serialization_initiated=0.014844s, others=0.000074s)
I0425 12:34:33.525328 134338315638592 composite_checkpoint_handler.py:715] [process=5][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.940928s (all_items=0.000023s, per_item={'model_params': '0.00001884', 'optimizer_state': '0.00000453'}, temp_paths=0.940905)
I0425 12:34:33.526535 134182059620096 async_checkpointer.py:79] [process=5][thread=async_save] Background save thread started.
I0425 12:34:33.526720 134338315638592 async_checkpointer.py:561] Finished blocking save. Time taken: 1.254133s. Continuing background save to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_03_sft_linen_ckpt/checkpoints/1.
I0425 12:34:33.597684 134338315638592 checkpoint_manager.py:1549] [process=5][thread=MainThread][step=1] Starting CheckpointManager Save Finalize thread=save_finalize
I0425 12:34:33.598034 134182588098304 async_checkpointer.py:265] [process=5][thread=save_finalize] Waiting for background save thread=async_save.
I0425 12:34:33.598193 134338315638592 standard_logger.py:34] {'step': 1, 'event_type': 'save', 'directory': 'gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_03_sft_linen_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1777120472.2706735, 'wait_for_prev_duration_secs': 0.0001404285430908203, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1777120472.2725534, 'checkpointer_blocking_duration_secs': 1.2542719841003418, 'get_old_steps_start_time': 1777120473.5268497, 'get_old_steps_duration_secs': 7.915496826171875e-05, 'checkpoint_manager_blocking_start_time': 1777120472.0225291, 'checkpoint_manager_blocking_duration_secs': 1.5756280422210693}
I0425 12:34:33.743766 134338315638592 peft_trainer.py:474] Train step 1 training loss: 5.926076  - training perplexity: 374.681519
I0425 12:34:33.744035 134338315638592 max_utils.py:750] 
Memstats: After params initialized:
I0425 12:34:33.744100 134338315638592 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_18(process=5,(2,4,0,0))
I0425 12:34:33.744138 134338315638592 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_19(process=5,(3,4,0,0))
I0425 12:34:33.744168 134338315638592 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_22(process=5,(2,5,0,0))
I0425 12:34:33.744193 134338315638592 max_utils.py:756] 	Using (GB) 0.01 / 31.25 (0.032000%) on TPU_23(process=5,(3,5,0,0))
I0425 12:34:33.874877 134338315638592 metric_logger.py:196] completed step: 1, seconds: 19.430, TFLOP/s/device: 0.000, Tokens/s/device: 52.701, total_weights: 21054, loss: 5.926, lm_loss: 0.000, perplexity: 0.000
I0425 12:34:33.883975 134338315638592 peft_trainer.py:474] Train step 2 training loss: 5.643813  - training perplexity: 282.537903
I0425 12:34:33.884693 134338315638592 metric_logger.py:196] completed step: 2, seconds: 0.140, TFLOP/s/device: 0.002, Tokens/s/device: 7317.213, total_weights: 21455, loss: 5.644, lm_loss: 0.000, perplexity: 0.000
I0425 12:34:33.977726 134338315638592 peft_trainer.py:474] Train step 3 training loss: 5.503054  - training perplexity: 245.440414
I0425 12:34:33.978687 134338315638592 metric_logger.py:196] completed step: 3, seconds: 0.094, TFLOP/s/device: 0.002, Tokens/s/device: 10916.163, total_weights: 22025, loss: 5.503, lm_loss: 0.000, perplexity: 0.000
I0425 12:34:34.003647 134338315638592 peft_trainer.py:474] Train step 4 training loss: 5.624704  - training perplexity: 277.190186
I0425 12:34:34.004412 134338315638592 metric_logger.py:196] completed step: 4, seconds: 0.026, TFLOP/s/device: 0.009, Tokens/s/device: 39582.210, total_weights: 23787, loss: 5.625, lm_loss: 0.000, perplexity: 0.000
I0425 12:34:34.020099 134338315638592 peft_trainer.py:733] Train loop finished in: 19.7052 seconds
I0425 12:34:34.021065 134338315638592 peft_trainer.py:474] Train step 5 training loss: 5.765533  - training perplexity: 319.109253
I0425 12:34:34.021786 134338315638592 metric_logger.py:196] completed step: 5, seconds: 0.017, TFLOP/s/device: 0.013, Tokens/s/device: 58929.751, total_weights: 20141, loss: 5.766, lm_loss: 0.000, perplexity: 0.000
I0425 12:34:37.928854 134182068012800 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_03_sft_linen_ckpt/checkpoints/1/optimizer_state/array_metadatas/process_5
I0425 12:34:37.986475 134182084798208 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_03_sft_linen_ckpt/checkpoints/1/model_params/array_metadatas/process_5
I0425 12:34:37.987620 134182059620096 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 4.729 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 5 seconds) (per-host)
I0425 12:34:37.987781 134182059620096 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 14.194 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 5 seconds) (per-host)
I0425 12:34:37.987821 134182059620096 async_checkpointer.py:90] [process=5][thread=async_save] 4 Handler Commit operations completed. Time taken: 4.461174s.
I0425 12:34:42.953957 134338315638592 checkpoint_manager.py:1994] [process=5][thread=MainThread][step=1][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0425 12:34:48.563957 134182059620096 async_checkpointer.py:144] [process=5][thread=async_save] Background save thread done. Time taken: 15.037288s.
I0425 12:34:48.564296 134182588098304 async_checkpointer.py:273] [process=5][thread=save_finalize] Done with waiting for background save thread=async_save.
I0425 12:34:48.564416 134182588098304 async_checkpointer.py:283] [process=5][thread=save_finalize] No errors found in background save thread=async_save.
I0425 12:34:48.564465 134182588098304 checkpoint_manager.py:2103] [process=5][thread=save_finalize][step=1] CheckpointManager Save Finalize is syncing with other hosts...
I0425 12:34:48.565907 134182588098304 checkpoint_manager.py:2112] [process=5][thread=save_finalize][step=1] CheckpointManager Save Finalize is done on all hosts.
I0425 12:34:48.566092 134338315638592 checkpoint_manager.py:2006] [process=5][thread=MainThread][step=1][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=1.
W0425 12:34:48.566224 134338315638592 checkpoint_manager.py:1441] Waiting for previous save to complete took 5.612286 seconds. If this number is high, consider checkpointing less frequently.
I0425 12:34:48.568126 134338315638592 checkpoint_manager.py:1501] [process=5] Saving checkpoint at step 5
I0425 12:34:48.571631 134338315638592 async_checkpointer.py:452] [process=5] Started async saving checkpoint to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_03_sft_linen_ckpt/checkpoints/5.
I0425 12:34:49.103825 134338315638592 jax_array_handlers.py:347] Scheduling D2H of 52 prioritized jax.Array.
I0425 12:34:49.103918 134338315638592 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0425 12:34:49.115984 134338315638592 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.013139s
I0425 12:34:49.116113 134338315638592 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 283.557 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 270 milliseconds) (per-host)
I0425 12:34:49.116160 134338315638592 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.270197s (batch_requests_ready=0.254848s, total_serialization_initiated=0.015265s, others=0.000084s)
I0425 12:34:49.402866 134338315638592 jax_array_handlers.py:347] Scheduling D2H of 46 prioritized jax.Array.
I0425 12:34:49.402964 134338315638592 replica_slices.py:410] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0425 12:34:49.408667 134338315638592 base_pytree_checkpoint_handler.py:153] [process=5][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.007868s
I0425 12:34:49.408788 134338315638592 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/blocking_gbytes_per_sec: 45.155 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 565 milliseconds) (per-host)
I0425 12:34:49.408835 134338315638592 base_pytree_checkpoint_handler.py:732] [process=5][thread=MainThread] Initiated Pytree async_save. Time taken: 0.565491s (batch_requests_ready=0.553819s, total_serialization_initiated=0.011591s, others=0.000081s)
I0425 12:34:49.408967 134338315638592 composite_checkpoint_handler.py:715] [process=5][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.569785s (all_items=0.000015s, per_item={'model_params': '0.00001168', 'optimizer_state': '0.00000310'}, temp_paths=0.569770)
I0425 12:34:49.409972 134182588098304 async_checkpointer.py:79] [process=5][thread=async_save] Background save thread started.
I0425 12:34:49.410128 134338315638592 async_checkpointer.py:561] Finished blocking save. Time taken: 0.841929s. Continuing background save to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_03_sft_linen_ckpt/checkpoints/5.
I0425 12:34:49.518273 134338315638592 checkpoint_manager.py:1549] [process=5][thread=MainThread][step=5] Starting CheckpointManager Save Finalize thread=save_finalize
I0425 12:34:49.518616 134182059620096 async_checkpointer.py:265] [process=5][thread=save_finalize] Waiting for background save thread=async_save.
I0425 12:34:49.518809 134338315638592 standard_logger.py:34] {'step': 5, 'event_type': 'save', 'directory': 'gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_03_sft_linen_ckpt/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1777120482.9539163, 'wait_for_prev_duration_secs': 5.612285614013672, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1777120488.5681665, 'checkpointer_blocking_duration_secs': 0.8420722484588623, 'get_old_steps_start_time': 1777120489.4102619, 'get_old_steps_duration_secs': 7.891654968261719e-05, 'checkpoint_manager_blocking_start_time': 1777120474.0263126, 'checkpoint_manager_blocking_duration_secs': 15.49246072769165}
I0425 12:34:49.519005 134338315638592 checkpoint_manager.py:1994] [process=5][thread=MainThread][step=5][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0425 12:34:54.014326 134182068012800 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 46 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_03_sft_linen_ckpt/checkpoints/5/model_params/array_metadatas/process_5
I0425 12:34:54.015551 134182588098304 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 4.936 KiB/s (total gbytes: 25.5 KiB) (time elapsed: 5 seconds) (per-host)
I0425 12:34:54.058274 134182093190912 array_metadata_store.py:203] [process=5][thread=array_type_handler] Wrote 52 array_metadata.ArrayMetadata to gs://lance-maxtext/pt_ckpt_xpk_feat_nnx_post_train_fixes_20260425_121405/pt_sft_nnx_xpk_feat_nnx_post_train_fixes_20260425_121405_03_sft_linen_ckpt/checkpoints/5/optimizer_state/array_metadatas/process_5
I0425 12:34:54.059400 134182588098304 base_pytree_checkpoint_handler.py:128] [process=5] /jax/checkpoint/write/gbytes_per_sec: 14.692 KiB/s (total gbytes: 76.6 KiB) (time elapsed: 5 seconds) (per-host)
I0425 12:34:54.059492 134182588098304 async_checkpointer.py:90] [process=5][thread=async_save] 4 Handler Commit operations completed. Time taken: 4.649411s.
I0425 12:35:02.480609 134182588098304 async_checkpointer.py:144] [process=5][thread=async_save] Background save thread done. Time taken: 13.070515s.
I0425 12:35:02.480968 134182059620096 async_checkpointer.py:273] [process=5][thread=save_finalize] Done with waiting for background save thread=async_save.
I0425 12:35:02.481083 134182059620096 async_checkpointer.py:283] [process=5][thread=save_finalize] No errors found in background save thread=async_save.
I0425 12:35:02.481132 134182059620096 checkpoint_manager.py:2103] [process=5][thread=save_finalize][step=5] CheckpointManager Save Finalize is syncing with other hosts...
I0425 12:35:02.482530 134182059620096 checkpoint_manager.py:2112] [process=5][thread=save_finalize][step=5] CheckpointManager Save Finalize is done on all hosts.
I0425 12:35:02.482746 134338315638592 checkpoint_manager.py:2006] [process=5][thread=MainThread][step=5][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=5.
I0425 12:35:02.482925 134338315638592 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=134338315638592 count=1 at 0x7a1539c6c080>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7a1538b99d90>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7a1538b98c80>, _write_futures=[])
I0425 12:35:02.483447 134338315638592 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=134338315638592 count=1 at 0x7a1539c6c080>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7a1538b99d90>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7a1538b98c80>, _write_futures=[])
I0425 12:35:02.483482 134338315638592 checkpoint.py:459] Closing _NonBlockingMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=134338315638592 count=1 at 0x7a1539c6c080>, _store_impl=<orbax.checkpoint._src.metadata.checkpoint._MetadataStoreImpl object at 0x7a1538b99d90>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7a1538b98c80>, _write_futures=[])
I0425 12:35:02.512422 134184735581952 grain_pool.py:542] Grain pool is exiting.
I0425 12:35:02.512493 134184735581952 grain_pool.py:547] Shutting down multiprocessing system.
I0425 12:35:04.593304 134184735581952 grain_pool.py:547] Shutting down multiprocessing system.
/usr/local/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 15 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
XPK End: Sat Apr 25 12:35:15 UTC 2026
EXIT_CODE=0