MaxView

← Back to run

Log Summary

XPK Start: Wed Apr 22 15:20:02 UTC 2026
PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
2026-04-22 15:20:26.733254: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0422 15:20:26.913036 138312735962944 max_utils.py:273] Attempting to initialize the jax distributed system...
I0422 15:20:35.955260 138312735962944 distributed.py:149] Starting JAX distributed service on [::]:8482
I0422 15:20:35.957521 138312735962944 distributed.py:172] Connecting to JAX distributed service on mt-15-ocdbt-false-sv-ozh9q-slice-job-0-0.mt-15-ocdbt-false-sv-ozh9q:8482
I0422 15:20:37.466193 138312735962944 max_utils.py:284] Jax distributed system initialized!
I0422 15:20:43.714442 138312735962944 max_utils.py:800] System Information: Jax Version: 0.9.2
I0422 15:20:43.714546 138312735962944 max_utils.py:801] System Information: Jaxlib Version: 0.9.2
I0422 15:20:43.714586 138312735962944 max_utils.py:802] System Information: Jax Backend: PJRT C API
TFRT TPU v6 lite
Built on Mar 4 2026 11:32:08 (1772652728) cl/878335365
I0422 15:20:43.714620 138312735962944 train_utils.py:378] WARNING: Sequence packing is essentially ignored for synthetic data. Please use a real dataset to use sequence packing.
I0422 15:20:44.413091 138312735962944 maxtext_utils.py:1718] Num_devices: 32, shape (1, 1, 1, 32, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0422 15:20:44.525709 138312735962944 checkpointing.py:688] Setting up checkpoint logger...
I0422 15:20:44.525825 138312735962944 checkpointing.py:234] Creating checkpoint manager with ocdbt=False and zarr3=True
I0422 15:20:44.525867 138312735962944 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None
I0422 15:20:44.526066 138312735962944 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=False, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7dcab9c53f20>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0422 15:20:47.485955 138312735962944 checkpointing.py:266] Enabling policy for fixed interval checkpointing.
I0422 15:20:47.486208 138312735962944 checkpoint_manager.py:708] [process=6][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=('items',), item_handlers={'items': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7db6504d4aa0>}, handler_registry=None
I0422 15:20:47.486624 138312735962944 composite_checkpoint_handler.py:237] Deferred registration for item: "items". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7db6504d4aa0>` for item "items" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0422 15:20:47.486761 138312735962944 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7db5706b73b0>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0422 15:20:47.486803 138312735962944 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('items', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7db6504d4aa0>, ('items', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7db6504d4aa0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7db5706b73b0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7db5706b73b0>}).
I0422 15:20:47.487165 138312735962944 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.34
I0422 15:20:47.487263 138312735962944 async_checkpointer.py:192] [process=6][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>._fn at 0x7db5705b9da0> timeout: 1200 secs and primary_host=0 for async checkpoint writes
I0422 15:20:48.266815 138312735962944 checkpoint_manager.py:1812] Found 0 checkpoint steps in gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123906/nnx_xpk_feat_nnx_post_train_fixes_20260422_123906_15_ocdbt_false/checkpoints
I0422 15:20:48.277862 138312735962944 checkpoint_manager.py:929] [process=6][thread=MainThread] CheckpointManager created,  primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=1, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=FixedIntervalPolicy(interval=5), preservation_policy=LatestN(n=None), prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False, lightweight_initialize=False), root_directory=gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123906/nnx_xpk_feat_nnx_post_train_fixes_20260422_123906_15_ocdbt_false/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7db5705143e0>
I0422 15:20:48.277982 138312735962944 checkpointing.py:302] Checkpoint manager created!
I0422 15:20:49.244451 138312735962944 checkpointing.py:578] checkpoint manager exists so trying to load this run's existing checkpoint
I0422 15:20:49.244571 138312735962944 checkpointing.py:676] No existing checkpoints found, not restoring checkpoint.
fsdp: 32
I0422 15:20:50.612196 138312735962944 maxtext_utils.py:1836]  decoder/decoder_norm/scale/value
    Shape:     float32[2048]
    Physical:  (None,)
I0422 15:20:50.612319 138312735962944 maxtext_utils.py:1836]  decoder/layers/mlp/wi_0/kernel/value
    Shape:     float32[2048,16,7168]
    Physical:  ('fsdp', None, None)
I0422 15:20:50.612370 138312735962944 maxtext_utils.py:1836]  decoder/layers/mlp/wi_1/kernel/value
    Shape:     float32[2048,16,7168]
    Physical:  ('fsdp', None, None)
I0422 15:20:50.612412 138312735962944 maxtext_utils.py:1836]  decoder/layers/mlp/wo/kernel/value
    Shape:     float32[7168,16,2048]
    Physical:  (None, None, 'fsdp')
I0422 15:20:50.612447 138312735962944 maxtext_utils.py:1836]  decoder/layers/post_self_attention_layer_norm/scale/value
    Shape:     float32[2048,16]
    Physical:  (None, None)
I0422 15:20:50.612482 138312735962944 maxtext_utils.py:1836]  decoder/layers/pre_self_attention_layer_norm/scale/value
    Shape:     float32[2048,16]
    Physical:  (None, None)
I0422 15:20:50.612517 138312735962944 maxtext_utils.py:1836]  decoder/layers/self_attention/key/kernel/value
    Shape:     float32[2048,16,16,128]
    Physical:  ('fsdp', None, None, None)
I0422 15:20:50.612551 138312735962944 maxtext_utils.py:1836]  decoder/layers/self_attention/out/kernel/value
    Shape:     float32[16,16,128,2048]
    Physical:  (None, None, None, 'fsdp')
I0422 15:20:50.612583 138312735962944 maxtext_utils.py:1836]  decoder/layers/self_attention/query/kernel/value
    Shape:     float32[2048,16,16,128]
    Physical:  ('fsdp', None, None, None)
I0422 15:20:50.612615 138312735962944 maxtext_utils.py:1836]  decoder/layers/self_attention/value/kernel/value
    Shape:     float32[2048,16,16,128]
    Physical:  ('fsdp', None, None, None)
I0422 15:20:50.612646 138312735962944 maxtext_utils.py:1836]  decoder/logits_dense/kernel/value
    Shape:     float32[2048,32000]
    Physical:  ('fsdp', None)
I0422 15:20:50.612676 138312735962944 maxtext_utils.py:1836]  token_embedder/embedding/value
    Shape:     float32[32000,2048]

    Physical:  (None, 'fsdp')
I0422 15:20:51.018665 138312735962944 nnx_decoders.py:465] nnx_decoders/carry Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_norm_length', 'activation_embed').
I0422 15:20:51.018757 138312735962944 nnx_decoders.py:465] nnx_decoders/carry Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None).
I0422 15:20:51.024131 138312735962944 nnx_decoders.py:465] Unknown Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_norm_length', 'activation_embed').
I0422 15:20:51.024186 138312735962944 nnx_decoders.py:465] Unknown Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None).
I0422 15:20:51.040461 138312735962944 attentions.py:1088] attentions/inputs_q Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_attn_length', 'activation_attn_embed').
I0422 15:20:51.040527 138312735962944 attentions.py:1088] attentions/inputs_q Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None).
I0422 15:20:51.056207 138312735962944 attentions.py:1089] attentions/inputs_kv Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_attn_length', 'activation_attn_embed').
I0422 15:20:51.056262 138312735962944 attentions.py:1089] attentions/inputs_kv Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None).
I0422 15:20:51.081263 138312735962944 attentions.py:1154] attentions/query Logical: bfloat16[32,2048,16,128].................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim').
I0422 15:20:51.081331 138312735962944 attentions.py:1154] attentions/query Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None).
I0422 15:20:51.097140 138312735962944 attentions.py:1155] attentions/key Logical: bfloat16[32,2048,16,128].................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim').
I0422 15:20:51.097199 138312735962944 attentions.py:1155] attentions/key Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None).
I0422 15:20:51.112874 138312735962944 attentions.py:1156] attentions/value Logical: bfloat16[32,2048,16,128].................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim').
I0422 15:20:51.112934 138312735962944 attentions.py:1156] attentions/value Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None).
I0422 15:20:51.143393 138312735962944 attentions.py:1197] attentions/out Logical: bfloat16[32,2048,16,128].................................... ('activation_batch', 'activation_attn_length', 'activation_heads', 'activation_kv').
I0422 15:20:51.143465 138312735962944 attentions.py:1197] attentions/out Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None).
I0422 15:20:51.163464 138312735962944 linears.py:525] linears/x Logical: bfloat16[32,2048,7168]...................................... ('activation_batch', 'activation_length', 'activation_mlp').
I0422 15:20:51.163529 138312735962944 linears.py:525] linears/x Physical: bfloat16[32,2048,7168]...................................... ('fsdp', None, None).
I0422 15:21:02.575182 138312735962944 max_utils.py:791] Total memory size: 1.5 GB, Output size: 0.4 GB, Temp size: 1.1 GB, Argument size: 0.4 GB, Host temp size: 0.0 GB.
I0422 15:21:02.577647 138312735962944 metric_logger.py:301] number parameters: 1.104 billion
I0422 15:21:13.766510 138312735962944 checkpointing.py:794] Waiting for step 0 to finish before checkpoint...
I0422 15:21:13.917872 138312735962944 checkpointing.py:798] Waited 0.15134763717651367 seconds for step 0 to finish before starting checkpointing.
I0422 15:21:13.920130 138312735962944 checkpoint_manager.py:2009] [process=6][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0422 15:21:13.921750 138312735962944 checkpoint_manager.py:1512] [process=6] Saving checkpoint at step 0
I0422 15:21:13.923183 138312735962944 event_tracking.py:70] [process=6] [async] Started save checkpoint @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123906/nnx_xpk_feat_nnx_post_train_fixes_20260422_123906_15_ocdbt_false/checkpoints/0.
I0422 15:21:14.215824 138312735962944 signaling_client.py:364] Using JaxDistributedSignalingClient
I0422 15:21:14.218310 138312735962944 jax_array_handlers.py:360] Scheduling D2H of 69 prioritized jax.Array.
I0422 15:21:14.218371 138312735962944 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0422 15:21:14.496780 138312735962944 base_pytree_checkpoint_handler.py:154] [process=6][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.281742s
I0422 15:21:14.496949 138312735962944 base_pytree_checkpoint_handler.py:130] [process=6] /jax/orbax/write/blocking_gbytes_per_sec: 5.300 GiB/s (total gbytes: 1.5 GiB) (time elapsed: 0.2910315990447998 s) (per-host)
I0422 15:21:14.496999 138312735962944 base_pytree_checkpoint_handler.py:768] [process=6][thread=MainThread] Initiated Pytree async_save. Time taken: 0.291091s (batch_requests_ready=0.003177s, total_serialization_initiated=0.287847s, others=0.000067s)
I0422 15:21:14.497109 138312735962944 composite_checkpoint_handler.py:715] [process=6][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.295311s (all_items=0.000019s, per_item={'items': '0.00001860'}, temp_paths=0.295293)
I0422 15:21:14.497911 138312735962944 event_tracking.py:125] [process=6] [async] Finished blocking save in 0.58 seconds. Continuing save @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123906/nnx_xpk_feat_nnx_post_train_fixes_20260422_123906_15_ocdbt_false/checkpoints/0.
I0422 15:21:14.498201 138183948490496 async_checkpointer.py:76] [process=6][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-22 15:41:14.498177
I0422 15:21:14.500192 138312735962944 checkpoint_manager.py:1560] [process=6][thread=MainThread][step=0] Starting CheckpointManager Save Finalize thread=save_finalize
I0422 15:21:14.500432 138182334293760 async_checkpointer.py:280] [process=6][thread=save_finalize] Waiting for background save thread=async_save.
I0422 15:21:14.500587 138312735962944 standard_logger.py:34] {'step': 0, 'event_type': 'save', 'directory': 'gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123906/nnx_xpk_feat_nnx_post_train_fixes_20260422_123906_15_ocdbt_false/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776871273.9201117, 'wait_for_prev_duration_secs': 6.175041198730469e-05, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776871273.9217887, 'checkpointer_blocking_duration_secs': 0.5765297412872314, 'get_old_steps_start_time': 1776871274.498345, 'get_old_steps_duration_secs': 2.8371810913085938e-05, 'checkpoint_manager_blocking_start_time': 1776871273.9182725, 'checkpoint_manager_blocking_duration_secs': 0.5822737216949463}
I0422 15:21:14.500692 138312735962944 checkpointing.py:409] Started an asynchronous checkpoint save for step 0
I0422 15:21:14.500762 138312735962944 max_utils.py:750] 
Memstats: After params initialized:
I0422 15:21:14.500808 138312735962944 max_utils.py:756] 	Using (GB) 0.43 / 31.25 (1.376000%) on TPU_24(process=6,(0,6,0,0))
I0422 15:21:14.500838 138312735962944 max_utils.py:756] 	Using (GB) 0.43 / 31.25 (1.376000%) on TPU_25(process=6,(1,6,0,0))
I0422 15:21:14.500863 138312735962944 max_utils.py:756] 	Using (GB) 0.43 / 31.25 (1.376000%) on TPU_28(process=6,(0,7,0,0))
I0422 15:21:14.500885 138312735962944 max_utils.py:756] 	Using (GB) 0.43 / 31.25 (1.376000%) on TPU_29(process=6,(1,7,0,0))
I0422 15:21:14.816908 138312735962944 metric_logger.py:196] completed step: 0, seconds: 11.188, TFLOP/s/device: 1.214, Tokens/s/device: 183.057, total_weights: 65536, loss: 10.874, lm_loss: 10.874, perplexity: 52786.469
I0422 15:21:14.995031 138312735962944 metric_logger.py:196] completed step: 1, seconds: 1.049, TFLOP/s/device: 12.959, Tokens/s/device: 1953.263, total_weights: 65536, loss: 10.874, lm_loss: 10.874, perplexity: 52786.469
I0422 15:21:15.438939 138312735962944 metric_logger.py:196] completed step: 2, seconds: 0.032, TFLOP/s/device: 424.996, Tokens/s/device: 64060.056, total_weights: 65536, loss: 10.560, lm_loss: 10.560, perplexity: 38565.637
I0422 15:21:15.586430 138312735962944 metric_logger.py:196] completed step: 3, seconds: 0.434, TFLOP/s/device: 31.313, Tokens/s/device: 4719.818, total_weights: 65536, loss: 9.979, lm_loss: 9.979, perplexity: 21564.932
I0422 15:21:15.592374 138312735962944 checkpointing.py:794] Waiting for step 5 to finish before checkpoint...
I0422 15:21:15.881382 138312735962944 checkpointing.py:798] Waited 0.28897929191589355 seconds for step 5 to finish before starting checkpointing.
I0422 15:21:15.884084 138312735962944 checkpoint_manager.py:2020] [process=6][thread=MainThread][step=0][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0422 15:21:18.109868    2808 google_auth_provider.cc:181] Running on GCE, using service account 562977990677-compute@developer.gserviceaccount.com
I0422 15:21:20.216924 138182342686464 array_metadata_store.py:203] [process=6][thread=array_type_handler] Wrote 69 array_metadata.ArrayMetadata to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123906/nnx_xpk_feat_nnx_post_train_fixes_20260422_123906_15_ocdbt_false/checkpoints/0/items/array_metadatas/process_6
I0422 15:21:25.271635 138183948490496 base_pytree_checkpoint_handler.py:130] [process=6] /jax/orbax/write/gbytes_per_sec: 142.747 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 11.065678119659424 s) (per-host)
I0422 15:21:25.271760 138183948490496 async_checkpointer.py:90] [process=6][thread=async_save] 3 Handler Commit operations completed. Time taken: 10.773483s.
I0422 15:21:32.800895 138183948490496 async_checkpointer.py:160] [process=6][thread=async_save] Background save thread done. Time taken: 18.302601s.
I0422 15:21:32.801196 138182334293760 async_checkpointer.py:288] [process=6][thread=save_finalize] Done with waiting for background save thread=async_save.
I0422 15:21:32.801318 138182334293760 async_checkpointer.py:298] [process=6][thread=save_finalize] No errors found in background save thread=async_save.
I0422 15:21:32.801376 138182334293760 checkpoint_manager.py:2137] [process=6][thread=save_finalize][step=0] CheckpointManager Save Finalize is syncing with other hosts...
I0422 15:21:32.802951 138182334293760 checkpoint_manager.py:2146] [process=6][thread=save_finalize][step=0] CheckpointManager Save Finalize is done on all hosts.
I0422 15:21:32.803165 138312735962944 checkpoint_manager.py:2032] [process=6][thread=MainThread][step=0][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=0.
W0422 15:21:32.803301 138312735962944 checkpoint_manager.py:1452] Waiting for previous save to complete took 16.919218 seconds. If this number is high, consider checkpointing less frequently.
I0422 15:21:32.805385 138312735962944 checkpoint_manager.py:1512] [process=6] Saving checkpoint at step 5
I0422 15:21:32.807931 138312735962944 event_tracking.py:70] [process=6] [async] Started save checkpoint @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123906/nnx_xpk_feat_nnx_post_train_fixes_20260422_123906_15_ocdbt_false/checkpoints/5.
I0422 15:21:33.098139 138312735962944 jax_array_handlers.py:360] Scheduling D2H of 69 prioritized jax.Array.
I0422 15:21:33.098243 138312735962944 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0422 15:21:33.143848 138312735962944 base_pytree_checkpoint_handler.py:154] [process=6][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.047304s
I0422 15:21:33.144010 138312735962944 base_pytree_checkpoint_handler.py:130] [process=6] /jax/orbax/write/blocking_gbytes_per_sec: 28.179 GiB/s (total gbytes: 1.5 GiB) (time elapsed: 0.054741621017456055 s) (per-host)
I0422 15:21:33.144056 138312735962944 base_pytree_checkpoint_handler.py:768] [process=6][thread=MainThread] Initiated Pytree async_save. Time taken: 0.054798s (batch_requests_ready=0.002786s, total_serialization_initiated=0.051949s, others=0.000063s)
I0422 15:21:33.144169 138312735962944 composite_checkpoint_handler.py:715] [process=6][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.059770s (all_items=0.000015s, per_item={'items': '0.00001478'}, temp_paths=0.059755)
I0422 15:21:33.144887 138312735962944 event_tracking.py:125] [process=6] [async] Finished blocking save in 0.34 seconds. Continuing save @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123906/nnx_xpk_feat_nnx_post_train_fixes_20260422_123906_15_ocdbt_false/checkpoints/5.
I0422 15:21:33.145253 138182334293760 async_checkpointer.py:76] [process=6][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-22 15:41:33.145216
I0422 15:21:33.151180 138312735962944 checkpoint_manager.py:1560] [process=6][thread=MainThread][step=5] Starting CheckpointManager Save Finalize thread=save_finalize
I0422 15:21:33.151437 138183425636096 async_checkpointer.py:280] [process=6][thread=save_finalize] Waiting for background save thread=async_save.
I0422 15:21:33.151601 138312735962944 standard_logger.py:34] {'step': 5, 'event_type': 'save', 'directory': 'gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123906/nnx_xpk_feat_nnx_post_train_fixes_20260422_123906_15_ocdbt_false/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776871275.884054, 'wait_for_prev_duration_secs': 16.91921830177307, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776871292.8054235, 'checkpointer_blocking_duration_secs': 0.33997201919555664, 'get_old_steps_start_time': 1776871293.1454182, 'get_old_steps_duration_secs': 3.0279159545898438e-05, 'checkpoint_manager_blocking_start_time': 1776871275.8817532, 'checkpoint_manager_blocking_duration_secs': 17.269814014434814}
I0422 15:21:33.151713 138312735962944 checkpointing.py:409] Started an asynchronous checkpoint save for step 5
I0422 15:21:33.152729 138312735962944 metric_logger.py:196] completed step: 4, seconds: 0.164, TFLOP/s/device: 82.745, Tokens/s/device: 12472.291, total_weights: 65536, loss: 9.457, lm_loss: 9.457, perplexity: 12802.436
I0422 15:21:33.160605 138312735962944 metric_logger.py:196] completed step: 5, seconds: 0.147, TFLOP/s/device: 92.289, Tokens/s/device: 13910.870, total_weights: 65536, loss: 8.955, lm_loss: 8.955, perplexity: 7748.778
I0422 15:21:38.770417 138183434028800 array_metadata_store.py:203] [process=6][thread=array_type_handler] Wrote 69 array_metadata.ArrayMetadata to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123906/nnx_xpk_feat_nnx_post_train_fixes_20260422_123906_15_ocdbt_false/checkpoints/5/items/array_metadatas/process_6
I0422 15:21:43.800827 138182334293760 base_pytree_checkpoint_handler.py:130] [process=6] /jax/orbax/write/gbytes_per_sec: 147.467 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 10.711519479751587 s) (per-host)
I0422 15:21:43.800950 138182334293760 async_checkpointer.py:90] [process=6][thread=async_save] 3 Handler Commit operations completed. Time taken: 10.655584s.
I0422 15:21:50.607041 138182334293760 async_checkpointer.py:160] [process=6][thread=async_save] Background save thread done. Time taken: 17.461658s.
I0422 15:21:50.607362 138183425636096 async_checkpointer.py:288] [process=6][thread=save_finalize] Done with waiting for background save thread=async_save.
I0422 15:21:50.607481 138183425636096 async_checkpointer.py:298] [process=6][thread=save_finalize] No errors found in background save thread=async_save.
I0422 15:21:50.607529 138183425636096 checkpoint_manager.py:2137] [process=6][thread=save_finalize][step=5] CheckpointManager Save Finalize is syncing with other hosts...
I0422 15:21:50.608883 138183425636096 checkpoint_manager.py:2146] [process=6][thread=save_finalize][step=5] CheckpointManager Save Finalize is done on all hosts.
I0422 15:21:56.279375 138312735962944 metric_logger.py:196] completed step: 6, seconds: 17.566, TFLOP/s/device: 0.773, Tokens/s/device: 116.587, total_weights: 65536, loss: 8.465, lm_loss: 8.465, perplexity: 4743.984
I0422 15:21:56.427033 138312735962944 metric_logger.py:196] completed step: 7, seconds: 22.974, TFLOP/s/device: 0.591, Tokens/s/device: 89.145, total_weights: 65536, loss: 7.998, lm_loss: 7.998, perplexity: 2976.266
I0422 15:21:56.574681 138312735962944 metric_logger.py:196] completed step: 8, seconds: 0.153, TFLOP/s/device: 88.710, Tokens/s/device: 13371.288, total_weights: 65536, loss: 7.567, lm_loss: 7.567, perplexity: 1933.479
I0422 15:21:56.722948 138312735962944 checkpointing.py:794] Waiting for step 9 to finish before checkpoint...
I0422 15:21:56.723655 138312735962944 checkpointing.py:798] Waited 0.0007200241088867188 seconds for step 9 to finish before starting checkpointing.
I0422 15:21:56.725569 138312735962944 checkpoint_manager.py:2009] [process=6][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0422 15:21:56.727136 138312735962944 checkpoint_manager.py:1512] [process=6] Saving checkpoint at step 9
I0422 15:21:56.728460 138312735962944 event_tracking.py:70] [process=6] [async] Started save checkpoint @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123906/nnx_xpk_feat_nnx_post_train_fixes_20260422_123906_15_ocdbt_false/checkpoints/9.
I0422 15:21:57.041834 138312735962944 jax_array_handlers.py:360] Scheduling D2H of 69 prioritized jax.Array.
I0422 15:21:57.042175 138312735962944 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0422 15:21:57.089255 138312735962944 base_pytree_checkpoint_handler.py:154] [process=6][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.048947s
I0422 15:21:57.089421 138312735962944 base_pytree_checkpoint_handler.py:130] [process=6] /jax/orbax/write/blocking_gbytes_per_sec: 27.564 GiB/s (total gbytes: 1.5 GiB) (time elapsed: 0.05596351623535156 s) (per-host)
I0422 15:21:57.089467 138312735962944 base_pytree_checkpoint_handler.py:768] [process=6][thread=MainThread] Initiated Pytree async_save. Time taken: 0.056020s (batch_requests_ready=0.002816s, total_serialization_initiated=0.053143s, others=0.000061s)
I0422 15:21:57.089564 138312735962944 composite_checkpoint_handler.py:715] [process=6][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.060070s (all_items=0.000010s, per_item={'items': '0.00001025'}, temp_paths=0.060060)
I0422 15:21:57.090322 138312735962944 event_tracking.py:125] [process=6] [async] Finished blocking save in 0.36 seconds. Continuing save @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123906/nnx_xpk_feat_nnx_post_train_fixes_20260422_123906_15_ocdbt_false/checkpoints/9.
I0422 15:21:57.090665 138161342621440 async_checkpointer.py:76] [process=6][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-22 15:41:57.090627
I0422 15:21:57.092609 138312735962944 checkpoint_manager.py:1560] [process=6][thread=MainThread][step=9] Starting CheckpointManager Save Finalize thread=save_finalize
I0422 15:21:57.092818 138182905550592 async_checkpointer.py:280] [process=6][thread=save_finalize] Waiting for background save thread=async_save.
I0422 15:21:57.092974 138312735962944 standard_logger.py:34] {'step': 9, 'event_type': 'save', 'directory': 'gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123906/nnx_xpk_feat_nnx_post_train_fixes_20260422_123906_15_ocdbt_false/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776871316.7255297, 'wait_for_prev_duration_secs': 7.724761962890625e-05, 'time_between_consecutive_saves_sec': 6.116607189178467, 'checkpointer_blocking_start_time': 1776871316.7271729, 'checkpointer_blocking_duration_secs': 0.36363792419433594, 'get_old_steps_start_time': 1776871317.0908282, 'get_old_steps_duration_secs': 2.384185791015625e-05, 'checkpoint_manager_blocking_start_time': 1776871316.7238908, 'checkpoint_manager_blocking_duration_secs': 0.3690526485443115}
I0422 15:21:57.093085 138312735962944 checkpointing.py:409] Started an asynchronous checkpoint save for step 9
I0422 15:21:57.093145 138312735962944 checkpoint_manager.py:2020] [process=6][thread=MainThread][step=9][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0422 15:22:02.278915 138161111955200 array_metadata_store.py:203] [process=6][thread=array_type_handler] Wrote 69 array_metadata.ArrayMetadata to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260422_123906/nnx_xpk_feat_nnx_post_train_fixes_20260422_123906_15_ocdbt_false/checkpoints/9/items/array_metadatas/process_6
I0422 15:22:07.689666 138161342621440 base_pytree_checkpoint_handler.py:130] [process=6] /jax/orbax/write/gbytes_per_sec: 148.233 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 10.656170129776001 s) (per-host)
I0422 15:22:07.689791 138161342621440 async_checkpointer.py:90] [process=6][thread=async_save] 3 Handler Commit operations completed. Time taken: 10.599012s.
I0422 15:22:14.021647 138161342621440 async_checkpointer.py:160] [process=6][thread=async_save] Background save thread done. Time taken: 16.930850s.
I0422 15:22:14.021927 138182905550592 async_checkpointer.py:288] [process=6][thread=save_finalize] Done with waiting for background save thread=async_save.
I0422 15:22:14.022046 138182905550592 async_checkpointer.py:298] [process=6][thread=save_finalize] No errors found in background save thread=async_save.
I0422 15:22:14.022086 138182905550592 checkpoint_manager.py:2137] [process=6][thread=save_finalize][step=9] CheckpointManager Save Finalize is syncing with other hosts...
I0422 15:22:14.023571 138182905550592 checkpoint_manager.py:2146] [process=6][thread=save_finalize][step=9] CheckpointManager Save Finalize is done on all hosts.
I0422 15:22:14.023742 138312735962944 checkpoint_manager.py:2032] [process=6][thread=MainThread][step=9][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=9.
I0422 15:22:14.023904 138312735962944 checkpoint_manager.py:2009] [process=6][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0422 15:22:14.024975 138312735962944 metric_logger.py:196] completed step: 9, seconds: 0.148, TFLOP/s/device: 91.818, Tokens/s/device: 13839.802, total_weights: 65536, loss: 7.176, lm_loss: 7.176, perplexity: 1307.366
Per train step:
 Total TFLOPs: 13.59 
 split as 93.93% learnable weight flops and 6.07% attention flops
XPK End: Wed Apr 22 15:22:22 UTC 2026
EXIT_CODE=0