MaxView

← Back to run

Log Summary

XPK Start: Wed Apr 22 17:59:52 UTC 2026
2026-04-22 17:59:56.690667: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1776880796.703445      10 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1776880796.707162      10 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1776880796.718695      10 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1776880796.718714      10 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1776880796.718716      10 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1776880796.718719      10 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2026-04-22 18:00:16.077419: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0422 18:00:16.611833 133349453268800 max_utils.py:273] Attempting to initialize the jax distributed system...
INFO:2026-04-22 18:00:25,653:jax._src.distributed:140: Starting JAX distributed service on [::]:8482
I0422 18:00:25.653745 133349453268800 distributed.py:140] Starting JAX distributed service on [::]:8482
INFO:2026-04-22 18:00:25,656:jax._src.distributed:157: Connecting to JAX distributed service on mt-09-pdb-lt-1-5qx1h-slice-job-0-0.mt-09-pdb-lt-1-5qx1h:8482
I0422 18:00:25.656208 133349453268800 distributed.py:157] Connecting to JAX distributed service on mt-09-pdb-lt-1-5qx1h-slice-job-0-0.mt-09-pdb-lt-1-5qx1h:8482
I0422 18:00:26.403593 133349453268800 max_utils.py:284] Jax distributed system initialized!
I0422 18:00:33.449366 133349453268800 max_utils.py:800] System Information: Jax Version: 0.8.1
I0422 18:00:33.449473 133349453268800 max_utils.py:801] System Information: Jaxlib Version: 0.8.1
I0422 18:00:33.449520 133349453268800 max_utils.py:802] System Information: Jax Backend: PJRT C API
TFRT TPU v6 lite
Built on Nov 12 2025 14:16:36 (1762985796) cl/831091709
I0422 18:00:33.449558 133349453268800 train_utils.py:377] WARNING: Sequence packing is essentially ignored for synthetic data. Please use a real dataset to use sequence packing.
I0422 18:00:34.084209 133349453268800 maxtext_utils.py:1631] Num_devices: 32, shape (1, 1, 1, 8, 1, 1, 1, 1, 4, 1, 1, 1, 1)
I0422 18:00:34.201411 133349453268800 checkpointing.py:688] Setting up checkpoint logger...
I0422 18:00:34.201530 133349453268800 checkpointing.py:234] Creating checkpoint manager with ocdbt=True and zarr3=True
I0422 18:00:34.201575 133349453268800 pytree_checkpoint_handler.py:589] save_device_host_concurrent_bytes=None
I0422 18:00:34.201796 133349453268800 base_pytree_checkpoint_handler.py:415] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7947527751c0>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0422 18:00:37.180336 133349453268800 checkpointing.py:266] Enabling policy for fixed interval checkpointing.
I0422 18:00:37.180696 133349453268800 checkpoint_manager.py:709] [process=6][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=('items',), item_handlers={'items': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7932e022ef90>}, handler_registry=None
I0422 18:00:37.180937 133349453268800 composite_checkpoint_handler.py:237] Deferred registration for item: "items". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7932e022ef90>` for item "items" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0422 18:00:37.180986 133349453268800 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7932e046e000>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0422 18:00:37.181022 133349453268800 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('items', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7932e022ef90>, ('items', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7932e022ef90>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7932e046e000>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7932e046e000>}).
I0422 18:00:37.181381 133349453268800 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.33
I0422 18:00:37.181460 133349453268800 async_checkpointer.py:177] [process=6][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>._fn at 0x7932e03f0f40> timeout: 600 secs and primary_host=0 for async checkpoint writes
I0422 18:00:38.734173 133349453268800 checkpoint_manager.py:1818] Found 0 checkpoint steps in gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/nnx_xpk_feat_nnx_set_defaults_true_20260422_154647_09_pdb_lt_1/checkpoints
I0422 18:00:39.170388 133349453268800 checkpoint_manager.py:929] [process=6][thread=MainThread] CheckpointManager created,  primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=1, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=FixedIntervalPolicy(interval=10), preservation_policy=LatestN(n=None), prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False, lightweight_initialize=False), root_directory=gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/nnx_xpk_feat_nnx_set_defaults_true_20260422_154647_09_pdb_lt_1/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7932e01ef200>
I0422 18:00:39.170575 133349453268800 checkpointing.py:302] Checkpoint manager created!
I0422 18:00:40.112866 133349453268800 checkpointing.py:578] checkpoint manager exists so trying to load this run's existing checkpoint
I0422 18:00:40.112993 133349453268800 checkpointing.py:676] No existing checkpoints found, not restoring checkpoint.
fsdp: 8
tensor: 4
I0422 18:00:41.582069 133349453268800 maxtext_utils.py:1749]  decoder/decoder_norm/scale/value
    Shape:     float32[2048]
    Physical:  ('tensor',)
I0422 18:00:41.582204 133349453268800 maxtext_utils.py:1749]  decoder/layers/mlp/wi_0/kernel/value
    Shape:     float32[2048,16,7168]
    Physical:  ('fsdp', None, 'tensor')
I0422 18:00:41.582259 133349453268800 maxtext_utils.py:1749]  decoder/layers/mlp/wi_1/kernel/value
    Shape:     float32[2048,16,7168]
    Physical:  ('fsdp', None, 'tensor')
I0422 18:00:41.582303 133349453268800 maxtext_utils.py:1749]  decoder/layers/mlp/wo/kernel/value
    Shape:     float32[7168,16,2048]
    Physical:  ('tensor', None, 'fsdp')
I0422 18:00:41.582344 133349453268800 maxtext_utils.py:1749]  decoder/layers/post_self_attention_layer_norm/scale/value
    Shape:     float32[2048,16]
    Physical:  ('tensor', None)
I0422 18:00:41.582380 133349453268800 maxtext_utils.py:1749]  decoder/layers/pre_self_attention_layer_norm/scale/value
    Shape:     float32[2048,16]
    Physical:  ('tensor', None)
I0422 18:00:41.582419 133349453268800 maxtext_utils.py:1749]  decoder/layers/self_attention/key/kernel/value
    Shape:     float32[2048,16,16,128]
    Physical:  ('fsdp', None, 'tensor', None)
I0422 18:00:41.582458 133349453268800 maxtext_utils.py:1749]  decoder/layers/self_attention/out/kernel/value
    Shape:     float32[16,16,128,2048]
    Physical:  ('tensor', None, None, 'fsdp')
I0422 18:00:41.582497 133349453268800 maxtext_utils.py:1749]  decoder/layers/self_attention/query/kernel/value
    Shape:     float32[2048,16,16,128]
    Physical:  ('fsdp', None, 'tensor', None)
I0422 18:00:41.582532 133349453268800 maxtext_utils.py:1749]  decoder/layers/self_attention/value/kernel/value
    Shape:     float32[2048,16,16,128]
    Physical:  ('fsdp', None, 'tensor', None)

I0422 18:00:41.582570 133349453268800 maxtext_utils.py:1749]  decoder/logits_dense/kernel/value
    Shape:     float32[2048,32000]
    Physical:  ('fsdp', 'tensor')
I0422 18:00:41.582606 133349453268800 maxtext_utils.py:1749]  token_embedder/embedding/value
    Shape:     float32[32000,2048]
    Physical:  ('tensor', 'fsdp')
I0422 18:00:41.993775 133349453268800 nnx_decoders.py:616] nnx_decoders/carry Logical: bfloat16[8,2048,2048]....................................... ('activation_batch', 'activation_norm_length', 'activation_embed').
I0422 18:00:41.993927 133349453268800 nnx_decoders.py:616] nnx_decoders/carry Physical: bfloat16[8,2048,2048]....................................... ('fsdp', None, 'tensor').
I0422 18:00:41.999363 133349453268800 nnx_decoders.py:616] Unknown Logical: bfloat16[8,2048,2048]....................................... ('activation_batch', 'activation_norm_length', 'activation_embed').
I0422 18:00:41.999476 133349453268800 nnx_decoders.py:616] Unknown Physical: bfloat16[8,2048,2048]....................................... ('fsdp', None, 'tensor').
I0422 18:00:42.148849 133349453268800 attentions.py:1084] attentions/inputs_q Logical: bfloat16[8,2048,2048]....................................... ('activation_batch', 'activation_attn_length', 'activation_attn_embed').
I0422 18:00:42.148983 133349453268800 attentions.py:1084] attentions/inputs_q Physical: bfloat16[8,2048,2048]....................................... ('fsdp', None, 'tensor').
I0422 18:00:42.304125 133349453268800 attentions.py:1085] attentions/inputs_kv Logical: bfloat16[8,2048,2048]....................................... ('activation_batch', 'activation_attn_length', 'activation_attn_embed').
I0422 18:00:42.304262 133349453268800 attentions.py:1085] attentions/inputs_kv Physical: bfloat16[8,2048,2048]....................................... ('fsdp', None, 'tensor').
I0422 18:00:42.461086 133349453268800 attentions.py:1150] attentions/query Logical: bfloat16[8,2048,16,128]..................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim').
I0422 18:00:42.461253 133349453268800 attentions.py:1150] attentions/query Physical: bfloat16[8,2048,16,128]..................................... ('fsdp', None, 'tensor', None).
I0422 18:00:42.610657 133349453268800 attentions.py:1151] attentions/key Logical: bfloat16[8,2048,16,128]..................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim').
I0422 18:00:42.610794 133349453268800 attentions.py:1151] attentions/key Physical: bfloat16[8,2048,16,128]..................................... ('fsdp', None, 'tensor', None).
I0422 18:00:42.761353 133349453268800 attentions.py:1152] attentions/value Logical: bfloat16[8,2048,16,128]..................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim').
I0422 18:00:42.761485 133349453268800 attentions.py:1152] attentions/value Physical: bfloat16[8,2048,16,128]..................................... ('fsdp', None, 'tensor', None).
I0422 18:00:42.925987 133349453268800 attentions.py:1193] attentions/out Logical: bfloat16[8,2048,16,128]..................................... ('activation_batch', 'activation_attn_length', 'activation_heads', 'activation_kv').
I0422 18:00:42.926192 133349453268800 attentions.py:1193] attentions/out Physical: bfloat16[8,2048,16,128]..................................... ('fsdp', None, 'tensor', None).
I0422 18:00:42.949281 133349453268800 linears.py:541] linears/x Logical: bfloat16[8,2048,7168]....................................... ('activation_batch', 'activation_length', 'activation_mlp').
I0422 18:00:42.949380 133349453268800 linears.py:541] linears/x Physical: bfloat16[8,2048,7168]....................................... ('fsdp', None, 'tensor').
I0422 18:00:53.193205 133349453268800 max_utils.py:791] Total memory size: 0.9 GB, Output size: 0.4 GB, Temp size: 0.5 GB, Argument size: 0.4 GB, Host temp size: 0.0 GB.
I0422 18:00:53.196251 133349453268800 metric_logger.py:289] number parameters: 1.104 billion
I0422 18:00:54.509411 133349453268800 checkpointing.py:794] Waiting for step 0 to finish before checkpoint...
I0422 18:01:02.925433 133349453268800 checkpointing.py:798] Waited 8.41600775718689 seconds for step 0 to finish before starting checkpointing.
I0422 18:01:02.928195 133349453268800 checkpoint_manager.py:2013] [process=6][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0422 18:01:02.930031 133349453268800 checkpoint_manager.py:1518] [process=6] Saving checkpoint at step 0
I0422 18:01:02.932645 133349453268800 async_checkpointer.py:452] [process=6] Started async saving checkpoint to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/nnx_xpk_feat_nnx_set_defaults_true_20260422_154647_09_pdb_lt_1/checkpoints/0.
I0422 18:01:03.894528 133349453268800 signaling_client.py:364] Using JaxDistributedSignalingClient
I0422 18:01:03.895617 133349453268800 jax_array_handlers.py:358] Scheduling D2H of 69 prioritized jax.Array.
I0422 18:01:03.895675 133349453268800 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0422 18:01:04.123656 133349453268800 base_pytree_checkpoint_handler.py:153] [process=6][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.230303s
I0422 18:01:04.123831 133349453268800 base_pytree_checkpoint_handler.py:129] [process=6] /jax/checkpoint/write/blocking_gbytes_per_sec: 3.123 GiB/s (total gbytes: 1.5 GiB) (time elapsed: 0.49393677711486816 s) (per-host)
I0422 18:01:04.123886 133349453268800 base_pytree_checkpoint_handler.py:737] [process=6][thread=MainThread] Initiated Pytree async_save. Time taken: 0.493999s (batch_requests_ready=0.257075s, total_serialization_initiated=0.236854s, others=0.000070s)
I0422 18:01:04.123980 133349453268800 composite_checkpoint_handler.py:715] [process=6][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.498178s (all_items=0.000019s, per_item={'items': '0.00001884'}, temp_paths=0.498160)
I0422 18:01:04.124959 133226983053056 async_checkpointer.py:79] [process=6][thread=async_save] Background save thread started.
I0422 18:01:04.125138 133349453268800 async_checkpointer.py:561] Finished blocking save. Time taken: 1.195036s. Continuing background save to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/nnx_xpk_feat_nnx_set_defaults_true_20260422_154647_09_pdb_lt_1/checkpoints/0.
I0422 18:01:04.147249 133349453268800 checkpoint_manager.py:1566] [process=6][thread=MainThread][step=0] Starting CheckpointManager Save Finalize thread=save_finalize
I0422 18:01:04.147546 133224265127680 async_checkpointer.py:265] [process=6][thread=save_finalize] Waiting for background save thread=async_save.
I0422 18:01:04.147708 133349453268800 standard_logger.py:34] {'step': 0, 'event_type': 'save', 'directory': 'gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/nnx_xpk_feat_nnx_set_defaults_true_20260422_154647_09_pdb_lt_1/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776880862.928176, 'wait_for_prev_duration_secs': 6.794929504394531e-05, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776880862.930073, 'checkpointer_blocking_duration_secs': 1.1951730251312256, 'get_old_steps_start_time': 1776880864.125268, 'get_old_steps_duration_secs': 2.4318695068359375e-05, 'checkpoint_manager_blocking_start_time': 1776880862.9258268, 'checkpoint_manager_blocking_duration_secs': 1.221843957901001}
I0422 18:01:04.147813 133349453268800 checkpointing.py:409] Started an asynchronous checkpoint save for step 0
I0422 18:01:04.147881 133349453268800 max_utils.py:750] 
Memstats: After params initialized:
I0422 18:01:04.147930 133349453268800 max_utils.py:756] 	Using (GB) 0.43 / 31.25 (1.376000%) on TPU_24(process=6,(0,6,0,0))
I0422 18:01:04.147961 133349453268800 max_utils.py:756] 	Using (GB) 0.43 / 31.25 (1.376000%) on TPU_25(process=6,(1,6,0,0))
I0422 18:01:04.147986 133349453268800 max_utils.py:756] 	Using (GB) 0.43 / 31.25 (1.376000%) on TPU_28(process=6,(0,7,0,0))
I0422 18:01:04.148011 133349453268800 max_utils.py:756] 	Using (GB) 0.43 / 31.25 (1.376000%) on TPU_29(process=6,(1,7,0,0))
I0422 18:01:04.417350 133349453268800 metric_logger.py:185] completed step: 0, seconds: 1.312, TFLOP/s/device: 2.589, Tokens/s/device: 390.236, total_weights: 16384, loss: 10.864
I0422 18:01:04.513408 133349453268800 metric_logger.py:185] completed step: 1, seconds: 9.906, TFLOP/s/device: 0.343, Tokens/s/device: 51.684, total_weights: 16384, loss: 10.864
I0422 18:01:04.951598 133349453268800 metric_logger.py:185] completed step: 2, seconds: 0.040, TFLOP/s/device: 84.122, Tokens/s/device: 12679.858, total_weights: 16384, loss: 9.760
I0422 18:01:05.010111 133349453268800 metric_logger.py:185] completed step: 3, seconds: 0.428, TFLOP/s/device: 7.944, Tokens/s/device: 1197.333, total_weights: 16384, loss: 8.820
I0422 18:01:05.127884 133349453268800 metric_logger.py:185] completed step: 4, seconds: 0.074, TFLOP/s/device: 45.896, Tokens/s/device: 6917.984, total_weights: 16384, loss: 7.947
I0422 18:01:05.135129 133349453268800 metric_logger.py:185] completed step: 5, seconds: 0.058, TFLOP/s/device: 58.752, Tokens/s/device: 8855.680, total_weights: 16384, loss: 7.209
I0422 18:01:17.755862 133349453268800 metric_logger.py:185] completed step: 6, seconds: 0.119, TFLOP/s/device: 28.637, Tokens/s/device: 4316.450, total_weights: 16384, loss: 6.651
I0422 18:01:17.814476 133349453268800 metric_logger.py:185] completed step: 7, seconds: 12.564, TFLOP/s/device: 0.270, Tokens/s/device: 40.752, total_weights: 16384, loss: 6.267
I0422 18:01:17.872926 133349453268800 metric_logger.py:185] completed step: 8, seconds: 0.064, TFLOP/s/device: 53.019, Tokens/s/device: 7991.634, total_weights: 16384, loss: 6.030
I0422 18:01:17.932477 133349453268800 checkpointing.py:794] Waiting for step 9 to finish before checkpoint...
I0422 18:01:17.933064 133349453268800 checkpointing.py:798] Waited 0.0005941390991210938 seconds for step 9 to finish before starting checkpointing.
I0422 18:01:17.935065 133349453268800 checkpoint_manager.py:2024] [process=6][thread=MainThread][step=0][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0422 18:01:19.348208    2439 google_auth_provider.cc:181] Running on GCE, using service account 562977990677-compute@developer.gserviceaccount.com
I0422 18:01:21.312361 133226991445760 array_metadata_store.py:203] [process=6][thread=array_type_handler] Wrote 69 array_metadata.ArrayMetadata to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/nnx_xpk_feat_nnx_set_defaults_true_20260422_154647_09_pdb_lt_1/checkpoints/0/items/array_metadatas/process_6
I0422 18:01:51.371142 133226983053056 base_pytree_checkpoint_handler.py:129] [process=6] /jax/checkpoint/write/gbytes_per_sec: 33.087 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 47.74119710922241 s) (per-host)
I0422 18:01:51.371264 133226983053056 async_checkpointer.py:90] [process=6][thread=async_save] 3 Handler Commit operations completed. Time taken: 47.246192s.
I0422 18:01:59.821460 133226983053056 async_checkpointer.py:144] [process=6][thread=async_save] Background save thread done. Time taken: 55.696373s.
I0422 18:01:59.821778 133224265127680 async_checkpointer.py:273] [process=6][thread=save_finalize] Done with waiting for background save thread=async_save.
I0422 18:01:59.821905 133224265127680 async_checkpointer.py:283] [process=6][thread=save_finalize] No errors found in background save thread=async_save.
I0422 18:01:59.821956 133224265127680 checkpoint_manager.py:2133] [process=6][thread=save_finalize][step=0] CheckpointManager Save Finalize is syncing with other hosts...
I0422 18:01:59.823482 133224265127680 checkpoint_manager.py:2142] [process=6][thread=save_finalize][step=0] CheckpointManager Save Finalize is done on all hosts.
I0422 18:01:59.823695 133349453268800 checkpoint_manager.py:2036] [process=6][thread=MainThread][step=0][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=0.
W0422 18:01:59.823834 133349453268800 checkpoint_manager.py:1458] Waiting for previous save to complete took 41.888771 seconds. If this number is high, consider checkpointing less frequently.
I0422 18:01:59.825687 133349453268800 checkpoint_manager.py:1518] [process=6] Saving checkpoint at step 9
I0422 18:01:59.829336 133349453268800 async_checkpointer.py:452] [process=6] Started async saving checkpoint to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/nnx_xpk_feat_nnx_set_defaults_true_20260422_154647_09_pdb_lt_1/checkpoints/9.
I0422 18:02:00.812497 133349453268800 jax_array_handlers.py:358] Scheduling D2H of 69 prioritized jax.Array.
I0422 18:02:00.812665 133349453268800 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0422 18:02:00.855729 133349453268800 base_pytree_checkpoint_handler.py:153] [process=6][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.045344s
I0422 18:02:00.855895 133349453268800 base_pytree_checkpoint_handler.py:129] [process=6] /jax/checkpoint/write/blocking_gbytes_per_sec: 5.065 GiB/s (total gbytes: 1.5 GiB) (time elapsed: 0.30458545684814453 s) (per-host)
I0422 18:02:00.855949 133349453268800 base_pytree_checkpoint_handler.py:737] [process=6][thread=MainThread] Initiated Pytree async_save. Time taken: 0.304648s (batch_requests_ready=0.254455s, total_serialization_initiated=0.050124s, others=0.000068s)
I0422 18:02:00.856040 133349453268800 composite_checkpoint_handler.py:715] [process=6][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.308977s (all_items=0.000015s, per_item={'items': '0.00001526'}, temp_paths=0.308961)
I0422 18:02:00.856973 133226991445760 async_checkpointer.py:79] [process=6][thread=async_save] Background save thread started.
I0422 18:02:00.857142 133349453268800 async_checkpointer.py:561] Finished blocking save. Time taken: 1.031384s. Continuing background save to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/nnx_xpk_feat_nnx_set_defaults_true_20260422_154647_09_pdb_lt_1/checkpoints/9.
I0422 18:02:00.864908 133349453268800 checkpoint_manager.py:1566] [process=6][thread=MainThread][step=9] Starting CheckpointManager Save Finalize thread=save_finalize
I0422 18:02:00.865171 133224265127680 async_checkpointer.py:265] [process=6][thread=save_finalize] Waiting for background save thread=async_save.
I0422 18:02:00.865329 133349453268800 standard_logger.py:34] {'step': 9, 'event_type': 'save', 'directory': 'gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/nnx_xpk_feat_nnx_set_defaults_true_20260422_154647_09_pdb_lt_1/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776880877.9350376, 'wait_for_prev_duration_secs': 41.888771295547485, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776880919.8257263, 'checkpointer_blocking_duration_secs': 1.031529426574707, 'get_old_steps_start_time': 1776880920.8572764, 'get_old_steps_duration_secs': 2.4557113647460938e-05, 'checkpoint_manager_blocking_start_time': 1776880877.9332993, 'checkpoint_manager_blocking_duration_secs': 42.931997299194336}
I0422 18:02:00.865438 133349453268800 checkpointing.py:409] Started an asynchronous checkpoint save for step 9
I0422 18:02:00.865480 133349453268800 checkpoint_manager.py:2024] [process=6][thread=MainThread][step=9][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0422 18:02:06.226340 133216245634816 array_metadata_store.py:203] [process=6][thread=array_type_handler] Wrote 69 array_metadata.ArrayMetadata to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/nnx_xpk_feat_nnx_set_defaults_true_20260422_154647_09_pdb_lt_1/checkpoints/9/items/array_metadatas/process_6
I0422 18:02:42.135187 133226991445760 base_pytree_checkpoint_handler.py:129] [process=6] /jax/checkpoint/write/gbytes_per_sec: 37.986 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 41.583836793899536 s) (per-host)
I0422 18:02:42.135305 133226991445760 async_checkpointer.py:90] [process=6][thread=async_save] 3 Handler Commit operations completed. Time taken: 41.278220s.
I0422 18:02:50.408351 133226991445760 async_checkpointer.py:144] [process=6][thread=async_save] Background save thread done. Time taken: 49.551251s.
I0422 18:02:50.408682 133224265127680 async_checkpointer.py:273] [process=6][thread=save_finalize] Done with waiting for background save thread=async_save.
I0422 18:02:50.408801 133224265127680 async_checkpointer.py:283] [process=6][thread=save_finalize] No errors found in background save thread=async_save.
I0422 18:02:50.408851 133224265127680 checkpoint_manager.py:2133] [process=6][thread=save_finalize][step=9] CheckpointManager Save Finalize is syncing with other hosts...
I0422 18:02:50.411233 133224265127680 checkpoint_manager.py:2142] [process=6][thread=save_finalize][step=9] CheckpointManager Save Finalize is done on all hosts.
I0422 18:02:50.411412 133349453268800 checkpoint_manager.py:2036] [process=6][thread=MainThread][step=9][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=9.
I0422 18:02:50.411571 133349453268800 checkpoint_manager.py:2013] [process=6][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0422 18:02:50.412184 133349453268800 metric_logger.py:185] completed step: 9, seconds: 0.058, TFLOP/s/device: 58.516, Tokens/s/device: 8820.135, total_weights: 16384, loss: 5.890
Per train step:
 Total TFLOPs: 3.40 
 split as 93.93% learnable weight flops and 6.07% attention flops
XPK End: Wed Apr 22 18:03:02 UTC 2026
EXIT_CODE=0