MaxView

← Back to run

Log Summary

XPK Start: Fri Apr 17 06:22:49 UTC 2026
PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
2026-04-17 06:23:13.414371: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0417 06:23:13.591925 133150336943936 max_utils.py:273] Attempting to initialize the jax distributed system...
I0417 06:23:22.631889 133150336943936 distributed.py:149] Starting JAX distributed service on [::]:8482
I0417 06:23:22.634140 133150336943936 distributed.py:172] Connecting to JAX distributed service on mt-16-zarr3-true-rs-22fsd-slice-job-0-0.mt-16-zarr3-true-rs-22fsd:8482
I0417 06:23:23.618781 133150336943936 max_utils.py:284] Jax distributed system initialized!
I0417 06:23:29.743468 133150336943936 max_utils.py:800] System Information: Jax Version: 0.9.2
I0417 06:23:29.743576 133150336943936 max_utils.py:801] System Information: Jaxlib Version: 0.9.2
I0417 06:23:29.743617 133150336943936 max_utils.py:802] System Information: Jax Backend: PJRT C API
TFRT TPU v6 lite
Built on Mar 4 2026 11:32:08 (1772652728) cl/878335365
I0417 06:23:29.743652 133150336943936 train_utils.py:377] WARNING: Sequence packing is essentially ignored for synthetic data. Please use a real dataset to use sequence packing.
I0417 06:23:30.432708 133150336943936 maxtext_utils.py:1687] Num_devices: 32, shape (1, 1, 1, 32, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0417 06:23:30.545346 133150336943936 checkpointing.py:688] Setting up checkpoint logger...
I0417 06:23:30.545458 133150336943936 checkpointing.py:234] Creating checkpoint manager with ocdbt=True and zarr3=True
I0417 06:23:30.545502 133150336943936 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None
I0417 06:23:30.545690 133150336943936 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7918e1918e30>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0417 06:23:31.785422 133150336943936 checkpointing.py:266] Enabling policy for fixed interval checkpointing.
I0417 06:23:31.785657 133150336943936 checkpoint_manager.py:708] [process=6][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=('items',), item_handlers={'items': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7917b756e0c0>}, handler_registry=None
I0417 06:23:31.785924 133150336943936 composite_checkpoint_handler.py:237] Deferred registration for item: "items". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7917b756e0c0>` for item "items" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0417 06:23:31.785975 133150336943936 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7904a41178c0>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0417 06:23:31.786012 133150336943936 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('items', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7917b756e0c0>, ('items', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7917b756e0c0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7904a41178c0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7904a41178c0>}).
I0417 06:23:31.786320 133150336943936 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.34
I0417 06:23:31.786392 133150336943936 async_checkpointer.py:192] [process=6][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>._fn at 0x79048c7e6020> timeout: 1200 secs and primary_host=0 for async checkpoint writes
I0417 06:23:34.125181 133028013664000 checkpoint.py:228] Read Metadata={'item_handlers': {'items': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler'}, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776406842010384561, 'commit_timestamp_nsecs': 1776406888250617569, 'custom_metadata': {}} from gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260417_023710/nnx_xpk_feat_nnx_post_train_fixes_20260417_023710_16_zarr3_true/checkpoints/9/_CHECKPOINT_METADATA
I0417 06:23:34.379333 133028013664000 checkpoint_manager.py:1871] Missing metrics for step 9
I0417 06:23:34.795706 133150336943936 checkpoint_manager.py:1812] Found 3 checkpoint steps in gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260417_023710/nnx_xpk_feat_nnx_post_train_fixes_20260417_023710_16_zarr3_true/checkpoints
I0417 06:23:35.235527 133150336943936 checkpoint_manager.py:929] [process=6][thread=MainThread] CheckpointManager created,  primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=1, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=FixedIntervalPolicy(interval=5), preservation_policy=LatestN(n=None), prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False, lightweight_initialize=False), root_directory=gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260417_023710/nnx_xpk_feat_nnx_post_train_fixes_20260417_023710_16_zarr3_true/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x79048c34ba70>
I0417 06:23:35.235695 133150336943936 checkpointing.py:302] Checkpoint manager created!
I0417 06:23:36.199442 133150336943936 checkpointing.py:578] checkpoint manager exists so trying to load this run's existing checkpoint
I0417 06:23:36.199559 133150336943936 checkpointing.py:582] restoring from this run's directory step 9
I0417 06:23:36.447325 133150336943936 event_tracking.py:70] [process=6] [sync] Started load checkpoint @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260417_023710/nnx_xpk_feat_nnx_post_train_fixes_20260417_023710_16_zarr3_true/checkpoints/9.
I0417 06:23:36.690117 133150336943936 checkpointer.py:307] Restoring checkpoint from gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260417_023710/nnx_xpk_feat_nnx_post_train_fixes_20260417_023710_16_zarr3_true/checkpoints/9.
I0417 06:23:36.690531 133150336943936 event_tracking.py:125] [process=6] [sync] Finished blocking load in 0.24 seconds. Continuing load @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260417_023710/nnx_xpk_feat_nnx_post_train_fixes_20260417_023710_16_zarr3_true/checkpoints/9.
I0417 06:23:38.111926    1461 google_auth_provider.cc:181] Running on GCE, using service account 562977990677-compute@developer.gserviceaccount.com
I0417 06:23:46.137741 133150336943936 jax_array_handlers.py:843] [process=6] /jax/orbax/read/worker/io/requested throughput: 196.732 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 8.03264856338501 s) (per-host)
I0417 06:23:46.145918 133150336943936 base_pytree_checkpoint_handler.py:130] [process=6] /jax/checkpoint/read/gbytes_per_sec: 186.911 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 8.467152118682861 s) (per-host)
I0417 06:23:48.129401 133150336943936 event_tracking.py:138] [process=6] [sync] Finished load in 11.68 seconds @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260417_023710/nnx_xpk_feat_nnx_post_train_fixes_20260417_023710_16_zarr3_true/checkpoints/9
I0417 06:23:48.129564 133150336943936 standard_logger.py:34] {'step': 9, 'event_type': 'restore', 'directory': 'gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260417_023710/nnx_xpk_feat_nnx_post_train_fixes_20260417_023710_16_zarr3_true/checkpoints', 'checkpointer_start_time': 1776407016.44723, 'checkpointer_duration_secs': 11.682301759719849, 'checkpoint_manager_start_time': 1776407016.200167, 'checkpoint_manager_duration_secs': 11.929367542266846}
fsdp: 32
I0417 06:23:48.204765 133150336943936 maxtext_utils.py:1805]  decoder/decoder_norm/scale/value
    Shape:     float32[2048]
    Physical:  (None,)
I0417 06:23:48.204898 133150336943936 maxtext_utils.py:1805]  decoder/layers/mlp/wi_0/kernel/value
    Shape:     float32[2048,16,7168]
    Physical:  ('fsdp', None, None)
I0417 06:23:48.204963 133150336943936 maxtext_utils.py:1805]  decoder/layers/mlp/wi_1/kernel/value
    Shape:     float32[2048,16,7168]
    Physical:  ('fsdp', None, None)
I0417 06:23:48.205016 133150336943936 maxtext_utils.py:1805]  decoder/layers/mlp/wo/kernel/value
    Shape:     float32[7168,16,2048]
    Physical:  (None, None, 'fsdp')
I0417 06:23:48.205063 133150336943936 maxtext_utils.py:1805]  decoder/layers/post_self_attention_layer_norm/scale/value
    Shape:     float32[2048,16]
    Physical:  (None, None)
I0417 06:23:48.205107 133150336943936 maxtext_utils.py:1805]  decoder/layers/pre_self_attention_layer_norm/scale/value
    Shape:     float32[2048,16]
    Physical:  (None, None)
I0417 06:23:48.205153 133150336943936 maxtext_utils.py:1805]  decoder/layers/self_attention/key/kernel/value
    Shape:     float32[2048,16,16,128]
    Physical:  ('fsdp', None, None, None)
I0417 06:23:48.205201 133150336943936 maxtext_utils.py:1805]  decoder/layers/self_attention/out/kernel/value
    Shape:     float32[16,16,128,2048]
    Physical:  (None, None, None, 'fsdp')
I0417 06:23:48.205249 133150336943936 maxtext_utils.py:1805]  decoder/layers/self_attention/query/kernel/value
    Shape:     float32[2048,16,16,128]
    Physical:  ('fsdp', None, None, None)
I0417 06:23:48.205287 133150336943936 maxtext_utils.py:1805]  decoder/layers/self_attention/value/kernel/value
    Shape:     float32[2048,16,16,128]
    Physical:  ('fsdp', None, None, None)
I0417 06:23:48.205320 133150336943936 maxtext_utils.py:1805]  decoder/logits_dense/kernel/value
    Shape:     float32[2048,32000]
    Physical:  ('fsdp', None)
I0417 06:23:48.205352 133150336943936 maxtext_utils.py:1805]  token_embedder/embedding/value
    Shape:     float32[32000,2048]
    Physical:  (None, 'fsdp')

I0417 06:23:48.607445 133150336943936 nnx_decoders.py:465] nnx_decoders/carry Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_norm_length', 'activation_embed').
I0417 06:23:48.607537 133150336943936 nnx_decoders.py:465] nnx_decoders/carry Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None).
I0417 06:23:48.613113 133150336943936 nnx_decoders.py:465] Unknown Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_norm_length', 'activation_embed').
I0417 06:23:48.613173 133150336943936 nnx_decoders.py:465] Unknown Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None).
I0417 06:23:48.629772 133150336943936 attentions.py:1088] attentions/inputs_q Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_attn_length', 'activation_attn_embed').
I0417 06:23:48.629845 133150336943936 attentions.py:1088] attentions/inputs_q Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None).
I0417 06:23:48.645780 133150336943936 attentions.py:1089] attentions/inputs_kv Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_attn_length', 'activation_attn_embed').
I0417 06:23:48.645851 133150336943936 attentions.py:1089] attentions/inputs_kv Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None).
I0417 06:23:48.670162 133150336943936 attentions.py:1154] attentions/query Logical: bfloat16[32,2048,16,128].................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim').
I0417 06:23:48.670229 133150336943936 attentions.py:1154] attentions/query Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None).
I0417 06:23:48.686246 133150336943936 attentions.py:1155] attentions/key Logical: bfloat16[32,2048,16,128].................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim').
I0417 06:23:48.686306 133150336943936 attentions.py:1155] attentions/key Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None).
I0417 06:23:48.702299 133150336943936 attentions.py:1156] attentions/value Logical: bfloat16[32,2048,16,128].................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim').
I0417 06:23:48.702356 133150336943936 attentions.py:1156] attentions/value Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None).
I0417 06:23:48.732521 133150336943936 attentions.py:1197] attentions/out Logical: bfloat16[32,2048,16,128].................................... ('activation_batch', 'activation_attn_length', 'activation_heads', 'activation_kv').
I0417 06:23:48.732597 133150336943936 attentions.py:1197] attentions/out Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None).
I0417 06:23:48.752547 133150336943936 linears.py:525] linears/x Logical: bfloat16[32,2048,7168]...................................... ('activation_batch', 'activation_length', 'activation_mlp').
I0417 06:23:48.752613 133150336943936 linears.py:525] linears/x Physical: bfloat16[32,2048,7168]...................................... ('fsdp', None, None).
I0417 06:23:59.866646 133150336943936 max_utils.py:791] Total memory size: 1.5 GB, Output size: 0.4 GB, Temp size: 1.1 GB, Argument size: 0.4 GB, Host temp size: 0.0 GB.
I0417 06:23:59.869027 133150336943936 metric_logger.py:289] number parameters: 1.104 billion
I0417 06:24:10.917815 133150336943936 checkpointing.py:794] Waiting for step 10 to finish before checkpoint...
I0417 06:24:11.070817 133150336943936 checkpointing.py:798] Waited 0.1529848575592041 seconds for step 10 to finish before starting checkpointing.
I0417 06:24:11.073296 133150336943936 checkpoint_manager.py:2009] [process=6][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0417 06:24:11.075230 133150336943936 checkpoint_manager.py:1512] [process=6] Saving checkpoint at step 10
I0417 06:24:11.076632 133150336943936 event_tracking.py:70] [process=6] [async] Started save checkpoint @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260417_023710/nnx_xpk_feat_nnx_post_train_fixes_20260417_023710_16_zarr3_true/checkpoints/10.
I0417 06:24:11.397045 133150336943936 signaling_client.py:364] Using JaxDistributedSignalingClient
I0417 06:24:11.398028 133150336943936 jax_array_handlers.py:360] Scheduling D2H of 69 prioritized jax.Array.
I0417 06:24:11.398087 133150336943936 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0417 06:24:11.676733 133150336943936 base_pytree_checkpoint_handler.py:154] [process=6][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.280461s
I0417 06:24:11.676935 133150336943936 base_pytree_checkpoint_handler.py:130] [process=6] /jax/orbax/write/blocking_gbytes_per_sec: 5.362 GiB/s (total gbytes: 1.5 GiB) (time elapsed: 0.28771018981933594 s) (per-host)
I0417 06:24:11.676994 133150336943936 base_pytree_checkpoint_handler.py:768] [process=6][thread=MainThread] Initiated Pytree async_save. Time taken: 0.287782s (batch_requests_ready=0.002868s, total_serialization_initiated=0.284832s, others=0.000083s)
I0417 06:24:11.677094 133150336943936 composite_checkpoint_handler.py:715] [process=6][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.292080s (all_items=0.000018s, per_item={'items': '0.00001812'}, temp_paths=0.292062)
I0417 06:24:11.677855 133150336943936 event_tracking.py:125] [process=6] [async] Finished blocking save in 0.60 seconds. Continuing save @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260417_023710/nnx_xpk_feat_nnx_post_train_fixes_20260417_023710_16_zarr3_true/checkpoints/10.
I0417 06:24:11.678193 133026923136768 async_checkpointer.py:76] [process=6][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-17 06:44:11.678156
I0417 06:24:11.682785 133150336943936 checkpoint_manager.py:1560] [process=6][thread=MainThread][step=10] Starting CheckpointManager Save Finalize thread=save_finalize
I0417 06:24:11.683107 133025169913600 async_checkpointer.py:280] [process=6][thread=save_finalize] Waiting for background save thread=async_save.
I0417 06:24:11.683257 133150336943936 standard_logger.py:34] {'step': 10, 'event_type': 'save', 'directory': 'gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260417_023710/nnx_xpk_feat_nnx_post_train_fixes_20260417_023710_16_zarr3_true/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776407051.0732772, 'wait_for_prev_duration_secs': 6.0558319091796875e-05, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776407051.0752697, 'checkpointer_blocking_duration_secs': 0.6030831336975098, 'get_old_steps_start_time': 1776407051.678379, 'get_old_steps_duration_secs': 3.123283386230469e-05, 'checkpoint_manager_blocking_start_time': 1776407051.0711951, 'checkpoint_manager_blocking_duration_secs': 0.6120278835296631}
I0417 06:24:11.683366 133150336943936 checkpointing.py:409] Started an asynchronous checkpoint save for step 10
I0417 06:24:11.683435 133150336943936 max_utils.py:750] 
Memstats: After params initialized:
I0417 06:24:11.683483 133150336943936 max_utils.py:756] 	Using (GB) 0.43 / 31.25 (1.376000%) on TPU_24(process=6,(0,6,0,0))
I0417 06:24:11.683515 133150336943936 max_utils.py:756] 	Using (GB) 0.43 / 31.25 (1.376000%) on TPU_25(process=6,(1,6,0,0))
I0417 06:24:11.683542 133150336943936 max_utils.py:756] 	Using (GB) 0.43 / 31.25 (1.376000%) on TPU_28(process=6,(0,7,0,0))
I0417 06:24:11.683566 133150336943936 max_utils.py:756] 	Using (GB) 0.43 / 31.25 (1.376000%) on TPU_29(process=6,(1,7,0,0))
I0417 06:24:12.007062 133150336943936 metric_logger.py:185] completed step: 10, seconds: 11.048, TFLOP/s/device: 1.230, Tokens/s/device: 185.379, total_weights: 65536, loss: 6.827
I0417 06:24:12.178483 133150336943936 metric_logger.py:185] completed step: 11, seconds: 1.088, TFLOP/s/device: 12.491, Tokens/s/device: 1882.846, total_weights: 65536, loss: 6.522
I0417 06:24:12.326054 133150336943936 metric_logger.py:185] completed step: 12, seconds: 0.015, TFLOP/s/device: 935.753, Tokens/s/device: 141046.832, total_weights: 65536, loss: 6.260
I0417 06:24:12.474053 133150336943936 metric_logger.py:185] completed step: 13, seconds: 0.165, TFLOP/s/device: 82.318, Tokens/s/device: 12407.835, total_weights: 65536, loss: 6.039
I0417 06:24:12.480563 133150336943936 checkpointing.py:794] Waiting for step 15 to finish before checkpoint...
I0417 06:24:12.769528 133150336943936 checkpointing.py:798] Waited 0.28893566131591797 seconds for step 15 to finish before starting checkpointing.
I0417 06:24:12.772284 133150336943936 checkpoint_manager.py:2020] [process=6][thread=MainThread][step=10][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0417 06:24:17.427339 133025178306304 array_metadata_store.py:203] [process=6][thread=array_type_handler] Wrote 69 array_metadata.ArrayMetadata to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260417_023710/nnx_xpk_feat_nnx_post_train_fixes_20260417_023710_16_zarr3_true/checkpoints/10/items/array_metadatas/process_6
I0417 06:24:54.164615 133026923136768 base_pytree_checkpoint_handler.py:130] [process=6] /jax/orbax/write/gbytes_per_sec: 36.928 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 42.775357484817505 s) (per-host)
I0417 06:24:54.164742 133026923136768 async_checkpointer.py:90] [process=6][thread=async_save] 3 Handler Commit operations completed. Time taken: 42.486434s.
I0417 06:25:02.675019 133026923136768 async_checkpointer.py:160] [process=6][thread=async_save] Background save thread done. Time taken: 50.996696s.
I0417 06:25:02.675293 133025169913600 async_checkpointer.py:288] [process=6][thread=save_finalize] Done with waiting for background save thread=async_save.
I0417 06:25:02.675413 133025169913600 async_checkpointer.py:298] [process=6][thread=save_finalize] No errors found in background save thread=async_save.
I0417 06:25:02.675462 133025169913600 checkpoint_manager.py:2137] [process=6][thread=save_finalize][step=10] CheckpointManager Save Finalize is syncing with other hosts...
I0417 06:25:02.676939 133025169913600 checkpoint_manager.py:2146] [process=6][thread=save_finalize][step=10] CheckpointManager Save Finalize is done on all hosts.
I0417 06:25:02.677101 133150336943936 checkpoint_manager.py:2032] [process=6][thread=MainThread][step=10][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=10.
W0417 06:25:02.677233 133150336943936 checkpoint_manager.py:1452] Waiting for previous save to complete took 49.904951 seconds. If this number is high, consider checkpointing less frequently.
I0417 06:25:02.679199 133150336943936 checkpoint_manager.py:1512] [process=6] Saving checkpoint at step 15
I0417 06:25:02.681517 133150336943936 event_tracking.py:70] [process=6] [async] Started save checkpoint @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260417_023710/nnx_xpk_feat_nnx_post_train_fixes_20260417_023710_16_zarr3_true/checkpoints/15.
I0417 06:25:02.982021 133150336943936 jax_array_handlers.py:360] Scheduling D2H of 69 prioritized jax.Array.
I0417 06:25:02.982115 133150336943936 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0417 06:25:03.009311 133150336943936 base_pytree_checkpoint_handler.py:154] [process=6][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.028837s
I0417 06:25:03.009476 133150336943936 base_pytree_checkpoint_handler.py:130] [process=6] /jax/orbax/write/blocking_gbytes_per_sec: 42.630 GiB/s (total gbytes: 1.5 GiB) (time elapsed: 0.036185503005981445 s) (per-host)
I0417 06:25:03.009528 133150336943936 base_pytree_checkpoint_handler.py:768] [process=6][thread=MainThread] Initiated Pytree async_save. Time taken: 0.036245s (batch_requests_ready=0.002808s, total_serialization_initiated=0.033372s, others=0.000066s)
I0417 06:25:03.009638 133150336943936 composite_checkpoint_handler.py:715] [process=6][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.041195s (all_items=0.000012s, per_item={'items': '0.00001240'}, temp_paths=0.041182)
I0417 06:25:03.010328 133150336943936 event_tracking.py:125] [process=6] [async] Finished blocking save in 0.33 seconds. Continuing save @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260417_023710/nnx_xpk_feat_nnx_post_train_fixes_20260417_023710_16_zarr3_true/checkpoints/15.
I0417 06:25:03.010658 133025169913600 async_checkpointer.py:76] [process=6][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-17 06:45:03.010622
I0417 06:25:03.030062 133150336943936 checkpoint_manager.py:1560] [process=6][thread=MainThread][step=15] Starting CheckpointManager Save Finalize thread=save_finalize
I0417 06:25:03.030363 133021524645632 async_checkpointer.py:280] [process=6][thread=save_finalize] Waiting for background save thread=async_save.
I0417 06:25:03.030539 133150336943936 standard_logger.py:34] {'step': 15, 'event_type': 'save', 'directory': 'gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260417_023710/nnx_xpk_feat_nnx_post_train_fixes_20260417_023710_16_zarr3_true/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776407052.7722535, 'wait_for_prev_duration_secs': 49.9049506187439, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776407102.67924, 'checkpointer_blocking_duration_secs': 0.3315558433532715, 'get_old_steps_start_time': 1776407103.010838, 'get_old_steps_duration_secs': 3.4809112548828125e-05, 'checkpoint_manager_blocking_start_time': 1776407052.769907, 'checkpoint_manager_blocking_duration_secs': 50.260597705841064}
I0417 06:25:03.030661 133150336943936 checkpointing.py:409] Started an asynchronous checkpoint save for step 15
I0417 06:25:03.031547 133150336943936 metric_logger.py:185] completed step: 14, seconds: 0.147, TFLOP/s/device: 92.411, Tokens/s/device: 13929.130, total_weights: 65536, loss: 5.858
I0417 06:25:03.039195 133150336943936 metric_logger.py:185] completed step: 15, seconds: 0.148, TFLOP/s/device: 91.669, Tokens/s/device: 13817.392, total_weights: 65536, loss: 5.713
I0417 06:25:09.052990 133025178306304 array_metadata_store.py:203] [process=6][thread=array_type_handler] Wrote 69 array_metadata.ArrayMetadata to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260417_023710/nnx_xpk_feat_nnx_post_train_fixes_20260417_023710_16_zarr3_true/checkpoints/15/items/array_metadatas/process_6
I0417 06:25:24.981702 133150336943936 metric_logger.py:185] completed step: 16, seconds: 50.557, TFLOP/s/device: 0.269, Tokens/s/device: 40.509, total_weights: 65536, loss: 5.599
I0417 06:25:25.129344 133150336943936 metric_logger.py:185] completed step: 17, seconds: 21.797, TFLOP/s/device: 0.623, Tokens/s/device: 93.956, total_weights: 65536, loss: 5.511
I0417 06:25:25.276650 133150336943936 metric_logger.py:185] completed step: 18, seconds: 0.153, TFLOP/s/device: 88.901, Tokens/s/device: 13400.072, total_weights: 65536, loss: 5.442
I0417 06:25:25.425119 133150336943936 checkpointing.py:794] Waiting for step 19 to finish before checkpoint...
I0417 06:25:25.425807 133150336943936 checkpointing.py:798] Waited 0.0006990432739257812 seconds for step 19 to finish before starting checkpointing.
I0417 06:25:25.427943 133150336943936 checkpoint_manager.py:2020] [process=6][thread=MainThread][step=15][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0417 06:25:44.728803 133025169913600 base_pytree_checkpoint_handler.py:130] [process=6] /jax/orbax/write/gbytes_per_sec: 37.830 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 41.75546836853027 s) (per-host)
I0417 06:25:44.728923 133025169913600 async_checkpointer.py:90] [process=6][thread=async_save] 3 Handler Commit operations completed. Time taken: 41.718157s.
I0417 06:25:53.065778 133025169913600 async_checkpointer.py:160] [process=6][thread=async_save] Background save thread done. Time taken: 50.054998s.
I0417 06:25:53.066113 133021524645632 async_checkpointer.py:288] [process=6][thread=save_finalize] Done with waiting for background save thread=async_save.
I0417 06:25:53.066228 133021524645632 async_checkpointer.py:298] [process=6][thread=save_finalize] No errors found in background save thread=async_save.
I0417 06:25:53.066275 133021524645632 checkpoint_manager.py:2137] [process=6][thread=save_finalize][step=15] CheckpointManager Save Finalize is syncing with other hosts...
I0417 06:25:53.067981 133021524645632 checkpoint_manager.py:2146] [process=6][thread=save_finalize][step=15] CheckpointManager Save Finalize is done on all hosts.
I0417 06:25:53.068145 133150336943936 checkpoint_manager.py:2032] [process=6][thread=MainThread][step=15][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=15.
W0417 06:25:53.068269 133150336943936 checkpoint_manager.py:1452] Waiting for previous save to complete took 27.640332 seconds. If this number is high, consider checkpointing less frequently.
I0417 06:25:53.070613 133150336943936 checkpoint_manager.py:1512] [process=6] Saving checkpoint at step 19
I0417 06:25:53.072711 133150336943936 event_tracking.py:70] [process=6] [async] Started save checkpoint @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260417_023710/nnx_xpk_feat_nnx_post_train_fixes_20260417_023710_16_zarr3_true/checkpoints/19.
I0417 06:25:54.208344 133150336943936 jax_array_handlers.py:360] Scheduling D2H of 69 prioritized jax.Array.
I0417 06:25:54.208437 133150336943936 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0417 06:25:54.234950 133150336943936 base_pytree_checkpoint_handler.py:154] [process=6][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.028081s
I0417 06:25:54.235143 133150336943936 base_pytree_checkpoint_handler.py:130] [process=6] /jax/orbax/write/blocking_gbytes_per_sec: 43.647 GiB/s (total gbytes: 1.5 GiB) (time elapsed: 0.03534197807312012 s) (per-host)
I0417 06:25:54.235201 133150336943936 base_pytree_checkpoint_handler.py:768] [process=6][thread=MainThread] Initiated Pytree async_save. Time taken: 0.035413s (batch_requests_ready=0.002888s, total_serialization_initiated=0.032449s, others=0.000076s)
I0417 06:25:54.235324 133150336943936 composite_checkpoint_handler.py:715] [process=6][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.039670s (all_items=0.000010s, per_item={'items': '0.00001049'}, temp_paths=0.039659)
I0417 06:25:54.236050 133150336943936 event_tracking.py:125] [process=6] [async] Finished blocking save in 1.17 seconds. Continuing save @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260417_023710/nnx_xpk_feat_nnx_post_train_fixes_20260417_023710_16_zarr3_true/checkpoints/19.
I0417 06:25:54.236377 133021524645632 async_checkpointer.py:76] [process=6][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-17 06:45:54.236340
I0417 06:25:54.238812 133150336943936 checkpoint_manager.py:1560] [process=6][thread=MainThread][step=19] Starting CheckpointManager Save Finalize thread=save_finalize
I0417 06:25:54.239069 133025178306304 async_checkpointer.py:280] [process=6][thread=save_finalize] Waiting for background save thread=async_save.
I0417 06:25:54.239200 133150336943936 standard_logger.py:34] {'step': 19, 'event_type': 'save', 'directory': 'gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260417_023710/nnx_xpk_feat_nnx_post_train_fixes_20260417_023710_16_zarr3_true/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776407125.4279127, 'wait_for_prev_duration_secs': 27.640332221984863, 'time_between_consecutive_saves_sec': 22.750930786132812, 'checkpointer_blocking_start_time': 1776407153.070652, 'checkpointer_blocking_duration_secs': 1.1658692359924316, 'get_old_steps_start_time': 1776407154.2365386, 'get_old_steps_duration_secs': 2.8133392333984375e-05, 'checkpoint_manager_blocking_start_time': 1776407125.4260738, 'checkpoint_manager_blocking_duration_secs': 28.813093423843384}
I0417 06:25:54.239311 133150336943936 checkpointing.py:409] Started an asynchronous checkpoint save for step 19
I0417 06:25:54.239353 133150336943936 checkpoint_manager.py:2020] [process=6][thread=MainThread][step=19][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0417 06:26:00.221971 133021009164032 array_metadata_store.py:203] [process=6][thread=array_type_handler] Wrote 69 array_metadata.ArrayMetadata to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260417_023710/nnx_xpk_feat_nnx_post_train_fixes_20260417_023710_16_zarr3_true/checkpoints/19/items/array_metadatas/process_6
I0417 06:26:35.746130 133021524645632 base_pytree_checkpoint_handler.py:130] [process=6] /jax/orbax/write/gbytes_per_sec: 38.020 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 41.54629635810852 s) (per-host)
I0417 06:26:35.746252 133021524645632 async_checkpointer.py:90] [process=6][thread=async_save] 3 Handler Commit operations completed. Time taken: 41.509767s.
I0417 06:26:44.775292 133021524645632 async_checkpointer.py:160] [process=6][thread=async_save] Background save thread done. Time taken: 50.538790s.
I0417 06:26:44.775597 133025178306304 async_checkpointer.py:288] [process=6][thread=save_finalize] Done with waiting for background save thread=async_save.
I0417 06:26:44.775714 133025178306304 async_checkpointer.py:298] [process=6][thread=save_finalize] No errors found in background save thread=async_save.
I0417 06:26:44.775755 133025178306304 checkpoint_manager.py:2137] [process=6][thread=save_finalize][step=19] CheckpointManager Save Finalize is syncing with other hosts...
I0417 06:26:44.777263 133025178306304 checkpoint_manager.py:2146] [process=6][thread=save_finalize][step=19] CheckpointManager Save Finalize is done on all hosts.
I0417 06:26:44.777465 133150336943936 checkpoint_manager.py:2032] [process=6][thread=MainThread][step=19][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=19.
I0417 06:26:44.777680 133150336943936 checkpoint_manager.py:2009] [process=6][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0417 06:26:44.778454 133150336943936 metric_logger.py:185] completed step: 19, seconds: 0.148, TFLOP/s/device: 91.838, Tokens/s/device: 13842.889, total_weights: 65536, loss: 5.385
Per train step:
 Total TFLOPs: 13.59 
 split as 93.93% learnable weight flops and 6.07% attention flops
XPK End: Fri Apr 17 06:26:54 UTC 2026
EXIT_CODE=0