MaxView

← Back to run

Log Summary

XPK Start: Thu Apr 23 15:38:00 UTC 2026
PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
`rope_parameters`'s factor field must be a float >= 1, got 40
`rope_parameters`'s beta_fast field must be a float, got 32
`rope_parameters`'s beta_slow field must be a float, got 1
DeepseekV32Config got `key=rope_scaling` in kwargs but hasn't set it as attribute. For RoPE standardization you need to set `self.rope_parameters` in model's config. 
2026-04-23 15:38:25.178972: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
I0423 15:38:25.388626 137609729292096 max_utils.py:273] Attempting to initialize the jax distributed system...
I0423 15:38:34.429597 137609729292096 distributed.py:149] Starting JAX distributed service on [::]:8482
I0423 15:38:34.432058 137609729292096 distributed.py:172] Connecting to JAX distributed service on mt-16-zarr3-true-rs-nr6ax-slice-job-0-0.mt-16-zarr3-true-rs-nr6ax:8482
I0423 15:38:35.612976 137609729292096 max_utils.py:284] Jax distributed system initialized!
I0423 15:38:41.608181 137609729292096 max_utils.py:800] System Information: Jax Version: 0.9.2
I0423 15:38:41.608287 137609729292096 max_utils.py:801] System Information: Jaxlib Version: 0.9.2
I0423 15:38:41.608327 137609729292096 max_utils.py:802] System Information: Jax Backend: PJRT C API
TFRT TPU v6 lite
Built on Mar 4 2026 11:32:08 (1772652728) cl/878335365
I0423 15:38:41.608362 137609729292096 train_utils.py:391] WARNING: Sequence packing is essentially ignored for synthetic data. Please use a real dataset to use sequence packing.
I0423 15:38:42.303873 137609729292096 maxtext_utils.py:1771] Num_devices: 32, shape (1, 1, 1, 32, 1, 1, 1, 1, 1, 1, 1, 1, 1)
I0423 15:38:42.416491 137609729292096 checkpointing.py:688] Setting up checkpoint logger...
I0423 15:38:42.416611 137609729292096 checkpointing.py:234] Creating checkpoint manager with ocdbt=True and zarr3=True
I0423 15:38:42.416656 137609729292096 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None
I0423 15:38:42.416864 137609729292096 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7d2731118290>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB)
I0423 15:38:43.745061 137609729292096 checkpointing.py:266] Enabling policy for fixed interval checkpointing.
I0423 15:38:43.745298 137609729292096 checkpoint_manager.py:708] [process=3][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=('items',), item_handlers={'items': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7d1394386780>}, handler_registry=None
I0423 15:38:43.745546 137609729292096 composite_checkpoint_handler.py:237] Deferred registration for item: "items". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7d1394386780>` for item "items" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`.
I0423 15:38:43.745595 137609729292096 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7d139428e960>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`.
I0423 15:38:43.745630 137609729292096 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('items', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7d1394386780>, ('items', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7d1394386780>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7d139428e960>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7d139428e960>}).
I0423 15:38:43.745989 137609729292096 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.34
I0423 15:38:43.746060 137609729292096 async_checkpointer.py:192] [process=3][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>._fn at 0x7d13645336a0> timeout: 1200 secs and primary_host=0 for async checkpoint writes
I0423 15:38:45.626029 137492128835328 checkpoint.py:228] Read Metadata={'item_handlers': {'items': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler'}, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776958463207390213, 'commit_timestamp_nsecs': 1776958505863953376, 'custom_metadata': {}} from gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124541/nnx_xpk_feat_nnx_post_train_fixes_20260423_124541_16_zarr3_true/checkpoints/0/_CHECKPOINT_METADATA
I0423 15:38:45.879197 137492128835328 checkpoint_manager.py:1871] Missing metrics for step 0
I0423 15:38:45.915119 137609729292096 checkpoint_manager.py:1812] Found 3 checkpoint steps in gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124541/nnx_xpk_feat_nnx_post_train_fixes_20260423_124541_16_zarr3_true/checkpoints
I0423 15:38:46.327409 137609729292096 checkpoint_manager.py:929] [process=3][thread=MainThread] CheckpointManager created,  primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=1, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=FixedIntervalPolicy(interval=5), preservation_policy=LatestN(n=None), prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False, lightweight_initialize=False), root_directory=gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124541/nnx_xpk_feat_nnx_post_train_fixes_20260423_124541_16_zarr3_true/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7d13645d0b30>
I0423 15:38:46.327578 137609729292096 checkpointing.py:302] Checkpoint manager created!
I0423 15:38:47.310011 137609729292096 checkpointing.py:578] checkpoint manager exists so trying to load this run's existing checkpoint
I0423 15:38:47.310134 137609729292096 checkpointing.py:582] restoring from this run's directory step 9
I0423 15:38:47.574054 137609729292096 event_tracking.py:70] [process=3] [sync] Started load checkpoint @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124541/nnx_xpk_feat_nnx_post_train_fixes_20260423_124541_16_zarr3_true/checkpoints/9.
I0423 15:38:47.830199 137609729292096 checkpointer.py:307] Restoring checkpoint from gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124541/nnx_xpk_feat_nnx_post_train_fixes_20260423_124541_16_zarr3_true/checkpoints/9.
I0423 15:38:47.830614 137609729292096 event_tracking.py:125] [process=3] [sync] Finished blocking load in 0.26 seconds. Continuing load @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124541/nnx_xpk_feat_nnx_post_train_fixes_20260423_124541_16_zarr3_true/checkpoints/9.
I0423 15:38:49.253900    1413 google_auth_provider.cc:181] Running on GCE, using service account 562977990677-compute@developer.gserviceaccount.com
I0423 15:38:56.641557 137609729292096 jax_array_handlers.py:843] [process=3] /jax/orbax/read/worker/io/requested throughput: 213.723 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 7.394039154052734 s) (per-host)
I0423 15:38:56.650302 137609729292096 base_pytree_checkpoint_handler.py:130] [process=3] /jax/checkpoint/read/gbytes_per_sec: 202.461 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 7.816838026046753 s) (per-host)
I0423 15:38:56.748632 137609729292096 event_tracking.py:138] [process=3] [sync] Finished load in 9.17 seconds @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124541/nnx_xpk_feat_nnx_post_train_fixes_20260423_124541_16_zarr3_true/checkpoints/9
I0423 15:38:56.748785 137609729292096 standard_logger.py:34] {'step': 9, 'event_type': 'restore', 'directory': 'gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124541/nnx_xpk_feat_nnx_post_train_fixes_20260423_124541_16_zarr3_true/checkpoints', 'checkpointer_start_time': 1776958727.5739548, 'checkpointer_duration_secs': 9.174798011779785, 'checkpoint_manager_start_time': 1776958727.3107326, 'checkpoint_manager_duration_secs': 9.43802285194397}
fsdp: 32
I0423 15:38:56.824505 137609729292096 maxtext_utils.py:1889]  decoder/decoder_norm/scale/value
    Shape:     float32[2048]
    Physical:  (None,)
I0423 15:38:56.824619 137609729292096 maxtext_utils.py:1889]  decoder/layers/mlp/wi_0/kernel/value
    Shape:     float32[2048,16,7168]
    Physical:  ('fsdp', None, None)
I0423 15:38:56.824671 137609729292096 maxtext_utils.py:1889]  decoder/layers/mlp/wi_1/kernel/value
    Shape:     float32[2048,16,7168]
    Physical:  ('fsdp', None, None)
I0423 15:38:56.824719 137609729292096 maxtext_utils.py:1889]  decoder/layers/mlp/wo/kernel/value
    Shape:     float32[7168,16,2048]
    Physical:  (None, None, 'fsdp')
I0423 15:38:56.824757 137609729292096 maxtext_utils.py:1889]  decoder/layers/post_self_attention_layer_norm/scale/value
    Shape:     float32[2048,16]
    Physical:  (None, None)
I0423 15:38:56.824791 137609729292096 maxtext_utils.py:1889]  decoder/layers/pre_self_attention_layer_norm/scale/value
    Shape:     float32[2048,16]
    Physical:  (None, None)
I0423 15:38:56.824827 137609729292096 maxtext_utils.py:1889]  decoder/layers/self_attention/key/kernel/value
    Shape:     float32[2048,16,16,128]
    Physical:  ('fsdp', None, None, None)
I0423 15:38:56.824863 137609729292096 maxtext_utils.py:1889]  decoder/layers/self_attention/out/kernel/value
    Shape:     float32[16,16,128,2048]
    Physical:  (None, None, None, 'fsdp')
I0423 15:38:56.824898 137609729292096 maxtext_utils.py:1889]  decoder/layers/self_attention/query/kernel/value
    Shape:     float32[2048,16,16,128]
    Physical:  ('fsdp', None, None, None)
I0423 15:38:56.824944 137609729292096 maxtext_utils.py:1889]  decoder/layers/self_attention/value/kernel/value
    Shape:     float32[2048,16,16,128]
    Physical:  ('fsdp', None, None, None)
I0423 15:38:56.824982 137609729292096 maxtext_utils.py:1889]  decoder/logits_dense/kernel/value
    Shape:     float32[2048,32000]
    Physical:  ('fsdp', None)

I0423 15:38:56.825016 137609729292096 maxtext_utils.py:1889]  token_embedder/embedding/value
    Shape:     float32[32000,2048]
    Physical:  (None, 'fsdp')
I0423 15:38:57.228991 137609729292096 nnx_decoders.py:465] nnx_decoders/carry Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_norm_length', 'activation_embed').
I0423 15:38:57.229092 137609729292096 nnx_decoders.py:465] nnx_decoders/carry Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None).
I0423 15:38:57.234522 137609729292096 nnx_decoders.py:465] Unknown Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_norm_length', 'activation_embed').
I0423 15:38:57.234580 137609729292096 nnx_decoders.py:465] Unknown Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None).
I0423 15:38:57.251201 137609729292096 attentions.py:1088] attentions/inputs_q Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_attn_length', 'activation_attn_embed').
I0423 15:38:57.251262 137609729292096 attentions.py:1088] attentions/inputs_q Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None).
I0423 15:38:57.267141 137609729292096 attentions.py:1089] attentions/inputs_kv Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_attn_length', 'activation_attn_embed').
I0423 15:38:57.267205 137609729292096 attentions.py:1089] attentions/inputs_kv Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None).
I0423 15:38:57.292326 137609729292096 attentions.py:1154] attentions/query Logical: bfloat16[32,2048,16,128].................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim').
I0423 15:38:57.292396 137609729292096 attentions.py:1154] attentions/query Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None).
I0423 15:38:57.308544 137609729292096 attentions.py:1155] attentions/key Logical: bfloat16[32,2048,16,128].................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim').
I0423 15:38:57.308606 137609729292096 attentions.py:1155] attentions/key Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None).
I0423 15:38:57.324710 137609729292096 attentions.py:1156] attentions/value Logical: bfloat16[32,2048,16,128].................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim').
I0423 15:38:57.324769 137609729292096 attentions.py:1156] attentions/value Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None).
I0423 15:38:57.354346 137609729292096 attentions.py:1198] attentions/out Logical: bfloat16[32,2048,16,128].................................... ('activation_batch', 'activation_attn_length', 'activation_heads', 'activation_kv').
I0423 15:38:57.354420 137609729292096 attentions.py:1198] attentions/out Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None).
I0423 15:38:57.374432 137609729292096 linears.py:525] linears/x Logical: bfloat16[32,2048,7168]...................................... ('activation_batch', 'activation_length', 'activation_mlp').
I0423 15:38:57.374496 137609729292096 linears.py:525] linears/x Physical: bfloat16[32,2048,7168]...................................... ('fsdp', None, None).
I0423 15:39:08.663573 137609729292096 max_utils.py:791] Total memory size: 1.5 GB, Output size: 0.4 GB, Temp size: 1.1 GB, Argument size: 0.4 GB, Host temp size: 0.0 GB.
I0423 15:39:08.666023 137609729292096 metric_logger.py:301] number parameters: 1.104 billion
I0423 15:39:19.923452 137609729292096 checkpointing.py:794] Waiting for step 10 to finish before checkpoint...
I0423 15:39:20.076506 137609729292096 checkpointing.py:798] Waited 0.15303778648376465 seconds for step 10 to finish before starting checkpointing.
I0423 15:39:20.078785 137609729292096 checkpoint_manager.py:2009] [process=3][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0423 15:39:20.080667 137609729292096 checkpoint_manager.py:1512] [process=3] Saving checkpoint at step 10
I0423 15:39:20.082089 137609729292096 event_tracking.py:70] [process=3] [async] Started save checkpoint @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124541/nnx_xpk_feat_nnx_post_train_fixes_20260423_124541_16_zarr3_true/checkpoints/10.
I0423 15:39:20.858580 137609729292096 signaling_client.py:364] Using JaxDistributedSignalingClient
I0423 15:39:20.859530 137609729292096 jax_array_handlers.py:360] Scheduling D2H of 69 prioritized jax.Array.
I0423 15:39:20.859588 137609729292096 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0423 15:39:21.139741 137609729292096 base_pytree_checkpoint_handler.py:154] [process=3][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.281943s
I0423 15:39:21.139921 137609729292096 base_pytree_checkpoint_handler.py:130] [process=3] /jax/orbax/write/blocking_gbytes_per_sec: 5.341 GiB/s (total gbytes: 1.5 GiB) (time elapsed: 0.2888181209564209 s) (per-host)
I0423 15:39:21.139995 137609729292096 base_pytree_checkpoint_handler.py:768] [process=3][thread=MainThread] Initiated Pytree async_save. Time taken: 0.288903s (batch_requests_ready=0.002582s, total_serialization_initiated=0.286228s, others=0.000093s)
I0423 15:39:21.140098 137609729292096 composite_checkpoint_handler.py:715] [process=3][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.293303s (all_items=0.000018s, per_item={'items': '0.00001788'}, temp_paths=0.293285)
I0423 15:39:21.140855 137609729292096 event_tracking.py:125] [process=3] [async] Finished blocking save in 1.06 seconds. Continuing save @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124541/nnx_xpk_feat_nnx_post_train_fixes_20260423_124541_16_zarr3_true/checkpoints/10.
I0423 15:39:21.141161 137491533248256 async_checkpointer.py:76] [process=3][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-23 15:59:21.141129
I0423 15:39:21.148822 137609729292096 checkpoint_manager.py:1560] [process=3][thread=MainThread][step=10] Starting CheckpointManager Save Finalize thread=save_finalize
I0423 15:39:21.149110 137490438534912 async_checkpointer.py:280] [process=3][thread=save_finalize] Waiting for background save thread=async_save.
I0423 15:39:21.149265 137609729292096 standard_logger.py:34] {'step': 10, 'event_type': 'save', 'directory': 'gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124541/nnx_xpk_feat_nnx_post_train_fixes_20260423_124541_16_zarr3_true/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776958760.0787678, 'wait_for_prev_duration_secs': 5.793571472167969e-05, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776958760.0807056, 'checkpointer_blocking_duration_secs': 1.0606071949005127, 'get_old_steps_start_time': 1776958761.1413379, 'get_old_steps_duration_secs': 3.0994415283203125e-05, 'checkpoint_manager_blocking_start_time': 1776958760.0769384, 'checkpoint_manager_blocking_duration_secs': 1.0722908973693848}
I0423 15:39:21.149382 137609729292096 checkpointing.py:409] Started an asynchronous checkpoint save for step 10
I0423 15:39:21.149451 137609729292096 max_utils.py:750] 
Memstats: After params initialized:
I0423 15:39:21.149503 137609729292096 max_utils.py:756] 	Using (GB) 0.43 / 31.25 (1.376000%) on TPU_10(process=3,(2,2,0,0))
I0423 15:39:21.149534 137609729292096 max_utils.py:756] 	Using (GB) 0.43 / 31.25 (1.376000%) on TPU_11(process=3,(3,2,0,0))
I0423 15:39:21.149560 137609729292096 max_utils.py:756] 	Using (GB) 0.43 / 31.25 (1.376000%) on TPU_14(process=3,(2,3,0,0))
I0423 15:39:21.149583 137609729292096 max_utils.py:756] 	Using (GB) 0.43 / 31.25 (1.376000%) on TPU_15(process=3,(3,3,0,0))
I0423 15:39:21.472480 137609729292096 metric_logger.py:196] completed step: 10, seconds: 11.256, TFLOP/s/device: 1.207, Tokens/s/device: 181.942, total_weights: 65536, loss: 6.827, lm_loss: 6.827, perplexity: 922.537
I0423 15:39:21.654293 137609729292096 metric_logger.py:196] completed step: 11, seconds: 1.547, TFLOP/s/device: 8.781, Tokens/s/device: 1323.626, total_weights: 65536, loss: 6.522, lm_loss: 6.522, perplexity: 679.821
I0423 15:39:21.801898 137609729292096 metric_logger.py:196] completed step: 12, seconds: 0.024, TFLOP/s/device: 556.987, Tokens/s/device: 83955.071, total_weights: 65536, loss: 6.260, lm_loss: 6.260, perplexity: 523.013
I0423 15:39:21.949458 137609729292096 metric_logger.py:196] completed step: 13, seconds: 0.165, TFLOP/s/device: 82.108, Tokens/s/device: 12376.193, total_weights: 65536, loss: 6.039, lm_loss: 6.039, perplexity: 419.520
I0423 15:39:21.955972 137609729292096 checkpointing.py:794] Waiting for step 15 to finish before checkpoint...
I0423 15:39:22.244645 137609729292096 checkpointing.py:798] Waited 0.2886478900909424 seconds for step 15 to finish before starting checkpointing.
I0423 15:39:22.247422 137609729292096 checkpoint_manager.py:2020] [process=3][thread=MainThread][step=10][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0423 15:39:26.199762 137490484684544 array_metadata_store.py:203] [process=3][thread=array_type_handler] Wrote 69 array_metadata.ArrayMetadata to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124541/nnx_xpk_feat_nnx_post_train_fixes_20260423_124541_16_zarr3_true/checkpoints/10/items/array_metadatas/process_3
I0423 15:40:02.901855 137491533248256 base_pytree_checkpoint_handler.py:130] [process=3] /jax/orbax/write/gbytes_per_sec: 37.564 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 42.05071687698364 s) (per-host)
I0423 15:40:02.902005 137491533248256 async_checkpointer.py:90] [process=3][thread=async_save] 3 Handler Commit operations completed. Time taken: 41.760734s.
I0423 15:40:11.672261 137491533248256 async_checkpointer.py:160] [process=3][thread=async_save] Background save thread done. Time taken: 50.530975s.
I0423 15:40:11.672542 137490438534912 async_checkpointer.py:288] [process=3][thread=save_finalize] Done with waiting for background save thread=async_save.
I0423 15:40:11.672662 137490438534912 async_checkpointer.py:298] [process=3][thread=save_finalize] No errors found in background save thread=async_save.
I0423 15:40:11.672711 137490438534912 checkpoint_manager.py:2137] [process=3][thread=save_finalize][step=10] CheckpointManager Save Finalize is syncing with other hosts...
I0423 15:40:11.675336 137490438534912 checkpoint_manager.py:2146] [process=3][thread=save_finalize][step=10] CheckpointManager Save Finalize is done on all hosts.
I0423 15:40:11.675481 137609729292096 checkpoint_manager.py:2032] [process=3][thread=MainThread][step=10][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=10.
W0423 15:40:11.675568 137609729292096 checkpoint_manager.py:1452] Waiting for previous save to complete took 49.428145 seconds. If this number is high, consider checkpointing less frequently.
I0423 15:40:11.677836 137609729292096 checkpoint_manager.py:1512] [process=3] Saving checkpoint at step 15
I0423 15:40:11.680400 137609729292096 event_tracking.py:70] [process=3] [async] Started save checkpoint @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124541/nnx_xpk_feat_nnx_post_train_fixes_20260423_124541_16_zarr3_true/checkpoints/15.
I0423 15:40:11.984390 137609729292096 jax_array_handlers.py:360] Scheduling D2H of 69 prioritized jax.Array.
I0423 15:40:11.984482 137609729292096 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0423 15:40:12.010105 137609729292096 base_pytree_checkpoint_handler.py:154] [process=3][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.027362s
I0423 15:40:12.010234 137609729292096 base_pytree_checkpoint_handler.py:130] [process=3] /jax/orbax/write/blocking_gbytes_per_sec: 45.148 GiB/s (total gbytes: 1.5 GiB) (time elapsed: 0.03416728973388672 s) (per-host)
I0423 15:40:12.010280 137609729292096 base_pytree_checkpoint_handler.py:768] [process=3][thread=MainThread] Initiated Pytree async_save. Time taken: 0.034222s (batch_requests_ready=0.002603s, total_serialization_initiated=0.031560s, others=0.000059s)
I0423 15:40:12.010355 137609729292096 composite_checkpoint_handler.py:715] [process=3][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.039294s (all_items=0.000013s, per_item={'items': '0.00001287'}, temp_paths=0.039281)
I0423 15:40:12.011076 137609729292096 event_tracking.py:125] [process=3] [async] Finished blocking save in 0.33 seconds. Continuing save @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124541/nnx_xpk_feat_nnx_post_train_fixes_20260423_124541_16_zarr3_true/checkpoints/15.
I0423 15:40:12.011404 137490438534912 async_checkpointer.py:76] [process=3][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-23 16:00:12.011368
I0423 15:40:12.033333 137609729292096 checkpoint_manager.py:1560] [process=3][thread=MainThread][step=15] Starting CheckpointManager Save Finalize thread=save_finalize
I0423 15:40:12.033637 137485107582720 async_checkpointer.py:280] [process=3][thread=save_finalize] Waiting for background save thread=async_save.
I0423 15:40:12.033819 137609729292096 standard_logger.py:34] {'step': 15, 'event_type': 'save', 'directory': 'gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124541/nnx_xpk_feat_nnx_post_train_fixes_20260423_124541_16_zarr3_true/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776958762.247392, 'wait_for_prev_duration_secs': 49.42814540863037, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776958811.6778743, 'checkpointer_blocking_duration_secs': 0.33368396759033203, 'get_old_steps_start_time': 1776958812.0115845, 'get_old_steps_duration_secs': 3.24249267578125e-05, 'checkpoint_manager_blocking_start_time': 1776958762.2450135, 'checkpoint_manager_blocking_duration_secs': 49.78877234458923}
I0423 15:40:12.033942 137609729292096 checkpointing.py:409] Started an asynchronous checkpoint save for step 15
I0423 15:40:12.035055 137609729292096 metric_logger.py:196] completed step: 14, seconds: 0.147, TFLOP/s/device: 92.304, Tokens/s/device: 13913.043, total_weights: 65536, loss: 5.858, lm_loss: 5.858, perplexity: 350.067
I0423 15:40:12.043062 137609729292096 metric_logger.py:196] completed step: 15, seconds: 0.148, TFLOP/s/device: 91.883, Tokens/s/device: 13849.535, total_weights: 65536, loss: 5.713, lm_loss: 5.713, perplexity: 302.780
I0423 15:40:17.226884 137490484684544 array_metadata_store.py:203] [process=3][thread=array_type_handler] Wrote 69 array_metadata.ArrayMetadata to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124541/nnx_xpk_feat_nnx_post_train_fixes_20260423_124541_16_zarr3_true/checkpoints/15/items/array_metadatas/process_3
I0423 15:40:33.563119 137609729292096 metric_logger.py:196] completed step: 16, seconds: 50.085, TFLOP/s/device: 0.271, Tokens/s/device: 40.890, total_weights: 65536, loss: 5.599, lm_loss: 5.599, perplexity: 270.201
I0423 15:40:33.710981 137609729292096 metric_logger.py:196] completed step: 17, seconds: 21.375, TFLOP/s/device: 0.636, Tokens/s/device: 95.814, total_weights: 65536, loss: 5.511, lm_loss: 5.511, perplexity: 247.443
I0423 15:40:33.858129 137609729292096 metric_logger.py:196] completed step: 18, seconds: 0.154, TFLOP/s/device: 88.217, Tokens/s/device: 13296.974, total_weights: 65536, loss: 5.442, lm_loss: 5.442, perplexity: 230.909
I0423 15:40:34.006568 137609729292096 checkpointing.py:794] Waiting for step 19 to finish before checkpoint...
I0423 15:40:34.007264 137609729292096 checkpointing.py:798] Waited 0.0007097721099853516 seconds for step 19 to finish before starting checkpointing.
I0423 15:40:34.009727 137609729292096 checkpoint_manager.py:2020] [process=3][thread=MainThread][step=15][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0423 15:40:54.017243 137490438534912 base_pytree_checkpoint_handler.py:130] [process=3] /jax/orbax/write/gbytes_per_sec: 37.573 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 42.04112982749939 s) (per-host)
I0423 15:40:54.017358 137490438534912 async_checkpointer.py:90] [process=3][thread=async_save] 3 Handler Commit operations completed. Time taken: 42.005834s.
I0423 15:41:02.854766 137490438534912 async_checkpointer.py:160] [process=3][thread=async_save] Background save thread done. Time taken: 50.843226s.
I0423 15:41:02.855100 137485107582720 async_checkpointer.py:288] [process=3][thread=save_finalize] Done with waiting for background save thread=async_save.
I0423 15:41:02.855222 137485107582720 async_checkpointer.py:298] [process=3][thread=save_finalize] No errors found in background save thread=async_save.
I0423 15:41:02.855271 137485107582720 checkpoint_manager.py:2137] [process=3][thread=save_finalize][step=15] CheckpointManager Save Finalize is syncing with other hosts...
I0423 15:41:02.856620 137485107582720 checkpoint_manager.py:2146] [process=3][thread=save_finalize][step=15] CheckpointManager Save Finalize is done on all hosts.
I0423 15:41:02.856790 137609729292096 checkpoint_manager.py:2032] [process=3][thread=MainThread][step=15][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=15.
W0423 15:41:02.856942 137609729292096 checkpoint_manager.py:1452] Waiting for previous save to complete took 28.847203 seconds. If this number is high, consider checkpointing less frequently.
I0423 15:41:02.858655 137609729292096 checkpoint_manager.py:1512] [process=3] Saving checkpoint at step 19
I0423 15:41:02.860832 137609729292096 event_tracking.py:70] [process=3] [async] Started save checkpoint @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124541/nnx_xpk_feat_nnx_post_train_fixes_20260423_124541_16_zarr3_true/checkpoints/19.
I0423 15:41:03.174234 137609729292096 jax_array_handlers.py:360] Scheduling D2H of 69 prioritized jax.Array.
I0423 15:41:03.174335 137609729292096 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False
I0423 15:41:03.200186 137609729292096 base_pytree_checkpoint_handler.py:154] [process=3][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.027609s
I0423 15:41:03.200343 137609729292096 base_pytree_checkpoint_handler.py:130] [process=3] /jax/orbax/write/blocking_gbytes_per_sec: 44.626 GiB/s (total gbytes: 1.5 GiB) (time elapsed: 0.034566640853881836 s) (per-host)
I0423 15:41:03.200404 137609729292096 base_pytree_checkpoint_handler.py:768] [process=3][thread=MainThread] Initiated Pytree async_save. Time taken: 0.034639s (batch_requests_ready=0.002746s, total_serialization_initiated=0.031816s, others=0.000077s)
I0423 15:41:03.200501 137609729292096 composite_checkpoint_handler.py:715] [process=3][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.038838s (all_items=0.000012s, per_item={'items': '0.00001192'}, temp_paths=0.038826)
I0423 15:41:03.201234 137609729292096 event_tracking.py:125] [process=3] [async] Finished blocking save in 0.34 seconds. Continuing save @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124541/nnx_xpk_feat_nnx_post_train_fixes_20260423_124541_16_zarr3_true/checkpoints/19.
I0423 15:41:03.201529 137485107582720 async_checkpointer.py:76] [process=3][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-23 16:01:03.201497
I0423 15:41:03.207523 137609729292096 checkpoint_manager.py:1560] [process=3][thread=MainThread][step=19] Starting CheckpointManager Save Finalize thread=save_finalize
I0423 15:41:03.207805 137486679000832 async_checkpointer.py:280] [process=3][thread=save_finalize] Waiting for background save thread=async_save.
I0423 15:41:03.207980 137609729292096 standard_logger.py:34] {'step': 19, 'event_type': 'save', 'directory': 'gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124541/nnx_xpk_feat_nnx_post_train_fixes_20260423_124541_16_zarr3_true/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776958834.0096989, 'wait_for_prev_duration_secs': 28.847203493118286, 'time_between_consecutive_saves_sec': 22.334322929382324, 'checkpointer_blocking_start_time': 1776958862.8586965, 'checkpointer_blocking_duration_secs': 0.3429732322692871, 'get_old_steps_start_time': 1776958863.201687, 'get_old_steps_duration_secs': 2.956390380859375e-05, 'checkpoint_manager_blocking_start_time': 1776958834.00752, 'checkpoint_manager_blocking_duration_secs': 29.200428009033203}
I0423 15:41:03.208094 137609729292096 checkpointing.py:409] Started an asynchronous checkpoint save for step 19
I0423 15:41:03.208137 137609729292096 checkpoint_manager.py:2020] [process=3][thread=MainThread][step=19][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete.
I0423 15:41:08.649863 137490484684544 array_metadata_store.py:203] [process=3][thread=array_type_handler] Wrote 69 array_metadata.ArrayMetadata to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_post_train_fixes_20260423_124541/nnx_xpk_feat_nnx_post_train_fixes_20260423_124541_16_zarr3_true/checkpoints/19/items/array_metadatas/process_3
I0423 15:41:46.509124 137485107582720 base_pytree_checkpoint_handler.py:130] [process=3] /jax/orbax/write/gbytes_per_sec: 36.444 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 43.343308448791504 s) (per-host)
I0423 15:41:46.509242 137485107582720 async_checkpointer.py:90] [process=3][thread=async_save] 3 Handler Commit operations completed. Time taken: 43.307602s.
I0423 15:41:53.580085 137485107582720 async_checkpointer.py:160] [process=3][thread=async_save] Background save thread done. Time taken: 50.378428s.
I0423 15:41:53.580378 137486679000832 async_checkpointer.py:288] [process=3][thread=save_finalize] Done with waiting for background save thread=async_save.
I0423 15:41:53.580501 137486679000832 async_checkpointer.py:298] [process=3][thread=save_finalize] No errors found in background save thread=async_save.
I0423 15:41:53.580543 137486679000832 checkpoint_manager.py:2137] [process=3][thread=save_finalize][step=19] CheckpointManager Save Finalize is syncing with other hosts...
I0423 15:41:53.582122 137486679000832 checkpoint_manager.py:2146] [process=3][thread=save_finalize][step=19] CheckpointManager Save Finalize is done on all hosts.
I0423 15:41:53.582291 137609729292096 checkpoint_manager.py:2032] [process=3][thread=MainThread][step=19][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=19.
I0423 15:41:53.582456 137609729292096 checkpoint_manager.py:2009] [process=3][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning.
I0423 15:41:53.583457 137609729292096 metric_logger.py:196] completed step: 19, seconds: 0.147, TFLOP/s/device: 92.178, Tokens/s/device: 13894.071, total_weights: 65536, loss: 5.385, lm_loss: 5.385, perplexity: 218.045
Per train step:
 Total TFLOPs: 13.59 
 split as 93.93% learnable weight flops and 6.07% attention flops
XPK End: Thu Apr 23 15:42:05 UTC 2026
EXIT_CODE=0