XPK Start: Thu Apr 23 19:03:36 UTC 2026 2026-04-23 19:03:40.644490: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1776971020.658052 10 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1776971020.661878 10 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered W0000 00:00:1776971020.673551 10 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. W0000 00:00:1776971020.673570 10 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. W0000 00:00:1776971020.673572 10 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. W0000 00:00:1776971020.673574 10 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. 2026-04-23 19:03:59.823025: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303) I0423 19:04:00.348956 135856350082880 max_utils.py:273] Attempting to initialize the jax distributed system... INFO:2026-04-23 19:04:09,390:jax._src.distributed:140: Starting JAX distributed service on [::]:8482 I0423 19:04:09.390919 135856350082880 distributed.py:140] Starting JAX distributed service on [::]:8482 INFO:2026-04-23 19:04:09,393:jax._src.distributed:157: Connecting to JAX distributed service on mt-16-zarr3-false-rs-bxx84-slice-job-0-0.mt-16-zarr3-false-rs-bxx84:8482 I0423 19:04:09.393266 135856350082880 distributed.py:157] Connecting to JAX distributed service on mt-16-zarr3-false-rs-bxx84-slice-job-0-0.mt-16-zarr3-false-rs-bxx84:8482 I0423 19:04:10.723255 135856350082880 max_utils.py:284] Jax distributed system initialized! I0423 19:04:17.761762 135856350082880 max_utils.py:800] System Information: Jax Version: 0.8.1 I0423 19:04:17.761870 135856350082880 max_utils.py:801] System Information: Jaxlib Version: 0.8.1 I0423 19:04:17.761911 135856350082880 max_utils.py:802] System Information: Jax Backend: PJRT C API TFRT TPU v6 lite Built on Nov 12 2025 14:16:36 (1762985796) cl/831091709 I0423 19:04:17.761947 135856350082880 train_utils.py:377] WARNING: Sequence packing is essentially ignored for synthetic data. Please use a real dataset to use sequence packing. I0423 19:04:18.382504 135856350082880 maxtext_utils.py:1631] Num_devices: 32, shape (1, 1, 1, 32, 1, 1, 1, 1, 1, 1, 1, 1, 1) I0423 19:04:18.497086 135856350082880 checkpointing.py:688] Setting up checkpoint logger... I0423 19:04:18.497221 135856350082880 checkpointing.py:234] Creating checkpoint manager with ocdbt=True and zarr3=False I0423 19:04:18.497267 135856350082880 pytree_checkpoint_handler.py:589] save_device_host_concurrent_bytes=None I0423 19:04:18.497480 135856350082880 base_pytree_checkpoint_handler.py:415] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=False, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7b8f011793a0>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB) I0423 19:04:20.192857 135856350082880 checkpointing.py:266] Enabling policy for fixed interval checkpointing. I0423 19:04:20.193233 135856350082880 checkpoint_manager.py:709] [process=6][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=('items',), item_handlers={'items': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b7abc3105f0>}, handler_registry=None I0423 19:04:20.193472 135856350082880 composite_checkpoint_handler.py:237] Deferred registration for item: "items". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b7abc3105f0>` for item "items" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`. I0423 19:04:20.193520 135856350082880 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7b7bb3789d60>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`. I0423 19:04:20.193562 135856350082880 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('items', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b7abc3105f0>, ('items', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7b7abc3105f0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7b7bb3789d60>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7b7bb3789d60>}). I0423 19:04:20.193902 135856350082880 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.33 I0423 19:04:20.193980 135856350082880 async_checkpointer.py:177] [process=6][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>._fn at 0x7b7abc2c8f40> timeout: 600 secs and primary_host=0 for async checkpoint writes I0423 19:04:22.009827 135736292517632 checkpoint.py:228] Read Metadata={'item_handlers': {'items': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler'}, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776970842133493218, 'commit_timestamp_nsecs': 1776970890403191398, 'custom_metadata': {}} from gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260423_155242/nnx_xpk_feat_nnx_set_defaults_true_20260423_155242_16_zarr3_false/checkpoints/5/_CHECKPOINT_METADATA I0423 19:04:22.260193 135736292517632 checkpoint_manager.py:1877] Missing metrics for step 5 I0423 19:04:22.662282 135856350082880 checkpoint_manager.py:1818] Found 3 checkpoint steps in gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260423_155242/nnx_xpk_feat_nnx_set_defaults_true_20260423_155242_16_zarr3_false/checkpoints I0423 19:04:23.077258 135856350082880 checkpoint_manager.py:929] [process=6][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=1, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=FixedIntervalPolicy(interval=5), preservation_policy=LatestN(n=None), prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False, lightweight_initialize=False), root_directory=gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260423_155242/nnx_xpk_feat_nnx_set_defaults_true_20260423_155242_16_zarr3_false/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7b7abc311a30> I0423 19:04:23.077433 135856350082880 checkpointing.py:302] Checkpoint manager created! I0423 19:04:24.019190 135856350082880 checkpointing.py:578] checkpoint manager exists so trying to load this run's existing checkpoint I0423 19:04:24.019318 135856350082880 checkpointing.py:582] restoring from this run's directory step 9 I0423 19:04:24.521383 135856350082880 checkpointer.py:304] Restoring checkpoint from gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260423_155242/nnx_xpk_feat_nnx_set_defaults_true_20260423_155242_16_zarr3_false/checkpoints/9. I0423 19:04:25.972190 1412 google_auth_provider.cc:181] Running on GCE, using service account 562977990677-compute@developer.gserviceaccount.com I0423 19:04:33.886941 135856350082880 jax_array_handlers.py:837] [process=6] /jax/orbax/read/worker/io/requested throughput: 199.519 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 7.92043137550354 s) (per-host) I0423 19:04:33.895061 135856350082880 base_pytree_checkpoint_handler.py:129] [process=6] /jax/checkpoint/read/gbytes_per_sec: 189.305 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 8.360077857971191 s) (per-host) I0423 19:04:34.500573 135856350082880 checkpointer.py:318] Finished restoring checkpoint in 10.22 seconds from gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260423_155242/nnx_xpk_feat_nnx_set_defaults_true_20260423_155242_16_zarr3_false/checkpoints/9. I0423 19:04:34.500731 135856350082880 standard_logger.py:34] {'step': 9, 'event_type': 'restore', 'directory': 'gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260423_155242/nnx_xpk_feat_nnx_set_defaults_true_20260423_155242_16_zarr3_false/checkpoints', 'checkpointer_start_time': 1776971064.277159, 'checkpointer_duration_secs': 10.223541498184204, 'checkpoint_manager_start_time': 1776971064.019926, 'checkpoint_manager_duration_secs': 10.480776309967041} fsdp: 32 I0423 19:04:34.581925 135856350082880 maxtext_utils.py:1749] decoder/decoder_norm/scale/value Shape: float32[2048] Physical: (None,) I0423 19:04:34.582044 135856350082880 maxtext_utils.py:1749] decoder/layers/mlp/wi_0/kernel/value Shape: float32[2048,16,7168] Physical: ('fsdp', None, None) I0423 19:04:34.582108 135856350082880 maxtext_utils.py:1749] decoder/layers/mlp/wi_1/kernel/value Shape: float32[2048,16,7168] Physical: ('fsdp', None, None) I0423 19:04:34.582156 135856350082880 maxtext_utils.py:1749] decoder/layers/mlp/wo/kernel/value Shape: float32[7168,16,2048] Physical: (None, None, 'fsdp') I0423 19:04:34.582195 135856350082880 maxtext_utils.py:1749] decoder/layers/post_self_attention_layer_norm/scale/value Shape: float32[2048,16] Physical: (None, None) I0423 19:04:34.582231 135856350082880 maxtext_utils.py:1749] decoder/layers/pre_self_attention_layer_norm/scale/value Shape: float32[2048,16] Physical: (None, None) I0423 19:04:34.582268 135856350082880 maxtext_utils.py:1749] decoder/layers/self_attention/key/kernel/value Shape: float32[2048,16,16,128] Physical: ('fsdp', None, None, None) I0423 19:04:34.582311 135856350082880 maxtext_utils.py:1749] decoder/layers/self_attention/out/kernel/value Shape: float32[16,16,128,2048] Physical: (None, None, None, 'fsdp') I0423 19:04:34.582348 135856350082880 maxtext_utils.py:1749] decoder/layers/self_attention/query/kernel/value Shape: float32[2048,16,16,128] Physical: ('fsdp', None, None, None) I0423 19:04:34.582381 135856350082880 maxtext_utils.py:1749] decoder/layers/self_attention/value/kernel/value Shape: float32[2048,16,16,128] Physical: ('fsdp', None, None, None) I0423 19:04:34.582414 135856350082880 maxtext_utils.py:1749] decoder/logits_dense/kernel/value Shape: float32[2048,32000] Physical: ('fsdp', None) I0423 19:04:34.582447 135856350082880 maxtext_utils.py:1749] token_embedder/embedding/value Shape: float32[32000,2048] Physical: (None, 'fsdp') I0423 19:04:34.983242 135856350082880 nnx_decoders.py:616] nnx_decoders/carry Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_norm_length', 'activation_embed'). I0423 19:04:34.983399 135856350082880 nnx_decoders.py:616] nnx_decoders/carry Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None). I0423 19:04:34.988906 135856350082880 nnx_decoders.py:616] Unknown Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_norm_length', 'activation_embed'). I0423 19:04:34.989001 135856350082880 nnx_decoders.py:616] Unknown Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None). I0423 19:04:35.144797 135856350082880 attentions.py:1084] attentions/inputs_q Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_attn_length', 'activation_attn_embed'). I0423 19:04:35.144933 135856350082880 attentions.py:1084] attentions/inputs_q Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None). I0423 19:04:35.299021 135856350082880 attentions.py:1085] attentions/inputs_kv Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_attn_length', 'activation_attn_embed'). I0423 19:04:35.299167 135856350082880 attentions.py:1085] attentions/inputs_kv Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None). I0423 19:04:35.463897 135856350082880 attentions.py:1150] attentions/query Logical: bfloat16[32,2048,16,128].................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim'). I0423 19:04:35.464032 135856350082880 attentions.py:1150] attentions/query Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None). I0423 19:04:35.617974 135856350082880 attentions.py:1151] attentions/key Logical: bfloat16[32,2048,16,128].................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim'). I0423 19:04:35.618118 135856350082880 attentions.py:1151] attentions/key Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None). I0423 19:04:35.769998 135856350082880 attentions.py:1152] attentions/value Logical: bfloat16[32,2048,16,128].................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim'). I0423 19:04:35.770185 135856350082880 attentions.py:1152] attentions/value Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None). I0423 19:04:35.930687 135856350082880 attentions.py:1193] attentions/out Logical: bfloat16[32,2048,16,128].................................... ('activation_batch', 'activation_attn_length', 'activation_heads', 'activation_kv'). I0423 19:04:35.930825 135856350082880 attentions.py:1193] attentions/out Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None). I0423 19:04:35.953727 135856350082880 linears.py:541] linears/x Logical: bfloat16[32,2048,7168]...................................... ('activation_batch', 'activation_length', 'activation_mlp'). I0423 19:04:35.953818 135856350082880 linears.py:541] linears/x Physical: bfloat16[32,2048,7168]...................................... ('fsdp', None, None). I0423 19:04:47.014523 135856350082880 max_utils.py:791] Total memory size: 1.5 GB, Output size: 0.4 GB, Temp size: 1.1 GB, Argument size: 0.4 GB, Host temp size: 0.0 GB. I0423 19:04:47.016882 135856350082880 metric_logger.py:289] number parameters: 1.104 billion I0423 19:04:58.025578 135856350082880 checkpointing.py:794] Waiting for step 10 to finish before checkpoint... I0423 19:04:58.179202 135856350082880 checkpointing.py:798] Waited 0.1536102294921875 seconds for step 10 to finish before starting checkpointing. I0423 19:04:58.181477 135856350082880 checkpoint_manager.py:2013] [process=6][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning. I0423 19:04:58.183356 135856350082880 checkpoint_manager.py:1518] [process=6] Saving checkpoint at step 10 I0423 19:04:58.186070 135856350082880 async_checkpointer.py:452] [process=6] Started async saving checkpoint to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260423_155242/nnx_xpk_feat_nnx_set_defaults_true_20260423_155242_16_zarr3_false/checkpoints/10. I0423 19:04:58.787178 135856350082880 signaling_client.py:364] Using JaxDistributedSignalingClient I0423 19:04:58.788153 135856350082880 jax_array_handlers.py:358] Scheduling D2H of 69 prioritized jax.Array. I0423 19:04:58.788213 135856350082880 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0423 19:04:59.029444 135856350082880 base_pytree_checkpoint_handler.py:153] [process=6][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.243407s I0423 19:04:59.029618 135856350082880 base_pytree_checkpoint_handler.py:129] [process=6] /jax/checkpoint/write/blocking_gbytes_per_sec: 3.052 GiB/s (total gbytes: 1.5 GiB) (time elapsed: 0.5054008960723877 s) (per-host) I0423 19:04:59.029670 135856350082880 base_pytree_checkpoint_handler.py:737] [process=6][thread=MainThread] Initiated Pytree async_save. Time taken: 0.505461s (batch_requests_ready=0.257308s, total_serialization_initiated=0.248085s, others=0.000068s) I0423 19:04:59.029761 135856350082880 composite_checkpoint_handler.py:715] [process=6][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.509620s (all_items=0.000018s, per_item={'items': '0.00001812'}, temp_paths=0.509602) I0423 19:04:59.030635 135736183478016 async_checkpointer.py:79] [process=6][thread=async_save] Background save thread started. I0423 19:04:59.030802 135856350082880 async_checkpointer.py:561] Finished blocking save. Time taken: 0.847375s. Continuing background save to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260423_155242/nnx_xpk_feat_nnx_set_defaults_true_20260423_155242_16_zarr3_false/checkpoints/10. I0423 19:04:59.471703 135856350082880 checkpoint_manager.py:1566] [process=6][thread=MainThread][step=10] Starting CheckpointManager Save Finalize thread=save_finalize I0423 19:04:59.472091 135730919937792 async_checkpointer.py:265] [process=6][thread=save_finalize] Waiting for background save thread=async_save. I0423 19:04:59.472275 135856350082880 standard_logger.py:34] {'step': 10, 'event_type': 'save', 'directory': 'gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260423_155242/nnx_xpk_feat_nnx_set_defaults_true_20260423_155242_16_zarr3_false/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776971098.1814592, 'wait_for_prev_duration_secs': 6.365776062011719e-05, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776971098.1833937, 'checkpointer_blocking_duration_secs': 0.8475151062011719, 'get_old_steps_start_time': 1776971099.0309277, 'get_old_steps_duration_secs': 2.5987625122070312e-05, 'checkpoint_manager_blocking_start_time': 1776971098.1795933, 'checkpoint_manager_blocking_duration_secs': 1.2926414012908936} I0423 19:04:59.472404 135856350082880 checkpointing.py:409] Started an asynchronous checkpoint save for step 10 I0423 19:04:59.472481 135856350082880 max_utils.py:750] Memstats: After params initialized: I0423 19:04:59.472539 135856350082880 max_utils.py:756] Using (GB) 0.43 / 31.25 (1.376000%) on TPU_24(process=6,(0,6,0,0)) I0423 19:04:59.472571 135856350082880 max_utils.py:756] Using (GB) 0.43 / 31.25 (1.376000%) on TPU_25(process=6,(1,6,0,0)) I0423 19:04:59.472596 135856350082880 max_utils.py:756] Using (GB) 0.43 / 31.25 (1.376000%) on TPU_28(process=6,(0,7,0,0)) I0423 19:04:59.472620 135856350082880 max_utils.py:756] Using (GB) 0.43 / 31.25 (1.376000%) on TPU_29(process=6,(1,7,0,0)) I0423 19:04:59.743274 135856350082880 metric_logger.py:185] completed step: 10, seconds: 11.008, TFLOP/s/device: 1.234, Tokens/s/device: 186.052, total_weights: 65536, loss: 6.827 I0423 19:04:59.922927 135856350082880 metric_logger.py:185] completed step: 11, seconds: 1.716, TFLOP/s/device: 7.918, Tokens/s/device: 1193.553, total_weights: 65536, loss: 6.522 I0423 19:05:00.070808 135856350082880 metric_logger.py:185] completed step: 12, seconds: 0.024, TFLOP/s/device: 562.940, Tokens/s/device: 84852.502, total_weights: 65536, loss: 6.260 I0423 19:05:00.218227 135856350082880 metric_logger.py:185] completed step: 13, seconds: 0.164, TFLOP/s/device: 82.897, Tokens/s/device: 12495.119, total_weights: 65536, loss: 6.039 I0423 19:05:00.225749 135856350082880 checkpointing.py:794] Waiting for step 15 to finish before checkpoint... I0423 19:05:00.513127 135856350082880 checkpointing.py:798] Waited 0.2873241901397705 seconds for step 15 to finish before starting checkpointing. I0423 19:05:00.515805 135856350082880 checkpoint_manager.py:2024] [process=6][thread=MainThread][step=10][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete. I0423 19:05:03.676539 135731997550336 array_metadata_store.py:203] [process=6][thread=array_type_handler] Wrote 69 array_metadata.ArrayMetadata to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260423_155242/nnx_xpk_feat_nnx_set_defaults_true_20260423_155242_16_zarr3_false/checkpoints/10/items/array_metadatas/process_6 I0423 19:05:40.596443 135736183478016 base_pytree_checkpoint_handler.py:129] [process=6] /jax/checkpoint/write/gbytes_per_sec: 37.545 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 42.07217264175415 s) (per-host) I0423 19:05:40.596573 135736183478016 async_checkpointer.py:90] [process=6][thread=async_save] 3 Handler Commit operations completed. Time taken: 41.565815s. I0423 19:05:48.541818 135736183478016 async_checkpointer.py:144] [process=6][thread=async_save] Background save thread done. Time taken: 49.511054s. I0423 19:05:48.542127 135730919937792 async_checkpointer.py:273] [process=6][thread=save_finalize] Done with waiting for background save thread=async_save. I0423 19:05:48.542252 135730919937792 async_checkpointer.py:283] [process=6][thread=save_finalize] No errors found in background save thread=async_save. I0423 19:05:48.542319 135730919937792 checkpoint_manager.py:2133] [process=6][thread=save_finalize][step=10] CheckpointManager Save Finalize is syncing with other hosts... I0423 19:05:48.543857 135730919937792 checkpoint_manager.py:2142] [process=6][thread=save_finalize][step=10] CheckpointManager Save Finalize is done on all hosts. I0423 19:05:48.544025 135856350082880 checkpoint_manager.py:2036] [process=6][thread=MainThread][step=10][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=10. W0423 19:05:48.544175 135856350082880 checkpoint_manager.py:1458] Waiting for previous save to complete took 48.028376 seconds. If this number is high, consider checkpointing less frequently. I0423 19:05:48.546210 135856350082880 checkpoint_manager.py:1518] [process=6] Saving checkpoint at step 15 I0423 19:05:48.550246 135856350082880 async_checkpointer.py:452] [process=6] Started async saving checkpoint to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260423_155242/nnx_xpk_feat_nnx_set_defaults_true_20260423_155242_16_zarr3_false/checkpoints/15. I0423 19:05:49.524022 135856350082880 jax_array_handlers.py:358] Scheduling D2H of 69 prioritized jax.Array. I0423 19:05:49.524141 135856350082880 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0423 19:05:49.565014 135856350082880 base_pytree_checkpoint_handler.py:153] [process=6][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.042973s I0423 19:05:49.565199 135856350082880 base_pytree_checkpoint_handler.py:129] [process=6] /jax/checkpoint/write/blocking_gbytes_per_sec: 5.080 GiB/s (total gbytes: 1.5 GiB) (time elapsed: 0.303652286529541 s) (per-host) I0423 19:05:49.565259 135856350082880 base_pytree_checkpoint_handler.py:737] [process=6][thread=MainThread] Initiated Pytree async_save. Time taken: 0.303716s (batch_requests_ready=0.256115s, total_serialization_initiated=0.047529s, others=0.000072s) I0423 19:05:49.565361 135856350082880 composite_checkpoint_handler.py:715] [process=6][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.308496s (all_items=0.000014s, per_item={'items': '0.00001359'}, temp_paths=0.308482) I0423 19:05:49.566234 135731997550336 async_checkpointer.py:79] [process=6][thread=async_save] Background save thread started. I0423 19:05:49.566366 135856350082880 async_checkpointer.py:561] Finished blocking save. Time taken: 1.020081s. Continuing background save to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260423_155242/nnx_xpk_feat_nnx_set_defaults_true_20260423_155242_16_zarr3_false/checkpoints/15. I0423 19:05:49.573666 135856350082880 checkpoint_manager.py:1566] [process=6][thread=MainThread][step=15] Starting CheckpointManager Save Finalize thread=save_finalize I0423 19:05:49.573887 135730919937792 async_checkpointer.py:265] [process=6][thread=save_finalize] Waiting for background save thread=async_save. I0423 19:05:49.574018 135856350082880 standard_logger.py:34] {'step': 15, 'event_type': 'save', 'directory': 'gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260423_155242/nnx_xpk_feat_nnx_set_defaults_true_20260423_155242_16_zarr3_false/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776971100.5157752, 'wait_for_prev_duration_secs': 48.02837610244751, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776971148.5462515, 'checkpointer_blocking_duration_secs': 1.020296573638916, 'get_old_steps_start_time': 1776971149.5665689, 'get_old_steps_duration_secs': 2.47955322265625e-05, 'checkpoint_manager_blocking_start_time': 1776971100.5134943, 'checkpoint_manager_blocking_duration_secs': 49.060492515563965} I0423 19:05:49.574139 135856350082880 checkpointing.py:409] Started an asynchronous checkpoint save for step 15 I0423 19:05:49.574869 135856350082880 metric_logger.py:185] completed step: 14, seconds: 0.147, TFLOP/s/device: 92.397, Tokens/s/device: 13927.141, total_weights: 65536, loss: 5.858 I0423 19:05:49.582585 135856350082880 metric_logger.py:185] completed step: 15, seconds: 0.149, TFLOP/s/device: 91.259, Tokens/s/device: 13755.491, total_weights: 65536, loss: 5.713 I0423 19:06:08.817505 135856350082880 metric_logger.py:185] completed step: 16, seconds: 49.355, TFLOP/s/device: 0.275, Tokens/s/device: 41.495, total_weights: 65536, loss: 5.599 I0423 19:06:08.964855 135856350082880 metric_logger.py:185] completed step: 17, seconds: 19.090, TFLOP/s/device: 0.712, Tokens/s/device: 107.283, total_weights: 65536, loss: 5.511 I0423 19:06:09.112289 135856350082880 metric_logger.py:185] completed step: 18, seconds: 0.154, TFLOP/s/device: 88.476, Tokens/s/device: 13336.112, total_weights: 65536, loss: 5.442 I0423 19:06:09.260868 135856350082880 checkpointing.py:794] Waiting for step 19 to finish before checkpoint... I0423 19:06:09.261522 135856350082880 checkpointing.py:798] Waited 0.0006659030914306641 seconds for step 19 to finish before starting checkpointing. I0423 19:06:09.263474 135856350082880 checkpoint_manager.py:2024] [process=6][thread=MainThread][step=15][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete. I0423 19:06:13.384612 135729841673984 array_metadata_store.py:203] [process=6][thread=array_type_handler] Wrote 69 array_metadata.ArrayMetadata to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260423_155242/nnx_xpk_feat_nnx_set_defaults_true_20260423_155242_16_zarr3_false/checkpoints/15/items/array_metadatas/process_6 I0423 19:06:50.599579 135731997550336 base_pytree_checkpoint_handler.py:129] [process=6] /jax/checkpoint/write/gbytes_per_sec: 25.752 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 61.337979316711426 s) (per-host) I0423 19:06:50.599703 135731997550336 async_checkpointer.py:90] [process=6][thread=async_save] 3 Handler Commit operations completed. Time taken: 61.033258s. I0423 19:06:58.883352 135731997550336 async_checkpointer.py:144] [process=6][thread=async_save] Background save thread done. Time taken: 69.316892s. I0423 19:06:58.883662 135730919937792 async_checkpointer.py:273] [process=6][thread=save_finalize] Done with waiting for background save thread=async_save. I0423 19:06:58.883783 135730919937792 async_checkpointer.py:283] [process=6][thread=save_finalize] No errors found in background save thread=async_save. I0423 19:06:58.883832 135730919937792 checkpoint_manager.py:2133] [process=6][thread=save_finalize][step=15] CheckpointManager Save Finalize is syncing with other hosts... I0423 19:06:58.885311 135730919937792 checkpoint_manager.py:2142] [process=6][thread=save_finalize][step=15] CheckpointManager Save Finalize is done on all hosts. I0423 19:06:58.885559 135856350082880 checkpoint_manager.py:2036] [process=6][thread=MainThread][step=15][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=15. W0423 19:06:58.885690 135856350082880 checkpoint_manager.py:1458] Waiting for previous save to complete took 49.622228 seconds. If this number is high, consider checkpointing less frequently. I0423 19:06:58.887148 135856350082880 checkpoint_manager.py:1518] [process=6] Saving checkpoint at step 19 I0423 19:06:58.890384 135856350082880 async_checkpointer.py:452] [process=6] Started async saving checkpoint to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260423_155242/nnx_xpk_feat_nnx_set_defaults_true_20260423_155242_16_zarr3_false/checkpoints/19. I0423 19:06:59.440909 135856350082880 jax_array_handlers.py:358] Scheduling D2H of 69 prioritized jax.Array. I0423 19:06:59.441011 135856350082880 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0423 19:06:59.485158 135856350082880 base_pytree_checkpoint_handler.py:153] [process=6][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.046250s I0423 19:06:59.485330 135856350082880 base_pytree_checkpoint_handler.py:129] [process=6] /jax/checkpoint/write/blocking_gbytes_per_sec: 4.996 GiB/s (total gbytes: 1.5 GiB) (time elapsed: 0.3087649345397949 s) (per-host) I0423 19:06:59.485381 135856350082880 base_pytree_checkpoint_handler.py:737] [process=6][thread=MainThread] Initiated Pytree async_save. Time taken: 0.308825s (batch_requests_ready=0.257934s, total_serialization_initiated=0.050826s, others=0.000066s) I0423 19:06:59.485474 135856350082880 composite_checkpoint_handler.py:715] [process=6][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.312760s (all_items=0.000012s, per_item={'items': '0.00001192'}, temp_paths=0.312748) I0423 19:06:59.486315 135736175085312 async_checkpointer.py:79] [process=6][thread=async_save] Background save thread started. I0423 19:06:59.486471 135856350082880 async_checkpointer.py:561] Finished blocking save. Time taken: 0.599251s. Continuing background save to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260423_155242/nnx_xpk_feat_nnx_set_defaults_true_20260423_155242_16_zarr3_false/checkpoints/19. I0423 19:06:59.516753 135856350082880 checkpoint_manager.py:1566] [process=6][thread=MainThread][step=19] Starting CheckpointManager Save Finalize thread=save_finalize I0423 19:06:59.517079 135730919937792 async_checkpointer.py:265] [process=6][thread=save_finalize] Waiting for background save thread=async_save. I0423 19:06:59.517254 135856350082880 standard_logger.py:34] {'step': 19, 'event_type': 'save', 'directory': 'gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260423_155242/nnx_xpk_feat_nnx_set_defaults_true_20260423_155242_16_zarr3_false/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776971169.2634463, 'wait_for_prev_duration_secs': 49.622227907180786, 'time_between_consecutive_saves_sec': 20.71955156326294, 'checkpointer_blocking_start_time': 1776971218.8871872, 'checkpointer_blocking_duration_secs': 0.5994000434875488, 'get_old_steps_start_time': 1776971219.4866023, 'get_old_steps_duration_secs': 1.9311904907226562e-05, 'checkpoint_manager_blocking_start_time': 1776971169.2617595, 'checkpoint_manager_blocking_duration_secs': 50.25546431541443} I0423 19:06:59.517364 135856350082880 checkpointing.py:409] Started an asynchronous checkpoint save for step 19 I0423 19:06:59.517406 135856350082880 checkpoint_manager.py:2024] [process=6][thread=MainThread][step=19][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete. I0423 19:07:04.745246 135736183478016 array_metadata_store.py:203] [process=6][thread=array_type_handler] Wrote 69 array_metadata.ArrayMetadata to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_set_defaults_true_20260423_155242/nnx_xpk_feat_nnx_set_defaults_true_20260423_155242_16_zarr3_false/checkpoints/19/items/array_metadatas/process_6 I0423 19:07:41.719283 135736175085312 base_pytree_checkpoint_handler.py:129] [process=6] /jax/checkpoint/write/gbytes_per_sec: 37.130 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 42.54268026351929 s) (per-host) I0423 19:07:41.719396 135736175085312 async_checkpointer.py:90] [process=6][thread=async_save] 3 Handler Commit operations completed. Time taken: 42.232973s. I0423 19:07:49.110614 135736175085312 async_checkpointer.py:144] [process=6][thread=async_save] Background save thread done. Time taken: 49.624170s. I0423 19:07:49.110911 135730919937792 async_checkpointer.py:273] [process=6][thread=save_finalize] Done with waiting for background save thread=async_save. I0423 19:07:49.111024 135730919937792 async_checkpointer.py:283] [process=6][thread=save_finalize] No errors found in background save thread=async_save. I0423 19:07:49.111068 135730919937792 checkpoint_manager.py:2133] [process=6][thread=save_finalize][step=19] CheckpointManager Save Finalize is syncing with other hosts... I0423 19:07:49.112581 135730919937792 checkpoint_manager.py:2142] [process=6][thread=save_finalize][step=19] CheckpointManager Save Finalize is done on all hosts. I0423 19:07:49.112751 135856350082880 checkpoint_manager.py:2036] [process=6][thread=MainThread][step=19][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=19. I0423 19:07:49.112921 135856350082880 checkpoint_manager.py:2013] [process=6][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning. I0423 19:07:49.113591 135856350082880 metric_logger.py:185] completed step: 19, seconds: 0.147, TFLOP/s/device: 92.489, Tokens/s/device: 13940.982, total_weights: 65536, loss: 5.385 Per train step: Total TFLOPs: 13.59 split as 93.93% learnable weight flops and 6.07% attention flops XPK End: Thu Apr 23 19:08:02 UTC 2026 EXIT_CODE=0