XPK Start: Mon Apr 20 14:01:37 UTC 2026 PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. 2026-04-20 14:02:01.251006: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303) I0420 14:02:01.434056 139964065179456 max_utils.py:273] Attempting to initialize the jax distributed system... I0420 14:02:10.475604 139964065179456 distributed.py:149] Starting JAX distributed service on [::]:8482 I0420 14:02:10.477990 139964065179456 distributed.py:172] Connecting to JAX distributed service on mt-05-fp8-5hhwj-slice-job-0-0.mt-05-fp8-5hhwj:8482 I0420 14:02:11.928024 139964065179456 max_utils.py:284] Jax distributed system initialized! I0420 14:02:18.189408 139964065179456 max_utils.py:800] System Information: Jax Version: 0.9.2 I0420 14:02:18.189508 139964065179456 max_utils.py:801] System Information: Jaxlib Version: 0.9.2 I0420 14:02:18.189547 139964065179456 max_utils.py:802] System Information: Jax Backend: PJRT C API TFRT TPU v6 lite Built on Mar 4 2026 11:32:08 (1772652728) cl/878335365 I0420 14:02:18.189580 139964065179456 train_utils.py:378] WARNING: Sequence packing is essentially ignored for synthetic data. Please use a real dataset to use sequence packing. I0420 14:02:19.232235 139964065179456 maxtext_utils.py:1718] Num_devices: 32, shape (1, 1, 1, 32, 1, 1, 1, 1, 1, 1, 1, 1, 1) I0420 14:02:19.666052 139964065179456 checkpointing.py:688] Setting up checkpoint logger... I0420 14:02:19.666179 139964065179456 checkpointing.py:234] Creating checkpoint manager with ocdbt=True and zarr3=True I0420 14:02:19.666224 139964065179456 pytree_checkpoint_handler.py:592] save_device_host_concurrent_bytes=None I0420 14:02:19.666433 139964065179456 base_pytree_checkpoint_handler.py:441] Created BasePyTreeCheckpointHandler: use_ocdbt=True, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7f4b657be6c0>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB) I0420 14:02:22.538281 139964065179456 checkpointing.py:266] Enabling policy for fixed interval checkpointing. I0420 14:02:22.538469 139964065179456 checkpoint_manager.py:708] [process=0][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=('items',), item_handlers={'items': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7f360c65ef60>}, handler_registry=None I0420 14:02:22.538707 139964065179456 composite_checkpoint_handler.py:237] Deferred registration for item: "items". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7f360c65ef60>` for item "items" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`. I0420 14:02:22.538754 139964065179456 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7f36e4416180>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`. I0420 14:02:22.538790 139964065179456 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('items', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7f360c65ef60>, ('items', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7f360c65ef60>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7f36e4416180>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7f36e4416180>}). I0420 14:02:22.539135 139964065179456 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.34 I0420 14:02:22.539207 139964065179456 async_checkpointer.py:192] [process=0][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>._fn at 0x7f36e470f4c0> timeout: 1200 secs and primary_host=0 for async checkpoint writes I0420 14:02:24.055034 139964065179456 checkpoint_manager.py:1812] Found 0 checkpoint steps in gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints I0420 14:02:24.492157 139964065179456 checkpoint_manager.py:929] [process=0][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=1, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=FixedIntervalPolicy(interval=10), preservation_policy=LatestN(n=None), prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False, lightweight_initialize=False), root_directory=gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7f37142090a0> I0420 14:02:24.492331 139964065179456 checkpointing.py:302] Checkpoint manager created! I0420 14:02:25.748666 139964065179456 checkpointing.py:578] checkpoint manager exists so trying to load this run's existing checkpoint I0420 14:02:25.748784 139964065179456 checkpointing.py:676] No existing checkpoints found, not restoring checkpoint. fsdp: 32 I0420 14:02:27.814679 139964065179456 maxtext_utils.py:1836] decoder/decoder_norm/scale/value Shape: float32[2048] Physical: (None,) I0420 14:02:27.814792 139964065179456 maxtext_utils.py:1836] decoder/layers/mlp/wi_0/kernel/value Shape: float32[2048,16,7168] Physical: ('fsdp', None, None) I0420 14:02:27.814842 139964065179456 maxtext_utils.py:1836] decoder/layers/mlp/wi_1/kernel/value Shape: float32[2048,16,7168] Physical: ('fsdp', None, None) I0420 14:02:27.814891 139964065179456 maxtext_utils.py:1836] decoder/layers/mlp/wo/kernel/value Shape: float32[7168,16,2048] Physical: (None, None, 'fsdp') I0420 14:02:27.814930 139964065179456 maxtext_utils.py:1836] decoder/layers/post_self_attention_layer_norm/scale/value Shape: float32[2048,16] Physical: (None, None) I0420 14:02:27.814965 139964065179456 maxtext_utils.py:1836] decoder/layers/pre_self_attention_layer_norm/scale/value Shape: float32[2048,16] Physical: (None, None) I0420 14:02:27.815002 139964065179456 maxtext_utils.py:1836] decoder/layers/self_attention/key/kernel/value Shape: float32[2048,16,16,128] Physical: ('fsdp', None, None, None) I0420 14:02:27.815036 139964065179456 maxtext_utils.py:1836] decoder/layers/self_attention/out/kernel/value Shape: float32[16,16,128,2048] Physical: (None, None, None, 'fsdp') I0420 14:02:27.815068 139964065179456 maxtext_utils.py:1836] decoder/layers/self_attention/query/kernel/value Shape: float32[2048,16,16,128] Physical: ('fsdp', None, None, None) I0420 14:02:27.815114 139964065179456 maxtext_utils.py:1836] decoder/layers/self_attention/value/kernel/value Shape: float32[2048,16,16,128] Physical: ('fsdp', None, None, None) I0420 14:02:27.815149 139964065179456 maxtext_utils.py:1836] decoder/logits_dense/kernel/value Shape: float32[2048,32000] Physical: ('fsdp', None) I0420 14:02:27.815179 139964065179456 maxtext_utils.py:1836] token_embedder/embedding/value Shape: float32[32000,2048] Physical: (None, 'fsdp') I0420 14:02:28.230260 139964065179456 nnx_decoders.py:465] nnx_decoders/carry Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_norm_length', 'activation_embed'). I0420 14:02:28.230354 139964065179456 nnx_decoders.py:465] nnx_decoders/carry Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None). I0420 14:02:28.235920 139964065179456 nnx_decoders.py:465] Unknown Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_norm_length', 'activation_embed'). I0420 14:02:28.235978 139964065179456 nnx_decoders.py:465] Unknown Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None). I0420 14:02:28.252326 139964065179456 attentions.py:1088] attentions/inputs_q Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_attn_length', 'activation_attn_embed'). I0420 14:02:28.252384 139964065179456 attentions.py:1088] attentions/inputs_q Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None). I0420 14:02:28.268123 139964065179456 attentions.py:1089] attentions/inputs_kv Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_attn_length', 'activation_attn_embed'). I0420 14:02:28.268181 139964065179456 attentions.py:1089] attentions/inputs_kv Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None). I0420 14:02:28.329220 139964065179456 attentions.py:1154] attentions/query Logical: bfloat16[32,2048,16,128].................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim'). I0420 14:02:28.329307 139964065179456 attentions.py:1154] attentions/query Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None). I0420 14:02:28.345108 139964065179456 attentions.py:1155] attentions/key Logical: bfloat16[32,2048,16,128].................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim'). I0420 14:02:28.345167 139964065179456 attentions.py:1155] attentions/key Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None). I0420 14:02:28.360882 139964065179456 attentions.py:1156] attentions/value Logical: bfloat16[32,2048,16,128].................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim'). I0420 14:02:28.360940 139964065179456 attentions.py:1156] attentions/value Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None). I0420 14:02:28.391326 139964065179456 attentions.py:1197] attentions/out Logical: bfloat16[32,2048,16,128].................................... ('activation_batch', 'activation_attn_length', 'activation_heads', 'activation_kv'). I0420 14:02:28.391399 139964065179456 attentions.py:1197] attentions/out Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None). I0420 14:02:28.452562 139964065179456 linears.py:525] linears/x Logical: bfloat16[32,2048,7168]...................................... ('activation_batch', 'activation_length', 'activation_mlp'). I0420 14:02:28.452640 139964065179456 linears.py:525] linears/x Physical: bfloat16[32,2048,7168]...................................... ('fsdp', None, None). I0420 14:02:42.351416 139964065179456 max_utils.py:791] Total memory size: 1.5 GB, Output size: 0.4 GB, Temp size: 1.1 GB, Argument size: 0.4 GB, Host temp size: 0.0 GB. I0420 14:02:42.362466 139964065179456 metric_logger.py:301] number parameters: 1.104 billion I0420 14:02:56.408733 139964065179456 checkpointing.py:794] Waiting for step 0 to finish before checkpoint... I0420 14:02:56.571110 139964065179456 checkpointing.py:798] Waited 0.16234612464904785 seconds for step 0 to finish before starting checkpointing. I0420 14:02:56.573878 139964065179456 checkpoint_manager.py:2009] [process=0][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning. I0420 14:02:56.575989 139964065179456 checkpoint_manager.py:1512] [process=0] Saving checkpoint at step 0 I0420 14:02:56.577405 139964065179456 event_tracking.py:70] [process=0] [async] Started save checkpoint @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/0. I0420 14:02:57.321866 139964065179456 signaling_client.py:364] Using JaxDistributedSignalingClient I0420 14:02:57.352531 139964065179456 jax_array_handlers.py:360] Scheduling D2H of 153 prioritized jax.Array. I0420 14:02:57.352631 139964065179456 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0420 14:02:57.746535 139833147520768 atomicity.py:140] Creating tmp directory gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/0 I0420 14:02:57.748683 139964065179456 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.398605s I0420 14:02:57.749533 139964065179456 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 3.679 GiB/s (total gbytes: 1.5 GiB) (time elapsed: 0.41936540603637695 s) (per-host) I0420 14:02:57.749596 139964065179456 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.419447s (batch_requests_ready=0.005757s, total_serialization_initiated=0.412930s, others=0.000760s) I0420 14:02:57.749682 139964065179456 composite_checkpoint_handler.py:715] [process=0][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.423426s (all_items=0.000018s, per_item={'items': '0.00001812'}, temp_paths=0.423408) I0420 14:02:57.750500 139964065179456 event_tracking.py:125] [process=0] [async] Finished blocking save in 1.17 seconds. Continuing save @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/0. I0420 14:02:57.750774 139833196410624 async_checkpointer.py:76] [process=0][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-20 14:22:57.750747 I0420 14:02:57.770353 139964065179456 checkpoint_manager.py:1560] [process=0][thread=MainThread][step=0] Starting CheckpointManager Save Finalize thread=save_finalize I0420 14:02:57.770632 139832625968896 async_checkpointer.py:280] [process=0][thread=save_finalize] Waiting for background save thread=async_save. I0420 14:02:57.770800 139964065179456 standard_logger.py:34] {'step': 0, 'event_type': 'save', 'directory': 'gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776693776.5738583, 'wait_for_prev_duration_secs': 6.246566772460938e-05, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776693776.5760279, 'checkpointer_blocking_duration_secs': 1.1748647689819336, 'get_old_steps_start_time': 1776693777.7509243, 'get_old_steps_duration_secs': 3.695487976074219e-05, 'checkpoint_manager_blocking_start_time': 1776693776.5715928, 'checkpoint_manager_blocking_duration_secs': 1.1991596221923828} I0420 14:02:57.770938 139964065179456 checkpointing.py:409] Started an asynchronous checkpoint save for step 0 I0420 14:02:57.771041 139964065179456 max_utils.py:750] Memstats: After params initialized: I0420 14:02:57.771129 139964065179456 max_utils.py:756] Using (GB) 0.45 / 31.25 (1.440000%) on TPU_0(process=0,(0,0,0,0)) I0420 14:02:57.771175 139964065179456 max_utils.py:756] Using (GB) 0.45 / 31.25 (1.440000%) on TPU_1(process=0,(1,0,0,0)) I0420 14:02:57.771213 139964065179456 max_utils.py:756] Using (GB) 0.45 / 31.25 (1.440000%) on TPU_4(process=0,(0,1,0,0)) I0420 14:02:57.771245 139964065179456 max_utils.py:756] Using (GB) 0.45 / 31.25 (1.440000%) on TPU_5(process=0,(1,1,0,0)) I0420 14:02:58.124622 139964065179456 metric_logger.py:196] completed step: 0, seconds: 13.604, TFLOP/s/device: 0.999, Tokens/s/device: 150.543, total_weights: 65536, loss: 10.872, lm_loss: 10.872, perplexity: 52680.742 I0420 14:02:58.128280 139964065179456 metric_logger.py:281] To see full metrics 'tensorboard --logdir=gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/tensorboard/' I0420 14:02:58.604866 139964065179456 metric_logger.py:196] completed step: 1, seconds: 1.710, TFLOP/s/device: 7.947, Tokens/s/device: 1197.838, total_weights: 65536, loss: 10.872, lm_loss: 10.872, perplexity: 52680.742 I0420 14:02:58.750865 139964065179456 metric_logger.py:196] completed step: 2, seconds: 0.482, TFLOP/s/device: 28.199, Tokens/s/device: 4250.479, total_weights: 65536, loss: 10.856, lm_loss: 10.856, perplexity: 51864.762 I0420 14:02:58.908476 139964065179456 metric_logger.py:196] completed step: 3, seconds: 0.018, TFLOP/s/device: 774.858, Tokens/s/device: 116794.981, total_weights: 65536, loss: 10.824, lm_loss: 10.824, perplexity: 50225.512 I0420 14:02:59.635020 139832659539712 atomicity.py:140] Creating tmp directory gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/0/items I0420 14:02:59.805788 139832116557568 checkpoint.py:188] Wrote Metadata={'item_handlers': None, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776693779334465532, 'commit_timestamp_nsecs': None, 'custom_metadata': {}}, json={"item_handlers": null, "metrics": {}, "performance_metrics": {}, "init_timestamp_nsecs": 1776693779334465532, "commit_timestamp_nsecs": null, "custom_metadata": {}} to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/0/_CHECKPOINT_METADATA I0420 14:03:01.253201 2852 google_auth_provider.cc:181] Running on GCE, using service account 562977990677-compute@developer.gserviceaccount.com I0420 14:03:03.083230 139832642754304 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 153 array_metadata.ArrayMetadata to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/0/items/array_metadatas/process_0 I0420 14:03:21.792752 139832634361600 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 20.551848s (commit=18.813279s, array_metadata_write=1.738569s) I0420 14:03:21.794176 139833196410624 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 64.575 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 24.46397376060486 s) (per-host) I0420 14:03:21.794308 139833196410624 async_checkpointer.py:90] [process=0][thread=async_save] 3 Handler Commit operations completed. Time taken: 24.043454s. I0420 14:03:22.908364 139833196410624 checkpoint.py:228] Read Metadata={'item_handlers': None, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776693779334465532, 'commit_timestamp_nsecs': None, 'custom_metadata': {}} from gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/0/_CHECKPOINT_METADATA I0420 14:03:23.579654 139964065179456 metric_logger.py:196] completed step: 4, seconds: 0.147, TFLOP/s/device: 92.601, Tokens/s/device: 13957.799, total_weights: 65536, loss: 10.793, lm_loss: 10.793, perplexity: 48689.156 I0420 14:03:23.592822 139964065179456 metric_logger.py:196] completed step: 5, seconds: 0.158, TFLOP/s/device: 86.043, Tokens/s/device: 12969.331, total_weights: 65536, loss: 10.763, lm_loss: 10.763, perplexity: 47229.012 I0420 14:03:23.744588 139964065179456 metric_logger.py:196] completed step: 6, seconds: 24.669, TFLOP/s/device: 0.551, Tokens/s/device: 83.020, total_weights: 65536, loss: 10.733, lm_loss: 10.733, perplexity: 45864.258 I0420 14:03:23.836689 139833196410624 array_metadata_store.py:411] [process=0][thread=async_save] Validated ArrayMetadata from all 8 hosts. Time taken: 0.000990s. I0420 14:03:23.838351 139832116557568 checkpoint.py:247] Updated Metadata={'item_handlers': {'items': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler'}, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776693779334465532, 'commit_timestamp_nsecs': None, 'custom_metadata': {}} to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/0/_CHECKPOINT_METADATA I0420 14:03:23.901501 139964065179456 metric_logger.py:196] completed step: 7, seconds: 0.011, TFLOP/s/device: 1252.155, Tokens/s/device: 188738.365, total_weights: 65536, loss: 10.706, lm_loss: 10.706, perplexity: 44620.242 I0420 14:03:24.058454 139964065179456 metric_logger.py:196] completed step: 8, seconds: 0.153, TFLOP/s/device: 88.581, Tokens/s/device: 13351.849, total_weights: 65536, loss: 10.680, lm_loss: 10.680, perplexity: 43477.055 I0420 14:03:24.069046 139964065179456 checkpointing.py:794] Waiting for step 10 to finish before checkpoint... I0420 14:03:24.373065 139964065179456 checkpointing.py:798] Waited 0.3039894104003906 seconds for step 10 to finish before starting checkpointing. I0420 14:03:24.375853 139964065179456 checkpoint_manager.py:2020] [process=0][thread=MainThread][step=0][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete. I0420 14:03:24.415832 139833196410624 ocdbt_utils.py:49] Param validation support for Zarr3 will be added later (b/362328389). I0420 14:03:25.384994 139833196410624 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 2.352871s. use_zarr3=True, enable_post_merge_validation=True, directory=gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/0/items I0420 14:03:25.386730 139833196410624 atomicity.py:666] Finalizing gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/0/items I0420 14:03:26.347493 139833196410624 atomicity.py:666] Finalizing gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/0 I0420 14:03:27.905006 139833196410624 atomicity.py:847] [process=0][thread=async_save] Finished saving checkpoint (finalized tmp dir) to `gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/0`. I0420 14:03:27.905767 139833196410624 event_tracking.py:138] [process=0] [async] Finished save (blocking + background) in 31.33 seconds @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/0 I0420 14:03:27.907296 139833196410624 async_checkpointer.py:160] [process=0][thread=async_save] Background save thread done. Time taken: 30.156441s. I0420 14:03:27.907485 139832625968896 async_checkpointer.py:288] [process=0][thread=save_finalize] Done with waiting for background save thread=async_save. I0420 14:03:27.907605 139832625968896 async_checkpointer.py:298] [process=0][thread=save_finalize] No errors found in background save thread=async_save. I0420 14:03:27.907673 139832625968896 checkpoint_manager.py:2137] [process=0][thread=save_finalize][step=0] CheckpointManager Save Finalize is syncing with other hosts... I0420 14:03:27.910468 139832625968896 checkpoint_manager.py:2146] [process=0][thread=save_finalize][step=0] CheckpointManager Save Finalize is done on all hosts. I0420 14:03:27.910583 139964065179456 checkpoint_manager.py:2032] [process=0][thread=MainThread][step=0][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=0. W0420 14:03:27.910664 139964065179456 checkpoint_manager.py:1452] Waiting for previous save to complete took 3.534814 seconds. If this number is high, consider checkpointing less frequently. I0420 14:03:27.912584 139964065179456 checkpoint_manager.py:1512] [process=0] Saving checkpoint at step 10 I0420 14:03:27.914739 139964065179456 event_tracking.py:70] [process=0] [async] Started save checkpoint @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/10. I0420 14:03:28.253246 139964065179456 jax_array_handlers.py:360] Scheduling D2H of 153 prioritized jax.Array. I0420 14:03:28.253347 139964065179456 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0420 14:03:28.292059 139964065179456 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.041215s I0420 14:03:28.292907 139964065179456 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 26.318 GiB/s (total gbytes: 1.5 GiB) (time elapsed: 0.05861926078796387 s) (per-host) I0420 14:03:28.292979 139964065179456 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.058709s (batch_requests_ready=0.005378s, total_serialization_initiated=0.052583s, others=0.000748s) I0420 14:03:28.293070 139964065179456 composite_checkpoint_handler.py:715] [process=0][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.062898s (all_items=0.000016s, per_item={'items': '0.00001574'}, temp_paths=0.062882) I0420 14:03:28.293787 139964065179456 event_tracking.py:125] [process=0] [async] Finished blocking save in 0.38 seconds. Continuing save @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/10. I0420 14:03:28.294090 139832625968896 async_checkpointer.py:76] [process=0][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-20 14:23:28.294048 I0420 14:03:28.481216 139833196410624 atomicity.py:140] Creating tmp directory gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/10 I0420 14:03:28.745937 139964065179456 checkpoint_manager.py:1560] [process=0][thread=MainThread][step=10] Starting CheckpointManager Save Finalize thread=save_finalize I0420 14:03:28.746304 139832074594048 async_checkpointer.py:280] [process=0][thread=save_finalize] Waiting for background save thread=async_save. I0420 14:03:28.746488 139964065179456 standard_logger.py:34] {'step': 10, 'event_type': 'save', 'directory': 'gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776693804.3758209, 'wait_for_prev_duration_secs': 3.5348143577575684, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776693807.9126234, 'checkpointer_blocking_duration_secs': 0.3816068172454834, 'get_old_steps_start_time': 1776693808.2942526, 'get_old_steps_duration_secs': 3.24249267578125e-05, 'checkpoint_manager_blocking_start_time': 1776693804.3735178, 'checkpoint_manager_blocking_duration_secs': 4.372931480407715} I0420 14:03:28.746633 139964065179456 checkpointing.py:409] Started an asynchronous checkpoint save for step 10 I0420 14:03:28.747515 139964065179456 metric_logger.py:196] completed step: 9, seconds: 0.157, TFLOP/s/device: 86.589, Tokens/s/device: 13051.569, total_weights: 65536, loss: 10.656, lm_loss: 10.656, perplexity: 42467.320 I0420 14:03:28.761905 139964065179456 metric_logger.py:196] completed step: 10, seconds: 0.157, TFLOP/s/device: 86.769, Tokens/s/device: 13078.741, total_weights: 65536, loss: 10.636, lm_loss: 10.636, perplexity: 41601.941 I0420 14:03:28.911574 139964065179456 metric_logger.py:196] completed step: 11, seconds: 4.690, TFLOP/s/device: 2.897, Tokens/s/device: 436.640, total_weights: 65536, loss: 10.618, lm_loss: 10.618, perplexity: 40849.570 I0420 14:03:29.068411 139964065179456 metric_logger.py:196] completed step: 12, seconds: 0.012, TFLOP/s/device: 1163.580, Tokens/s/device: 175387.514, total_weights: 65536, loss: 10.602, lm_loss: 10.602, perplexity: 40203.926 I0420 14:03:29.225285 139964065179456 metric_logger.py:196] completed step: 13, seconds: 0.151, TFLOP/s/device: 89.965, Tokens/s/device: 13560.489, total_weights: 65536, loss: 10.588, lm_loss: 10.588, perplexity: 39652.145 I0420 14:03:29.741595 139964065179456 metric_logger.py:196] completed step: 14, seconds: 0.157, TFLOP/s/device: 86.803, Tokens/s/device: 13083.921, total_weights: 65536, loss: 10.577, lm_loss: 10.577, perplexity: 39212.879 I0420 14:03:29.893865 139964065179456 metric_logger.py:196] completed step: 15, seconds: 0.662, TFLOP/s/device: 20.528, Tokens/s/device: 3094.221, total_weights: 65536, loss: 10.568, lm_loss: 10.568, perplexity: 38879.223 I0420 14:03:30.050596 139964065179456 metric_logger.py:196] completed step: 16, seconds: 0.010, TFLOP/s/device: 1344.861, Tokens/s/device: 202712.066, total_weights: 65536, loss: 10.562, lm_loss: 10.562, perplexity: 38619.105 I0420 14:03:30.207287 139964065179456 metric_logger.py:196] completed step: 17, seconds: 0.154, TFLOP/s/device: 88.478, Tokens/s/device: 13336.372, total_weights: 65536, loss: 10.556, lm_loss: 10.556, perplexity: 38395.816 I0420 14:03:30.364294 139964065179456 metric_logger.py:196] completed step: 18, seconds: 0.157, TFLOP/s/device: 86.574, Tokens/s/device: 13049.324, total_weights: 65536, loss: 10.552, lm_loss: 10.552, perplexity: 38236.652 I0420 14:03:30.523972 139964065179456 checkpointing.py:794] Waiting for step 19 to finish before checkpoint... I0420 14:03:30.525608 139964065179456 checkpointing.py:798] Waited 0.0016434192657470703 seconds for step 19 to finish before starting checkpointing. I0420 14:03:30.527606 139964065179456 checkpoint_manager.py:2020] [process=0][thread=MainThread][step=10][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete. I0420 14:03:30.777917 139828901443328 atomicity.py:140] Creating tmp directory gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/10/items I0420 14:03:34.236716 139832091379456 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 153 array_metadata.ArrayMetadata to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/10/items/array_metadatas/process_0 I0420 14:03:52.852701 139832082986752 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 20.869170s (commit=19.211290s, array_metadata_write=1.657881s) I0420 14:03:52.854016 139832625968896 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 64.167 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 24.61968684196472 s) (per-host) I0420 14:03:52.854161 139832625968896 async_checkpointer.py:90] [process=0][thread=async_save] 3 Handler Commit operations completed. Time taken: 24.559954s. I0420 14:03:54.552562 139832625968896 array_metadata_store.py:411] [process=0][thread=async_save] Validated ArrayMetadata from all 8 hosts. Time taken: 0.001003s. I0420 14:03:54.926602 139832625968896 ocdbt_utils.py:49] Param validation support for Zarr3 will be added later (b/362328389). I0420 14:03:56.031366 139832625968896 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 2.071139s. use_zarr3=True, enable_post_merge_validation=True, directory=gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/10/items I0420 14:03:56.033099 139832625968896 atomicity.py:666] Finalizing gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/10/items I0420 14:03:56.561558 139832625968896 atomicity.py:666] Finalizing gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/10 I0420 14:03:57.988236 139832625968896 atomicity.py:847] [process=0][thread=async_save] Finished saving checkpoint (finalized tmp dir) to `gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/10`. I0420 14:03:57.989114 139832625968896 event_tracking.py:138] [process=0] [async] Finished save (blocking + background) in 30.08 seconds @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/10 I0420 14:03:57.990679 139832625968896 async_checkpointer.py:160] [process=0][thread=async_save] Background save thread done. Time taken: 29.696470s. I0420 14:03:57.990880 139832074594048 async_checkpointer.py:288] [process=0][thread=save_finalize] Done with waiting for background save thread=async_save. I0420 14:03:57.991009 139832074594048 async_checkpointer.py:298] [process=0][thread=save_finalize] No errors found in background save thread=async_save. I0420 14:03:57.991073 139832074594048 checkpoint_manager.py:2137] [process=0][thread=save_finalize][step=10] CheckpointManager Save Finalize is syncing with other hosts... I0420 14:03:57.993508 139832074594048 checkpoint_manager.py:2146] [process=0][thread=save_finalize][step=10] CheckpointManager Save Finalize is done on all hosts. I0420 14:03:57.993680 139964065179456 checkpoint_manager.py:2032] [process=0][thread=MainThread][step=10][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=10. W0420 14:03:57.993764 139964065179456 checkpoint_manager.py:1452] Waiting for previous save to complete took 27.466167 seconds. If this number is high, consider checkpointing less frequently. I0420 14:03:57.995343 139964065179456 checkpoint_manager.py:1512] [process=0] Saving checkpoint at step 19 I0420 14:03:57.997458 139964065179456 event_tracking.py:70] [process=0] [async] Started save checkpoint @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/19. I0420 14:03:58.366484 139964065179456 jax_array_handlers.py:360] Scheduling D2H of 153 prioritized jax.Array. I0420 14:03:58.366647 139964065179456 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0420 14:03:58.413035 139964065179456 base_pytree_checkpoint_handler.py:154] [process=0][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.048898s I0420 14:03:58.413818 139964065179456 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/blocking_gbytes_per_sec: 23.308 GiB/s (total gbytes: 1.5 GiB) (time elapsed: 0.06618833541870117 s) (per-host) I0420 14:03:58.413877 139964065179456 base_pytree_checkpoint_handler.py:768] [process=0][thread=MainThread] Initiated Pytree async_save. Time taken: 0.066263s (batch_requests_ready=0.005347s, total_serialization_initiated=0.060238s, others=0.000679s) I0420 14:03:58.413961 139964065179456 composite_checkpoint_handler.py:715] [process=0][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.070261s (all_items=0.000011s, per_item={'items': '0.00001121'}, temp_paths=0.070250) I0420 14:03:58.414700 139964065179456 event_tracking.py:125] [process=0] [async] Finished blocking save in 0.42 seconds. Continuing save @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/19. I0420 14:03:58.414989 139832074594048 async_checkpointer.py:76] [process=0][thread=async_save] Background save thread started. Deadline for this save operation is 2026-04-20 14:23:58.414956 I0420 14:03:58.417010 139964065179456 checkpoint_manager.py:1560] [process=0][thread=MainThread][step=19] Starting CheckpointManager Save Finalize thread=save_finalize I0420 14:03:58.417262 139831971665664 async_checkpointer.py:280] [process=0][thread=save_finalize] Waiting for background save thread=async_save. I0420 14:03:58.417423 139964065179456 standard_logger.py:34] {'step': 19, 'event_type': 'save', 'directory': 'gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776693810.5275745, 'wait_for_prev_duration_secs': 27.466167449951172, 'time_between_consecutive_saves_sec': 2.617067575454712, 'checkpointer_blocking_start_time': 1776693837.9953814, 'checkpointer_blocking_duration_secs': 0.41975951194763184, 'get_old_steps_start_time': 1776693838.4151623, 'get_old_steps_duration_secs': 2.86102294921875e-05, 'checkpoint_manager_blocking_start_time': 1776693810.525958, 'checkpoint_manager_blocking_duration_secs': 27.891432285308838} I0420 14:03:58.417564 139964065179456 checkpointing.py:409] Started an asynchronous checkpoint save for step 19 I0420 14:03:58.417614 139964065179456 checkpoint_manager.py:2020] [process=0][thread=MainThread][step=19][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete. I0420 14:03:58.594633 139832625968896 atomicity.py:140] Creating tmp directory gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/19 I0420 14:04:00.423653 139839630468864 atomicity.py:140] Creating tmp directory gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/19/items I0420 14:04:03.968017 139832091379456 array_metadata_store.py:203] [process=0][thread=array_type_handler] Wrote 153 array_metadata.ArrayMetadata to gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/19/items/array_metadatas/process_0 I0420 14:04:22.334448 139832082986752 base_pytree_checkpoint_handler.py:1282] [process=0][thread=write_metadata_after_commits] Commit + Array metadata written. Time taken: 20.660774s (commit=18.944826s, array_metadata_write=1.715948s) I0420 14:04:22.335714 139832074594048 base_pytree_checkpoint_handler.py:130] [process=0] /jax/orbax/write/gbytes_per_sec: 65.857 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 23.988043546676636 s) (per-host) I0420 14:04:22.335831 139832074594048 async_checkpointer.py:90] [process=0][thread=async_save] 3 Handler Commit operations completed. Time taken: 23.920717s. I0420 14:04:23.992747 139832074594048 ocdbt_utils.py:49] Param validation support for Zarr3 will be added later (b/362328389). I0420 14:04:24.011210 139832074594048 array_metadata_store.py:411] [process=0][thread=async_save] Validated ArrayMetadata from all 8 hosts. Time taken: 0.000814s. I0420 14:04:25.097712 139832074594048 base_pytree_checkpoint_handler.py:1406] [process=0][thread=async_save] Pytree save finalize (merge_ocdbt + ArrayMetadata validation) completed. Time taken: 1.947426s. use_zarr3=True, enable_post_merge_validation=True, directory=gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/19/items I0420 14:04:25.099436 139832074594048 atomicity.py:666] Finalizing gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/19/items I0420 14:04:25.647530 139832074594048 atomicity.py:666] Finalizing gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/19 I0420 14:04:27.107902 139832074594048 atomicity.py:847] [process=0][thread=async_save] Finished saving checkpoint (finalized tmp dir) to `gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/19`. I0420 14:04:27.108630 139832074594048 event_tracking.py:138] [process=0] [async] Finished save (blocking + background) in 29.11 seconds @ gs://lance-maxtext/nnx_ckpt_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158/nnx_xpk_feat_nnx_linen_converter_and_sharding_tools_20260420_112158_05_fp8/checkpoints/19 I0420 14:04:27.110114 139832074594048 async_checkpointer.py:160] [process=0][thread=async_save] Background save thread done. Time taken: 28.694998s. I0420 14:04:27.110359 139831971665664 async_checkpointer.py:288] [process=0][thread=save_finalize] Done with waiting for background save thread=async_save. I0420 14:04:27.110472 139831971665664 async_checkpointer.py:298] [process=0][thread=save_finalize] No errors found in background save thread=async_save. I0420 14:04:27.110523 139831971665664 checkpoint_manager.py:2137] [process=0][thread=save_finalize][step=19] CheckpointManager Save Finalize is syncing with other hosts... I0420 14:04:27.112106 139831971665664 checkpoint_manager.py:2146] [process=0][thread=save_finalize][step=19] CheckpointManager Save Finalize is done on all hosts. I0420 14:04:27.112263 139964065179456 checkpoint_manager.py:2032] [process=0][thread=MainThread][step=19][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=19. I0420 14:04:27.112396 139964065179456 checkpoint_manager.py:2009] [process=0][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning. I0420 14:04:27.113339 139964065179456 metric_logger.py:196] completed step: 19, seconds: 0.157, TFLOP/s/device: 86.733, Tokens/s/device: 13073.397, total_weights: 65536, loss: 10.549, lm_loss: 10.549, perplexity: 38125.074 Per train step: Total TFLOPs: 13.59 split as 93.93% learnable weight flops and 6.07% attention flops XPK End: Mon Apr 20 14:04:37 UTC 2026 EXIT_CODE=0