XPK Start: Wed Apr 22 17:07:39 UTC 2026 2026-04-22 17:07:43.794114: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1776877663.807344 10 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1776877663.811069 10 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered W0000 00:00:1776877663.822606 10 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. W0000 00:00:1776877663.822624 10 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. W0000 00:00:1776877663.822626 10 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. W0000 00:00:1776877663.822628 10 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. 2026-04-22 17:08:03.004980: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303) I0422 17:08:03.537950 134837345322816 max_utils.py:273] Attempting to initialize the jax distributed system... INFO:2026-04-22 17:08:12,580:jax._src.distributed:140: Starting JAX distributed service on [::]:8482 I0422 17:08:12.580386 134837345322816 distributed.py:140] Starting JAX distributed service on [::]:8482 INFO:2026-04-22 17:08:12,582:jax._src.distributed:157: Connecting to JAX distributed service on mt-15-ocdbt-false-rs-s9zh2-slice-job-0-0.mt-15-ocdbt-false-rs-s9zh2:8482 I0422 17:08:12.582698 134837345322816 distributed.py:157] Connecting to JAX distributed service on mt-15-ocdbt-false-rs-s9zh2-slice-job-0-0.mt-15-ocdbt-false-rs-s9zh2:8482 I0422 17:08:13.599870 134837345322816 max_utils.py:284] Jax distributed system initialized! I0422 17:08:19.307269 134837345322816 max_utils.py:800] System Information: Jax Version: 0.8.1 I0422 17:08:19.307376 134837345322816 max_utils.py:801] System Information: Jaxlib Version: 0.8.1 I0422 17:08:19.307417 134837345322816 max_utils.py:802] System Information: Jax Backend: PJRT C API TFRT TPU v6 lite Built on Nov 12 2025 14:16:36 (1762985796) cl/831091709 I0422 17:08:19.307452 134837345322816 train_utils.py:377] WARNING: Sequence packing is essentially ignored for synthetic data. Please use a real dataset to use sequence packing. I0422 17:08:19.930192 134837345322816 maxtext_utils.py:1631] Num_devices: 32, shape (1, 1, 1, 32, 1, 1, 1, 1, 1, 1, 1, 1, 1) I0422 17:08:19.930790 134837345322816 maxtext_utils.py:1631] Num_devices: 32, shape (1, 1, 1, 32, 1, 1, 1, 1, 1, 1, 1, 1, 1) I0422 17:08:19.930982 134837345322816 checkpointing.py:688] Setting up checkpoint logger... I0422 17:08:19.931032 134837345322816 checkpointing.py:234] Creating checkpoint manager with ocdbt=False and zarr3=True I0422 17:08:19.931076 134837345322816 pytree_checkpoint_handler.py:589] save_device_host_concurrent_bytes=None I0422 17:08:19.931444 134837345322816 base_pytree_checkpoint_handler.py:415] Created BasePyTreeCheckpointHandler: use_ocdbt=False, use_zarr3=True, pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=<orbax.checkpoint._src.metadata.array_metadata_store.Store object at 0x7aa1b63755b0>, enable_pinned_host_transfer=False, save_concurrent_bytes: 96000000000 (89.4 GiB), restore_concurrent_bytes: 96000000000 (89.4 GiB) I0422 17:08:21.237255 134837345322816 checkpointing.py:266] Enabling policy for fixed interval checkpointing. I0422 17:08:21.237495 134837345322816 checkpoint_manager.py:709] [process=6][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=('items',), item_handlers={'items': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7a8f68ad17c0>}, handler_registry=None I0422 17:08:21.237733 134837345322816 composite_checkpoint_handler.py:237] Deferred registration for item: "items". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7a8f68ad17c0>` for item "items" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`. I0422 17:08:21.237782 134837345322816 composite_checkpoint_handler.py:237] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7a8f68ad61b0>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`. I0422 17:08:21.237819 134837345322816 composite_checkpoint_handler.py:505] Initialized registry DefaultCheckpointHandlerRegistry({('items', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7a8f68ad17c0>, ('items', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x7a8f68ad17c0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7a8f68ad61b0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x7a8f68ad61b0>}). I0422 17:08:21.238340 134837345322816 abstract_checkpointer.py:35] orbax-checkpoint version: 0.11.33 I0422 17:08:21.238420 134837345322816 async_checkpointer.py:177] [process=6][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>._fn at 0x7a8f47538180> timeout: 600 secs and primary_host=0 for async checkpoint writes I0422 17:08:23.512652 134718334945024 checkpoint.py:228] Read Metadata={'item_handlers': {'items': 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler'}, 'metrics': {}, 'performance_metrics': {}, 'init_timestamp_nsecs': 1776877515350191643, 'commit_timestamp_nsecs': 1776877529083709951, 'custom_metadata': {}} from gs://lance-maxtext/linen_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/linen_xpk_feat_nnx_set_defaults_true_20260422_154647_15_ocdbt_false/checkpoints/0/_CHECKPOINT_METADATA I0422 17:08:23.756558 134718334945024 checkpoint_manager.py:1877] Missing metrics for step 0 I0422 17:08:23.805739 134837345322816 checkpoint_manager.py:1818] Found 3 checkpoint steps in gs://lance-maxtext/linen_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/linen_xpk_feat_nnx_set_defaults_true_20260422_154647_15_ocdbt_false/checkpoints I0422 17:08:24.211255 134837345322816 checkpoint_manager.py:929] [process=6][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=1, max_to_keep=None, keep_time_interval=None, keep_period=None, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=True, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, todelete_full_path=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=None, multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=FixedIntervalPolicy(interval=5), preservation_policy=LatestN(n=None), prevent_write_metrics=False, enable_should_save_is_saving_in_progress_check=True, enable_per_process_directory_creation=False, lightweight_initialize=False), root_directory=gs://lance-maxtext/linen_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/linen_xpk_feat_nnx_set_defaults_true_20260422_154647_15_ocdbt_false/checkpoints: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x7a8f68ad6c00> I0422 17:08:24.211435 134837345322816 checkpointing.py:302] Checkpoint manager created! I0422 17:08:25.110423 134837345322816 nnx_wrappers.py:455] Unknown Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_norm_length', 'activation_embed'). I0422 17:08:25.110536 134837345322816 nnx_wrappers.py:455] Unknown Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None). I0422 17:08:25.480924 134837345322816 attentions.py:1084] attentions/inputs_q Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_attn_length', 'activation_attn_embed'). I0422 17:08:25.481023 134837345322816 attentions.py:1084] attentions/inputs_q Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None). I0422 17:08:25.496819 134837345322816 attentions.py:1085] attentions/inputs_kv Logical: bfloat16[32,2048,2048]...................................... ('activation_batch', 'activation_attn_length', 'activation_attn_embed'). I0422 17:08:25.496878 134837345322816 attentions.py:1085] attentions/inputs_kv Physical: bfloat16[32,2048,2048]...................................... ('fsdp', None, None). I0422 17:08:25.519520 134837345322816 attentions.py:1150] attentions/query Logical: bfloat16[32,2048,16,128].................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim'). I0422 17:08:25.519590 134837345322816 attentions.py:1150] attentions/query Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None). I0422 17:08:25.535494 134837345322816 attentions.py:1151] attentions/key Logical: bfloat16[32,2048,16,128].................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim'). I0422 17:08:25.535558 134837345322816 attentions.py:1151] attentions/key Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None). I0422 17:08:25.551251 134837345322816 attentions.py:1152] attentions/value Logical: bfloat16[32,2048,16,128].................................... ('activation_kv_batch', 'activation_attn_length', 'activation_kv_heads', 'activation_kv_head_dim'). I0422 17:08:25.551311 134837345322816 attentions.py:1152] attentions/value Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None). I0422 17:08:25.575836 134837345322816 attentions.py:1193] attentions/out Logical: bfloat16[32,2048,16,128].................................... ('activation_batch', 'activation_attn_length', 'activation_heads', 'activation_kv'). I0422 17:08:25.575908 134837345322816 attentions.py:1193] attentions/out Physical: bfloat16[32,2048,16,128].................................... ('fsdp', None, None, None). I0422 17:08:25.601146 134837345322816 linears.py:541] linears/x Logical: bfloat16[32,2048,7168]...................................... ('activation_batch', 'activation_length', 'activation_mlp'). I0422 17:08:25.601248 134837345322816 linears.py:541] linears/x Physical: bfloat16[32,2048,7168]...................................... ('fsdp', None, None). I0422 17:08:25.797341 134837345322816 checkpointing.py:578] checkpoint manager exists so trying to load this run's existing checkpoint I0422 17:08:25.797452 134837345322816 checkpointing.py:582] restoring from this run's directory step 9 I0422 17:08:26.301240 134837345322816 checkpointer.py:304] Restoring checkpoint from gs://lance-maxtext/linen_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/linen_xpk_feat_nnx_set_defaults_true_20260422_154647_15_ocdbt_false/checkpoints/9. I0422 17:08:29.273479 1427 google_auth_provider.cc:181] Running on GCE, using service account 562977990677-compute@developer.gserviceaccount.com I0422 17:08:37.668361 134837345322816 jax_array_handlers.py:837] [process=6] /jax/orbax/read/worker/io/requested throughput: 163.921 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 9.640438079833984 s) (per-host) I0422 17:08:37.675615 134837345322816 base_pytree_checkpoint_handler.py:129] [process=6] /jax/checkpoint/read/gbytes_per_sec: 157.739 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 10.033003568649292 s) (per-host) I0422 17:08:39.958580 134837345322816 checkpointer.py:318] Finished restoring checkpoint in 13.91 seconds from gs://lance-maxtext/linen_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/linen_xpk_feat_nnx_set_defaults_true_20260422_154647_15_ocdbt_false/checkpoints/9. I0422 17:08:39.958740 134837345322816 standard_logger.py:34] {'step': 9, 'event_type': 'restore', 'directory': 'gs://lance-maxtext/linen_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/linen_xpk_feat_nnx_set_defaults_true_20260422_154647_15_ocdbt_false/checkpoints', 'checkpointer_start_time': 1776877706.0465221, 'checkpointer_duration_secs': 13.912187814712524, 'checkpoint_manager_start_time': 1776877705.7976992, 'checkpoint_manager_duration_secs': 14.161012649536133} fsdp: 32 I0422 17:08:40.361580 134837345322816 maxtext_utils.py:1740] params/params/decoder/decoder_norm/scale Shape: float32[2048] Logical: PartitionSpec('norm',) Physical: (None,) I0422 17:08:40.361712 134837345322816 maxtext_utils.py:1740] params/params/decoder/layers/mlp/wi_0/kernel Shape: float32[2048,16,7168] Logical: PartitionSpec('embed', 'layers', 'mlp') Physical: ('fsdp', None, None) I0422 17:08:40.361768 134837345322816 maxtext_utils.py:1740] params/params/decoder/layers/mlp/wi_1/kernel Shape: float32[2048,16,7168] Logical: PartitionSpec('embed', 'layers', 'mlp') Physical: ('fsdp', None, None) I0422 17:08:40.361830 134837345322816 maxtext_utils.py:1740] params/params/decoder/layers/mlp/wo/kernel Shape: float32[7168,16,2048] Logical: PartitionSpec('mlp', 'layers', 'embed') Physical: (None, None, 'fsdp') I0422 17:08:40.361885 134837345322816 maxtext_utils.py:1740] params/params/decoder/layers/post_self_attention_layer_norm/scale Shape: float32[2048,16] Logical: PartitionSpec('norm', 'layers') Physical: (None, None) I0422 17:08:40.361926 134837345322816 maxtext_utils.py:1740] params/params/decoder/layers/pre_self_attention_layer_norm/scale Shape: float32[2048,16] Logical: PartitionSpec('norm', 'layers') Physical: (None, None) I0422 17:08:40.361980 134837345322816 maxtext_utils.py:1740] params/params/decoder/layers/self_attention/key/kernel Shape: float32[2048,16,16,128] Logical: PartitionSpec('embed', 'layers', 'kv_heads', 'kv_head_dim') Physical: ('fsdp', None, None, None) I0422 17:08:40.362032 134837345322816 maxtext_utils.py:1740] params/params/decoder/layers/self_attention/out/kernel Shape: float32[16,16,128,2048] Logical: PartitionSpec('heads', 'layers', 'kv', 'embed') Physical: (None, None, None, 'fsdp') I0422 17:08:40.362074 134837345322816 maxtext_utils.py:1740] params/params/decoder/layers/self_attention/query/kernel Shape: float32[2048,16,16,128] Logical: PartitionSpec('embed', 'layers', 'q_heads', 'kv') Physical: ('fsdp', None, None, None) I0422 17:08:40.362126 134837345322816 maxtext_utils.py:1740] params/params/decoder/layers/self_attention/value/kernel Shape: float32[2048,16,16,128] Logical: PartitionSpec('embed', 'layers', 'kv_heads', 'kv_head_dim') Physical: ('fsdp', None, None, None) I0422 17:08:40.362178 134837345322816 maxtext_utils.py:1740] params/params/decoder/logits_dense/kernel Shape: float32[2048,32000] Logical: PartitionSpec('embed', 'vocab') Physical: ('fsdp', None) I0422 17:08:40.362226 134837345322816 maxtext_utils.py:1740] params/params/token_embedder/embedding Shape: float32[32000,2048] Logical: PartitionSpec('vocab', 'embed') Physical: (None, 'fsdp') I0422 17:08:40.856237 134837345322816 train.py:158] train/xent Logical: float32[32,2048]............................................ ('activation_embed_and_logits_batch', 'activation_length'). I0422 17:08:40.856337 134837345322816 train.py:158] train/xent Physical: float32[32,2048]............................................ ('fsdp', None). I0422 17:08:40.871309 134837345322816 train.py:165] train/z_loss Logical: float32[32,2048]............................................ ('activation_embed_and_logits_batch', 'activation_length'). I0422 17:08:40.871370 134837345322816 train.py:165] train/z_loss Physical: float32[32,2048]............................................ ('fsdp', None). I0422 17:08:51.760745 134837345322816 max_utils.py:791] Total memory size: 1.5 GB, Output size: 0.4 GB, Temp size: 1.1 GB, Argument size: 0.4 GB, Host temp size: 0.0 GB. I0422 17:08:51.761596 134837345322816 metric_logger.py:289] number parameters: 1.104 billion I0422 17:09:03.150597 134837345322816 checkpointing.py:794] Waiting for step 10 to finish before checkpoint... I0422 17:09:03.308497 134837345322816 checkpointing.py:798] Waited 0.15788555145263672 seconds for step 10 to finish before starting checkpointing. I0422 17:09:03.311216 134837345322816 checkpoint_manager.py:2013] [process=6][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning. I0422 17:09:03.313145 134837345322816 checkpoint_manager.py:1518] [process=6] Saving checkpoint at step 10 I0422 17:09:03.315889 134837345322816 async_checkpointer.py:452] [process=6] Started async saving checkpoint to gs://lance-maxtext/linen_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/linen_xpk_feat_nnx_set_defaults_true_20260422_154647_15_ocdbt_false/checkpoints/10. I0422 17:09:03.954466 134837345322816 signaling_client.py:364] Using JaxDistributedSignalingClient I0422 17:09:03.955465 134837345322816 jax_array_handlers.py:358] Scheduling D2H of 39 prioritized jax.Array. I0422 17:09:03.955523 134837345322816 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0422 17:09:04.211194 134837345322816 base_pytree_checkpoint_handler.py:153] [process=6][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.256815s I0422 17:09:04.211370 134837345322816 base_pytree_checkpoint_handler.py:129] [process=6] /jax/checkpoint/write/blocking_gbytes_per_sec: 2.787 GiB/s (total gbytes: 1.5 GiB) (time elapsed: 0.5534152984619141 s) (per-host) I0422 17:09:04.211428 134837345322816 base_pytree_checkpoint_handler.py:737] [process=6][thread=MainThread] Initiated Pytree async_save. Time taken: 0.553483s (batch_requests_ready=0.294574s, total_serialization_initiated=0.258833s, others=0.000076s) I0422 17:09:04.211515 134837345322816 composite_checkpoint_handler.py:715] [process=6][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.557881s (all_items=0.000018s, per_item={'items': '0.00001788'}, temp_paths=0.557863) I0422 17:09:04.212404 134716237784832 async_checkpointer.py:79] [process=6][thread=async_save] Background save thread started. I0422 17:09:04.212584 134837345322816 async_checkpointer.py:561] Finished blocking save. Time taken: 0.899362s. Continuing background save to gs://lance-maxtext/linen_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/linen_xpk_feat_nnx_set_defaults_true_20260422_154647_15_ocdbt_false/checkpoints/10. I0422 17:09:04.214488 134837345322816 checkpoint_manager.py:1566] [process=6][thread=MainThread][step=10] Starting CheckpointManager Save Finalize thread=save_finalize I0422 17:09:04.214728 134713595393792 async_checkpointer.py:265] [process=6][thread=save_finalize] Waiting for background save thread=async_save. I0422 17:09:04.214842 134837345322816 standard_logger.py:34] {'step': 10, 'event_type': 'save', 'directory': 'gs://lance-maxtext/linen_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/linen_xpk_feat_nnx_set_defaults_true_20260422_154647_15_ocdbt_false/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776877743.311198, 'wait_for_prev_duration_secs': 6.890296936035156e-05, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776877743.3131824, 'checkpointer_blocking_duration_secs': 0.8995122909545898, 'get_old_steps_start_time': 1776877744.2127135, 'get_old_steps_duration_secs': 2.6702880859375e-05, 'checkpoint_manager_blocking_start_time': 1776877743.3090413, 'checkpoint_manager_blocking_duration_secs': 0.9057652950286865} I0422 17:09:04.214949 134837345322816 checkpointing.py:409] Started an asynchronous checkpoint save for step 10 I0422 17:09:04.215000 134837345322816 max_utils.py:750] Memstats: After params initialized: I0422 17:09:04.215050 134837345322816 max_utils.py:756] Using (GB) 0.43 / 31.25 (1.376000%) on TPU_24(process=6,(0,6,0,0)) I0422 17:09:04.215085 134837345322816 max_utils.py:756] Using (GB) 0.43 / 31.25 (1.376000%) on TPU_25(process=6,(1,6,0,0)) I0422 17:09:04.215131 134837345322816 max_utils.py:756] Using (GB) 0.43 / 31.25 (1.376000%) on TPU_28(process=6,(0,7,0,0)) I0422 17:09:04.215160 134837345322816 max_utils.py:756] Using (GB) 0.43 / 31.25 (1.376000%) on TPU_29(process=6,(1,7,0,0)) I0422 17:09:04.481401 134837345322816 metric_logger.py:185] completed step: 10, seconds: 11.389, TFLOP/s/device: 1.193, Tokens/s/device: 179.824, total_weights: 65536, loss: 6.833 I0422 17:09:04.657545 134837345322816 metric_logger.py:185] completed step: 11, seconds: 1.329, TFLOP/s/device: 10.220, Tokens/s/device: 1540.474, total_weights: 65536, loss: 6.528 I0422 17:09:04.805066 134837345322816 metric_logger.py:185] completed step: 12, seconds: 0.029, TFLOP/s/device: 465.743, Tokens/s/device: 70201.899, total_weights: 65536, loss: 6.265 I0422 17:09:04.952531 134837345322816 metric_logger.py:185] completed step: 13, seconds: 0.153, TFLOP/s/device: 88.793, Tokens/s/device: 13383.784, total_weights: 65536, loss: 6.045 I0422 17:09:04.957276 134837345322816 checkpointing.py:794] Waiting for step 15 to finish before checkpoint... I0422 17:09:05.247176 134837345322816 checkpointing.py:798] Waited 0.28986573219299316 seconds for step 15 to finish before starting checkpointing. I0422 17:09:05.250820 134837345322816 checkpoint_manager.py:2024] [process=6][thread=MainThread][step=10][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete. I0422 17:09:08.677929 134714115479296 array_metadata_store.py:203] [process=6][thread=array_type_handler] Wrote 39 array_metadata.ArrayMetadata to gs://lance-maxtext/linen_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/linen_xpk_feat_nnx_set_defaults_true_20260422_154647_15_ocdbt_false/checkpoints/10/items/array_metadatas/process_6 I0422 17:09:14.548980 134716237784832 base_pytree_checkpoint_handler.py:129] [process=6] /jax/checkpoint/write/gbytes_per_sec: 145.037 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 10.89098834991455 s) (per-host) I0422 17:09:14.549139 134716237784832 async_checkpointer.py:90] [process=6][thread=async_save] 3 Handler Commit operations completed. Time taken: 10.336617s. I0422 17:09:20.484399 134716237784832 async_checkpointer.py:144] [process=6][thread=async_save] Background save thread done. Time taken: 16.271860s. I0422 17:09:20.484718 134713595393792 async_checkpointer.py:273] [process=6][thread=save_finalize] Done with waiting for background save thread=async_save. I0422 17:09:20.484848 134713595393792 async_checkpointer.py:283] [process=6][thread=save_finalize] No errors found in background save thread=async_save. I0422 17:09:20.484895 134713595393792 checkpoint_manager.py:2133] [process=6][thread=save_finalize][step=10] CheckpointManager Save Finalize is syncing with other hosts... I0422 17:09:20.486278 134713595393792 checkpoint_manager.py:2142] [process=6][thread=save_finalize][step=10] CheckpointManager Save Finalize is done on all hosts. I0422 17:09:20.486450 134837345322816 checkpoint_manager.py:2036] [process=6][thread=MainThread][step=10][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=10. W0422 17:09:20.486589 134837345322816 checkpoint_manager.py:1458] Waiting for previous save to complete took 15.235768 seconds. If this number is high, consider checkpointing less frequently. I0422 17:09:20.488380 134837345322816 checkpoint_manager.py:1518] [process=6] Saving checkpoint at step 15 I0422 17:09:20.492407 134837345322816 async_checkpointer.py:452] [process=6] Started async saving checkpoint to gs://lance-maxtext/linen_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/linen_xpk_feat_nnx_set_defaults_true_20260422_154647_15_ocdbt_false/checkpoints/15. I0422 17:09:21.059790 134837345322816 jax_array_handlers.py:358] Scheduling D2H of 39 prioritized jax.Array. I0422 17:09:21.059890 134837345322816 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0422 17:09:21.106796 134837345322816 base_pytree_checkpoint_handler.py:153] [process=6][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.048272s I0422 17:09:21.106961 134837345322816 base_pytree_checkpoint_handler.py:129] [process=6] /jax/checkpoint/write/blocking_gbytes_per_sec: 4.756 GiB/s (total gbytes: 1.5 GiB) (time elapsed: 0.32434582710266113 s) (per-host) I0422 17:09:21.107013 134837345322816 base_pytree_checkpoint_handler.py:737] [process=6][thread=MainThread] Initiated Pytree async_save. Time taken: 0.324406s (batch_requests_ready=0.274194s, total_serialization_initiated=0.050145s, others=0.000067s) I0422 17:09:21.107119 134837345322816 composite_checkpoint_handler.py:715] [process=6][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.329309s (all_items=0.000012s, per_item={'items': '0.00001168'}, temp_paths=0.329297) I0422 17:09:21.109620 134713595393792 async_checkpointer.py:79] [process=6][thread=async_save] Background save thread started. I0422 17:09:21.109733 134837345322816 async_checkpointer.py:561] Finished blocking save. Time taken: 0.621282s. Continuing background save to gs://lance-maxtext/linen_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/linen_xpk_feat_nnx_set_defaults_true_20260422_154647_15_ocdbt_false/checkpoints/15. I0422 17:09:21.528935 134837345322816 checkpoint_manager.py:1566] [process=6][thread=MainThread][step=15] Starting CheckpointManager Save Finalize thread=save_finalize I0422 17:09:21.529337 134716229392128 async_checkpointer.py:265] [process=6][thread=save_finalize] Waiting for background save thread=async_save. I0422 17:09:21.529508 134837345322816 standard_logger.py:34] {'step': 15, 'event_type': 'save', 'directory': 'gs://lance-maxtext/linen_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/linen_xpk_feat_nnx_set_defaults_true_20260422_154647_15_ocdbt_false/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776877745.2507906, 'wait_for_prev_duration_secs': 15.23576807975769, 'time_between_consecutive_saves_sec': None, 'checkpointer_blocking_start_time': 1776877760.488418, 'checkpointer_blocking_duration_secs': 0.6214289665222168, 'get_old_steps_start_time': 1776877761.1098664, 'get_old_steps_duration_secs': 2.5987625122070312e-05, 'checkpoint_manager_blocking_start_time': 1776877745.2478538, 'checkpoint_manager_blocking_duration_secs': 16.281620740890503} I0422 17:09:21.529629 134837345322816 checkpointing.py:409] Started an asynchronous checkpoint save for step 15 I0422 17:09:21.530234 134837345322816 metric_logger.py:185] completed step: 14, seconds: 0.147, TFLOP/s/device: 92.129, Tokens/s/device: 13886.629, total_weights: 65536, loss: 5.864 I0422 17:09:21.535934 134837345322816 metric_logger.py:185] completed step: 15, seconds: 0.147, TFLOP/s/device: 92.157, Tokens/s/device: 13890.867, total_weights: 65536, loss: 5.719 I0422 17:09:39.130065 134837345322816 metric_logger.py:185] completed step: 16, seconds: 16.578, TFLOP/s/device: 0.820, Tokens/s/device: 123.541, total_weights: 65536, loss: 5.605 I0422 17:09:39.277518 134837345322816 metric_logger.py:185] completed step: 17, seconds: 17.448, TFLOP/s/device: 0.779, Tokens/s/device: 117.378, total_weights: 65536, loss: 5.517 I0422 17:09:39.424981 134837345322816 metric_logger.py:185] completed step: 18, seconds: 0.152, TFLOP/s/device: 89.486, Tokens/s/device: 13488.326, total_weights: 65536, loss: 5.448 I0422 17:09:39.572431 134837345322816 checkpointing.py:794] Waiting for step 19 to finish before checkpoint... I0422 17:09:39.572993 134837345322816 checkpointing.py:798] Waited 0.000583648681640625 seconds for step 19 to finish before starting checkpointing. I0422 17:09:39.575222 134837345322816 checkpoint_manager.py:2024] [process=6][thread=MainThread][step=15][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete. I0422 17:09:42.839631 134714115479296 array_metadata_store.py:203] [process=6][thread=array_type_handler] Wrote 39 array_metadata.ArrayMetadata to gs://lance-maxtext/linen_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/linen_xpk_feat_nnx_set_defaults_true_20260422_154647_15_ocdbt_false/checkpoints/15/items/array_metadatas/process_6 I0422 17:09:48.643439 134713595393792 base_pytree_checkpoint_handler.py:129] [process=6] /jax/checkpoint/write/gbytes_per_sec: 56.696 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 27.86079168319702 s) (per-host) I0422 17:09:48.643544 134713595393792 async_checkpointer.py:90] [process=6][thread=async_save] 3 Handler Commit operations completed. Time taken: 27.533860s. I0422 17:09:53.846757 134713595393792 async_checkpointer.py:144] [process=6][thread=async_save] Background save thread done. Time taken: 32.737058s. I0422 17:09:53.846987 134716229392128 async_checkpointer.py:273] [process=6][thread=save_finalize] Done with waiting for background save thread=async_save. I0422 17:09:53.847043 134716229392128 async_checkpointer.py:283] [process=6][thread=save_finalize] No errors found in background save thread=async_save. I0422 17:09:53.847087 134716229392128 checkpoint_manager.py:2133] [process=6][thread=save_finalize][step=15] CheckpointManager Save Finalize is syncing with other hosts... I0422 17:09:53.848625 134716229392128 checkpoint_manager.py:2142] [process=6][thread=save_finalize][step=15] CheckpointManager Save Finalize is done on all hosts. I0422 17:09:53.848814 134837345322816 checkpoint_manager.py:2036] [process=6][thread=MainThread][step=15][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=15. W0422 17:09:53.848937 134837345322816 checkpoint_manager.py:1458] Waiting for previous save to complete took 14.273730 seconds. If this number is high, consider checkpointing less frequently. I0422 17:09:53.850452 134837345322816 checkpoint_manager.py:1518] [process=6] Saving checkpoint at step 19 I0422 17:09:53.853940 134837345322816 async_checkpointer.py:452] [process=6] Started async saving checkpoint to gs://lance-maxtext/linen_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/linen_xpk_feat_nnx_set_defaults_true_20260422_154647_15_ocdbt_false/checkpoints/19. I0422 17:09:54.813975 134837345322816 jax_array_handlers.py:358] Scheduling D2H of 39 prioritized jax.Array. I0422 17:09:54.814073 134837345322816 replica_slices.py:424] Transferring arrays to host memory with options: use_replica_parallel=True, min_slice_bytes_for_replica_parallel=None, max_replicas_for_replica_parallel=None, enable_pinned_host_transfer=False I0422 17:09:54.859962 134837345322816 base_pytree_checkpoint_handler.py:153] [process=6][thread=MainThread] Initiated "orbax.checkpoint._src.serialization.jax_array_handlers.ArrayHandler".serialize. Time taken: 0.046864s I0422 17:09:54.860139 134837345322816 base_pytree_checkpoint_handler.py:129] [process=6] /jax/checkpoint/write/blocking_gbytes_per_sec: 5.129 GiB/s (total gbytes: 1.5 GiB) (time elapsed: 0.3007526397705078 s) (per-host) I0422 17:09:54.860191 134837345322816 base_pytree_checkpoint_handler.py:737] [process=6][thread=MainThread] Initiated Pytree async_save. Time taken: 0.300815s (batch_requests_ready=0.251785s, total_serialization_initiated=0.048962s, others=0.000068s) I0422 17:09:54.860279 134837345322816 composite_checkpoint_handler.py:715] [process=6][thread=MainThread] Initiated CompositeCheckpointHandler.async_save. Time taken: 0.304823s (all_items=0.000010s, per_item={'items': '0.00001025'}, temp_paths=0.304813) I0422 17:09:54.861076 134711942817536 async_checkpointer.py:79] [process=6][thread=async_save] Background save thread started. I0422 17:09:54.861256 134837345322816 async_checkpointer.py:561] Finished blocking save. Time taken: 1.010731s. Continuing background save to gs://lance-maxtext/linen_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/linen_xpk_feat_nnx_set_defaults_true_20260422_154647_15_ocdbt_false/checkpoints/19. I0422 17:09:54.887249 134837345322816 checkpoint_manager.py:1566] [process=6][thread=MainThread][step=19] Starting CheckpointManager Save Finalize thread=save_finalize I0422 17:09:54.887544 134716229392128 async_checkpointer.py:265] [process=6][thread=save_finalize] Waiting for background save thread=async_save. I0422 17:09:54.887693 134837345322816 standard_logger.py:34] {'step': 19, 'event_type': 'save', 'directory': 'gs://lance-maxtext/linen_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/linen_xpk_feat_nnx_set_defaults_true_20260422_154647_15_ocdbt_false/checkpoints', 'reached_preemption': False, 'preemption_received_at': None, 'synchronous': False, 'wait_for_prev_start_time': 1776877779.5751922, 'wait_for_prev_duration_secs': 14.2737295627594, 'time_between_consecutive_saves_sec': 19.088873386383057, 'checkpointer_blocking_start_time': 1776877793.8504906, 'checkpointer_blocking_duration_secs': 1.0108754634857178, 'get_old_steps_start_time': 1776877794.8613782, 'get_old_steps_duration_secs': 2.193450927734375e-05, 'checkpoint_manager_blocking_start_time': 1776877779.5732462, 'checkpoint_manager_blocking_duration_secs': 15.314415693283081} I0422 17:09:54.887797 134837345322816 checkpointing.py:409] Started an asynchronous checkpoint save for step 19 I0422 17:09:54.887841 134837345322816 checkpoint_manager.py:2024] [process=6][thread=MainThread][step=19][wait_until_finished] Waiting for Save Finalize thread (save_finalize) to complete. I0422 17:09:59.525480 134714115479296 array_metadata_store.py:203] [process=6][thread=array_type_handler] Wrote 39 array_metadata.ArrayMetadata to gs://lance-maxtext/linen_ckpt_xpk_feat_nnx_set_defaults_true_20260422_154647/linen_xpk_feat_nnx_set_defaults_true_20260422_154647_15_ocdbt_false/checkpoints/19/items/array_metadatas/process_6 I0422 17:10:04.865564 134711942817536 base_pytree_checkpoint_handler.py:129] [process=6] /jax/checkpoint/write/gbytes_per_sec: 153.268 MiB/s (total gbytes: 1.5 GiB) (time elapsed: 10.306128740310669 s) (per-host) I0422 17:10:04.865685 134711942817536 async_checkpointer.py:90] [process=6][thread=async_save] 3 Handler Commit operations completed. Time taken: 10.004482s. I0422 17:10:09.966732 134711942817536 async_checkpointer.py:144] [process=6][thread=async_save] Background save thread done. Time taken: 15.105514s. I0422 17:10:09.966968 134716229392128 async_checkpointer.py:273] [process=6][thread=save_finalize] Done with waiting for background save thread=async_save. I0422 17:10:09.967032 134716229392128 async_checkpointer.py:283] [process=6][thread=save_finalize] No errors found in background save thread=async_save. I0422 17:10:09.967071 134716229392128 checkpoint_manager.py:2133] [process=6][thread=save_finalize][step=19] CheckpointManager Save Finalize is syncing with other hosts... I0422 17:10:09.968594 134716229392128 checkpoint_manager.py:2142] [process=6][thread=save_finalize][step=19] CheckpointManager Save Finalize is done on all hosts. I0422 17:10:09.968759 134837345322816 checkpoint_manager.py:2036] [process=6][thread=MainThread][step=19][wait_until_finished] Done waiting for Save Finalize thread (save_finalize) running at step=19. I0422 17:10:09.968902 134837345322816 checkpoint_manager.py:2013] [process=6][thread=MainThread][wait_until_finished] No Save Finalize thread to wait for. Returning. I0422 17:10:09.969556 134837345322816 metric_logger.py:185] completed step: 19, seconds: 0.147, TFLOP/s/device: 92.117, Tokens/s/device: 13884.934, total_weights: 65536, loss: 5.391 Per train step: Total TFLOPs: 13.59 split as 93.93% learnable weight flops and 6.07% attention flops XPK End: Wed Apr 22 17:10:19 UTC 2026 EXIT_CODE=0