Age | Commit message (Collapse) | Author |
|
Refactor _report_ready_if_needed() to work for both Savable PPS
and Runnable PPS:
* rename _report_ready_if_needed() to _report_ready_for_pps()
* return interface name from lease to support _poll_imds() behavior
without changing it.
* fixes an issue where reporting ready return value was silently
ignored for Savable PPS.
* add explicit handling for failure to obtain DHCP lease to
result in sources.InvalidMetaDataException.
Refactor _poll_imds():
* use _report_ready_for_pps() for reporting ready, removing this logic
to simplify loop logic.
* move netlink and vnetswitch out of while loop to simplify loop logic,
leaving only reprovision polling in loop.
* add explicit handling for failure to obtain DHCP lease and
retry in the next iteration.
Signed-off-by: Chris Patterson cpatterson@microsoft.com
|
|
Consolidate _should_reprovision_after_nic_attach() with
_should_reprovision() into the following:
_write_reprovision_marker() to write provisioning marker for
reboot-during-provisioning case.
PPSType enum and _determine_pps_type() for determining which to
provisioning mode, if any, we're running under.
PPSType.UNKNOWN is when the reprovisioning marker is found and we
do not have the context to know what the original mode was. In this
scenario, we must resort to polling for reprovision data.
Tests:
Introduce a simple data source fixture to for fine-grain
control of mocking with pytest without unittest.
Migrate relevant _should_reprovision() tests into a combination of
TestDeterminePPSTypeScenarios cases.
Signed-off-by: Chris Patterson cpatterson@microsoft.com
|
|
According to the documentation in the tests:
```
We expect 3 calls to report_failure_to_fabric,
because we try 3 different methods of calling report failure.
The different methods are attempted in the following order:
1. Using cached ephemeral dhcp context to report failure to Azure
2. Using new ephemeral dhcp to report failure to Azure
3. Using fallback lease to report failure to Azure
```
Case 1 and 2 make sense. If networking is established, use it.
Should failure occur using current network configuration, retry
with fresh DHCP.
Case 3 suggests that we can fall back to a lease file and retry.
Given that:
1. The wireserver address has never changed to date.
2. The wireserver address should be in the DHCP lease.
3. Parsing the lease file does not improve connectivity over the
prior attempts.
...we can safely remove this case without regression.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
Avoid requirement of getattr() and ensure _ephemeral_dhcp_ctx isn't
persisted in the cache.
Signed-off-by: Chris Patterson cpatterson@microsoft.com
|
|
load_azure_ds_dir() always returns a tuple. Instead of saving this
tuple as ret, expand it immediately as md, userdata_raw, cfg, files.
This allows for more fine-grained passing of data before getting
expanded later.
- Update _should_reprovision methods to use cfg instead of tuple.
- Update _should_reprovision methods to remove the ovf_md guard.
This should be a safe refactor as the OVF is not required, and the
config is initialized to an empty dict. In practice, a mount failure
would have initialized ret anyways if the OVF was not found. If a
mount failure wasn't seen and ret was None, this guard could be
causing other failures by ignoring the PPS state that should be
available from IMDS metadata.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
The if-statement set ovf_is_accessible to True if the OVF is read
from /dev/sr0, but not from other data sources. It defaults to
True, but may get flipped to False while processing an invalid
source, and never get set back to True when reading from the data
directory.
Instead, default ovf_is_accessible to False, and only set it to
True once we've read an OVF successfully (and end the search).
This fixes an error when OVF is read from data_dir and IMDS
data is unavailable (failing with "No OVF or IMDS available").
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
- Update EphemeralDHCPv4WithReporting to subclass EphemeralDHCPv4 for
consistency (non-functional change).
- Replace all usage of EphemeralDHCPv4 with EphemeralDHCPv4WithReporting.
- Converging to one DHCP class exposed an issue with ExitStack patches
being mixed with decorators. Specifically, it appeared that tests
that did not enable azure.EphemeralDHCPv4WithReporting mocks had it
applied anyways from previous tests.
Presumably ExitStack was overwriting the actual value with the
mock provided by the decorator? For now, remove some mock patches
that trigger failures, but future work should move towards a
consistent approach to prevent undetected effects.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
Format tweak to match naming conventions for classes & enums.
No functional changes.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
If get_imds_data_with_api_fallback() falls back to the minimum required
API version, it is effectively pinned to the old API version forever.
Remove the failed_desired_api_version property to prevent persistence of
the flag between calls and/or reboots.
The continued presence of this flag in obj.pkl should be harmless.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
Applied Black and isort, fixed any linting issues, updated tox.ini
and CI.
|
|
Thanks to [1], the hostname is set prior to network bring-up.
The Azure data source has been bouncing the hostname during
setup(), occurring after the hostname has already been
properly configured.
Note that this doesn't prevent leaking the image's hostname
during Azure's _get_data() when it brings up ephemeral DHCP.
However, as are not guaranteed to have the hostname metadata
available from a truly "local" source, this behavior is to
be expected unless we disable `send host-name` from dhclient
config.
[1]: https://github.com/canonical/cloud-init/commit/133ad2cb327ad17b7b81319fac8f9f14577c04df
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
Some references were missed in the removal of the agent command
in PR #799. This simply removes the remaining references.
Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
|
|
During reprovisioning, VM network will change. fallback nic
should be cleared after use so that it can be re-evaluated after
reprovisioning
|
|
Without UDF support, DS Azure cannot mount the provisioning ISO,
which contains platform metadata necessary to support
pre-provisioning. The required metadata is made available in IMDS
starting with api version 2021-08-01. This change will leverage IMDS
to obtain the required metadata to support pre-preprovisioning if
provisioning ISO was not available.
|
|
When self.failed_desired_api_version was added to DataSourceAzure, the
attribute was never added to the _unpickle method using the upgrade
framework. This commit adds the attribute.
LP: #1946644
|
|
In #1006, we set Azure to apply networking config every
BOOT_NEW_INSTANCE because the BOOT_LEGACY option was causing problems
applying networking the second time per boot. However,
BOOT_NEW_INSTANCE is also wrong as Azure needs to apply networking
once per boot, during init-local phase.
|
|
In #834, we refactored the handling of events for fetching new metadata.
Previously, in Azure's __init__, the BOOT event was added to the
update_events, so it was assumed that Azure required the standard BOOT
behavior, which is to apply metadata twice every boot: once during
local-init, then again during standard init phase.
https://github.com/canonical/cloud-init/blob/21.2/cloudinit/sources/DataSourceAzure.py#L356
However, this line was effectively meaningless. After the metadata was
fetched in local-init, it was then pickled out to disk. Because
"update_events" was a class variable, the EventType.BOOT was not
persisted into the pickle. When the pickle was then unpickled in the
init phase, metadata did not get re-fetched because EventType.BOOT was
not present, so Azure is effectely only BOOT_NEW_INSTANCE.
Fetching metadata twice during boot causes some issue for
pre-provisioning on Azure because updating metadata during
re-provisioning will cause cloud-init to poll for reprovisiondata again
in DataSourceAzure, which will infinitely return 404(reprovisiondata
is deleted from IMDS after health signal was sent by cloud-init during
init-local). This makes cloud-init stuck in 'init'
|
|
In the nic attach path, we skip doing dhcp since we already did it
when bringing the interface up. However when polling for
reprovisiondata, it is possible for the request to timeout due to
platform issues. In those cases we still need to do dhcp and try again
since we tear down the context. We can only skip the first dhcp
attempt.
|
|
before rebinding again (#990)
Add 10 second polling loop in wait_for_link_up after performing
an unbind and re-bind of primary NIC in hv_netvsc driver.
Also reduce cloud-init logging levels to debug for these operations.
|
|
When bringing interface up by unbinding and then binding hv_netvsc
driver, it might take a short delay after binding for the link to be
up. So before trying unbind/bind again after sleep, check if the link
is up. This is a corner case when a preprovisioned VM is reused and
the NICs are hot-attached.
|
|
|
|
Azure Linux Agent (WaLinuxAgent) waits for the ovf-env.xml file
to be written by cloud-init when cloud-init provisions the VM. This
file is written whenever cloud-init reads its contents from the
provisioning ISO.
With this change, when there is no provisioning ISO,
DataSourceAzure will generate the ovf-env.xml file based on the
metadata obtained from Azure IMDS.
|
|
With a few exceptions, Azure VM deployments receive provisioning
metadata through the provisioning iso presented as a cdrom device
(/dev/sr0). The existing code attempts to find this device by calling
blkid to find all devices that have either type iso9660 or udf. This
can be very expensive if the VM has a lot of disks. This commit will
attempt to mount the default iso location first and only tries to use
blkid to locate the iso location if the default mounting location fails
|
|
Control is currently limited to boot events, though this should
allow us to more easily incorporate HOTPLUG support. Disabling
'instance-first-boot' is not supported as we apply networking config
too early in boot to have processed userdata (along with the fact
that this would be a pretty big foot-gun).
The concept of update events on datasource has been split into
supported update events and default update events. Defaults will be
used if there is no user-defined update events, but user-defined
events won't be supplied if they aren't supported.
When applying the networking config, we now check to see if the event
is supported by the datasource as well as if it is enabled.
Configuration looks like:
updates:
network:
when: ['boot']
|
|
See https://bugs.launchpad.net/cloud-init/+bug/1910835
|
|
|
|
When network interfaces are hot-attached to the VM, attempting to get
network metadata might return 410 (or 500, 503 etc) because the info
is not yet available. In those cases, we retry getting the metadata
before giving up. The only case where we can move on to wait for more
nic attach events is if the call times out despite retries, which
means the interface is not likely a primary interface, and we should
try for more nic attach events.
|
|
This change allows us to retrieve the username and hostname from
IMDS instead of having to rely on the mounted OVF.
|
|
Due to hyper-v implementations, iso ejection is more efficient if performed
from within the guest. The code will attempt to perform a best-effort ejection.
Failure during ejection will not prevent reporting ready from happening. If iso
ejection is successful, later iso ejection from the platform will be a no-op.
In the event the iso ejection from the guest fails, iso ejection will still happen at
the platform level.
|
|
Invoking walinuxagent from within cloud-init is no longer
supported/necessary
|
|
Add flexibility to IMDS api-version by having both a desired IMDS
api-version and a minimum api-version. The desired api-version will
be used first, and if that fails it will fall back to the minimum
api-version.
|
|
Changes:
* Only merge in default Azure cloud ephemeral disk configs
during DataSourceAzure._get_data() if the ephemeral disk
exists.
* DataSourceAzure.address_ephemeral_resize() (which is
invoked in DataSourceAzure.activate() should only set up
the ephemeral disk if the disk exists.
Azure VMs may or may not come with ephemeral resource disks
depending on the VM SKU. For VM SKUs that come with
ephemeral resource disks, the Azure platform guarantees that
the ephemeral resource disk is attached to the VM before
the VM is booted. For VM SKUs that do not come with
ephemeral resource disks, cloud-init currently attempts
to wait and set up a non-existent ephemeral resource
disk, which wastes boot time. It also causes disk setup
modules to fail (due to non-existent references to the
ephemeral resource disk).
udevadm settle is invoked by cloud-init very early in boot.
udevadm settle is invoked very early, before
DataSourceAzure's _get_data() and activate() methods.
Within DataSourceAzure's _get_data() and activate() methods,
the ephemeral resource disk path should exist if the
VM SKU comes with an ephemeral resource disk.
The ephemeral resource disk path should not exist if the
VM SKU does not come with an ephemeral resource disk.
LP: #1901011
|
|
Kernel's newer than 4.15 present /sys/dmi/id/product_uuid as a
lowercase value. Previously UUID was uppercase.
Azure datasource reads the product_uuid directly as their platform's
instance-id. This presents a problem if a kernel is either
upgraded or downgraded across the 4.15 kernel version boundary because
the case of the UUID will change, resulting in cloud-init seeing a
"new" instance id and re-running all modules.
Re-running cc_ssh in cloud-init deletes and regenerates ssh_host keys
on a system which can cause concern on long-running instances that
somethingnefarious has happened.
Also add:
- An integration test for this for Azure Bionic Ubuntu FIPS upgrading from
a FIPS kernel with uppercase UUID to a lowercase UUID in linux-azure
- A new pytest.mark.sru_next to collect all integration tests related to our
next SRU
LP: #1835584
|
|
With the changes for SSH public keys to be retrieved from IMDS as a
first option, when a key is passed through not in the raw SSH public key
format it causes an issue and the key is not added to the user's
authorized_keys file.
This PR will temporarily disable this behavior until a permanent fix is
put in place.
|
|
|
|
Prevent network interfaces without IP addresses from being added to the
generated network configuration.
|
|
Adds the ability to run the Azure preprovisioned VMs as NIC-less and
then hot-attach them when assigned for reprovision.
The NIC on the preprovisioned VM is hot-detached as soon as it reports
ready and goes into wait for one or more interfaces to be hot-attached.
Once they are attached, cloud-init gets the expected number of NICs (in
case there are more than one) that will be attached from IMDS and waits
until all of them are attached. After all the NICs are attached,
reprovision proceeds as usual.
|
|
cc_set_password will only update the password for the default user if
cfg['password'] is set. The existing code of datasource Azure will fail
to update the default user's password because it does not set that
metadata. If the default user doesn't exist in the image, the current
code works fine because the password is set during user create and
not in cc_set_password
|
|
On systems where the Azure datasource
is a viable platform for crawling metadata,
cloud-init occasionally encounters fatal
irrecoverable errors during the crawling
of the Azure datasource.
When this happens, cloud-init crashes,
and Azure VM provisioning would fail.
However, instead of failing immediately,
the user will continue seeing provisioning
for a long time until it times out with
"OS Provisioning Timed Out" message.
In these situations, cloud-init should
report failure to the Azure datasource
endpoint indicating provisioning failure.
The user will immediately see provisioning
terminate, giving them a much better
failure experience instead of pointlessly
waiting for OS provisioning timeout.
|
|
This just separates the reading of dmi values into its own file.
Some things of note:
* left import of util in dmi.py only for 'is_container'
It'd be good if is_container was not in util.
* just the use of 'util.is_x86' to dmi.py
* open() is used directly rather than load_file.
|
|
DataSourceAzure previously writes the preprovisioning
reported ready marker file before it goes through the
report ready workflow. On certain VM instances, the
marker file is successfully written but then reporting
ready fails.
Upon rare VM reboots by the platform, cloud-init sees
that the report ready marker file already exists.
The existence of this marker file tells cloud-init
not to report ready again (because it mistakenly
assumes that it already reported ready in
preprovisioning).
In this scenario, cloud-init instead erroneously
takes the reprovisioning workflow instead of
reporting ready again.
|
|
This moves logging into `report_diagnostic_event`, to clean up its callsites.
|
|
enumeration of physical network devices (#591)
|
|
fails (#549)
Azure datasource's `parse_network_config` throws a fatal uncaught exception when an exception is raised during generation of network config from IMDS metadata. This happens when IMDS metadata is invalid/corrupted (such as when it is missing network or interface metadata). This causes the rest of provisioning to fail.
This changes `parse_network_config` to be a non-fatal implementation. Additionally, when generating network config from IMDS metadata fails, fall back on generating fallback network config (`_generate_network_config_from_fallback_config`).
This also changes fallback network config generation (`_generate_network_config_from_fallback_config`) to blacklist an additional driver: `mlx5_core`.
|
|
* pull ssh keys from imds first and fall back to ovf if unavailable
* refactor log and diagnostic messages
* refactor the OpenSSLManager instantiation and certificate usage
* fix unit test where exception was being silenced for generate cert
* fix tests now that certificate is not always generated
* add documentation for ssh key retrieval
* add ability to check if http client has security enabled
* refactor certificate logic to GoalState
|
|
Changes:
tox: bump the pylint version to 2.6.0 in the default run
Fix pylint 2.6.0 W0707 warnings (raise-missing-from)
|
|
This fixes a long delay during boot of some instances. For Azure instance types using SR-IOV via the Hyper-V netvsc network driver, two network interfaces are created that share the same MAC, but only the virtual device should be configured and used. Updating the netplan configuration to filter on the hv_netvsc driver prevents netplan from trying to figure both devices.
LP: #1830740
|
|
Push the cloud-init.log file (Up to 500KB at once) to the KVP before reporting ready to the Azure platform.
Based on the analysis done on a large sample of cloud-init.log files, Here's the statistics collected on the log file size:
P50 P90 P95 P99 P99.9 P99.99
137K 423K 537K 3.5MB 6MB 16MB
This change limits the size of cloud-init.log file data that gets dumped to KVP to 500KB. So for ~95% of the cases, the whole log file will be dumped and for the remaining ~5%, we will get the last 500KB of the cloud-init.log file.
To asses the performance of the 500KB limit, 250 VM were deployed with a 500KB cloud-init.log file and the time taken to compress, encode and dump the entries to KVP was measured. Here's the time in milliseconds percentiles:
P50 P99 P999
75.705 232.701 1169.636
Another 250 VMs were deployed with this logic dumping their normal cloud-init.log file to KVP, the same timing was measured as above. Here's the time in milliseconds percentiles:
P50 P99 P999
1.88 5.277 6.992
Added excluded_handlers to the report_event function to be able to opt-out from reporting the events of the compressed cloud-init.log file to the cloud-init.log file.
The KVP break_down logic had a bug, where it will reuse the same key for all the split chunks of KVP which results in overwriting the split KVPs by the last one when consumed by Hyper-V. I added the split chunk index as a differentiator to the KVP key.
The Hyper-V consumes the KVPs from the KVP file as chunks whose key is 512KB and value is 2048KB but the Azure platform expects the value to be 1024KB, thus I introduced the Azure value limit.
|
|
DataSourceAzure: Gracefully handle the case of set hostname failure during provisioning
|
|
JSONDecodeError is only available in Python 3.5+. When it isn't available (i.e. on Python 3.4, which cloud-init still supports) use the more generic ValueError.
|