summaryrefslogtreecommitdiff
path: root/cloudinit/sources/DataSourceAzure.py
AgeCommit message (Collapse)Author
2022-02-15sources/azure: report ready in local phase (#1265)Chris Patterson
Pre-provisioned instances report ready early in the local phase and again in the non-local phase, during setup(). Non-PPS only reports ready during non-local phase. Update the process to report ready during the local phase for all cases. Only attempt to do so if networking is up to prevent stalling boot. We've already waited at least 20 minutes for DHCP if we're provisioning, or 5 minutes for DHCP on normal boot requesting updated network configuration. - Extend _report_ready() with pubkey_info and raise exception on error to consolidate reporting done in _negotiate() and _report_ready(). - Remove setup(), moving relevant logic into crawl_metadata(). - Move remaining _negotiate() logic into _cleanup_markers() and _determine_wireserver_pubkey_info(). These changes effectively fix two issues that were present: (1) _negotiated is incorrectly set to True When failing to report ready. _negotiate() squashed the exception and the return value was not checked. This was probably masked due to the forced removal of obj.pkl on Ubuntu instances, but would be preferable once we start persisting it to prevent unnecessary re-negotiation. (2) provisioning media is not ejected for non-PPS _negotiate() did not pass iso_dev parameter when reporting ready. The host will ensure this operation takes place, but it is preferable to eject /dev/sr0 from within the guest when we're done with it. Lastly, this removes any need for lease file parsing as the wireserver addressed is tracked for ephemeral DHCP. A follow-up PR will remove this now-unused logic. Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
2022-02-14sources/azure: validate IMDS network configuration metadata (#1257)Chris Patterson
Due to race conditions and caching, IMDS may return stale or incomplete metadata. Add some validation to detect these scenarios and report appropriate telemetry. Introduce normalize_mac_address() to allow for comparison of mac addresses, replacing that found inline in: _generate_network_config_from_imds_metadata() Add validation of final fetch of IMDS metadata. Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
2022-02-11sources/azure: address mypy/pyright typing complaints (#1245)Chris Patterson
Raise runtime errors for unhandled cases which would cause other exceptions. Ignore types for a few cases where a non-trivial refactor would be required to prevent the warning. Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
2022-02-10sources/azure: refactor ssh key handling (#1248)Chris Patterson
Split _get_public_ssh_keys_and_source() into _get_public_keys_from_imds() and _get_public_keys_from_ovf(). Set _get_public_keys_from_imds() to take a parameter of the IMDS metadata rather than assuming it is already set in self.metadata. This will allow us to move negotation into local phase where self.metadata may not be set yet. Update this method to raise KeyError if IMDS metadata is missing/malformed, and ValueError if SSH key format is not supported. Update get_public_ssh_keys() to catch these errors and fall back to the OVF/Wireserver keys as needed. To improve clarity, update register_with_azure_and_fetch_data() to return the list of SSH keys, rather than bundling them into a dictionary for updating against the metadata dictionary. There should be no change in behavior with this refactor. Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
2022-02-04sources/azure: consolidate ephemeral DHCP configuration (#1229)Chris Patterson
Introduce: - _setup_ephemeral_networking() to bring up networking. If no iface is specified, it will use net.find_fallback_nic() which is consistent with the previous usage of fallback_interface. This method now tracks the encoded address of the wireserver with a new property `_wireserver_endpoint`. Introduce a timeout parameter to allow for retrying for a specified amount of time. - _teardown_ephemeral_networking() to bring down networking. - _is_ephemeral_networking_up() to check status. Ephemeral networking is now: - Brought up prior to checking IMDS. - Torn down following metadata crawl. - For Savable PPS, torn down prior to waiting for NIC detach. The link must be torn down in advance or we will see errors from cleaning up network after the interface is unplugged. - For Running PPS, torn down after waiting for media switch. The link must be up for media switch to be detected. - For all PPS, after network switch is complete, networking is brought back up to poll for reprovision data and report ready. It will be torn down after metadata crawl is complete like non-PPS paths. Additionally: - Remove EphemeralDHCPv4WithReporting variant in favor of directly using EphemeralDHCPv4. The reporting was only for __enter__ usage which is no longer a used path. Continue to use dhcp_log_cb callback. Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
2022-02-04sources/azure: fix metadata check in _check_if_nic_is_primary() (#1232)Chris Patterson
Currently _check_if_nic_is_primary() checks for imds_md is None, but imds_md is returned as an empty dictionary on error fetching metdata. Fix this check and the tests that are incorrectly vetting IMDS polling code. Additionally, use response.contents instead of str(response) when loding the JSON data returned from readurl. This helps simplify the mocking and avoids an unncessary conversion. Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
2022-01-31sources/azure: refactor _report_ready_if_needed and _poll_imds (#1222)Chris Patterson
Refactor _report_ready_if_needed() to work for both Savable PPS and Runnable PPS: * rename _report_ready_if_needed() to _report_ready_for_pps() * return interface name from lease to support _poll_imds() behavior without changing it. * fixes an issue where reporting ready return value was silently ignored for Savable PPS. * add explicit handling for failure to obtain DHCP lease to result in sources.InvalidMetaDataException. Refactor _poll_imds(): * use _report_ready_for_pps() for reporting ready, removing this logic to simplify loop logic. * move netlink and vnetswitch out of while loop to simplify loop logic, leaving only reprovision polling in loop. * add explicit handling for failure to obtain DHCP lease and retry in the next iteration. Signed-off-by: Chris Patterson cpatterson@microsoft.com
2022-01-28sources/azure: refactor _should_reprovision[_after_nic_attach]() logic (#1206)Chris Patterson
Consolidate _should_reprovision_after_nic_attach() with _should_reprovision() into the following: _write_reprovision_marker() to write provisioning marker for reboot-during-provisioning case. PPSType enum and _determine_pps_type() for determining which to provisioning mode, if any, we're running under. PPSType.UNKNOWN is when the reprovisioning marker is found and we do not have the context to know what the original mode was. In this scenario, we must resort to polling for reprovision data. Tests: Introduce a simple data source fixture to for fine-grain control of mocking with pytest without unittest. Migrate relevant _should_reprovision() tests into a combination of TestDeterminePPSTypeScenarios cases. Signed-off-by: Chris Patterson cpatterson@microsoft.com
2022-01-20sources/azure: drop unused case in _report_failure() (#1200)Chris Patterson
According to the documentation in the tests: ``` We expect 3 calls to report_failure_to_fabric, because we try 3 different methods of calling report failure. The different methods are attempted in the following order: 1. Using cached ephemeral dhcp context to report failure to Azure 2. Using new ephemeral dhcp to report failure to Azure 3. Using fallback lease to report failure to Azure ``` Case 1 and 2 make sense. If networking is established, use it. Should failure occur using current network configuration, retry with fresh DHCP. Case 3 suggests that we can fall back to a lease file and retry. Given that: 1. The wireserver address has never changed to date. 2. The wireserver address should be in the DHCP lease. 3. Parsing the lease file does not improve connectivity over the prior attempts. ...we can safely remove this case without regression. Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
2022-01-20sources/azure: always initialize _ephemeral_dhcp_ctx on unpickle (#1199)Chris Patterson
Avoid requirement of getattr() and ensure _ephemeral_dhcp_ctx isn't persisted in the cache. Signed-off-by: Chris Patterson cpatterson@microsoft.com
2022-01-20sources/azure: unpack ret tuple in crawl_metadata() (#1194)Chris Patterson
load_azure_ds_dir() always returns a tuple. Instead of saving this tuple as ret, expand it immediately as md, userdata_raw, cfg, files. This allows for more fine-grained passing of data before getting expanded later. - Update _should_reprovision methods to use cfg instead of tuple. - Update _should_reprovision methods to remove the ovf_md guard. This should be a safe refactor as the OVF is not required, and the config is initialized to an empty dict. In practice, a mount failure would have initialized ret anyways if the OVF was not found. If a mount failure wasn't seen and ret was None, this guard could be causing other failures by ignoring the PPS state that should be available from IMDS metadata. Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
2022-01-18sources/azure: set ovf_is_accessible when OVF is read successfully (#1193)Chris Patterson
The if-statement set ovf_is_accessible to True if the OVF is read from /dev/sr0, but not from other data sources. It defaults to True, but may get flipped to False while processing an invalid source, and never get set back to True when reading from the data directory. Instead, default ovf_is_accessible to False, and only set it to True once we've read an OVF successfully (and end the search). This fixes an error when OVF is read from data_dir and IMDS data is unavailable (failing with "No OVF or IMDS available"). Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
2022-01-18sources/azure: consolidate DHCP variants to EphemeralDHCPv4WithReporting (#1190)Chris Patterson
- Update EphemeralDHCPv4WithReporting to subclass EphemeralDHCPv4 for consistency (non-functional change). - Replace all usage of EphemeralDHCPv4 with EphemeralDHCPv4WithReporting. - Converging to one DHCP class exposed an issue with ExitStack patches being mixed with decorators. Specifically, it appeared that tests that did not enable azure.EphemeralDHCPv4WithReporting mocks had it applied anyways from previous tests. Presumably ExitStack was overwriting the actual value with the mock provided by the decorator? For now, remove some mock patches that trigger failures, but future work should move towards a consistent approach to prevent undetected effects. Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
2022-01-11sources/azure: rename metadata_type -> MetadataType (#1181)Chris Patterson
Format tweak to match naming conventions for classes & enums. No functional changes. Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
2022-01-04sources/azure: do not persist failed_desired_api_version flag (#1159)Chris Patterson
If get_imds_data_with_api_fallback() falls back to the minimum required API version, it is effectively pinned to the old API version forever. Remove the failed_desired_api_version property to prevent persistence of the flag between calls and/or reboots. The continued presence of this flag in obj.pkl should be harmless. Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
2021-12-15Adopt Black and isort (SC-700) (#1157)James Falcon
Applied Black and isort, fixed any linting issues, updated tox.ini and CI.
2021-12-10sources/azure: remove unnecessary hostname bounce (#1143)Chris Patterson
Thanks to [1], the hostname is set prior to network bring-up. The Azure data source has been bouncing the hostname during setup(), occurring after the hostname has already been properly configured. Note that this doesn't prevent leaking the image's hostname during Azure's _get_data() when it brings up ephemeral DHCP. However, as are not guaranteed to have the hostname metadata available from a truly "local" source, this behavior is to be expected unless we disable `send host-name` from dhclient config. [1]: https://github.com/canonical/cloud-init/commit/133ad2cb327ad17b7b81319fac8f9f14577c04df Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
2021-11-29sources/azure: remove unused remnants related to agent command (#1119)Chris Patterson
Some references were missed in the removal of the agent command in PR #799. This simply removes the remaining references. Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
2021-11-02Azure: fallback nic needs to be reevaluated during reprovisioning (#1094)Anh Vo
During reprovisioning, VM network will change. fallback nic should be cleared after use so that it can be re-evaluated after reprovisioning
2021-11-02azure: pps imds (#1093)Anh Vo
Without UDF support, DS Azure cannot mount the provisioning ISO, which contains platform metadata necessary to support pre-provisioning. The required metadata is made available in IMDS starting with api version 2021-08-01. This change will leverage IMDS to obtain the required metadata to support pre-preprovisioning if provisioning ISO was not available.
2021-10-19Update Azure _unpickle (SC-500) (#1067)James Falcon
When self.failed_desired_api_version was added to DataSourceAzure, the attribute was never added to the _unpickle method using the upgrade framework. This commit adds the attribute. LP: #1946644
2021-09-17Set Azure to apply networking config every BOOT (#1023)James Falcon
In #1006, we set Azure to apply networking config every BOOT_NEW_INSTANCE because the BOOT_LEGACY option was causing problems applying networking the second time per boot. However, BOOT_NEW_INSTANCE is also wrong as Azure needs to apply networking once per boot, during init-local phase.
2021-09-03Set Azure to only update metadata on BOOT_NEW_INSTANCE (#1006)James Falcon
In #834, we refactored the handling of events for fetching new metadata. Previously, in Azure's __init__, the BOOT event was added to the update_events, so it was assumed that Azure required the standard BOOT behavior, which is to apply metadata twice every boot: once during local-init, then again during standard init phase. https://github.com/canonical/cloud-init/blob/21.2/cloudinit/sources/DataSourceAzure.py#L356 However, this line was effectively meaningless. After the metadata was fetched in local-init, it was then pickled out to disk. Because "update_events" was a class variable, the EventType.BOOT was not persisted into the pickle. When the pickle was then unpickled in the init phase, metadata did not get re-fetched because EventType.BOOT was not present, so Azure is effectely only BOOT_NEW_INSTANCE. Fetching metadata twice during boot causes some issue for pre-provisioning on Azure because updating metadata during re-provisioning will cause cloud-init to poll for reprovisiondata again in DataSourceAzure, which will infinitely return 404(reprovisiondata is deleted from IMDS after health signal was sent by cloud-init during init-local). This makes cloud-init stuck in 'init'
2021-08-24Azure: Retry dhcp on timeouts when polling reprovisiondata (#998)aswinrajamannar
In the nic attach path, we skip doing dhcp since we already did it when bringing the interface up. However when polling for reprovisiondata, it is possible for the request to timeout due to platform issues. In those cases we still need to do dhcp and try again since we tear down the context. We can only skip the first dhcp attempt.
2021-08-20Azure: During primary nic detection, check interface status continuously ↵aswinrajamannar
before rebinding again (#990) Add 10 second polling loop in wait_for_link_up after performing an unbind and re-bind of primary NIC in hv_netvsc driver. Also reduce cloud-init logging levels to debug for these operations.
2021-08-12Azure: Check if interface is up after sleep when trying to bring it up (#972)aswinrajamannar
When bringing interface up by unbinding and then binding hv_netvsc driver, it might take a short delay after binding for the link to be up. So before trying unbind/bind again after sleep, check if the link is up. This is a corner case when a preprovisioned VM is reused and the NICs are hot-attached.
2021-08-10Azure: Limit polling network metadata on connection errors (#961)aswinrajamannar
2021-08-05generate contents for ovf-env.xml when provisioning via IMDS (#959)Anh Vo
Azure Linux Agent (WaLinuxAgent) waits for the ovf-env.xml file to be written by cloud-init when cloud-init provisions the VM. This file is written whenever cloud-init reads its contents from the provisioning ISO. With this change, when there is no provisioning ISO, DataSourceAzure will generate the ovf-env.xml file based on the metadata obtained from Azure IMDS.
2021-07-19Azure: mount default provisioning iso before try device listing (#870)Anh Vo
With a few exceptions, Azure VM deployments receive provisioning metadata through the provisioning iso presented as a cdrom device (/dev/sr0). The existing code attempts to find this device by calling blkid to find all devices that have either type iso9660 or udf. This can be very expensive if the VM has a lot of disks. This commit will attempt to mount the default iso location first and only tries to use blkid to locate the iso location if the default mounting location fails
2021-05-13Allow user control over update events (#834)James Falcon
Control is currently limited to boot events, though this should allow us to more easily incorporate HOTPLUG support. Disabling 'instance-first-boot' is not supported as we apply networking config too early in boot to have processed userdata (along with the fact that this would be a pretty big foot-gun). The concept of update events on datasource has been split into supported update events and default update events. Defaults will be used if there is no user-defined update events, but user-defined events won't be supplied if they aren't supported. When applying the networking config, we now check to see if the event is supported by the datasource as well as if it is enabled. Configuration looks like: updates: network: when: ['boot']
2021-05-05Add \r\n check for SSH keys in Azure (#889)James Falcon
See https://bugs.launchpad.net/cloud-init/+bug/1910835
2021-04-27Azure: adding support for consuming userdata from IMDS (#884)Anh Vo
2021-04-26Azure: Retry net metadata during nic attach for non-timeout errs (#878)aswinrajamannar
When network interfaces are hot-attached to the VM, attempting to get network metadata might return 410 (or 500, 503 etc) because the info is not yet available. In those cases, we retry getting the metadata before giving up. The only case where we can move on to wait for more nic attach events is if the call times out despite retries, which means the interface is not likely a primary interface, and we should try for more nic attach events.
2021-04-26Azure: Retrieve username and hostname from IMDS (#865)Thomas Stringer
This change allows us to retrieve the username and hostname from IMDS instead of having to rely on the mounted OVF.
2021-04-23Azure: eject the provisioning iso before reporting ready (#861)Anh Vo
Due to hyper-v implementations, iso ejection is more efficient if performed from within the guest. The code will attempt to perform a best-effort ejection. Failure during ejection will not prevent reporting ready from happening. If iso ejection is successful, later iso ejection from the platform will be a no-op. In the event the iso ejection from the guest fails, iso ejection will still happen at the platform level.
2021-04-13azure: Removing ability to invoke walinuxagent (#799)Anh Vo
Invoking walinuxagent from within cloud-init is no longer supported/necessary
2021-03-03Add flexibility to IMDS api-version (#793)Thomas Stringer
Add flexibility to IMDS api-version by having both a desired IMDS api-version and a minimum api-version. The desired api-version will be used first, and if that fails it will fall back to the minimum api-version.
2021-02-22Azure: Support for VMs without ephemeral resource disks. (#800)Johnson Shi
Changes: * Only merge in default Azure cloud ephemeral disk configs during DataSourceAzure._get_data() if the ephemeral disk exists. * DataSourceAzure.address_ephemeral_resize() (which is invoked in DataSourceAzure.activate() should only set up the ephemeral disk if the disk exists. Azure VMs may or may not come with ephemeral resource disks depending on the VM SKU. For VM SKUs that come with ephemeral resource disks, the Azure platform guarantees that the ephemeral resource disk is attached to the VM before the VM is booted. For VM SKUs that do not come with ephemeral resource disks, cloud-init currently attempts to wait and set up a non-existent ephemeral resource disk, which wastes boot time. It also causes disk setup modules to fail (due to non-existent references to the ephemeral resource disk). udevadm settle is invoked by cloud-init very early in boot. udevadm settle is invoked very early, before DataSourceAzure's _get_data() and activate() methods. Within DataSourceAzure's _get_data() and activate() methods, the ephemeral resource disk path should exist if the VM SKU comes with an ephemeral resource disk. The ephemeral resource disk path should not exist if the VM SKU does not come with an ephemeral resource disk. LP: #1901011
2021-02-19azure: case-insensitive UUID to avoid new IID during kernel upgrade (#798)Chad Smith
Kernel's newer than 4.15 present /sys/dmi/id/product_uuid as a lowercase value. Previously UUID was uppercase. Azure datasource reads the product_uuid directly as their platform's instance-id. This presents a problem if a kernel is either upgraded or downgraded across the 4.15 kernel version boundary because the case of the UUID will change, resulting in cloud-init seeing a "new" instance id and re-running all modules. Re-running cc_ssh in cloud-init deletes and regenerates ssh_host keys on a system which can cause concern on long-running instances that somethingnefarious has happened. Also add: - An integration test for this for Azure Bionic Ubuntu FIPS upgrading from a FIPS kernel with uppercase UUID to a lowercase UUID in linux-azure - A new pytest.mark.sru_next to collect all integration tests related to our next SRU LP: #1835584
2021-01-11Fix regression with handling of IMDS ssh keys (#760)Thomas Stringer
With the changes for SSH public keys to be retrieved from IMDS as a first option, when a key is passed through not in the raw SSH public key format it causes an issue and the key is not added to the user's authorized_keys file. This PR will temporarily disable this behavior until a permanent fix is put in place.
2021-01-04Azure: Add telemetry for poll IMDS (#741)Johnson Shi
2020-12-16Azure: only generate config for NICs with addresses (#709)Thomas Stringer
Prevent network interfaces without IP addresses from being added to the generated network configuration.
2020-11-23Ability to hot-attach NICs to preprovisioned VMs before reprovisioning (#613)aswinrajamannar
Adds the ability to run the Azure preprovisioned VMs as NIC-less and then hot-attach them when assigned for reprovision. The NIC on the preprovisioned VM is hot-detached as soon as it reports ready and goes into wait for one or more interfaces to be hot-attached. Once they are attached, cloud-init gets the expected number of NICs (in case there are more than one) that will be attached from IMDS and waits until all of them are attached. After all the NICs are attached, reprovision proceeds as usual.
2020-11-18DataSourceAzure: update password for defuser if exists (#671)Anh Vo
cc_set_password will only update the password for the default user if cfg['password'] is set. The existing code of datasource Azure will fail to update the default user's password because it does not set that metadata. If the default user doesn't exist in the image, the current code works fine because the password is set during user create and not in cc_set_password
2020-11-18DataSourceAzure: send failure signal on Azure datasource failure (#594)Johnson Shi
On systems where the Azure datasource is a viable platform for crawling metadata, cloud-init occasionally encounters fatal irrecoverable errors during the crawling of the Azure datasource. When this happens, cloud-init crashes, and Azure VM provisioning would fail. However, instead of failing immediately, the user will continue seeing provisioning for a long time until it times out with "OS Provisioning Timed Out" message. In these situations, cloud-init should report failure to the Azure datasource endpoint indicating provisioning failure. The user will immediately see provisioning terminate, giving them a much better failure experience instead of pointlessly waiting for OS provisioning timeout.
2020-11-02cloudinit: move dmi functions out of util (#622)Scott Moser
This just separates the reading of dmi values into its own file. Some things of note: * left import of util in dmi.py only for 'is_container' It'd be good if is_container was not in util. * just the use of 'util.is_x86' to dmi.py * open() is used directly rather than load_file.
2020-10-16DataSourceAzure: write marker file after report ready in preprovisioning (#590)Johnson Shi
DataSourceAzure previously writes the preprovisioning reported ready marker file before it goes through the report ready workflow. On certain VM instances, the marker file is successfully written but then reporting ready fails. Upon rare VM reboots by the platform, cloud-init sees that the report ready marker file already exists. The existence of this marker file tells cloud-init not to report ready again (because it mistakenly assumes that it already reported ready in preprovisioning). In this scenario, cloud-init instead erroneously takes the reprovisioning workflow instead of reporting ready again.
2020-10-15azure: clean up and refactor report_diagnostic_event (#563)Johnson Shi
This moves logging into `report_diagnostic_event`, to clean up its callsites.
2020-10-13net: add the ability to blacklist network interfaces based on driver during ↵Anh Vo
enumeration of physical network devices (#591)
2020-09-24Azure parse_network_config uses fallback cfg when generate IMDS network cfg ↵Johnson Shi
fails (#549) Azure datasource's `parse_network_config` throws a fatal uncaught exception when an exception is raised during generation of network config from IMDS metadata. This happens when IMDS metadata is invalid/corrupted (such as when it is missing network or interface metadata). This causes the rest of provisioning to fail. This changes `parse_network_config` to be a non-fatal implementation. Additionally, when generating network config from IMDS metadata fails, fall back on generating fallback network config (`_generate_network_config_from_fallback_config`). This also changes fallback network config generation (`_generate_network_config_from_fallback_config`) to blacklist an additional driver: `mlx5_core`.