Flash Translation Layer (FTL) - Intermittently unable to read previously stored data

Hi,

Here is some background information:
Using Flash Translation Layer (FTL) to store configuration data including SSID, pwd, firmware parameters. The size of the data structure is about 500 bytes.
FLT is initialized as such:

ftl_init(ftl_phy_page_start_addr, ftl_phy_page_num);
//where ftl_phy_page_start_addr is USER_DATA_START_ADDR
#define USER_DATA_START_ADDR	0x00100000
//and ftl_phy_page_num = 3

Flash structure has 2 bytes to start with “flash_magic” to signify that flash data is valid:

// Structure that contains all user data, note the maximum size is 8K (or maybe 4K, untested)
typedef struct af_flash__user_data_t {

	// Flash Parameters
	uint16_t flash_magic;				// used to determine if flash is "good" or "bad"

	//more parameters go here...
} af_flash__user_data_t;

Application logic is:

  1. read flash
  2. if error then “write defaults” in flash (flash_magic = 0xABCD)
  3. if flash_magic != 0xABCD then “write defaults”
  4. Firmware business logic starts here using values from af_flash__user_data_t structure

Problem:
Above code/logic runs fine. We have run tests on several tags without any issues. However, when we increased the sample size and duration of tests, we are seeing cases where flash had reset to default values which under normal circumstances should not happen. We had 7 tags out of 200 that after about 2 days, the 7 tags lost their configuration. without re flashing the tags, we have a process of de-activation and activation, during activation configuration is pushed the tags - once we did that (after re-activated), the tags are working fine.
Seems an intermittent problem, and because of it I am not sure how to troubleshoot and fix.

To provide more info, flash is read after every startup, flash is only written to during activation and when AP switch occurs - when tag is connecting to a new AP, the idea is to write to flash only when values change. Duty cycle for the test of 200 was to report 1 minute, main MCU (not the RTL8722 chip) powers on RTL8722 - read flash, connect, send a packet, write to flash if needed, respond to main MCU, then RTL8722 is powered off.

Code to read FTL:

// initialize FTL subsystem
	ftl_init(ftl_phy_page_start_addr, ftl_phy_page_num);

	// Read what's currently stored in flash
	ret = af_flash__read_user_data();

	// Check if the "magic" value is valid, if not let's write defaults
	if (__global_flash_user_data.flash_magic != AF_FLASH_DEFAULT_FLASH_MAGIC) 
{							/* default write forced by configuration */

		// Write defaults stored in `airista_flash.h`
		ret = af_flash__write_defaults(&__global_flash_user_data);
	}
// Read user data into global buffer, returns pdPASS or pdFAIL
int af_flash__read_user_data(void) {

	// read from storage
	int err = ftl_load_from_storage(&__global_flash_user_data, 0x0000, sizeof(af_flash__user_data_t));

	// check for errors (for debug print purposes)
	if (err == FTL_WRITE_ERROR_INVALID_ADDR) {
		AF_DEBUG(AF_CONFIG_DEBUG_FLASH, "[Warning]: found uninitialized FTL layer");  // this return code is not necessarily an error, just means FTL has never been initialized before
	}
	else if ((!err) != pdPASS) {
		AF_DEBUG(AF_CONFIG_DEBUG_FLASH, "[Error]: failed to read user data from flash! err=%d", err);  // everything here is an actual error
	}
	return ((!err) == pdPASS); // return failure (pdFAIL) or success (pdPASS)
}
// Write user data buffer into flash, and afterward read flash into global buffer, returns pdPASS or pdFAIL
int af_flash__write_user_data(af_flash__user_data_t* data) {

	// write data to storage
	int err = ftl_save_to_storage((void*) data, 0x0000, sizeof(af_flash__user_data_t));

	AF_DEBUG(AF_CONFIG_DEBUG_FLASH || AF_CONFIG_DEBUG_WLAN_FAST_CONNECT | AF_CONFIG_DEBUG_WIFI, "af_flash__write_user_data, err=%d",err);

	AF_ASSERT(err,"af_flash__write_user_data failed");

	// check for error (for debug print purposes)
	if ((!err) != pdPASS) {
		AF_DEBUG(AF_CONFIG_DEBUG_FLASH, "[Error]: failed to write user data to flash! err=%d", err);
	}

	return ((!err) == pdPASS); // return failure (pdFAIL) or success (pdPASS)
}

Flash Memory shouldn’t fail to read, since the firmware is also stored in it, so it sounds more like a problem with writing.
Modifying data works something like:
Read Flash → Store in RAM → Modify in RAM → Erase Flash → Write to Flash
You say you power the chip off after using it. Can this happen between erasing and writing to flash? Maybe you can try reading flash_magic after writing as an extra check? If you have 8KB of Flash, you can write a copy of the data to the second page (addr 4096) as a backup, because pages are normally erased separately, but that would be more of a workaround.

thanks for your response. I understand it should not fail. It is an intermittent issue and this is why i needed some advice of what to check.

To answer your questions:

  1. “Can this happen between erasing and writing to flash?” - no, the chip is powered off after the flash write calls. Do you think introducing a delay is needed after “af_flash__write_user_data”
  2. “Maybe you can try reading flash_magic after writing as an extra check?” - we do have 2 bytes flash magic and we are checking it. There are two possibilities to reset flash contents to defaults - either “ftl_load_from_storage” returns an error OR it read data but magic did not match. Unfortunately, I have no way of knowing as these were production tags that were not being logged
  3. “you can write a copy of the data to the second page” - FTL layer has read and write functions only. It is intended to manage where data is stored and only erase when needed (erasing is transparent to user). This is exactly the reason why we chose to use FTL vs raw flash API. Our data is 400-500 bytes, it did not make sense to erase entire page every-time we need to update our data - the way i understand, this is the whole reason behind FTL.
  4. workaround - yes, we will basically store data twice, and if reading fails from one logical location, will check the backup location. This should reduce the change of this happening. Although, it will increase the frequency of the internal page erase calls that FTL must be making when page is full and it needs to go to the next page. We are passing 3 to ftl_init(xx,3), so i am assuming FTL has 3 physical pages to work with.

will implement #4 above and try to re-run same test.

If there are any other suggestions, please let me know.

Thanks,
Ivan S.

Hey, thanks for the thorough explanation. You are right, FTL should take over the flash commands. A delay after writing is not necessary… if FTL was properly implemented that is. There aren’t many more things I can think of.
Are you using something like: wifi_disable_powersave();? That could be a problem.
What happens if the chip can’t connect to a WiFi because it’s in a location with temporarliy bad connection?
I’m saying this because WiFi is using a separate thread from the main functions and powering down the chip while the WiFi is busy sending or reconnecting can lead to a corrupted state and undefined behavior on the next startup. This would only affect devices with poor connection.
You can test this by switching off WiFi first, then shutting down the chip while it attempts reconnecting.