Basic analyzing and plotting¶

This tutorial will go over the basics of analyzing eggs, the primary data structure used in quail. To learn about how an egg is set up, see the egg tutorial.

An egg is made up of (at minimum) the stimuli presented to a subject and the stimuli recalled by the subject. With these, two components we can perform a number of analyses:

Recall Accuracy - the proportion of stimuli presented that were later recalled
Serial Position Curve - recall accuracy as a function of the encoding position of the stimulus
Probability of First Recall - the probability that a stimulus will be recalled first as a function of its encoding position
Lag-CRP - given the recall of word n, the probability of recalling stimuli at neighboring positions (n+/-1, 2, 3 etc).
Temporal Clustering - a measure of recall clustering by temporal proximity during encoding

If we have a set of features for the stimuli, we can also compute a Memory Fingerprint, which is an estimate of how a subject clusters their recall responses with respect to features of a stimulus (see the fingerprint tutorial for more on this).

Let’s get to analyzing some eggs. First, we’ll load in some example data:

[1]:

import quail
%matplotlib inline

egg = quail.load_example_data()

This dataset is comprised of 30 subjects, who each performed 8 study/test blocks of 16 words each. Here are some of the presented words:

[2]:

egg.get_pres_items().head()

[2]:

		0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
Subject	List
0	0	BROCCOLI	CAULIFLOWER	ONION	PICKLE	STRAINER	SAUCER	DISH	BUTTERCUP	GRIDDLE	CARPET	FLOOR	FOUNDATION	ELEVATOR	AZALEA	DAHLIA	LOG
	1	POTATO	CHIMNEY	GERMANY	BLOUSE	EGYPT	LOBBY	JACKET	ARTICHOKE	CLOSET	SUIT	CUBA	GARLIC	CAMISOLE	SPINACH	IRAN	FURNACE
	2	OVEN	TUBA	MONTREAL	MUG	HIP	BROILER	PICCOLO	ARMS	DALLAS	ROME	TRUMPET	PELVIS	THERMOMETER	TAMBOURINE	PARIS	STOMACH
	3	MOOSE	MICHIGAN	CLEMENTINE	ANTELOPE	MONKEY	RIB	RACOON	FLORIDA	TONGUE	POMEGRANATE	PEAR	IOWA	PANCREAS	KANSAS	LEMON	TOOTH
	4	KITCHEN	ROSE	DOG	CARNATION	BARN	DONKEY	TIGER	EAR	FACE	GAZEBO	HEART	PETUNIA	HIPPOPOTAMUS	ALCOVE	TULIP	KNUCKLE

and some of the recalled words:

[3]:

egg.get_rec_items().head()

[3]:

		0	1	2	3	4	5	6	7	8	9	...	12	13	14	15	16	17	18	19	20	21
Subject	List
0	0	BROCCOLI	CAULIFLOWER	ONION	DISH	GRIDDLE	DAHLIA	SAUCER	AZALEA	[]	[]	...	[]	[]	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	1	FURNACE	CHIMNEY	CUBA	GERMANY	ARTICHOKE	SPINACH	POTATO	SUIT	CLOSET	CHIMNEY	...	[]	[]	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	2	MAINE	ARMS	PARIS	[]	[]	[]	[]	[]	[]	[]	...	[]	[]	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	3	IS	RIB	PANCREAS	CLEMENTINE	LEMON	IOWA	FLORIDA	MICHIGAN	MOOSE	MONKEY	...	[]	[]	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
	4	CARNATION	ROSE	TULIP	HIPPOPOTAMUS	ALCOVE	[]	[]	[]	[]	[]	...	[]	[]	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 22 columns

We can start with the simplest analysis - recall accuracy - which is just the proportion of stimuli recalled that were in the encoding lists. To compute accuracy, simply call the analyze method, with the analysis key word argument set to accuracy:

Recall Accuracy¶

[4]:

acc = egg.analyze('accuracy')
acc.get_data().head()

[4]:

		0
b'Subject'	List
0	0	0.5000
	1	0.5625
	2	0.1250
	3	0.5625
	4	0.3125

The result is a FriedEgg data object. The accuracy data can be retrieved using the get_data method, which returns a multi-index Pandas DataFrame where the first-level index is the subject identifier and the second level index is the list number. By default, note that each list is analyzed separately. However, you can easily return the average over lists using the listgroup kew word argument:

[5]:

accuracy_avg = egg.analyze('accuracy', listgroup=['average']*8)
accuracy_avg.get_data().head()

[5]:

		0
b'Subject'	List
0	average	0.367188
1	average	0.601562
2	average	0.742188
3	average	0.546875
4	average	0.867188

Now, the result is a single value for each subject representing the average accuracy across the 16 lists. The listgroup kwarg can also be used to do some fancier groupings, like splitting the data into the first and second half of the experiment:

[6]:

accuracy_split = egg.analyze('accuracy', listgroup=['First Half']*4+['Second Half']*4)
accuracy_split.get_data().head()

[6]:

		0
b'Subject'	List
0	Second Half	0.296875
0	First Half	0.437500
1	Second Half	0.656250
1	First Half	0.546875
2	Second Half	0.750000

These analysis results can be passed directly into the plot function like so:

[7]:

accuracy_split.plot()

[7]:

<Axes: xlabel='List', ylabel='Accuracy'>

../_images/tutorial_basic_analyze_and_plot_14_1.png

For more details on plotting, see the advanced plotting tutorial. Next, lets take a look at the serial position curve analysis. As stated above the serial position curve (or spc) computes recall accuracy as a function of the encoding position of the stimulus. To use it, use the same analyze method illustrated above, but set the analysis kwarg to spc. Let’s also average across lists within subject:

Serial Position Curve¶

[8]:

spc = egg.analyze('spc', listgroup=['average']*8)
spc.get_data().head()

[8]:

		0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
b'Subject'	List
0	average	0.625	0.625	0.375	0.250	0.250	0.375	0.125	0.375	0.250	0.375	0.250	0.250	0.375	0.625	0.500	0.250
1	average	0.875	0.625	0.375	0.625	0.625	0.625	0.750	0.625	0.375	0.500	0.375	0.875	0.750	0.375	0.625	0.625
2	average	0.875	1.000	0.750	0.875	0.500	0.750	0.625	1.000	0.750	0.625	0.625	0.625	0.875	0.625	0.750	0.625
3	average	0.875	1.000	0.750	0.750	0.625	0.625	0.500	0.500	0.250	0.500	0.000	0.375	0.625	0.375	0.375	0.625
4	average	1.000	1.000	1.000	1.000	0.750	0.875	0.875	0.875	1.000	0.750	0.750	0.625	0.750	0.750	1.000	0.875

The result is a df where each row is a subject and each column is the encoding position of the word. To plot, simply pass the result of the analysis function to the plot function:

[9]:

spc.plot(ylim=[0, 1])

[9]:

<Axes: xlabel='Position', ylabel='Proportion recalled'>

../_images/tutorial_basic_analyze_and_plot_19_1.png

Probability of First Recall¶

The next analysis we’ll take a look at is the probability of first recall, which is the probability that a word will be recalled first as a function of its encoding position. To compute this, call the analyze method with the analysis kwarg set to pfr. Again, we’ll average over lists:

[10]:

pfr = egg.analyze('pfr', listgroup=['average']*8)
pfr.get_data().head()

[10]:

		0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
b'Subject'	List
0	average	0.250	0.000	0.000	0.125	0.0	0.0	0.000	0.0	0.0	0.125	0.125	0.0	0.000	0.000	0.000	0.125
1	average	0.250	0.000	0.000	0.000	0.0	0.0	0.000	0.0	0.0	0.000	0.000	0.0	0.250	0.000	0.375	0.000
2	average	0.375	0.000	0.000	0.000	0.0	0.0	0.125	0.0	0.0	0.000	0.125	0.0	0.125	0.125	0.000	0.125
3	average	0.625	0.125	0.000	0.000	0.0	0.0	0.000	0.0	0.0	0.000	0.000	0.0	0.000	0.000	0.125	0.125
4	average	0.250	0.000	0.125	0.000	0.0	0.0	0.000	0.0	0.0	0.000	0.000	0.0	0.375	0.125	0.000	0.125

This df is set up just like the serial position curve. To plot:

[11]:

pfr.plot()

[11]:

<Axes: xlabel='Position', ylabel='Probability of recall: position 0'>

../_images/tutorial_basic_analyze_and_plot_23_1.png

Lag-CRP¶

The next analysis to consider is the lag-CRP, which again is a function that given the recall of word n, returns the probability of recalling words at neighboring positions (n+/-1, 2, 3 etc). To use it? You guessed it: call the analyze method with the analysis kwarg set to lagcrp:

[12]:

lagcrp = egg.analyze('lagcrp', listgroup=['average']*8)
lagcrp.get_data().head()

[12]:

		-16	-15	-14	-13	-12	-11	-10	-9	-8	-7	...	7	8	9	10	11	12	13	14	15	16
b'Subject'	List
0	average	0.0	0.125	0.1250	0.0625	0.000000	0.000000	0.041667	0.187500	0.104167	0.031250	...	0.229167	0.087500	0.083333	0.000000	0.0	0.062500	0.250000	0.125	0.0	0.0
1	average	0.0	0.125	0.0000	0.0000	0.041667	0.041667	0.062500	0.072917	0.129167	0.041667	...	0.041667	0.114583	0.062500	0.062500	0.0	0.104167	0.187500	0.000	0.0	0.0
2	average	0.0	0.125	0.0625	0.0000	0.041667	0.041667	0.000000	0.000000	0.187500	0.000000	...	0.040625	0.087500	0.020833	0.031250	0.0	0.000000	0.000000	0.000	0.0	0.0
3	average	0.0	0.250	0.0000	0.0000	0.000000	0.166667	0.062500	0.062500	0.250000	0.093750	...	0.062500	0.093750	0.099107	0.020833	0.0	0.031250	0.000000	0.000	0.0	0.0
4	average	0.0	0.125	0.1250	0.0000	0.000000	0.000000	0.125000	0.000000	0.000000	0.013889	...	0.013889	0.000000	0.000000	0.000000	0.0	0.000000	0.041667	0.000	0.0	0.0

5 rows × 33 columns

Unlike the previous two analyses, the result of this analysis returns a df where the number of columns are double the length of the lists. To view the results:

[13]:

lagcrp.plot()

[13]:

<Axes: xlabel='Lag', ylabel='Conditional response probability'>

../_images/tutorial_basic_analyze_and_plot_27_1.png

Temporal clustering¶

Another way to evaluate temporal clustering is to measure the temporal distance of each transition made with respect to where on a list the subject could have transitioned. This ‘temporal clustering score’ is a good summary of how strongly participants are clustering their responses according to temporal proximity during encoding.

[14]:

temporal = egg.analyze('temporal', listgroup=['First Half']*4+['Second Half']*4)
temporal.plot(plot_style='violin', ylim=[0,1])

[14]:

<Axes: xlabel='List', ylabel='Temporal clustering score'>

../_images/tutorial_basic_analyze_and_plot_29_1.png

Memory Fingerprint¶

Last but not least is the memory fingerprint analysis. For a detailed treatment of this analysis, see the fingerprint tutorial.

As described in the fingerprint tutorial, the features data structure is used to estimate how subjects cluster their recall responses with respect to the features of the encoded stimuli. Briefly, these estimates are derived by computing the similarity of neighboring recall words along each feature dimension. For example, if you recall “dog”, and then the next word you recall is “cat”, your clustering by category score would increase because the two recalled words are in the same category. Similarly, if after you recall “cat” you recall the word “can”, your clustering by starting letter score would increase, since both words share the first letter “c”. This logic can be extended to any number of feature dimensions.

Here is a glimpse of the features df:

[15]:

egg.feature_names

[15]:

['color',
 'location',
 'category',
 'firstLetter',
 'size',
 'wordLength',
 'Temporal']

Like the other analyses, computing the memory fingerprint can be done using the analyze method with the analysis kwarg set to fingerprint:

[16]:

fingerprint = egg.analyze('fingerprint', listgroup=['average']*8)
fingerprint.get_data().head()

[16]:

		color	location	category	firstLetter	size	wordLength	Temporal
b'Subject'	List
0	average	0.539366	0.539366	0.622640	0.574667	0.688298	0.564399	0.446074
1	average	0.547466	0.547466	0.630292	0.532948	0.588362	0.546864	0.607850
2	average	0.556769	0.556769	0.666886	0.587325	0.629711	0.523147	0.729072
3	average	0.545835	0.545835	0.595446	0.544486	0.581789	0.483173	0.615956
4	average	0.576243	0.576243	0.587332	0.578217	0.614837	0.568500	0.865006

The result of this analysis is a df, where each row is a subject’s fingerprint and each column is a feature dimensions. The values represent a subjects tendency to cluster their recall responses along a particular feature dimensions. They are probability values, and thus, greater values indicate more clustering along that feature dimension. To plot, simply pass the result to the plot function:

[17]:

order=sorted(egg.feature_names)
fingerprint.plot(order=order, ylim=[0, 1])

[17]:

<Axes: xlabel='Feature', ylabel='Clustering score'>

../_images/tutorial_basic_analyze_and_plot_35_1.png

This result suggests that subjects in this example dataset tended to cluster their recall responses by category as well as the size (bigger or smaller than a shoebox) of the word. List length and other properties of your experiment can bias these clustering scores. To help with this, we implemented a permutation clustering procedure which shuffles the order of each recall list and recomputes the clustering score with respect to that distribution. Note: this also works with the temporal clustering analysis.

[18]:

# warning: this can take a little while.  Setting parallel=True will help speed up the permutation computation
# fingerprint = quail.analyze(egg, analysis='fingerprint', listgroup=['average']*8, permute=True, n_perms=100)
# ax = quail.plot(fingerprint, ylim=[0,1.2])